<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Erik-Jan van Baaren &#187; Research</title>
	<atom:link href="http://www.eriky.com/category/research/feed" rel="self" type="application/rss+xml" />
	<link>http://www.eriky.com</link>
	<description>Just another geek with a blog</description>
	<lastBuildDate>Fri, 07 May 2010 19:13:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>WikiBench master thesis</title>
		<link>http://www.eriky.com/2009/05/wikibench-master-thesis</link>
		<comments>http://www.eriky.com/2009/05/wikibench-master-thesis#comments</comments>
		<pubDate>Wed, 20 May 2009 08:24:46 +0000</pubDate>
		<dc:creator>Erik-Jan</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[WikiBench]]></category>
		<category><![CDATA[wikipedia]]></category>

		<guid isPermaLink="false">http://www.eriky.com/?p=107</guid>
		<description><![CDATA[I officially got my WikiBench project graded with an 8, with which I&#8217;m of course very satisfied. You can now read my thesis called WikiBench: A distributed, Wikipedia based web application benchmark. People interested in this project already found their way to my blog. For those who are wondering: I will publish the code and [...]]]></description>
			<content:encoded><![CDATA[<p>I officially got my WikiBench project graded with an 8, with which I&#8217;m of course very satisfied. You can now read my thesis called <em><a href="http://www.eriky.com/wp-content/uploads/2009/05/wikibench.pdf">WikiBench: A distributed, Wikipedia based web application benchmark</a>.</em></p>
<p>People interested in this project already found their way to my blog. For those who are wondering: I will publish the code and I will do so very shortly (within days). It will most probably appear on Google Code and you won&#8217;t have to search for it since I will devote a post to it right here and include the URL. I will probably release it under a BSD-style license which should give you lots of freedom. Unfortunately I&#8217;m not sure yet if I am allowed to release some of the trace files obtained from Wikipedia.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.eriky.com/2009/05/wikibench-master-thesis/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bloom filters</title>
		<link>http://www.eriky.com/2009/03/bloom-filters</link>
		<comments>http://www.eriky.com/2009/03/bloom-filters#comments</comments>
		<pubDate>Tue, 03 Mar 2009 22:19:11 +0000</pubDate>
		<dc:creator>Erik-Jan</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[bloom filters]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[math]]></category>

		<guid isPermaLink="false">http://www.eriky.com/?p=93</guid>
		<description><![CDATA[This article by Broder and Mitzenmacher gives a good description of how bloom filters work and what they can do for you. The bloom filter basically replaces a dataset with a filter that can tell you if an item is a member of that set or not. It will not give false negatives, but it [...]]]></description>
			<content:encoded><![CDATA[<p><a title="Bloom filters" href="http://security.riit.tsinghua.edu.cn/seminar/2006_11_23/Bloomfilter_survey_Broder.pdf">This article</a> by Broder and Mitzenmacher gives a good description of how bloom filters work and what they can do for you. The bloom filter basically replaces a dataset with a filter that can tell you if an item is a member of that set or not. It will not give false negatives, but it might give false positives. In practise, this is a negative property that can be outweighted by the space savings a bloom filter introduces; after all, you do not need to query the dataset to determine membership. The most important and summarizing quote you should remember from the article:</p>
<p style="padding-left: 30px;"><strong>The Bloom ﬁlter principle:</strong> Wherever a list or set is used, and space is at a premium, consider using a Bloom ﬁlter if the effect of false positives can be mitigated.</p>
<p>The article also gives a number of examples in which bloom filters are used. E.g. to aid resource location in P2P and cache systems.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.eriky.com/2009/03/bloom-filters/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>WikiBench presentation</title>
		<link>http://www.eriky.com/2009/02/wikibench-presentation</link>
		<comments>http://www.eriky.com/2009/02/wikibench-presentation#comments</comments>
		<pubDate>Thu, 19 Feb 2009 17:29:15 +0000</pubDate>
		<dc:creator>Erik-Jan</dc:creator>
				<category><![CDATA[Research]]></category>

		<guid isPermaLink="false">http://www.eriky.com/?p=56</guid>
		<description><![CDATA[Today I presented my master research project to a group of people at the Vrije Universiteit. The project is called &#8220;WikiBench, a distributed Wikipedia based web application benchmark&#8220;. You can view my slides on this url if you are interested. The thesis (and source code!) will be released towards the end of March.]]></description>
			<content:encoded><![CDATA[<p>Today I presented my master research project to a group of people at the Vrije Universiteit. The project is called <strong>&#8220;WikiBench, a distributed Wikipedia based web application benchmark</strong>&#8220;. You can view my slides on <a href="http://docs.google.com/Presentation?id=dcn3ghxm_58fgfxvmds">this url</a> if you are interested. The thesis (and source code!) will be released towards the end of March.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.eriky.com/2009/02/wikibench-presentation/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Importing the complete wikipedia database in 5 hours</title>
		<link>http://www.eriky.com/2008/11/importing-the-complete-english-wikipedia-database</link>
		<comments>http://www.eriky.com/2008/11/importing-the-complete-english-wikipedia-database#comments</comments>
		<pubDate>Sat, 08 Nov 2008 15:10:50 +0000</pubDate>
		<dc:creator>Erik-Jan</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[mediawiki]]></category>
		<category><![CDATA[wikipedia]]></category>

		<guid isPermaLink="false">http://www.eriky.com/?p=3</guid>
		<description><![CDATA[Say you want to setup a local installation of Wikipedia. Installing the software is easy, just go to www.mediawiki.org and download the MediaWiki software, created by the WikiMedia foundation, which is the foundation of Wikipedia. That&#8217;s right, you&#8217;re guaranteed to mess up the names at some point in your life. OK, we have it running [...]]]></description>
			<content:encoded><![CDATA[<p>Say you want to setup a local installation of Wikipedia. Installing the software is easy, just go to www.mediawiki.org and download the MediaWiki software, created by the WikiMedia foundation, which is the foundation of Wikipedia. That&#8217;s right, you&#8217;re guaranteed to mess up the names at some point in your life.</p>
<p>OK, we have it running on a local Ubuntu installation, on Apache and MySQL5 with PHP5. Not a big deal. Now it&#8217;s time to download the Wikipedia data, which is licensed in such a way that you are free to use it. In my case, I will be using it for scientific reseach, about which I&#8217;ll surely post more in the coming months. The english data of just the current version of all the articles is 4+ gigabytes in compressed bzip2 format. Let&#8217;s decompress &#8211; you are now left with a whopping 18.2GB xml file. Outch. Importing this xml file, with the importData.php tool from mediawiki, will take lots of time. It starts importing at a rate of 4 pages per second, but this rate will go down to 2.5 per second after about 20,000 pages. After 30,000 pages, it seemed to stabilize at 2,25 pages/sec, so I started to do some math. There are about 15 million pages in total, if I remember correctly. That is 15,000,000 / (2.25 * 60*60*24) = 77 days.</p>
<p>Unfortunately, but to be expected, the rate keeps going down. After 150,000 pages I&#8217;m now at 1.7 pages/sec. Maybe this wasn&#8217;t such a good idea..</p>
<h2>Plan B</h2>
<p>I can almost hear you screaming now: &#8216;thank god, there is a plan B&#8217;.</p>
<p>Head over the <a href="http://www.mediawiki.org/wiki/MWDumper">MWDumper page</a> and download the jar file. Follow the instructions and use the example they give, looking like:</p>
<pre>java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 |
   mysql -u &lt;username&gt; -p &lt;databasename&gt;</pre>
<p>This is a lot better already, it will import at 200 to 300 pages/sec. But it gets better. If you remove ALL indexes and auto_increments, the speed goes up beyond 2000 pages per second! Don&#8217;t forget to re-add the indexes and auto_increment fields when the import is done.</p>
<p>You can have your own locally installed Wikipedia in about 5 hours or less if your PC is fast. The example PC I used is a relatively dated AMD athlon 3200+ with 1GB of memory and a regular sata disk.</p>
<p>I&#8217;ve not been entirely fair with you so far. If you want a working, complete copy of Wikipedia you will also need to import a number of SQL dumps from different tables. Especially the tables that provide information about link structure. Although these tables are large, they are not that difficult to import. Just disable/remove indexes again and import the data. After importing you can recreate the indexes.</p>
<p>P.S.: you might want to look into MySQLs binary logging. Turning it off or reducing the maximum log size to 1MB will increase performance too!</p>
<h2>Plan C</h2>
<p>&#8220;What?! There is a plan C?!&#8221;</p>
<p>Yes there is. There is another tool called <a title="xml2sql" href="http://meta.wikimedia.org/wiki/Xml2sql">xml2sql</a>. If mvdumper does not give you the speed you need, you can use this tool to extract data from the XML file too. It&#8217;s fast but you will have to do a little patch to the source code before you compile it.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.eriky.com/2008/11/importing-the-complete-english-wikipedia-database/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
