<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Eriky.com &#187; wikipedia</title>
	<atom:link href="http://www.eriky.com/tag/wikipedia/feed" rel="self" type="application/rss+xml" />
	<link>http://www.eriky.com</link>
	<description>Just another geek with a blog</description>
	<lastBuildDate>Fri, 16 Dec 2011 23:04:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>WikiBench master thesis</title>
		<link>http://www.eriky.com/2009/05/wikibench-master-thesis</link>
		<comments>http://www.eriky.com/2009/05/wikibench-master-thesis#comments</comments>
		<pubDate>Wed, 20 May 2009 08:24:46 +0000</pubDate>
		<dc:creator>Erik-Jan</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[WikiBench]]></category>
		<category><![CDATA[wikipedia]]></category>

		<guid isPermaLink="false">http://www.eriky.com/?p=107</guid>
		<description><![CDATA[<a href="http://www.eriky.com/2009/05/wikibench-master-thesis" title="WikiBench master thesis"></a>I officially got my WikiBench project graded with an 8, with which I&#8217;m of course very satisfied. You can now read my thesis called WikiBench: A distributed, Wikipedia based web application benchmark. People interested in this project already found their &#8230;<p class="read-more"><a href="http://www.eriky.com/2009/05/wikibench-master-thesis">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://www.eriky.com/2009/05/wikibench-master-thesis" title="WikiBench master thesis"></a><p>I officially got my WikiBench project graded with an 8, with which I&#8217;m of course very satisfied. You can now read my thesis called <em><a href="http://www.eriky.com/wp-content/uploads/2009/05/wikibench.pdf">WikiBench: A distributed, Wikipedia based web application benchmark</a>.</em></p>
<p>People interested in this project already found their way to my blog. For those who are wondering: I will publish the code and I will do so very shortly (within days). It will most probably appear on Google Code and you won&#8217;t have to search for it since I will devote a post to it right here and include the URL. I will probably release it under a BSD-style license which should give you lots of freedom. Unfortunately I&#8217;m not sure yet if I am allowed to release some of the trace files obtained from Wikipedia.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.eriky.com%2F2009%2F05%2Fwikibench-master-thesis&amp;title=WikiBench%20master%20thesis" id="wpa2a_2"><img src="http://www.eriky.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.eriky.com/2009/05/wikibench-master-thesis/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Importing the complete wikipedia database in 5 hours</title>
		<link>http://www.eriky.com/2008/11/importing-the-complete-english-wikipedia-database</link>
		<comments>http://www.eriky.com/2008/11/importing-the-complete-english-wikipedia-database#comments</comments>
		<pubDate>Sat, 08 Nov 2008 15:10:50 +0000</pubDate>
		<dc:creator>Erik-Jan</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[mediawiki]]></category>
		<category><![CDATA[wikipedia]]></category>

		<guid isPermaLink="false">http://www.eriky.com/?p=3</guid>
		<description><![CDATA[<a href="http://www.eriky.com/2008/11/importing-the-complete-english-wikipedia-database" title="Importing the complete wikipedia database in 5 hours"></a>Say you want to setup a local installation of Wikipedia. Installing the software is easy, just go to www.mediawiki.org and download the MediaWiki software, created by the WikiMedia foundation, which is the foundation of Wikipedia. That&#8217;s right, you&#8217;re guaranteed to &#8230;<p class="read-more"><a href="http://www.eriky.com/2008/11/importing-the-complete-english-wikipedia-database">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://www.eriky.com/2008/11/importing-the-complete-english-wikipedia-database" title="Importing the complete wikipedia database in 5 hours"></a><p>Say you want to setup a local installation of Wikipedia. Installing the software is easy, just go to www.mediawiki.org and download the MediaWiki software, created by the WikiMedia foundation, which is the foundation of Wikipedia. That&#8217;s right, you&#8217;re guaranteed to mess up the names at some point in your life.</p>
<p>OK, we have it running on a local Ubuntu installation, on Apache and MySQL5 with PHP5. Not a big deal. Now it&#8217;s time to download the Wikipedia data, which is licensed in such a way that you are free to use it. In my case, I will be using it for scientific reseach, about which I&#8217;ll surely post more in the coming months. The english data of just the current version of all the articles is 4+ gigabytes in compressed bzip2 format. Let&#8217;s decompress &#8211; you are now left with a whopping 18.2GB xml file. Outch. Importing this xml file, with the importData.php tool from mediawiki, will take lots of time. It starts importing at a rate of 4 pages per second, but this rate will go down to 2.5 per second after about 20,000 pages. After 30,000 pages, it seemed to stabilize at 2,25 pages/sec, so I started to do some math. There are about 15 million pages in total, if I remember correctly. That is 15,000,000 / (2.25 * 60*60*24) = 77 days.</p>
<p>Unfortunately, but to be expected, the rate keeps going down. After 150,000 pages I&#8217;m now at 1.7 pages/sec. Maybe this wasn&#8217;t such a good idea..</p>
<h2>Plan B</h2>
<p>I can almost hear you screaming now: &#8216;thank god, there is a plan B&#8217;.</p>
<p>Head over the <a href="http://www.mediawiki.org/wiki/MWDumper">MWDumper page</a> and download the jar file. Follow the instructions and use the example they give, looking like:</p>
<pre>java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 |
   mysql -u &lt;username&gt; -p &lt;databasename&gt;</pre>
<p>This is a lot better already, it will import at 200 to 300 pages/sec. But it gets better. If you remove ALL indexes and auto_increments, the speed goes up beyond 2000 pages per second! Don&#8217;t forget to re-add the indexes and auto_increment fields when the import is done.</p>
<p>You can have your own locally installed Wikipedia in about 5 hours or less if your PC is fast. The example PC I used is a relatively dated AMD athlon 3200+ with 1GB of memory and a regular sata disk.</p>
<p>I&#8217;ve not been entirely fair with you so far. If you want a working, complete copy of Wikipedia you will also need to import a number of SQL dumps from different tables. Especially the tables that provide information about link structure. Although these tables are large, they are not that difficult to import. Just disable/remove indexes again and import the data. After importing you can recreate the indexes.</p>
<p>P.S.: you might want to look into MySQLs binary logging. Turning it off or reducing the maximum log size to 1MB will increase performance too!</p>
<h2>Plan C</h2>
<p>&#8220;What?! There is a plan C?!&#8221;</p>
<p>Yes there is. There is another tool called <a title="xml2sql" href="http://meta.wikimedia.org/wiki/Xml2sql">xml2sql</a>. If mvdumper does not give you the speed you need, you can use this tool to extract data from the XML file too. It&#8217;s fast but you will have to do a little patch to the source code before you compile it.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.eriky.com%2F2008%2F11%2Fimporting-the-complete-english-wikipedia-database&amp;title=Importing%20the%20complete%20wikipedia%20database%20in%205%20hours" id="wpa2a_4"><img src="http://www.eriky.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.eriky.com/2008/11/importing-the-complete-english-wikipedia-database/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

