Category Archives: Research

WikiBench master thesis

I officially got my WikiBench project graded with an 8, with which I’m of course very satisfied. You can now read my thesis called WikiBench: A distributed, Wikipedia based web application benchmark.

People interested in this project already found their way to my blog. For those who are wondering: I will publish the code and I will do so very shortly (within days). It will most probably appear on Google Code and you won’t have to search for it since I will devote a post to it right here and include the URL. I will probably release it under a BSD-style license which should give you lots of freedom. Unfortunately I’m not sure yet if I am allowed to release some of the trace files obtained from Wikipedia.

Bloom filters

This article by Broder and Mitzenmacher gives a good description of how bloom filters work and what they can do for you. The bloom filter basically replaces a dataset with a filter that can tell you if an item is a member of that set or not. It will not give false negatives, but it might give false positives. In practise, this is a negative property that can be outweighted by the space savings a bloom filter introduces; after all, you do not need to query the dataset to determine membership. The most important and summarizing quote you should remember from the article:

The Bloom filter principle: Wherever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated.

The article also gives a number of examples in which bloom filters are used. E.g. to aid resource location in P2P and cache systems.

WikiBench presentation

Today I presented my master research project to a group of people at the Vrije Universiteit. The project is called “WikiBench, a distributed Wikipedia based web application benchmark“. You can view my slides on this url if you are interested. You can get more information and find the master thesis I wrote on it at this url: http://www.wikibench.eu

Importing the complete wikipedia database in 5 hours

Say you want to setup a local installation of Wikipedia. Installing the software is easy, just go to www.mediawiki.org and download the MediaWiki software, created by the WikiMedia foundation, which is the foundation of Wikipedia. That’s right, you’re guaranteed to mess up the names at some point in your life.

OK, we have it running on a local Ubuntu installation, on Apache and MySQL5 with PHP5. Not a big deal. Now it’s time to download the Wikipedia data, which is licensed in such a way that you are free to use it. In my case, I will be using it for scientific reseach, about which I’ll surely post more in the coming months. The english data of just the current version of all the articles is 4+ gigabytes in compressed bzip2 format. Let’s decompress – you are now left with a whopping 18.2GB xml file. Outch. Importing this xml file, with the importData.php tool from mediawiki, will take lots of time. It starts importing at a rate of 4 pages per second, but this rate will go down to 2.5 per second after about 20,000 pages. After 30,000 pages, it seemed to stabilize at 2,25 pages/sec, so I started to do some math. There are about 15 million pages in total, if I remember correctly. That is 15,000,000 / (2.25 * 60*60*24) = 77 days.

Unfortunately, but to be expected, the rate keeps going down. After 150,000 pages I’m now at 1.7 pages/sec. Maybe this wasn’t such a good idea..

Plan B

I can almost hear you screaming now: ‘thank god, there is a plan B’.

Head over the MWDumper page and download the jar file. Follow the instructions and use the example they give, looking like:

java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 |
   mysql -u <username> -p <databasename>

This is a lot better already, it will import at 200 to 300 pages/sec. But it gets better. If you remove ALL indexes and auto_increments, the speed goes up beyond 2000 pages per second! Don’t forget to re-add the indexes and auto_increment fields when the import is done.

You can have your own locally installed Wikipedia in about 5 hours or less if your PC is fast. The example PC I used is a relatively dated AMD athlon 3200+ with 1GB of memory and a regular sata disk.

I’ve not been entirely fair with you so far. If you want a working, complete copy of Wikipedia you will also need to import a number of SQL dumps from different tables. Especially the tables that provide information about link structure. Although these tables are large, they are not that difficult to import. Just disable/remove indexes again and import the data. After importing you can recreate the indexes.

P.S.: you might want to look into MySQLs binary logging. Turning it off or reducing the maximum log size to 1MB will increase performance too!

Plan C

“What?! There is a plan C?!”

Yes there is. There is another tool called xml2sql. If mvdumper does not give you the speed you need, you can use this tool to extract data from the XML file too. It’s fast but you will have to do a little patch to the source code before you compile it.