Say you want to setup a local installation of Wikipedia. Installing the software is easy, just go to www.mediawiki.org and download the MediaWiki software, created by the WikiMedia foundation, which is the foundation of Wikipedia. That’s right, you’re guaranteed to mess up the names at some point in your life.
OK, we have it running on a local Ubuntu installation, on Apache and MySQL5 with PHP5. Not a big deal. Now it’s time to download the Wikipedia data, which is licensed in such a way that you are free to use it. In my case, I will be using it for scientific reseach, about which I’ll surely post more in the coming months. The english data of just the current version of all the articles is 4+ gigabytes in compressed bzip2 format. Let’s decompress – you are now left with a whopping 18.2GB xml file. Outch. Importing this xml file, with the importData.php tool from mediawiki, will take lots of time. It starts importing at a rate of 4 pages per second, but this rate will go down to 2.5 per second after about 20,000 pages. After 30,000 pages, it seemed to stabilize at 2,25 pages/sec, so I started to do some math. There are about 15 million pages in total, if I remember correctly. That is 15,000,000 / (2.25 * 60*60*24) = 77 days.
Unfortunately, but to be expected, the rate keeps going down. After 150,000 pages I’m now at 1.7 pages/sec. Maybe this wasn’t such a good idea..
I can almost hear you screaming now: ‘thank god, there is a plan B’.
Head over the MWDumper page and download the jar file. Follow the instructions and use the example they give, looking like:
java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 |
mysql -u <username> -p <databasename>
This is a lot better already, it will import at 200 to 300 pages/sec. But it gets better. If you remove ALL indexes and auto_increments, the speed goes up beyond 2000 pages per second! Don’t forget to re-add the indexes and auto_increment fields when the import is done.
You can have your own locally installed Wikipedia in about 5 hours or less if your PC is fast. The example PC I used is a relatively dated AMD athlon 3200+ with 1GB of memory and a regular sata disk.
I’ve not been entirely fair with you so far. If you want a working, complete copy of Wikipedia you will also need to import a number of SQL dumps from different tables. Especially the tables that provide information about link structure. Although these tables are large, they are not that difficult to import. Just disable/remove indexes again and import the data. After importing you can recreate the indexes.
P.S.: you might want to look into MySQLs binary logging. Turning it off or reducing the maximum log size to 1MB will increase performance too!
“What?! There is a plan C?!”
Yes there is. There is another tool called xml2sql. If mvdumper does not give you the speed you need, you can use this tool to extract data from the XML file too. It’s fast but you will have to do a little patch to the source code before you compile it.