Return home early

OK this is a bit old but I wanted to link to it anyway, just in case you haven’t heart about this programming style yet. It is called “return home early” and it basically means that you can change the logic of your code in such a way that you get less nesting. I tend to think about this when I see lots of curly braces and it often helps me reduce code size and complexity. If it sounds interesting enough for you, read about it here!

Sniffing http headers with Wireshark

If you are ever in need of seeing http requests and responses, you can use this little snippet that I “borrowed” directly from this blog. You need to install WireShark first. On a mac, you can use Darwin ports, use the command sudo port install wireshark. You can also install it on most Linux distributions and there is even a Windows version available for download 😉

tshark -i wlan0 -f 'host 1.2.3.4' -R 'http' -S -V -l | \
awk '/^[HL]/ {p=30} /^[^ HL]/ {p=0} /^ / {--p} {if (p>0) print}'

Replace wlan0 with the network interface name you use and the ip 1.2.3.4 with the ip of the destination machine.

Top 25 Most Dangerous Programming Errors

Top 25 Most Dangerous Programming Errors is a list of the most significant programming errors that can lead to serious software vulnerabilities. They occur frequently, are often easy to find, and easy to exploit. They are dangerous because they will frequently allow attackers to completely take over the software, steal data, or prevent the software from working at all. The list is the result of collaboration between the SANS Institute, MITRE, and many top software security experts in the US and Europe.

MySQL: Error No. 1033 Incorrect information in file: ‘filename’

This is one of those ‘OMG Eriky, you saved my ass!’ posts.

You probably came here after searching for this error. Before you try anything else, check if your /tmp directory exists and if it has the right permissions. If not, create it and do a “chmod 777 /tmp”. MySQL will give you the weirdest errors if it can not use the /tmp folder.

If this does not help, check your /etc/mysql/my.cnf. There could be a line somewhere, stating which directory to use as a temporary directory. Create that directory, or remove/change the line so MySQL uses your /tmp folder instead.

Importing the complete wikipedia database in 5 hours

Say you want to setup a local installation of Wikipedia. Installing the software is easy, just go to www.mediawiki.org and download the MediaWiki software, created by the WikiMedia foundation, which is the foundation of Wikipedia. That’s right, you’re guaranteed to mess up the names at some point in your life.

OK, we have it running on a local Ubuntu installation, on Apache and MySQL5 with PHP5. Not a big deal. Now it’s time to download the Wikipedia data, which is licensed in such a way that you are free to use it. In my case, I will be using it for scientific reseach, about which I’ll surely post more in the coming months. The english data of just the current version of all the articles is 4+ gigabytes in compressed bzip2 format. Let’s decompress – you are now left with a whopping 18.2GB xml file. Outch. Importing this xml file, with the importData.php tool from mediawiki, will take lots of time. It starts importing at a rate of 4 pages per second, but this rate will go down to 2.5 per second after about 20,000 pages. After 30,000 pages, it seemed to stabilize at 2,25 pages/sec, so I started to do some math. There are about 15 million pages in total, if I remember correctly. That is 15,000,000 / (2.25 * 60*60*24) = 77 days.

Unfortunately, but to be expected, the rate keeps going down. After 150,000 pages I’m now at 1.7 pages/sec. Maybe this wasn’t such a good idea..

Plan B

I can almost hear you screaming now: ‘thank god, there is a plan B’.

Head over the MWDumper page and download the jar file. Follow the instructions and use the example they give, looking like:

java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 |
   mysql -u <username> -p <databasename>

This is a lot better already, it will import at 200 to 300 pages/sec. But it gets better. If you remove ALL indexes and auto_increments, the speed goes up beyond 2000 pages per second! Don’t forget to re-add the indexes and auto_increment fields when the import is done.

You can have your own locally installed Wikipedia in about 5 hours or less if your PC is fast. The example PC I used is a relatively dated AMD athlon 3200+ with 1GB of memory and a regular sata disk.

I’ve not been entirely fair with you so far. If you want a working, complete copy of Wikipedia you will also need to import a number of SQL dumps from different tables. Especially the tables that provide information about link structure. Although these tables are large, they are not that difficult to import. Just disable/remove indexes again and import the data. After importing you can recreate the indexes.

P.S.: you might want to look into MySQLs binary logging. Turning it off or reducing the maximum log size to 1MB will increase performance too!

Plan C

“What?! There is a plan C?!”

Yes there is. There is another tool called xml2sql. If mvdumper does not give you the speed you need, you can use this tool to extract data from the XML file too. It’s fast but you will have to do a little patch to the source code before you compile it.