Someone had to do it: a Map-Reduce system build around the browser. Just point your browser to a URL and you are instantly helping someone to solve large problems by taking part in the process and running a number of jobs. If you think about this, it can even be used to replace advertisements. Instead of looking at flashy ads, a site can load a few tasks in the background (a frame would be best) and use some of your CPU power
That would probably even be cheaper for the visitor than running the CPU power drain called “Abobe Flash” to show the usual “OMG you just won an iPod!!!” ads.
Author Archives: Erik-Jan - Page 2
Map-Reduce in the browser
Note to self: randomly drop lines in a text file
If you ever need to drop lines from a stream of text randomly, you can use this simple and short awk command:
Example: cat file | awk '{if (int(rand()*100) < 10) print $0;}'
This example keeps only 10%. You can change the 10 to any other percentage to drop more or less.
As an example, I use this to warmup my MediaWiki installation before doing a real WikiBench benchmark:
cat benchmarks/1pct.trace | head -n 100000 | grep "\-$" | \
awk '{if (int(rand()*100) < 10) print $0;}' | ./start_controller.sh -verbose
MediaWiki: “1048: Column ‘old_id’ cannot be null”
If you get this error with MediaWiki (or any other software) you should look at the properties of your table structure. Most probable is that an “auto_increment” is missing. This problem took me quite a while to find, especially because of lots of people on the web come up with the weirdest explanations and solutions, like simply reinstalling and re-importing the data. Not fun if you are in a hurry and have a table with 7.5 million text files.
WikiBench presentation
Today I presented my master research project to a group of people at the Vrije Universiteit. The project is called “WikiBench, a distributed Wikipedia based web application benchmark“. You can view my slides on this url if you are interested. The thesis (and source code!) will be released towards the end of March.
Oh crap.. Debian 5 has been released
The Debian Project is pleased to announce the official release of Debian GNU/Linux version 5.0 (codenamed Lenny
) after 22 months of constant development.
Although I should be glad, I’m not since I’m still running a server with Debian 3.1 on it. It’s stable as as rock though.
The availability and updates of OpenJDK, GNU Java compiler, GNU Java bytecode interpreter, Classpath and other free versions of Sun’s Java technology, into Debian GNU/Linux 5.0 allow us to ship Java-based applications in Debian’s main
repository.
That is good news for Java as a language. This will obviously make it easier to install Java software.
Return home early
OK this is a bit old but I wanted to link to it anyway, just in case you haven’t heart about this programming style yet. It is called “return home early” and it basically means that you can change the logic of your code in such a way that you get less nesting. I tend to think about this when I see lots of curly braces and it often helps me reduce code size and complexity. If it sounds interesting enough for you, read about it here!
Sniffing http headers with Wireshark
If you are ever in need of seeing http requests and responses, you can use this little snippet that I “borrowed” directly from this blog. You need to install WireShark first. On a mac, you can use Darwin ports, use the command sudo port install wireshark. You can also install it on most Linux distributions and there is even a Windows version available for download
tshark -i wlan0 -f 'host 1.2.3.4' -R 'http' -S -V -l | \
awk '/^[HL]/ {p=30} /^[^ HL]/ {p=0} /^ / {--p} {if (p>0) print}'
Replace wlan0 with the network interface name you use and the ip 1.2.3.4 with the ip of the destination machine.
Top 25 Most Dangerous Programming Errors
Top 25 Most Dangerous Programming Errors is a list of the most significant programming errors that can lead to serious software vulnerabilities. They occur frequently, are often easy to find, and easy to exploit. They are dangerous because they will frequently allow attackers to completely take over the software, steal data, or prevent the software from working at all. The list is the result of collaboration between the SANS Institute, MITRE, and many top software security experts in the US and Europe.
MySQL: Error No. 1033 Incorrect information in file: ‘filename’
This is one of those ‘OMG Eriky, you saved my ass!’ posts.
You probably came here after searching for this error. Before you try anything else, check if your /tmp directory exists and if it has the right permissions. If not, create it and do a “chmod 777 /tmp”. MySQL will give you the weirdest errors if it can not use the /tmp folder.
If this does not help, check your /etc/mysql/my.cnf. There could be a line somewhere, stating which directory to use as a temporary directory. Create that directory, or remove/change the line so MySQL uses your /tmp folder instead.
Importing the complete wikipedia database in 5 hours
Say you want to setup a local installation of Wikipedia. Installing the software is easy, just go to www.mediawiki.org and download the MediaWiki software, created by the WikiMedia foundation, which is the foundation of Wikipedia. That’s right, you’re guaranteed to mess up the names at some point in your life.
OK, we have it running on a local Ubuntu installation, on Apache and MySQL5 with PHP5. Not a big deal. Now it’s time to download the Wikipedia data, which is licensed in such a way that you are free to use it. In my case, I will be using it for scientific reseach, about which I’ll surely post more in the coming months. The english data of just the current version of all the articles is 4+ gigabytes in compressed bzip2 format. Let’s decompress – you are now left with a whopping 18.2GB xml file. Outch. Importing this xml file, with the importData.php tool from mediawiki, will take lots of time. It starts importing at a rate of 4 pages per second, but this rate will go down to 2.5 per second after about 20,000 pages. After 30,000 pages, it seemed to stabilize at 2,25 pages/sec, so I started to do some math. There are about 15 million pages in total, if I remember correctly. That is 15,000,000 / (2.25 * 60*60*24) = 77 days.
Unfortunately, but to be expected, the rate keeps going down. After 150,000 pages I’m now at 1.7 pages/sec. Maybe this wasn’t such a good idea..
Plan B
I can almost hear you screaming now: ‘thank god, there is a plan B’.
Head over the MWDumper page and download the jar file. Follow the instructions and use the example they give, looking like:
java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 | mysql -u <username> -p <databasename>
This is a lot better already, it will import at 200 to 300 pages/sec. But it gets better. If you remove ALL indexes and auto_increments, the speed goes up beyond 2000 pages per second! Don’t forget to re-add the indexes and auto_increment fields when the import is done.
You can have your own locally installed Wikipedia in about 5 hours or less if your PC is fast. The example PC I used is a relatively dated AMD athlon 3200+ with 1GB of memory and a regular sata disk.
I’ve not been entirely fair with you so far. If you want a working, complete copy of Wikipedia you will also need to import a number of SQL dumps from different tables. Especially the tables that provide information about link structure. Although these tables are large, they are not that difficult to import. Just disable/remove indexes again and import the data. After importing you can recreate the indexes.
P.S.: you might want to look into MySQLs binary logging. Turning it off or reducing the maximum log size to 1MB will increase performance too!
Plan C
“What?! There is a plan C?!”
Yes there is. There is another tool called xml2sql. If mvdumper does not give you the speed you need, you can use this tool to extract data from the XML file too. It’s fast but you will have to do a little patch to the source code before you compile it.