WikiBench master thesis

May 20th, 2009 Posted in Research | no comment »

I officially got my WikiBench project graded with an 8, with which I’m of course very satisfied. You can now read my thesis called WikiBench: A distributed, Wikipedia based web application benchmark.

People interested in this project already found their way to my blog. For those who are wondering: I will publish the code and I will do so very shortly (within days). It will most probably appear on Google Code and you won’t have to search for it since I will devote a post to it right here and include the URL. I will probably release it under a BSD-style license which should give you lots of freedom. Unfortunately I’m not sure yet if I am allowed to release some of the trace files obtained from Wikipedia.

Ghostnet

Apr 14th, 2009 Posted in Computer networks | no comment »

After all the buzz around Ghostnet, it’s fun to look back and read the origal document describing the spy network. It’s an interesting read, and if you don’t have the time to read this you can also check out the Security Now! podcast from April 9th in which Steve Gibson explains how the research group found out about the spy network and how amateurish the (open source) Gh0trat software actually is.

One very important lesson learned from this story is that attackers no longer  control these networks by using IRC as we have seen in the past. Ghostnet used plain old http requests to periodically check for new commands. The startling thing about this is that this is exactly the kind of traffic that gets through firewalls and even proxy servers without any problems. HTTP replies consisting of jpg images contained the actual, encoded commands.

Bloom filters

Mar 3rd, 2009 Posted in Research | no comment »

This article by Broder and Mitzenmacher gives a good description of how bloom filters work and what they can do for you. The bloom filter basically replaces a dataset with a filter that can tell you if an item is a member of that set or not. It will not give false negatives, but it might give false positives. In practise, this is a negative property that can be outweighted by the space savings a bloom filter introduces; after all, you do not need to query the dataset to determine membership. The most important and summarizing quote you should remember from the article:

The Bloom filter principle: Wherever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated.

The article also gives a number of examples in which bloom filters are used. E.g. to aid resource location in P2P and cache systems.

Map-Reduce in the browser

Mar 3rd, 2009 Posted in Computer networks | no comment »

Someone had to do it: a Map-Reduce system build around the browser. Just point your browser to a URL and you are instantly helping someone to solve large problems by taking part in the process and running a number of jobs. If you think about this, it can even be used to replace advertisements. Instead of looking at flashy ads, a site can load a few tasks in the background (a frame would be best) and use some of your CPU power :-) That would probably even be cheaper for the visitor than running the CPU power drain called “Abobe Flash” to show the usual “OMG you just won an iPod!!!” ads.

Note to self: randomly drop lines in a text file

Feb 25th, 2009 Posted in Programming | no comment »

If you ever need to drop lines from a stream of text randomly, you can use this simple and short awk command:

Example: cat file | awk '{if (int(rand()*100) < 10) print $0;}'

This example keeps only 10%. You can change the 10 to any other percentage to drop more or less.

As an example, I use this to warmup my MediaWiki installation before doing a real WikiBench benchmark:

cat benchmarks/1pct.trace | head -n 100000 | grep "\-$" | \
awk '{if (int(rand()*100) < 10) print $0;}' | ./start_controller.sh -verbose

MediaWiki: “1048: Column ‘old_id’ cannot be null”

Feb 25th, 2009 Posted in Software | no comment »

If you get this error with MediaWiki (or any other software) you should look at the properties of your table structure. Most probable is that an “auto_increment” is missing. This problem took me quite a while to find, especially because of lots of people on the web come up with the weirdest explanations and solutions, like simply reinstalling and re-importing the data. Not fun if you are in a hurry and have a table with 7.5 million text files.

WikiBench presentation

Feb 19th, 2009 Posted in Research | one comment »

Today I presented my master research project to a group of people at the Vrije Universiteit. The project is called “WikiBench, a distributed Wikipedia based web application benchmark“. You can view my slides on this url if you are interested. The thesis (and source code!) will be released towards the end of March.

Oh crap.. Debian 5 has been released

Feb 15th, 2009 Posted in Unix/Linux | no comment »

The Debian Project is pleased to announce the official release of Debian GNU/Linux version 5.0 (codenamed Lenny) after 22 months of constant development.

Although I should be glad, I’m not since I’m still running a server with Debian 3.1 on it. It’s stable as as rock though.

The availability and updates of OpenJDK, GNU Java compiler, GNU Java bytecode interpreter, Classpath and other free versions of Sun’s Java technology, into Debian GNU/Linux 5.0 allow us to ship Java-based applications in Debian’s main repository.

That is good news for Java as a language. This will obviously make it easier to install Java software.

Return home early

Feb 15th, 2009 Posted in Programming | no comment »

OK this is a bit old but I wanted to link to it anyway, just in case you haven’t heart about this programming style yet. It is called “return home early” and it basically means that you can change the logic of your code in such a way that you get less nesting. I tend to think about this when I see lots of curly braces and it often helps me reduce code size and complexity. If it sounds interesting enough for you, read about it here!

Sniffing http headers with Wireshark

Jan 15th, 2009 Posted in Computer networks | no comment »

If you are ever in need of seeing http requests and responses, you can use this little snippet that I “borrowed” directly from this blog. You need to install WireShark first. On a mac, you can use Darwin ports, use the command sudo port install wireshark. You can also install it on most Linux distributions and there is even a Windows version available for download ;-)

tshark -i wlan0 -f 'host 1.2.3.4' -R 'http' -S -V -l | \
awk '/^[HL]/ {p=30} /^[^ HL]/ {p=0} /^ / {--p} {if (p>0) print}'

Replace wlan0 with the network interface name you use and the ip 1.2.3.4 with the ip of the destination machine.