Author Archives: Erik-Jan - Page 2

Another privacy invading monster (LSO cookies)

Recently I came across something called LSO cookies (read about it, I’m not going to explain them in detail here). Since more and more browsers, virus scanners and security software block cookies, these LSO cookies are a real treat for advertisers and tracking companies for several reason:

  • most people never heard of them
  • they are difficult to block
  • they provide lots of storage (100KB per site)
  • they are not removed by your browser, ever
  • they work and track you, even with your browsers “privacy mode”


There are several dangers to these cookies. First and foremost, we block cookies for a reason. We don’t want to be tracked everywhere on the web and we don’t want companies to build profiles of our web usage for whatever reasons they have. These companies shamelessly track us anyway by using all kinds of tricks, like these LSO cookies, instead of respecting our explicit choice to not be tracked and monitored.

Another problem is that this will leave tracks of your Internet usage on your computer, even if you try to cover those tracks by deleting cookies, browser cache and temporary files.

So what can we do about this?

First of all, the best thing would be to not use flash but that ain’t an option. (We want our youtube to work!). So the second best option is to block or at least remove the cookies. There is an excellent Firefox plugin called Better Privacy that will give you all kinds of options to remove or block LSO cookies.

If you don’t have Firefox, your third option is to go to Abobe’s Flash player settings page – you never heard of it, neither did I – and set the storage space to zero KB. Next, go to the last tab there or use this link, and be amazed at the amount of sites that use LSO cookies to store whatever they want to store on your PC. Next, click the remove all button to remove it all. Note that setting the storage to zero prevents sites from storing cookies, but Flash will still create directories for each site that tries. So next time you visit that shameful pr0n site, be aware that Flash will keep track of it.

WikiBench master thesis

I officially got my WikiBench project graded with an 8, with which I’m of course very satisfied. You can now read my thesis called WikiBench: A distributed, Wikipedia based web application benchmark.

People interested in this project already found their way to my blog. For those who are wondering: I will publish the code and I will do so very shortly (within days). It will most probably appear on Google Code and you won’t have to search for it since I will devote a post to it right here and include the URL. I will probably release it under a BSD-style license which should give you lots of freedom. Unfortunately I’m not sure yet if I am allowed to release some of the trace files obtained from Wikipedia.


After all the buzz around Ghostnet, it’s fun to look back and read the origal document describing the spy network. It’s an interesting read, and if you don’t have the time to read this you can also check out the Security Now! podcast from April 9th in which Steve Gibson explains how the research group found out about the spy network and how amateurish the (open source) Gh0trat software actually is.

One very important lesson learned from this story is that attackers no longer  control these networks by using IRC as we have seen in the past. Ghostnet used plain old http requests to periodically check for new commands. The startling thing about this is that this is exactly the kind of traffic that gets through firewalls and even proxy servers without any problems. HTTP replies consisting of jpg images contained the actual, encoded commands.

Bloom filters

This article by Broder and Mitzenmacher gives a good description of how bloom filters work and what they can do for you. The bloom filter basically replaces a dataset with a filter that can tell you if an item is a member of that set or not. It will not give false negatives, but it might give false positives. In practise, this is a negative property that can be outweighted by the space savings a bloom filter introduces; after all, you do not need to query the dataset to determine membership. The most important and summarizing quote you should remember from the article:

The Bloom filter principle: Wherever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated.

The article also gives a number of examples in which bloom filters are used. E.g. to aid resource location in P2P and cache systems.

Map-Reduce in the browser

Someone had to do it: a Map-Reduce system build around the browser. Just point your browser to a URL and you are instantly helping someone to solve large problems by taking part in the process and running a number of jobs. If you think about this, it can even be used to replace advertisements. Instead of looking at flashy ads, a site can load a few tasks in the background (a frame would be best) and use some of your CPU power :-) That would probably even be cheaper for the visitor than running the CPU power drain called “Abobe Flash” to show the usual “OMG you just won an iPod!!!” ads.

Note to self: randomly drop lines in a text file

If you ever need to drop lines from a stream of text randomly, you can use this simple and short awk command:

Example: cat file | awk '{if (int(rand()*100) < 10) print $0;}'

This example keeps only 10%. You can change the 10 to any other percentage to drop more or less.

As an example, I use this to warmup my MediaWiki installation before doing a real WikiBench benchmark:

cat benchmarks/1pct.trace | head -n 100000 | grep "\-$" | \
awk '{if (int(rand()*100) < 10) print $0;}' | ./ -verbose

MediaWiki: “1048: Column ‘old_id’ cannot be null”

If you get this error with MediaWiki (or any other software) you should look at the properties of your table structure. Most probable is that an “auto_increment” is missing. This problem took me quite a while to find, especially because of lots of people on the web come up with the weirdest explanations and solutions, like simply reinstalling and re-importing the data. Not fun if you are in a hurry and have a table with 7.5 million text files.

WikiBench presentation

Today I presented my master research project to a group of people at the Vrije Universiteit. The project is called “WikiBench, a distributed Wikipedia based web application benchmark“. You can view my slides on this url if you are interested. You can get more information and find the master thesis I wrote on it at this url:

Oh crap.. Debian 5 has been released

The Debian Project is pleased to announce the official release of Debian GNU/Linux version 5.0 (codenamed Lenny) after 22 months of constant development.

Although I should be glad, I’m not since I’m still running a server with Debian 3.1 on it. It’s stable as as rock though.

The availability and updates of OpenJDK, GNU Java compiler, GNU Java bytecode interpreter, Classpath and other free versions of Sun’s Java technology, into Debian GNU/Linux 5.0 allow us to ship Java-based applications in Debian’s main repository.

That is good news for Java as a language. This will obviously make it easier to install Java software.

Return home early

OK this is a bit old but I wanted to link to it anyway, just in case you haven’t heart about this programming style yet. It is called “return home early” and it basically means that you can change the logic of your code in such a way that you get less nesting. I tend to think about this when I see lots of curly braces and it often helps me reduce code size and complexity. If it sounds interesting enough for you, read about it here!