It’s been a while since I did some real coding. Actually, I haven’t done any coding in ages! Time to change all that. I saw that Marieke had posted that she had written a script for doing tf.idf calculations in Perl. The idea is simple: you have a couple of documents with a bunch of words in them and you want to calculate per word how often it occurs in a document (document frequency, tf) divided by how often it occurs in the total collection (document frequency, df). A couple of years ago I wrote the same thing in Perl for a course in IR. A while later I rewrote it in Perl. Since I’m mostly interested in Ruby these days I rewrote it again. The code can be found here and the nicely formatted version can be found here.
I ran the script on three test documents with such texts as “fishes swim in the morning”. Nothing spectacular. The output looks something like this:
cat 0.2500 0.0000 0.7500
but 0.0000 1.0000 0.0000
dogs 0.0000 0.5000 0.5000
rain 1.0000 0.0000 0.0000
funny 0.0000 1.0000 0.0000
morning 1.0000 0.0000 0.0000
sweet 0.0000 1.0000 0.0000
in 1.0000 0.0000 0.0000
the 1.0000 0.0000 0.0000
dog 1.0000 0.0000 0.0000
cats 0.0000 1.0000 0.0000
swim 1.0000 0.0000 0.0000
are 0.0000 1.0000 0.0000
suck 0.0000 0.0000 1.0000
fishes 1.0000 0.0000 0.0000
Sure, the script is still a little crude. I’m not doing any stemming or other neat tricks but still… not bad for an hour of hacking
I must say that Ruby makes it pretty easy to elegantly write this little script even though it took a bit of tweaking to get the syntax right!





