Programming Ruby

It’s been a while since I did some real coding. Actually, I haven’t done any coding in ages! Time to change all that. I saw that Marieke had posted that she had written a script for doing tf.idf calculations in Perl. The idea is simple: you have a couple of documents with a bunch of words in them and you want to calculate per word how often it occurs in a document (document frequency, tf) divided by how often it occurs in the total collection (document frequency, df). A couple of years ago I wrote the same thing in Perl for a course in IR. A while later I rewrote it in Perl. Since I’m mostly interested in Ruby these days I rewrote it again. The code can be found here and the nicely formatted version can be found here.

I ran the script on three test documents with such texts as “fishes swim in the morning”. Nothing spectacular. The output looks something like this:

            cat 0.2500 0.0000 0.7500
            but 0.0000 1.0000 0.0000
           dogs 0.0000 0.5000 0.5000
           rain 1.0000 0.0000 0.0000
          funny 0.0000 1.0000 0.0000
        morning 1.0000 0.0000 0.0000
          sweet 0.0000 1.0000 0.0000
             in 1.0000 0.0000 0.0000
            the 1.0000 0.0000 0.0000
            dog 1.0000 0.0000 0.0000
           cats 0.0000 1.0000 0.0000
           swim 1.0000 0.0000 0.0000
            are 0.0000 1.0000 0.0000
           suck 0.0000 0.0000 1.0000
         fishes 1.0000 0.0000 0.0000

Sure, the script is still a little crude. I’m not doing any stemming or other neat tricks but still… not bad for an hour of hacking :-) I must say that Ruby makes it pretty easy to elegantly write this little script even though it took a bit of tweaking to get the syntax right!

This entry was posted in comp. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>