• comp 29.11.2006 No Comments

    It’s been a while since I did some real coding. Actually, I haven’t done any coding in ages! Time to change all that. I saw that Marieke had posted that she had written a script for doing tf.idf calculations in Perl. The idea is simple: you have a couple of documents with a bunch of words in them and you want to calculate per word how often it occurs in a document (document frequency, tf) divided by how often it occurs in the total collection (document frequency, df). A couple of years ago I wrote the same thing in Perl for a course in IR. A while later I rewrote it in Perl. Since I’m mostly interested in Ruby these days I rewrote it again. The code can be found here and the nicely formatted version can be found here.

    I ran the script on three test documents with such texts as “fishes swim in the morning”. Nothing spectacular. The output looks something like this:

                cat 0.2500 0.0000 0.7500
                but 0.0000 1.0000 0.0000
               dogs 0.0000 0.5000 0.5000
               rain 1.0000 0.0000 0.0000
              funny 0.0000 1.0000 0.0000
            morning 1.0000 0.0000 0.0000
              sweet 0.0000 1.0000 0.0000
                 in 1.0000 0.0000 0.0000
                the 1.0000 0.0000 0.0000
                dog 1.0000 0.0000 0.0000
               cats 0.0000 1.0000 0.0000
               swim 1.0000 0.0000 0.0000
                are 0.0000 1.0000 0.0000
               suck 0.0000 0.0000 1.0000
             fishes 1.0000 0.0000 0.0000
    

    Sure, the script is still a little crude. I’m not doing any stemming or other neat tricks but still… not bad for an hour of hacking :-) I must say that Ruby makes it pretty easy to elegantly write this little script even though it took a bit of tweaking to get the syntax right!

  • comp, general 25.11.2006 No Comments

    For those of you interested: I just published the digital version of my dissertation on my publications page. There’s a PDF version available as well as the LaTeX sources..

  • comp, general 24.11.2006 No Comments

    The last week has been busy busy busy … again. Monday and Tuesday were really busy at work. Last week we finished a big project (Travis) and it seems that wrapping up such projects always costs more time than you think in advance. Either way, I was happy that the project was over. It wasn’t really my cup of tea. Now I can work on a much more interesting project where we can actually do some requirements specification for a change.

    Speaking of which, I selected some more books (from my readinglist) to buy from Addison-Wesley. I’ve ordered with them before and now I get a discount of 50% when I order 3 books or more. That means that it is cheaper to buy 3 books than 2 :-) Interesting! Also I found a really good blog about software and software development: Tyner Blain. I sent the author (Scott Sehlhorst) an E-mail with some ideas I had and got a really nice response. He even read some of the papers on my personal page. Cool :-)

    Anyway, Wednesday I was home most of the day, had to rush-shop in the morning (groceries, more building materials, hairdressers with Koen) as the construction company came over in the early afternoon to fix our roof. Looks like they’ve done a good job by the way, so far we haven’t had any water on the attic anymore. Thursday was… well… Thursday and today I went to Nijmegen again to fit my outfit for the defense of my dissertation. Way cool! I also discussed a review on one of our papers that wasn’t published yet. The review is really good this time; high quality review as well as a positive outcome (we have to fix some minor stuff but the two reviewers were extremely positive about the contribution). So: nothing but good news!

    I’ve also started studying again. For our competence center I have to read some stuff on software architecture. We’re using the books from the “Open University” as well as some books on DYA (Dynamic Architecture). I’ve read a couple of chapters. The DYA book is nice but nothing new so far. Still a good read though :-)