alexn.org

Using the Best Tools in Programming: Not Really Doable

There's something that bothers me when it comes to starting a new project. You can't really use the best tool for a certain job, if that tool is not integrated with the rest of your platform. Let me explain.

At our startup we pride ourselves with our pragmatism. We are true polyglots :) capable of diving in any project, no matter the language it was written in. This also gives us the power to make educated choices about the technologies we're going to use for our own gigs.

Our programming language of choice is Perl, because of its flexibility and because usually there's no need to reinvent the wheel since you can find a CPAN module for almost anything.

But recently I began experimenting with data-mining techniques, flirting with various NLP libraries. You can find almost anything in CPAN's AI:: namespace. But I also knew about NLTK, a Python collection of libraries with excellent documentation, and I also found OpenNLP, MontyLingua, ConceptNet, link-grammar and various Ruby modules.

And all of a sudden I got cold feet. Java packages in OpenNLP may have the advantage of speed (just a guess and it doesn't matter for the purpose of this discussion). NLTK has pedigree and great documentation, not to mention that many books related to NLP, AI and data mining have Python samples (for example I own Programming Collective Intelligence and AIMA). Usually the solution is straightforward: you test all the options, and choose the best one.

But what if you want to combine them?

Well, then you're shit out of luck. Surely you can do that with inter-process communication, but for that you'll have to write glue-code and pay the price for extra latency, bandwidth and memory ... parsing millions of documents, moving results between processes, it's not really practical. Perl does have Inline::Java, but I would only use it in extreme situations.

That's why there's so much wheel reinvention around. Unless a module is written in C, for which any language has a FFI, almost nobody wants to use a Java module from Ruby, or a Python module from Perl. That's why there's Lucene, and then there's Lucene.NET, CLucene, Ferret, Zend_Search_Lucene, Plucene and Lucene4c.

What is really needed is a universal virtual machine with a flexible MOP, allowing seamless communication between languages. I'm happy there are a couple of efforts in this space, including Parrot, and the DLR. Also, the biggest obstacles of alternative implementations are the modules written in C. Fortunately, JRuby/Rubinius have a brand new implementation-independent FFI, and Ironclad will allow IronPython users to use CPython extensions (number one on their list being numpy).

These developments make me happy :)