Flickr

  • www.flickr.com
    Abiola_Lapite's photos More of Abiola_Lapite's photos

April 19, 2008

Building Search Engines is Hard

Having built a search engine from scratch, I can verify that the task is far harder to do right than most people assume, and that's even leaving aside the need to do something cleverer than everyone else with all of the data you're busy gathering and indexing. Rather than go on at length about the precise whys and wherefores, I'll let this ACM Queue article do a little of the work for me; I say "a little" because it addresses only the basic technical aspects of the exercise, while even a stirringly successful development push will still leave the daunting task of commercializing the product. There's nothing to sour one on the competence and vision of the supposed leading lights of the financial world like dealing with venture capitalists who insist on telling you that "search is dead" just 2 years before Google goes public ...

June 20, 2007

Regexp Magic

Anyone care to guess what the following Perl code does?

perl -wle '$_ = 1; (1 x $_) !~ /^(11+)\1+$/ && print while $_++'
Impressive, isn't it?

February 18, 2007

An Introduction to the Lambda Calculus

And now for something different. Are you a programmer who's made a run in or two with esoteric languages like Haskell, Erlang and OCaml, and do you find yourself confused by closures, maddened by monads and feeling downright cantankerous whenever you hear about currying? If you are, I have a suggestion for you: instead of wasting your time struggling through yet another functional programming tutorial whose writers implicitly assume that you already understand half of the new things they're supposedly trying to teach, you'd be better off giving a close reading to a paper by Henk Barendregt and Erik Barendsen called "Introduction to Lambda Calculus". This paper gives a ground-up, axiomatic treatment of the subject from first principles, and what you learn from it will have immediate application with any programming language with functional aspects, not just one particular dialect like Ruby or Lisp. You might also find this online tutorial useful.

On a tangential note, I definitely get the feeling these days that functional programming has finally come of age. First you have languages like Java and C# starting to tackle closures, then we see OCaml and (especially) Haskell showing up more and more on places like Google Blogsearch and programming.reddit.com. This is a Good Thing™ for software development as a whole (no side effects => many fewer obscure bugs), though I believe the additional intellectual sophistication required to truly understand what functional programming is all about will likely prove a barrier too high for many programmers to surmount, just as many a procedural coder never really figured out object-oriented programming. Not that "OOP only" types are going to be out of work any time soon - just look at how many PHP coders there are out there churning out write-only procedural spaghetti.

August 04, 2006

Google Shares its Corpus

All who are interested in computational linguists will be cheering this Google Research announcement.

... we have been harnessing the vast power of Google's datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there's no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more - resulting in a training corpus of one trillion words from public Web pages.

We believe that the entire research community can benefit from access to such massive amounts of data [...] We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.

And to think it wasn't so long ago that the British National Corpus seemed large. At any rate, while this is obviously very good news, the true measure of the usefulness of all this data will lie in the quality of the annotation; given the sheer size of Google's corpus, I doubt the quality of the tagging will be such as to make the BNC or even the Brown Corpus irrelevent any time soon.

July 03, 2006

A Common Fallacy in Computational Linguistics

If you want a paper which shows the importance of careful, rigorous reasoning even about "obvious" matters, this one by Mehryar Mohri and Richard Sproat ought to do the trick. From the abstract comes the following:

Continue reading "A Common Fallacy in Computational Linguistics" »

October 19, 2005

A Filing System for Brilliant People

It's safe to say that there are two types of people in this world of ours - stuck-in-the-mud, anal-retentive neat-freaks who obsessively keep their desks prim and proper at all times, and creative geniuses so engrossed in "voyaging through strange seas of thought" that they can't help accumulating piles of assorted "stuff." Those who fall into the former category will find that alphabetical file drawers suit them just fine, but if you're lucky enough to belong in the latter group, you'll likely appreciate this clever document filing technique. The best thing of all? It actually benefits from a dash of that creativity the less inspired like to call "randomness" ...

September 24, 2005

Dumbest. OOP. Critique. Ever

If you're a software developer who doesn't live in the stone age, you're going to find yourself doubled over with laughter on reading this ridiculous article written by a machine-code troglodyte "enthusiast" named Richard Mansfield. Object-oriented programming may not be the final word on software development, but the ideas pushed by this guy are so inane as to be grounds for questioning whether he's ever had any actual experience of modern software development on a large scale: who wants to have to know the implementation details of third-party libraries, and how can any thinking person believe that "cut and paste" constitutes a viable technique for code reuse? Yet Mr. Mansfield's rant is littered with such braindead notions.

The really funny thing about people like Richard Mansfield who think procedural programming constitutes the last word on software development is that with C++, Java and C#, it's actually pretty easy to program in a purely procedural style while pretending to be doing OOP: just write one big class, sign all your methods as static, and dump all the logic of your code into the main() function as if you were working with good ole C. Object-oriented programming isn't the problem, OOP languages which allow hidebound programmers to ignore the whole notion of objects are - too many so-called "developers" fail entirely to grasp the benefits of concepts like polymorphism, encapsulation and inheritance at a gut level, and as for design patterns, it's amazing the number of programmers who display an ignorance of something as simple and ubiquitous as the singleton, let alone an abstract factory or a decorator. Mansfield's rant is nothing but a rehash of "Real Programmers Don't Use Pascal" for the 21st century, without any of the self-aware humor of the original.

August 05, 2005

Why Women Dislike Computing

Why do women still avoid computer science like the plague when they're now already near parity representation in undergraduate mathematics? Ernie's 3D Pancakes gives the real, unvarnished answer:

No, Bill, it's because most people in the software industry are glorified factory workers. Highly trained specialists, yes, but in the end, most of them are just putting more bolts on more widgets (and trying to deal with with someone else's crappy widget design) to make more money for people like you. Booooring. Boring boring boring.
Let's face it, the IT industry has a serious image problem, one no amount of dotcom mania has managed to eradicate: who can blame young women for not wanting to sit indoors all day chugging out yet another "enterprise application", alongside spotty-faced geeks with minimal social skills, even worse hygiene, and inordinate obsessions with Star Trek, Dungeons & Dragons and Natalie Portman?*

It's at times like these that I feel a renewed appreciation for David Krumholtz's work on "Numb3rs", the show's silly name notwithstanding ...

*Er, not that there's anything wrong with the last - hey, what can I say, she's hot!

July 13, 2005

The Physical Limits of Computing

Enough with the terrorism for now, and let's turn our sights to a more interesting topic. I'm sure I don't need to introduce anyone here to Moore's Law, according to which computing power can be reliably expected to double every 2 years or so. This guideline has proven to be a reliable one for going on 35 years now, vagaries of CPU architecture notwithstanding, and there's good reason to believe that there's still some life left to it yet. Nonetheless, anyone familiar with the amazing rapidity of exponential growth will realize that this is a trend which ultimately must come to a halt someday, if for no reason other than because the laws of physics say so. It is therefore worthwhile asking just what the fundamental limits said laws impose upon us are: how fast can computers possibly get, even under ideal conditions, and how densely can we store the information we manipulate with them?

The first time I asked myself this question was about 10 years ago, and one answer I was able to come up with at the time was the Bekenstein bound, which sets a firm limit on the amount of information which can be stored in a given volume of space; beyond that, I made little further progress. It therefore comes as a very pleasant surprise to run into this page by Michael Frank, who is currently an Assistant Professor at FSU. Not only does Frank give an extensive reading list of research articles one can consult for more information, but he also provides lecture notes of his own, as well as information leading to this page containing related material by Warren D. Smith.

It may well be that the limits of computation are so great that we are unlikely to approach them at any point in time we can currently envision, but I have a feeling that the opposite is true, and all the extropians, transhumanists and other assorted techno-boosters are going to find their fondest hopes disappointed in the space of a few decades. At any rate, the seemingly straightforward question of just how fast and how big computers can get turns out to be tied intimately to the question of just what the fundamental laws governing our universe are ...

PS: Check out this paper by Seth Lloyd, and take a look at this blog entry by Sun Distinguished Engineer Jeff Bonwick on why 128-bits really should be enough for any conceivable filesystem.

September 11, 2004

The Problem of Search

A fortuitous discovery I've just made is this April 2004 ACM Queue edition focusing on enterprise search. Despite the ostensible focus, the edition covers a lot more ground than just search, though, including the problematic issue of electronic voting and the heights game graphics will need to scale to attain to cinematic quality.

What's most interesting about the last point is that thanks to the high-volume economics of the video game industry, game and graphics card developers and are now the drivers of graphics research, rather than the followers they used to be. When was the last time anyone heard anything of note from the likes of Evan & Sutherland, or even the once-mighty SGI?

Notes for Readers