Wednesday, May 20, 2009

The New Real-Time Search Buzz Isn't That New

In the past couple months, we've been reading a lot about real-time search in the start-up internet world. With Twitter's amazing ascent, the interest in real-time search -- even beyond the microblogging phenom -- has increased dramatically. Every other day it seems, there's another start-up launching that is either wholly dedicated to it or has some angle on it. And when I read and hear all the hoopla, I can't help but think: Technorati has been doing real-time search of the live web -- specifically the blogosphere -- for three years already. We know a lot about real-time search!

BTW, I don't have any grand point in this post; rather, this is just the curious observation that the new-new thing is really something not that new. Of course, most of the new forays into real-time are all taking new and different angles, but essentially, Technorati was one of the first to crack the code on it, and we've learned a ton doing it.

Like what, you ask? That's a long story, but the short one is that's incredibly challenging and complex to do it well. Making real-time work is very complex at scale, and there's a reason there are only two real-time blog search indexes (Google Blog Search and Technorati). The volume of data presents multiple challenges: the data becomes nearly irrelevant shortly after it appears (often within days and certainly within weeks -- over 90% of all searches on Technorati are looking for something less than a month old); it’s much easier to spam* (Twitter is just beginning to experience this -- just wait...); it’s hard to balance recency and relevancy together; and lastly, it's expensive -- spinning large quantities of data so it's readily available to query is really expensive, and the entire live web is a really large place (Technorati only focuses on the blogosphere).

Anyway, it's interesting to see so much activity focused on something we've been working on for three years, and I'm even more interested to see how these various initiatives approach some of the complexities.

* Spam filtering for real-time blog search just might be Technorati's most valuable IP!!

3 comments:

  1. Thanks for the post ...

    It seems all Relational databases are designed with "Real Time" or "as close to Real Time" data delivery as just one of the development goals (it should be!). This is often requested by clients. It seems hard to believe that Real Time Search sprang up 2/3 years ago!

    This is not to say that the 2 entities mentioned, Google and Technorati do not curently provide their niches well.

    ReplyDelete
  2. It seems all Relational databases are designed with "Real Time" or "as close to Real Time" data delivery as just one of the development goals (it should be!). This is often requested by clients. It seems hard to believe that Real Time Search sprang up 2/3 years ago!

    This is not to say that the 2 entities mentioned, Google and Technorati do not currently provide their niches well.

    ReplyDelete
  3. Enjoyed your comments about blogging in the NYT.
    Marty Davis
    chickaboomer.blogspot.com

    ReplyDelete