Two very different web-based activities involve traditional content search ( google ) versus automatic delivery of a filtered version of the web, condensed and focused according to your personal interests and preferences – Search versus correlation. This article contrasts the automatic delivery model with search in order to provide an understanding why stream correlation and delivery of matched content is a viable alternative to search.
Automatic delivery of filtered streams of web content, greatly condensed through a process using correlation and matching, is a compelling technique for “bringing the web to you”. Whereas search is “pull”, correlation and matching done by Real Time Matrix is “push”. Like a newspaper thats delivered to your door, edited and reduced in order to retain only those articles that are relevant to your interests, correlation with push is convenient. There is no need for newsstands and generic versions of the paper when you have an alternative for delivery of a version of the same paper condensed in such a way that each item is likely to match your interests? Over the last 2 years RSS has provided a number of technical components that complement the process of web delivery of news. Under the covers, RSS is powered by streams, rivers of data more mobile and more flexible than the analogous web pages utilized by search engines.
The remainder of the post describes stream processing for web content as it has been adopted by Real Time Matrix and in doing so, contrasts traditional search with matching and correlation performed on “live” data underlying the river of new web pages, as those pages are posted to the web. For the purpose of making a general comparison, without getting too involved in low-level technical details, the web can be represented by a freeway lined with a series of billboards. To use an example, the New York Times has a series of bill boards representing sections of the news. The NYT Sports Section can be percieived as a billboard represented by a deep stack of pages – a multi page billboard in which every sports article is represented by a billboard page, organized in LIFO order so that the “hot off the press” articles float to the top of the page stack. In will be of particular interest to examine search and correlation techniques as theyrelate to a single, new page as it is “posted” to the web.
Search builds an Index for the new page
The existence of an entry for the new page of content within a search index is a prerequisite. Before the new articles page is indexed, before that new index gets added back to existing indices for NYT Sports, noone can see the new page via search. Index building and pre-processing is not religious about “fresh content” and whats sits at the top of the page stack in the billboard. In building and refreshing its indices, Google employs crawlers on 10s of thousands of servers. The Crawlers act as robots that know the location of every billboard and how to schedule re-visits in order to pick up changes posted to each billboard. Reoccurring visits follow their own schedule, determined by Google algorythms that may miss by weeks the exact time of posting of freshest page. When they visit a site like NYT Sports,
Crawlers focus on the HTML formats of data in a page ( more on HTML data formats and RSS data formats below). Any new page that has appeared since the last visit gets processed by the indexer, links to and from other web pages get analyzed, meta information on the page is read, the page is scanned for key words to be included in an index entry for a new page. Before this new entry can update the main Google search indexes, before the new page can begin showing up among results presented as “hits” pages for web queries, the new index entry needs to be shipped along the freeway to the nearest Google data center where it can be included with other new index entry’s and scheduled for a process that will eventually update the main search indices. After all these pre-processing steps are complete, a sports search could include the new page within the list of search “hits.”
RSS Sample Data , Streaming Data
To understand the streaming data approach, look more closely at the example of a new Sports page article as it is posted to the NYTimes web site. As described above, there are two different data formats that can be used ( html or rss ) to represent the sample page from NYT sports. This html link is the data used by the crawlers described above. Besides this format, there is another data format of the same page that is more mobile, more easily distributed, and more easily exchanged with other cooperating computer systems. The
, also from the NYTimes, contains the same articles as the html links, but these articles are organized in a very different data format – see the sample of RSS format data below :
<div> <pre>
Bryant Barred for One Game After Hard Hit http://www.nytimes.com/2007/01/31/sports/basketball/31suspended.html?ex=1327899600&en=7ccd90d2e67129e2&ei=5088&partner=rssnyt&emc=rss The N.B.A. has shown little tolerance this season for any action that is illegal and overtly ugly. LIZ ROBBINS http://www.nytimes.com/2007/01/31/sports/basketball/31suspended.html Wed, 31 Jan 2007 01:33:30 EDT</pre></div>
Visualize the data above, drawn from an rss sample with itunes type NameSpace additions, as a stream. Scan each data line starting with <itms:artist> placed in one, long line. Next, animate that one big data line like a ticker-tape so the data scrolls across a given place. A ticker tape is a data stream containing words or items that can be correlated to other patterns. Look closely at the animated version of the sample data above – at some point , you would see the words Jack Johnson. If you were interested in music works by Jack, these 2 words could be found among the ticker stream, “matched” by correlation process on your behalf.
RSS Capabilities , Advantages
In contrast to the static html form of the page used by traditional crawlers, the RSS version of a page is fast and flexible. For the purpose of this example, the stream of rss data mentioned above “flows between” arbitrary source and destination points on the freeway. What does that mean? Mobile stream data, Rss friendly, leverages “Feed-enabled” software from any number of 3rd parties compatible with even more 3rd party device types. Equipped with a “reader” that consumes Rss, my phone can locate itself at a suitable off ramp, “subscribing” to just the data that has been correlated and matched to me, streaming past the off-ramp among a much larger “firehose” of generic data. Infrastructure supporting rss distribution is built into the phone and it just works.
In addition, the Rss form of newly posted content does its own PR. As articles post to the NYTimes sports pages, the Rss format has the ability to inject itself into the syndication infrastructure with an announcement that simply says “I’ve arrived”. That simple statement triggers data flows and complementary filtering or “personalization” of the data in transit. The practical result is that relevant data arrives where it is supposed to go on an intelligent freeway. Again with the aid of ots of 3rd party software supporting announcement pings and the feedMesh, company’s like Technorati, syndic8 respond to the data that was just posted with all sorts of automated, “just arrived” type tools. The discovery method for newly posted Html is very different, involving higher cost and greater latency.
Compared to streaming Rss data, moving immediately out to the interested consumer from the billboard at the same time that the article is posted, traditional search indices can make you wait. You wait after the post while the crawlers are idle for a period of time determined by the algorythms at google. For example, if you have small traffic e-commerce site and you begin offering new categories of merchandise in your on-line catalog, it may be 30 days befor the new category gets picked up by crawlers and added to the indices for your site. NYtimes sports, although larger and of more interest to the dispatchers behind the crawlers, is still going to be affected by considerable latency between standard practice for CMS and newly posted Rss versions of articles calls for a notice of the post and a concurrent flow of data that mimics email. Registered parties and groups of people will be notified immediately after the Rss version of the article arrives on the web.
technorati tags:search, rss, crawler, robot, indexer, realtime
Blogged with Flock