Thoughts on a Sewage-Filled Spam Hose


Earlier today Josh Stylman pointed out an interesting post by Mark Pilatowski entitled Search and Social: Will the Twitter Firehose Become a Sewage-Filled Spam Hose? While the short answer to that question is “what do you mean will become a sewage-filled spam hose?” I’d like to dig a little deeper than that.

It’s well worth clicking through above and reading the entire post, but I’ll give you the first paragraph here:

As most of you probably know Bing and Google announced that they have finalized agreements with Twitter to begin incorporating Tweets into their search engine results. Everyone seems to be overjoyed and excited about this. Search engines are excited because they get access to the Twitter firehose and they can begin providing real time results in the SERPs. Twitter is happy because they are finally getting paid. Searchers are happy because they can now get real time results for queries that deserve it, like breaking news. Everyone seems to be overjoyed about the possibilities and I myself am very interested to see how this all plays out. I do have one concern and that is how are Bing and Google going to deal with the issue of spam when it comes to real time search via Twitter results?

Two issues jump out at me from that first paragraph: the phrase “real time results for the queries that deserve it, like breaking news,” and the question “how are Bing and Google going to deal with the issue of spam when it comes to real time searh via Twitter results?” The two are very much intertwined, to the point where I think they’re aspects of a single question: how can you use Twitter’s data to enhance a search of the web?

Taking on the real time results view of things first, I question whether displaying tweets at the top of a search is where the interesting stuff necessarily has to happen at first. Yes, that’s real time and kind of cool, but does it enhance my experience when searching the web? Not sure about that. More recent isn’t necessarily better. As Pilatowski points out, breaking news is the poster child here, but outside of the oft-cited earthquakes example, how often does Twitter bring you the substance of breaking news and how often does it bring notification of breaking news and a link?

A lot of links, not a lot of meaty, 140 character eyewitness reports, right?

So if you’re a search engine, just having access to that Twitter firehose—even if you never display a single tweet in your result set—does get you some hugely useful real time information to work with in a format you’re used to dealing with. Your real time data set from Twitter is pointing out significant links (many of which may not recently/yet have been crawled) and effectively associating them with a few keywords. So when that “earthquake” or “kanye west” search term is submitted, you know that topic just became hot on Twitter a few minutes ago, and you’ve got a frequency count on a bunch of URLs that can either supplement or help rank the URLs you would have returned pre-Twitter. Real time relevance, right there.

The spam question then rears its head. You’ve all likely taken a look at the tweets associated with Twitter’s trending topics, and know full well that spammers will latch on to anything that’s happening in an effort to sell more discount herbal web cam work-at-home cigarette franchises. So won’t the spammers just use Twitter to hit the “real time” part of search results?

I don’t think so.

See, if you’re Twitter the company, you have to walk rather softly around spammers. While spammers are certainly an irritation on Twitter the service, they’re a relatively mild one, and if an anti-spam measure taken by Twitter the company (say, something around following/follower ratio) should affect any legitimate users, uproar will inevitably follow. Twitter the company is also Twitter the service, and so if the company modifies what tweets a users sees—at all—they’re changing an element of the service itself. And then people will opine (perhaps accurately) that they’re “breaking” the service.

So Twitter the company seems to tread pretty lightly in a lot of areas, spam included. I suspect that there have been at least a few relatively simple projects tossed around internally that could make a huge impact on Twitter spam, before being shelved because of the impact (real or perceived) on non-spammy users of the service.

If you’re Bing or Google, on the other hand, Twitter is just a data stream. The search engine have absolutely no need to be “fair” to Twitter’s users, give those users the benefit of some sort of doubt, or think about how a given anti-spam measure might affect Twitter the service; the search engine’s decisions on how to slice and filter the data do not affect Twitter the service in any way.

The search engines can decide that only Twitter users with a following:follower ratio of X:Y or better will be included in their analysis and display pool, to exclude forever after any user who tweets a link to a domain that the search engine considers “bad,” or take any other approach they like to filtering the data. Who’s going to complain? On what grounds?

I think that Pilatowski nailed the key issues in his post, but my gut says that he has taken on too much of Twitter’s perspective on these issues, and not enough of the […um, still largely hypothetical…] “real time search” perspective. Particularly if the Twitter firehose is being used primarily as a behind the scenes data source as described above, the search engines have far more latitude to attack problems in Twitter’s data that does Twitter itself.