Lunchtime Musings: Ed Felten on Bayesian Filtering

Standard

Okay, it’s a little sad that I’m sitting here typing between bites of my sandwich (at least it’s not a cheese sandwich), but I came across Ed Felten’s Victims of Spam Filtering post this morning and wanted to note a couple of things about it. Well, I suppose that I actually want to note one thing: that I entirely disagree with his logic.

While it’s best for you to go and read his entire post, I’ll copy the first two paragraphs here, since they’re the ones that set off alarm bells in my head:

### FELTEN QUOTE BEGINS
Anyway, this reminded me of an interesting problem with Bayesian spam filters: they’re trained by the bad guys.

[Background: A Bayesian spam filter uses human advice to learn how to recognize spam. A human classifies messages into spam and non-spam. The Bayesian filter assigns a score to each word, depending on how often that word appears in spam vs. non-spam messages. Newly arrived messages are then classified based on the scores of the words they contain. Words used mostly in spam, such as “Viagra”, get negative scores, so messages containing them tend to get classified as spam. Which is good, unless your name is Jose Viagra.]
### FELTEN QUOTE ENDS

Now let’s compare that to a snippet from Paul Graham’s A Plan for Spam, the document that introduced the word “bayesian” to so many of us:

### GRAHAM QUOTE BEGINS
Because it is measuring probabilities, the Bayesian approach considers all the evidence in the email, both good and bad. Words that occur disproportionately rarely in spam (like “though” or “tonight” or “apparently”) contribute as much to decreasing the probability as bad words like “unsubscribe” and “opt-in” do to increasing it. So an otherwise innocent email that happens to include the word “sex” is not going to get tagged as spam.
### GRAHAM QUOTE ENDS

I appears that Felten, like many other people (me, for example), has made the mistake of viewing a Bayesian filter as something like a keyword filter on steroids. A story:

At one point I was using one of the popular open source Bayesian filters. I had it set up so that it wasn’t just marking “spam” and “ham,” but rather was categorizing all of my mail for me: tech/programming mail into one bucket, personal mail another, mailing lists a third, and so on. This worked well, and led to the brilliant idea of training my system to recognize a special “password” that people could include in their emails if they wanted to end up in my “priority” bucket.

This didn’t work well.

Why not? Well, a couple of reasons. The first reason is that I have a bunch of immature programmers as friends, so within fifteen minutes of sending out a note asking for email containing the word “avocado” so that I could train my system, there were no fewer than three scripts written that did nothing but send me email after email, all containing nothing but the word “avocado” repeated over and over. Ha, ha, you bastards.

The second reason that this didn’t work is that the whole point of a Bayesian style filter is that it’s looking at texts as a whole, not just individual words. Bayesian filters aren’t “trained by the bad guys,” they’re trained by the bad guys, your mother in law, your co-workers, your friends…they’re trained by everyone who sends you email. My avocado never did work very well, because a single word (“avocado”) was rarely, if ever, enough to change the overall character of a message. If the rest of the message content (good and bad) looked like my other “personal” messages, it would end up in the personal bucket; if it looked like my “social software” messages it ended up in that bucket.

The kind of attack that Ed Felten is imagining would be crippling if Bayesian filtering worked on a sort of “adaptive keyword” basis, picking out the messages with spam words and looking for new spam words to filter…but that’s just not the case. Let’s take Felten’s example of a spammer trying to poison the word “fahrenheit” prior to the release of the Michael Moore film:

You send me 50, 500, or 5000 spam messages containing “fahrenheit,” and that word has never before appeared in a message that I received. All of them get marked as spam due to the other spammy message content, which increases the spam potential of “fahrenheit.” Then a friend sends me a note with some thoughts on the movie fahrenheit 9/11 — will that message go into the spam folder? It could, but that’s not really likely. Because it’s been in n spam messages and no good ones, “fahrenheit” will have a really high spam potential, but there’s other content in the message: if it’s an email from your friend about a movie that they just saw, the other message content (your friend’s email address, your name, the words used in normal conversation, etc.) probably all has very low spam potential. The odds are that the “ham” potential of the 500 other words that your friend wrote will dramatically outweigh the “spam” potential of a single word and the message will make it into your inbox, which in turn reduces the spam potential of “fahrenheit.”

The whole idea of Bayesian filtering is to get away from this “one bad word poisons the message” sort of thinking. So forget about this one and go worry about google bombing or something.

The Technology of Tracking

Standard

No, not the Verichip and its “the end times are here” fan club.

Just plain old tracking of who’s doing what on the internet. The Christian Science Monitor recently published a non-techical article on the difficulties of accurately tracking how many people visit their Web site. “People” is the operative word here — the beauty of the Web from a tracking perspective is that you’ve got very precise record of how the machines involved are interacting; knowing something about the people attached to those machines is something else entirely.

With didtheyreadit’s recent, brief moment of email tracking infamy, a million and one discussions of how one might track RSS feed usage (including FeedBurner’s excellent update to their tracking reports), and — of course — MarketingSherpa’s belated realization that email open and clickthrough reporting may not be all that they’re cracked up to be, a couple of things seem to be happening.

Companies are starting to pay attention to online operations again, and asking the right sort of questions: who is coming to my site/getting my emails/reading my RSS feed? What are those people doing when they access the content I’m putting out there?

Companies are realizing that this tracking is a lot harder than it seems. While DoubleClick, 24/7, and a host of smaller companies offer tools (some better, some worse) to track and analyze Web traffic and email activity , relatively few organizations have the money to spend on those sorts of tools. Even fewer have any idea of what to do with the data once they have it.

We’re mostly moved past using the httpd access_log for purposes that nature never intended, but even when tracking tools are using more user-focused metrics, we often don’t know what those metrics are, nor what assumptions they’re making. Becuase there are machines involved in every step of online business, we often opt for the comforting illusion that we therefore have volumes of bulletproof data about users and their actions, when that’s just not the case.

Users are (for the moment) not hardwired into their computers, and it’s the computers that we have data on, not the users. We can extrapolate from machine to user pretty well, but it’s essential that we understand the assumptions that we’re making and the attendant limitations.

Anti-Spam Technical Alliance Recommends Not Doing Stupid Things

Standard

It never ceases to amaze me that it is necessary to make public statements like “don’t do stupid things,” and “don’t be an asshole,” but time and time again such statements prove to be absolutely necessary.

The Anti-Spam Technical Alliance (ASTA), whose big-ticket participants include Yahoo, Microsoft, Earthlink, and AOL, today published a report containing best practices and technical recommendations (article has links to actual documents) for ISPs, Email Service Providers, and high volume email senders.

First and foremost I have to say that I absolutely agree with their recommendations, but then I’m neither an idiot nor an asshole (I hope). What sort of thing appears on their list of recommendations for high volume email senders?

  • Do not harvest e-mail addresses through SMTP or other means (defined as collecting e-mail addresses, usually by automated means) without the owners’ affirmative consent.
  • Do not employ any technique to hide or obscure any information that identifies the true origin or the transmission path of bulk e-mail.

It’s absolutely incredible to me that in the year 2004, as we are buried beneath ever-growing piles of spam, it is necessary to tell ostensibly legitimate companies that harvesting email is a bad idea from both ethical and business perspectives, or that trying to hide the fact that you’re sending email is unacceptable behavior.

I suppose that this is really more of a warning shot: whatever else it may accomplish, it lays the groundwork for the Gang of Four to implement the technical solutions that they see fit while chanting “you can’t say we didn’t warn you” over and over again.

Honestly, while this will necessarily cause a bunch of problems — some of them probably big and affecting people who are doing everything right — it’s an action that is overdue. I have to support this, for the same reason that I was overjoyed to see MS’ “caller ID for email” merge with SPF — once the big ISPs agree on the standards that they’re going to use, you’ve got a known quantity. Whether or not you agree with those de facto standards, everyone is clear on what they are, not just muddling through with best guesses and sympathetic magic.

FeedBurner Stats Updated: EXCELLENT work!

Standard

I use FeedBurner to handle the syndication of this little experiment; that they also report usage statistics was an “oh, that’s nice” feature to me, so the fact that their reporting was a tad on the byzantine side wasn’t a big deal. Today I went to their site for the first time in a while, though, and saw their updated reporting.

Their reporting is now fucking excellent, and there’s just no other way to say it. Nice work, FeedBurner folk!

The Batter Coating Rule

Standard

It falls a little outside my normal range of topics, but I just received an email about this and must pass the word along…

Apparently the USDA now classifies frozen french fries as “fresh vegetables.” I can’t help wondering whether this is some sort of warped “ketchup is a vegetable redux” Reagan tribute.

More important, however, is the fact that this update is apparently known to as the “Batter Coating Rule.” Had anyone asked, I would have predicted that the United States’ Batter Coating Rule would be something more along the lines of “things that are coated in batter are gooooooooood.” Oh, well. This one’s okay, too.

Roll your own Real Simple Shopping feed

Standard

About a week ago I noticed Real Simple Shopping, a service that takes the spam risk out of subscribing to product offer email lists by subscribing to those lists for you and passing the content along as a customized RSS feed.

A couple of days later I noticed that dodgeit.com — a service that offers free, public “maildrops” — offers the ability to read @dodgeit.com mailboxes via RSS feeds.

Because it was a slow Sunday yesterday, I found myself sitting around and thinking “you know, dodgeit.com would allow me to build a better customized RSS feed right now.” I just made up a @dodgeit.com email address, added the RSS feed for the address to FeedDemon, and started subscribing.

Now I’ve got the offer/event emails from Powell’s Books (the best bookstore in the world, bar none), REI (excellent outdoor equipment), and the Self Starter Foundation (good independent records) coming to me in a nice, neat feed…and if Huy Fong Foods offered a mailing list, you know I’d be signed up for that right quick.

Funny thing, really: I genuinely want to hear from all of those places about offers that they might have for me, but I would never have actually signed up for their lists via email. I just have too much unread email for me to voluntarily add to the pile. I’m not sure that I’d even have signed up for RSS feeds from each individual source — but with the ability to create one completely customized commercial feed of my own? Hell, yes, I’m there!

Commercial RSS F@#$ing Everywhere

Standard

Holy jeez. Whether there’s actually any real interest or not on the subscriber side I don’t know yet, but it seems that you can’t throw a rock these days without hitting somebody who’s offering purely commercial RSS feeds.

Yesterday Tuesday it was RealSimpleShopping (which, I might point out, I blogged before the Scobleizer got around to it), and today I came across Coupon Clock, which offers dozens of coupon/special offer feeds, grouped by topic. They don’t (yet) appear to offer per-subscriber customized feeds, but since they’ve already got an email notification system set up I have to imagine that custom feeds aren’t too far away.

Now to find out whether any of these guys are actually making money…