More on comment spam (and more parenthetical remarks)

Comment spam is back in the spotlight after Six Apart fixed a big Movable Type bug that was causing the load of comment spam to DoS servers. This eWeek story on the subject led me to a post where Sean Gallagher agrees with my inflammatory post from earlier this year (I got an email from a Yahoo! search engineer asking why I wasn’t dumping on them too).
This is a little long so I’ll warn you that I haven’t come up with any solutions that I think would work. Still, I think they’re entertaining failures (which are the best kind).
One of the solutions I’ve been thinking of (and please spare me the form letter, it’s no longer funny) involved adding an optional argument to mt-comments.cgi, say require_pagerank.
The idea would be that the comment form would not have require_pagerank at all, but comment spammers search engine optimizers would be told very loudly that they need to include require_pagerank=1 in their comment submitting programs. (Most search engine optimizers are not comment spammers, but comment spammers always refer to what they’re doing as SEO.)
What happens to comments posted without require_pagerank? They’re rewritten to use mt-redirect.cgi, or, both of which (should) remove PageRank. A comment posted with require_pagerank=1 will be unceremoniously be dumped to /dev/null because they require PageRank and we wouldn’t give it to them anyways.
This sounds a little bit like the evil bit right now, I’m sure. OK, a lot bit.
So how do we actually get the comment spammers to send require_pagerank? We punish them for not sending it.
My first thought was to have a delay built in to comments that don’t include require_pagerank, 10 seconds wouldn’t screw up the average commenter but multiplied over 1000 blogs it would save the comment spammer nearly 3 hours to include require_pagerank. Since they wouldn’t miss out on any PageRank by including it (we remove PageRank from our comments anyway) there’s only upside.
The problem with this solution is that comment spammers are probably using threaded submitting programs, which means all that time waiting is time spent commenting on other blogs. Until someone can come up with something like Hashcash for comment spam, they wouldn’t start sending require_pagerank
Right now I’m thinking about something a little more vicious.
A decompression bomb is a compressed file that has an enormous compression ratio, such that its size is many, many orders of magnitude larger than the compressed file. In the linked advisory, they claim to have a 100GB file compressed to under 6K.
Imagine what would happen to a comment spammer’s computer if you used HTTP gzip compression to send that 6K file as a response to a comment that your anti-spam measures determined was comment spam.
This is fun to think about, but this is not a very good thing to do, even to a comment spammer. More importantly, some innocent users are likely to be caught in the cross fire. I have no way of telling if comment spamming bots bother to get the response of their postings, let alone try to decompress them. Even if they do, they’ll quickly evolve to ignore gziped responses.
For now it looks like comment authentication is the closest thing to an answer, either using a central service like TypeKey or a distributed format like FOAF/GPG was supposed to create.

4 thoughts on “More on comment spam (and more parenthetical remarks)

  1. I haven’t gotten a single spam comment to my weblog ever since I added in a simple one-way hash to the comment engine, which does the following:
    – It requires that the comment is actually submitted from the page it’s supposed to be submitted from, using the same IP address and user-agent which the page was loaded from (and that there is actually a user-agent set)
    – It requires that the comment be submitted at least 5 seconds after said page is loaded
    – If these conditions are not met, the resulting page is turned into a preview, rather than an error message, so spambots don’t realize that there was an error in submission and users are given a transparent second chance (in a nice passive way which just makes it seem like they accidentally clicked “preview” instead)
    I had the first two parts working fine on MT’s built-in comment engine, though the third required a bit more hacking on the MT source than I was comfortable with. All three parts are really easy to do in phpBB.
    Sure, it’s possible for the spammers to start actually loading the comment submission form, figuring out how long they have to wait, and then submitting the form from the same IP address, but that’s starting to look like a pretty unlikely series of events.

  2. That’s a good customized solution but I’m sure you can see what would happen if every Movable Type and WordPress blog started using it. Comment spammers would simply code their bots to load the page and grab that hash, wait the requisite time, and post.

  3. “It requires that the comment is actually submitted from the page it’s supposed to be submitted from, using the same IP address and user-agent which the page was loaded from (and that there is actually a user-agent set)”
    And what would you tell people like AOL’ers who may end up using a different IP address for every page request due to all of the different proxies? Not that AOL’ers have anything intelligent to say anyway (joke), but with your system, they’d have to get lucky and go through the same proxy twice in a row to make a comment.
    This whole optional flag idea just sounds pointless. You think the people who write the spam bots don’t read the source code? You seem to think they’re both incredibly dumb (don’t read the source to find out how it works) and incredibly intelligent (capable of writing a bot smart enough to grab required form field information) at the same time. I assure you, once MT and WP stops dicking around and start implementing real spam prevention techniques, spammers will hire a bigger dork to write a better program to bypass said techniques. It just becomes a question of how costly must the spamming workarounds be to make them not worth the effort to implement.
    Me? I’m just having fun staying one step ahead of the “bigger dork” while laughing at the “fat cat” weblog design teams failure to keep up. As long as the “fat cats” continue doing the bare minimum, I’ll never have to worry about the “bigger dork” getting ahead.

  4. Yeah, the optional flag thing is pointless because there isn’t a good way to punish a comment spammer for not setting it. A delay won’t stop them because they’ll write multi-threaded bots that don’t get stuck waiting.
    Custom solutions to the problem are effective, but by definition they aren’t something that can stop comment spam on the macro level.
    Custom solutions remind me of the old joke about two backpackers who come across a tiger while hiking. The tiger starts coming after them and one of the hikers throws down his backback and starts running. The other one says “Are you crazy? You can’t outrun a tiger!”
    The other says “I don’t have to outrun the tiger, I just need to outrun you.”

Leave a Reply