This page documents an experiment designed to investigate the best way to hide email addresses from web-crawling address harvesters, and to track how sophisticated they are at extracting “hidden” addresses.

The folks over at Project Honey Pot and their volunteer members are working on ways to identify and block spam harvesting bots — those automated programs that spider the web looking for email addresses to add to their databases of spam victims. Their documentation and help pages contain a very good discussion of how to “munge” an email address on a web page so that it is not recognized as an address by a harvest bot, but can still be used by human readers. A variety of ways are discussed, from very simple to very complex, and they include some intriguing speculation about the bots becoming sophisticated enough to decode munged addresses, too.

So I created a web page (at a different domain) that is hidden from humans and which has 11 different email addresses on it encoded eleven different ways. They vary from totally unobfuscated to compound JavaScript statements to image-only presentation. I am watching the mailboxes for these addresses to see which ones get spammed and which don’t. That will tell us which ways of munging addresses on web pages to avoid.

These addresses have not been publicized anywhere in any manner, and they have never been used to send email. The user names are unlikely to be generated algorithmically. The web page they are on is unlikely to ever be stumbled upon by a human clicking on links, and a human viewing the page is very unlikely to find the addresses. In short, these addresses are well hidden.

Three of them were spammed nine days after the page was created.

Not just that, but two of the spammed addresses were presented only in an munged manner. That means the spambot that found them already has some degree of smarts at ferreting out what you might have thought a safe way of sharing your email address on the web. Here is a count of spam messages received for each type of munging. (These are not the actual addresses, of course. That domain name does not even exist, so harvesters who visit this page won’t find anything useful.)

Address HTML presentation Appearance First Spam Count
victim0@podunk.edu <a href=”mailto:victim0@podunk.edu”>
victim0@podunk.edu

</a>

victim0@podunk.edu 2009-10-24 7
victim1@podunk.edu victim1@REMOVETHISpodunk.edu victim1@REMOVETHISpodunk.edu 2009-10-24 1
victim2@podunk.edu victim2REMOVETHIS@podunk.edu victim2REMOVETHIS@podunk.edu 2009-10-24 1
As of 2009-11-12

Obviously not all address harvesting bots are as sophisticated as others. I was surprised that the first spam I received went to all three of these addresses. The first address is a gimme to the harvesters, just to serve as a baseline. Any spammer who can’t find that shoudl look for another line of work. But the other two you would think to be more resistant to automated harvesting. So they are, but they’re not bullet-proof, and it’s obvious that those simple-minded ways of munging should be avoided.

This is intended to be a long-running experiment. I’ll only present here those ways of munging addresses that have been proven ineffective against at least one spam harvester; I don’t want to give away all of my tricks to any spammers who might be reading this. But you can read the how-to at Project Honey Pot for ideas on other ways that may still be effective.

Update 2009-11-16: I’ve created a separate page with a plain email address on it that is not linked to at all. It is only listed in a robots.txt file with an entry that tells bots to not go there. That’s the only reference to the page on the whole internet, so the only visits it gets should be evil bots that don’t obey the robot exclusion standard. Any email received sent to the address on that page verifies the nefarious nature of at least one visiting bot.