Google Web spam

Yesterday, @mims wrote this post on "content-mills," which prompted this discussion on HN about Web spam. Many of the comments are by moultano, who is on Google's search quality team. This particular comment really drew my attention:

I doubt you'll find MFA spam to be better on DDG than on Google, but please, if you see a query where they are beating us. Send it over. :) I can guarantee you that I'll get a lot of eyes looking at it.

At DDG, I mainly crawl looking for these types of spam domains. On my last crawl, I identified about 37.8M domains as spam in the com/net/org/biz/info/us TLDs. I found Web sites at another 61.3M domains; the rest timed out. So roughly 40% of the domains I visited (with sites) were spam.

I just took a random sample of those spam domains and checked them against Google's index. All of this code as well as the sample and results are now on github.

First I started checking against Google's Web site directly, but their bot detection quickly shut me down. I was able to check 589 domains before being shut down, using the site: syntax. The results are here. The second column is the # of results reported in the index. For example, you can verify the first one with this query.

Of those I checked, 302 came up with at least one result, i.e. are in their index in some form. That means (extrapolating) roughly 50% of my spam domains are in Google's index, or about 19M domains.

Once shut off, I moved to Google's search API to process the full 10K sample. Interestingly though, it apparently returns very different results. For example check out web vs api. The Web shows 1 result, whereas the API shows none. 

Weird. I carried it out anyway though. Of the 10K full sample, I found 719 in Google's API index, or 7%. If you extrapolate that to the full list, that would be ~3M spam domains in the index. 

In any case, these #s are pretty conservative estimates because a) I'm only covering about half the domain space (missing all the country tlds except .us), and b) I know I still have a lot of false negatives (please send me them when you see them).

On the other side, the way I do the identification, there are minimal false positives at the time of identification. However, sites turn from spam/non-spam all the time, and since it takes me a while to crawl, there are certainly a few false positives in there. 

There are also legitimate false positives, and if you see those, please report them as well. I did nothing to hide those from view here, so you can see for yourself in the results.

Of course this says nothing about how much they appear in the rankings. I tried to find the modern equivalent of Metaspy to get some random queries, but I couldn't find such a such a service in existence. Nevertheless, half of the spam domains are not in the index, so it begs the question why the difference? 

If people have lots of links from Google results saved, I'd be happy to run them against my list.

I'm the Founder & CEO of DuckDuckGo, the search engine that doesn't track you. I'm also the co-author of Traction Book, the book that helps you get traction.

