Recently in Duck Duck Go Category

Google Web spam

 
Yesterday, @mims wrote this post on "content-mills," which prompted this discussion on HN about Web spam. Many of the comments are by moultano, who is on Google's search quality team. This particular comment really drew my attention:

I doubt you'll find MFA spam to be better on DDG than on Google, but please, if you see a query where they are beating us. Send it over. :) I can guarantee you that I'll get a lot of eyes looking at it.

At DDG, I mainly crawl looking for these types of spam domains. On my last crawl, I identified about 37.8M domains as spam in the com/net/org/biz/info/us TLDs. I found Web sites at another 61.3M domains; the rest timed out. So roughly 40% of the domains I visited (with sites) were spam.

I just took a random sample of those spam domains and checked them against Google's index. All of this code as well as the sample and results are now on github.

First I started checking against Google's Web site directly, but their bot detection quickly shut me down. I was able to check 589 domains before being shut down, using the site: syntax. The results are here. The second column is the # of results reported in the index. For example, you can verify the first one with this query.

Of those I checked, 302 came up with at least one result, i.e. are in their index in some form. That means (extrapolating) roughly 50% of my spam domains are in Google's index, or about 19M domains.

Once shut off, I moved to Google's search API to process the full 10K sample. Interestingly though, it apparently returns very different results. For example check out web vs api. The Web shows 1 result, whereas the API shows none. 

Weird. I carried it out anyway though. Of the 10K full sample, I found 719 in Google's API index, or 7%. If you extrapolate that to the full list, that would be ~3M spam domains in the index. 

In any case, these #s are pretty conservative estimates because a) I'm only covering about half the domain space (missing all the country tlds except .us), and b) I know I still have a lot of false negatives (please send me them when you see them).

On the other side, the way I do the identification, there are minimal false positives at the time of identification. However, sites turn from spam/non-spam all the time, and since it takes me a while to crawl, there are certainly a few false positives in there. 

There are also legitimate false positives, and if you see those, please report them as well. I did nothing to hide those from view here, so you can see for yourself in the results.

Of course this says nothing about how much they appear in the rankings. I tried to find the modern equivalent of Metaspy to get some random queries, but I couldn't find such a such a service in existence. Nevertheless, half of the spam domains are not in the index, so it begs the question why the difference? 

If people have lots of links from Google results saved, I'd be happy to run them against my list.

Announcing duck.co - The DuckDuckGo Community

 
LOGO_CoFounderWebsite.gif
Today the .co domain launches and I'm proud to be a part of the .co Founders program and have one of the founding Websites.

It's duck.co, which is the new home of the DuckDuckGo Community forum (previously ideas.duckduckgo.com).

I have high hopes for this site, and wish it to become a real community centered around DDG. I've initially created four top-level forums:
  • DuckDuckGo Feedback - report problems and give suggestions
  • Spreading DuckDuckGo - share ideas and experiences.
  • DuckDuckGo Code - discuss DDG open source code & APIs. Over the next year, I hope to put a lot more effort into this area.
  • DuckDuckGo for Educators - use of DDG in the educator/library community in particular.
I of course will be actively participating in all of these forums and would welcome and greatly appreciate your participation as well.

The forum is run by Zoho discussions. I have it set up so you can post anonymously, though if you do sign in you'll be able to more easily follow topics & forums (by email).

You might also be wondering why I switched forum providers. Long story short, Slinkset (the previous provider) was down about 50% of the time and really slow when it was up. That is, it was essentially unusable.

PickFu review

 
pickfu.pngLately I've found myself repeatedly encouraging startups to use PickFu, so I thought it's time to share that advice with a wider audience. Here's how it works.

You give PickFu $5 and then they get 50 people via Mechanical Turk to do an A/B test for you in the form of a question, e.g. which page/image looks better and why? 

In addition to recommending them to others, I've used it twice so far myself. I made the results public so you can see them. First, I asked about two different homepage logo designs. Second, I asked about two different about page designs. (The latter test I paid a bit more, $9 for 100 responses.)

It was interesting to me that in both cases there wasn't a clear winner. My original concern was people would just pick one and you'd expect an equal split in that case. And I'm still not sure that isn't happening. 

On the other hand, I found the comments people left somewhat compelling. They make sense and so I'm inclined believe most people are actually taking it seriously.

More to the point though, I found the feedback valuable enough to iterate on the About page design and ended up with more of a hybrid approach. I wouldn't have got there without reading those comments. It's really a cheap and quick form of useful feedback.

I'd of course be interested in your experiences as well. Also of note are UserTesting ($40 for videoed user test) and FeedbackArmy ($15 for 10 written up usability tests). I haven't used these sites yet, but plan to do so soon when a need arises. I'd also be interested in your feedback on them.

Free DuckDuckGo Stickers

 

stickers.jpg

A few months ago I ordered 1,000 stickers for DuckDuckGo from StickerRobot, and then gave them all away for free to DuckDuckGo newsletter subscribers. (I had a lot of takers.)

So a few weeks ago, I ordered 3,000 more and 1,000 bigger ones, which are pictured above. Now I'm giving them away (still for free) to any DuckDuckGo user who wants them.

I need help spreading the search engine, and I think this is a fun way to do so. These are high quality, weather-proof stickers, so they should be good on your car or on a high-traffic area, in addition to your laptop.

pole.jpglaptop.jpg

To get them off my desk and into your hands, simply send me your name and address and I'll put some in the mail!

Duck Duck Go Searches Are Now Externally Anonymous

 

duckduckgo.pngDuck Duck Go searches have been internally anonymous for a while. IP addresses are not stored. Neither are full user agents.[1] 

When you search at Duck Duck Go, there is no way for me to know who you are, or tie your searches together. For more info, check out the privacy policy. It's short--206 words; compare that to Google's 2,137 words.

When you search on Google, not only is your info stored, but also when you click on a link, your search terms are passed on to that site via the Referer header. A lot of sites use this information to tailor content and advertising to you specifically. Your searches also show up in analytics tools, which people use for SEO and other tracking purposes. This information leakage creates legitimate privacy concerns.

If you use the encrypted version of Duck Duck Go, the Referer header is not sent as per the HTTP standard. However, not everyone wants to use the encrypted version because it is slower to initially connect.

As of today, the Referer header is also not sent when you use the normal http version of Duck Duck Go. In other words, your search terms are not leaked to the sites that you click on, regardless if you use the encrypted version or not.

Referer headers are sent by your browser (client side) automatically, so I can't control it from my servers.[2] As a result, I'm currently using a meta refresh to force a client side redirect, and if meta redirects are turned off, a JS location.replace from that redirect page.[3]

After a lot of testing, I've determined there is negligible slowdown with this process due to the way I've implemented it. (In fact, most search engines make a background request for click tracking already--I don't.) The meta page is sent in memory via the nginx echo module. As you're already in a keepalive state, this happens essentially instantly. 

However, I realize that some people may want to turn this off, e.g. if you want your search term leaked. In conjunction with this release, I've also added a redirect setting to do just that. You can also force it one way or another using URL parameters.

A related issue was brought to my attention a few months ago on reddit, which has to do with the use of images served from Amazon's S3 service. When those images (mainly favicons, but sometimes in the 0-click box) are requested, your browser sends to Amazon the Referer header, which includes your search. In response, I had made four changes to address this issue.
  1. I added a setting to use POST requests, which solves the issue completely.
  2. I added a setting to disable favicons.
  3. I added a setting to disable the 0-click box.
  4. I started making all calls to S3 over https, so the headers would not be sent in plain text.
There are two issues with these. First, they all impact usability. POST requests break the back button and URL copying; you may want to see the images; and https slows things down a bit. Second, a couple days ago it was pointed out that despite these changes, the Referer header is still sent to Amazon, albeit encrypted, and they could be storing it. 

I decided to ask Amazon about their logging policy. They have a setting called Server Access Logging, which I do not have turned on, and so their logging policy in this case was unclear. Apparently they do log even if Server Access Logging is turned off.

All of the information exposed via Server Access Logs is in our internal logs - including referrer strings.

There were a lot of good suggestions on Hacker News on how to address this issue, but they all similarly impacted usability in one way or another.

I have now solved this problem by setting up a reverse proxy between me and S3. This costs me more bandwidth and server resources, but it is worth it to solve the privacy problem for you. Additionally, it actually improves usability because a) I set up a cache on my end and b) I can now turn off https to S3.

Furthermore, it is even more private than simply dropping the Referer string. Since you are no longer making the request on your side, your IP address isn't being sent to them at all. I can also explicitly set the Referer string (using the nginx more headers module), which I set to 'http://duckduckgo.com/';

I welcome feedback on these new processes. As they are new, I'm sure there are bugs to work out and further optimizations to work in. I already have a few in mind myself.


[1] Actually, user agents currently are not stored at all. In the future, however, I may compress user agent strings to short codes, e.g. FF for FireFox. For reference, the current nginx logformat is as follows.

logformat  main  '127.0.0.2 [$timelocal] "$request" $status $requesttime $bodybytessent "$http_referer"';

[2] As I noted, if you click on an http link from an https page, the Referer is not supposed to be sent. However, if you have the server redirect you from an https link to an http link, clients will pass the Referer header through. Annoying!

[3] Note that this client-side, so if you have a client that doesn't behave, it may not work. I've tested it on most modern browsers, including Chrome, Safari (including iPhone/iPad), Konqueror, FireFox, Avant, Opera, and IE (including IE6).  

Duck Duck Go Traffic & Sponsorship

 

sponsor.png

Duck Duck Go served 1,182,204 result pages last month. You can now track the progress (or lack thereof ) on the new traffic page. As you can see, this month is on track for a bit more. 

Thank you to everyone who's been spreading DDG for helping to make these traffic #s happen. And a special thanks to the reddit & Hacker News communities for being so supportive.

I'm also now accepting DDG sponsorship. I just put a square 85 pixel sponsor banner on the right bar

For those users who hate ads, note that there is already a setting to hide the right bar, and if enough people request it, I'll create another one just to hide the sponsor banner (within the right bar). However, I plan to only accept high quality sponsors and banners.

Sponsorship is exclusive. That is, there will only be one sponsor at a time, on a weekly or monthly basis. The sponsor banner will display on all search result pages.

The new traffic page is meant to count exactly those (sponsored) page views. Pages without search results, e.g. the homepage, about page, etc., are not counted. Also, if someone clicks 'More links...' on a search results page, that is not counted as another page. For example, if someone clicked 'More links...' five times that would still just be one page.

I have no idea what to expect so this program may fail fast. (This program is modeled after Daring Fireball's sponsorship program.) I will keep the traffic page up regardless though. 

The initial sponsorship rate is $1.5K/week or $5K/month, subject to increases based on demand. To kick things off, any sponsor who pays before the end of this month can get a discounted rate of $1K/week or $3.5K/month for up to two months (pre-paid). Of course, please let me know if you have any questions.

Twitter RT Test Results

 
Test: I asked @duckduckgo followers to RT this tweet.

tweet1.png

I also RTd it from @yegg (my personal account) with slightly different text.

tweet2.png

Hypothesis: I wasn't sure what to expect, but figured I would get a bunch of RTs because my followers seem pretty solid (not spam, auto follows or other non-sense).  After that, I thought maybe I'd get some 2nd level RTs. I wasn't even holding out hope it would go viral, and of course it didn't.

Results: I tallied up the RTs using twitter search. The @duckducko tweet was RTd by 18 people using Twitter's RT system. The @yegg tweet was RTd by 6 people. Then there were 11 people who RTd it on their own, 4 of which got RTd 1 time each. This totals 39 RTs.

All of these people for the most part aren't spammy either, i.e. their accounts look real, with real followers. Counting them up they had 4,406 followers (avg 126, min 6, max 408). I threw out one outlier who had 22,150 followers but was also following 24,308.
 
As you can see from the tweet, I linked to a special URL, dukgo.com, that hardly anyone uses, and so I used it to track clicks. All told, 73 clicks. So that's ~2 clicks per RT--not too good.

I also considered RTs/followers. @duckduckgo has 848 followers, so that's 3.4% who RTd it. At the second level you have the 4,406 followers tied to those RTs plus my 911 followers for 5,317 followers. At 10 RTs, that's 1.8%. So you can see the drop off.

Here are my takeaways:

  • You need a lot of followers. I'd say you'd need two orders of magnitude more, i.e. 100K real followers, to make this at all worthwhile. Then we're talking on the order of 10K clicks.

  • To go viral, you'd need RTs by important people. I'm really grateful for all the RT support, but no one had a ton of meaningful followers. I think you'd need that celebrity push to get it out there, which may kick off other celebrity RTs.

  • To go viral, you'd probably need more compelling content. Of course this test was business related, but you'd probably need it tied to either more of a fad/news story or have more of a hook, e.g. a super interesting Web page on the other side.

  • Viral coefficient is not 0. There were second level RTs. If the content also lent itself to RTing, i.e. it was a game or something that involved tweeting, you might be able to bring that up and keep the chain going.

I also tried Fiverr, a service where people say what they'll do for $5.  I spent $15 on 3 people who seemed legit and said they would retweet to all of their x thousand followers. Only one has done it so far, and that yielded a RT by 1 person (none of which I counted above). So I'm guessing that is not going to be a good advertising channel.

My Duck Duck Go reddit ad by the numbers

 

reddit.ad.png

My reddit ad (above) has yielded the highest ROI (by far) compared to the other ad platforms I've tried, which includes Adwords, Yahoo, Bing, Facebook, Myspace, and StumbleUpon. Of course your mileage will vary depending on your ad, product, target market, etc., but my basic message to you is you shouldn't overlook reddit self-serve advertising.

My ad ran for 13 days, from 3/7 to 3/20. It cost $650, and I spent $50 per day. In total it had 1,288,378 impressions (282,732 uniques) and yielded 20,700 clicks (18,420 uniques).

That's a CTR of 1.61%, or 6.49% per unique redditor. This CTR works out to a CPC of 3.14 cents, or 3.53 cents per unique visitor, and a CPM of $1.98, or $2.30 for uniques. 

This is actually my second reddit ad. The first one looked exactly the same and was run during their beta period back in November. The first day that one came out it had a unique CTR of 13%! This rate eventually came down to about the same unique CTR% as the second ad.

Here were my takeaways from the whole experience:

  • Redditors actually try out your site. 3c per unique visitor is pretty good in and of itself, but it's all worthless unless they actually try out your site. For example, you can get 5c unique visitors from StumbleUpon (presumably in a similar demographic), but StumbleUpon visitors never would try out my sites. Reddit visitors did try out Duck Duck Go. 

  • Redditors actually comment on your site. A unique feature of reddit ads is that redditors can comment on your ad on reddit. You can turn this off if you want, but I don't recommend it. Here's why. That is my ad's comment thread, which has 656 comments! (About half are from me though.) On this thread is immensely useful feedback. I fixed bugs, got feature requests, got first impressions, etc., and perhaps most importantly, was able to engage with my users in almost real time.

  • I think it helped my other Reddit submissions. At the suggestion of one of my ad's comments, I made a Ask Me Anything submission. At another suggestion, I finally pulled the trigger on no longer storing IP addresses, which I submitted to reddit here. This last submission went to the top of the technology sub-reddit and made it to the reddit front page as well, to somewhere around #5.

    These two posts yielded more traffic than my entire ad, and I think they did so in large part because of the ad. My hunch is people recognized the site and were more likely to vote it up as a result. Two concrete examples.

    First, I doubt the privacy submission would have made it to the front page if not for the ad because doing so requires a certain upvote velocity that you don't usually get when people aren't familiar with you.

    Second, the other day @kn0thing submitted my cuil parody to reddit. On its' comment thread are two comments from redditors praising the search engine. People read these types of comments and take them seriously, and they wouldn't happen without converting reddit users into Duck Duck Go users beforehand.

  • CPM and CPC varies widely by day. As you can see from these graphs, some days were very different than others. This has to do with how reddit sells advertising. They split up the pool based on how many $ people bid for that day. So if you get a bigger piece of the pie that day, you'll get better CPM. My takeaway is to run the ad over a longer period. You can always stop it if you're getting diminishing returns.

reddit CPM graph.png
reddit CPC graph.png













  • My ad did better than others. You can tell from these graphs that my CPC was less than the average by a lot because my CPM is during the high Mar period (higher than average) but my CPC is still way less than average. I think the reason for this is two-fold.

    First, a search engine ad is a good fit for reddit ads in general. It has broad market appeal and redditters in general like trying out new technology.

    Second, I think the ad is particularly well structured. The circular duck icon draws your attention, is contrasting to site colors, and sticks out because it is a circle (as most images are square). I believe the title also has appeal.

  • People are still mentioning the ad to me. I've received a bunch of regular site feedback (through my feedback system) where people mentioned they discovered Duck Duck Go via the reddit ad. Also, at least two people I've initiated conversations with told me they've been using the search engine after seeing the reddit ad. These types of conversations have not happened with any other ads I've placed.

Finally, I want to be clear that even without the successes of my follow-on reddit submissions, I still believe reddit ads have had the highest ROI (and still by far) for Duck Duck Go. 

Well, that's it :). In short, consider trying reddit self-serve advertising. You can test for as little as $20.

Care about search privacy? Use Duck Duck Go!

 
duckduckgo.pngDuck Duck Go might be the most private place to search the Internet right now. Here's why:

  • No IP addresses. I no longer store IP addresses at all. Not hashes of them. Not remnants of them. Nothing. I don't even log them.

  • No cookies. I don't track you with cookies. If you just come to the site and search, no cookies are set. In fact they're only set if you use the settings page.

  • Https. You can now search on an encrypted connection (using https). And nothing in the logs tells me you are using the https site.

  • No contractors. Actually, it's just me right now, and there are no plans to change this in the foreseeable future. So you don't have to worry about anyone with access stealing the non-IP address logs we do have.

  • No third-party feedback. Our feedback page is powered by me. No Google docs or other third party site. In fact, it really just emails me so if you want you can just email me directly at yegg@alum.mit.edu. This may not seem like a big deal, but when we were using third-party feedback I had people writing in many times asking if they could trust the third-parties and to ditch it. So I did.

  • No cloud. I run my own servers and network. However, we do have the capability to run on EC2 if needed (in case of network failure). Yet under normal circumstances, your search traffic is running on servers I can actually see :).

  • No Google. I don't use any Google APIs. No Google search. No Google Analytics. No Google ads. I have no problem with Google, but if you're concerned with how much information they have on you, I offer you a viable alternative.

I've updated our Privacy Policy to reflect the above. When you don't collect any personal information, it becomes pretty simple!

If I missed something that concerns you, please tell me about it. I'm certainly open to doing whatever I can to protect your privacy. And privacy is just one of the reasons to use Duck Duck Go instead of Google.

Hack Hack Go

 

iostat.png

I want to make Duck Duck Go a better search engine for programmers like me. If you're a programmer, I'd appreciate your feedback and ideas.

Duck Duck Go is intended to be a general purpose search engine and that isn't going to change. Our user base certainly reflects this purpose, i.e. is quite varied on every metric I've tried to measure.

Yet there are certain search niches like casual research where Duck Duck Go really excels. I'd like programming to be one of those areas.

To that end, here's what I've got so far.

  • A general search engine. The good news here is I know a lot of programmers who use it as their primary search engine. It works and (at least some) people really like it. I'm always willing to add new features whose absence are preventing people from switching. Currently on that list are some maps and images.

  • Zero-click Info. There are red boxes above links on some searches with info you can get without clicking, i.e. on-site. We have a lot of info that is specific to programming topics. Of course we have Wikipedia, e.g. Dijkstra's algorithm. But I've also added software sources, i.e. github, freshmeat, download.com, versiontracker, and sourceforge.

  • Category pages. I've mined sources to create to useful topic lists for browsing/learning, e.g. Search Algorithms.

  • Disambiguation pages. I've created pages to help you isolate programming topics in common query terms, e.g. cookie links to HTTP cookie, which has results more geared toward that meaning. There are also programming specific disambiguation pages, e.g. nearest neighbor.

  • Crowd-sourced links. I also mine links from crowd-sources sites, e.g. coroutine.

  • Wikipedia paragraphs. I've deep-indexed Wikipedia at the paragraph level. You don't have to match a topic nearly exactly anymore to get some Zero-click Info, e.g. python switch statement. This is way more than a regular search index, as it is sub-section/section/title aware and uses some NLP for relevancy. I hope to make that matching algorithm even more sophisticated over time.

  • Bang. There are a few hundred !x shortcuts that can be used, e.g. !cpan Net::DNS

Here's what I'm thinking of doing.

  • O'Reilly Paragraphs. I think it would be awesome if I could index all O'Reilly books at the paragraph level, like I've done for Wikipedia. This content is well-written, encyclopedia-like, is largely in paragraph form, and has surrounding contextual information (section titles, etc.) that will make the relevance matching excellent. Problem is, I don't know anyone at O'Reilly. I think it's a win-win because it can link right to their Safari product or individual book pages. And I don't think it canabalise Safari because you're getting people in a very different context (when searching). Anyway, I thought I'd start by writing them an email. I did that and haven't heard back yet.

  • More topic sources. I'm going to add man/info pages, so you can type in a command and get a description. I could also do packages for distributions/languages in a similar manner if people think that would be useful to them. I've explored indexing these at the paragraph level, but the content doesn't seem to work well for that purpose. Other, more general sources, may be incidentally useful to programmers like Amazon product descriptions. I'd love your thoughts here.

  • Bang documentation. The current bang commands aren't documented. I'll document them as well as add more that are useful to programmers. Any you want?

  • Zero-click Info by IM. I'm thinking of making a chatbot that will respond to you via IM with Zero-click Info (and links). So you send it a search query and we'll send you back a description along with a few links. Would you use that?

  • API integration. I wrote the Perl binding for Wolfram Alpha. I'm exploring ways to use it to integrate good WA content. I'm open to using other APIs, but I'd strongly prefer to get dumps instead so I can ensure speed. Another one I'd like to integrate for programmers is ErrorHelp.com (previously bug.gd).

That's where I'm at right now. If you're a programmer, my questions for you are:

  1. Do you find the above compelling?

  2. Do you have any particular feedback/ideas?
Feel free to comment below, on HN, on reddit, or email me directly.

About

   

My home page.

Online Karma

-
From a new search engine

Online Profiles

-
From a new search engine