Recently in Technology Category

How do you completely de-personalize Google results?

 
I asked this question to Twitter the other day but didn't get any answers I hadn't heard before. I'm hoping this post uncovers something new.

pws.jpg

Plugging the hole in my personal backup system

 
funny-pictures-file-transfering-birds.jpg

A few weeks ago I lost a chunk of personal photos to the digital abyss. These were mainly photos of my first child from 1mo to 6mo. I was able to recover 1600x1200 versions of many good ones because I had uploaded them to posterous albums. And I was able to recover videos from the same time period because I had given a DVD of them to my mom. But we did lose larger sizes and and a number of original shots.

I do a lot of backing up, but nevertheless there was a big hole in my system that I will relate in hope that you don't make the same mistake (and I don't make it again).

Online services I pay for

 
I currently pay for the following online services.


Entertainment

Hulu.jpg
  • What: regular TV on iPad.
  • Why: for in bed & on couch - less hot/cumbersome than laptop and can put on while using laptop.
  • Love: portability; queue (though you get that with regular Hulu).
  • Hate: some shows still Web only; has ads; endless buffering.


thedaily.png

  • What: daily newspaper on the iPad.
  • Why: initially bc sister works there (graphic designer), but now bc it is fun to read before going to bed at night.
  • Love: UX, i.e. nice browsing and readability features; good graphics.
  • Hate: can't go back to older issues since I don't pick it up every day.


Productivity


skype.jpg

  • What: 3-way video calling.
  • Why: hate going to meetings in person.
  • Love: way more personal phone.
  • Hate: often peoples' Internet connections aren't fast enough.


prime_landing_logo._V212353139_.gif

  • What: free, two-day shipping on many Amazon products (no minimum, overnight for $3.99).
  • Why: going to the store is a hassle.
  • Love: shipping is so fast (often just one day).
  • Hate: how much stuff on Amazon is not prime-able.

googleapps.jpg

  • What: online email.
  • Why: solved my Gmail issues (slowness, sending limits).
  • Love: 2-step authentication (though can get that on regular Gmail).
  • Hate: can't add free (lower) accounts once paid for higher ones; switching between accounts a pain.


github.png

  • What: online code repository.
  • Why: easy private collaboration; could use my servers, but it has issues built in and don't have to share credentials/resources.
  • Love: easy segmented collaboration.
  • Hate: dealing with pull requests.


Ooma.gif

  • What: Internet-based home phone.
  • Why: free after you buy device
  • Love: subscribed to premium for emailed voicemail and automatic blacklist detection.
  • Hate: faxing doesn't work well.


HelloFax.png
  • What: online fax.
  • Why: my home fax doesn't work very well.
  • Love: easy.
  • Hate: feels expensive.


Infrastructure


Amazon Web Services.gif

  • What: cloud services.
  • Why: initially wanted easy failover and scaling capacity; now run as primary host.
  • Love: spin up new nodes in minutes.
  • Hate: dealing with EBS issues; instances fail more than (I think) they should.


linode.png

  • What: virtual (shared) servers.
  • Why: shared development resource.
  • Love: cheap, root access, no bs.
  • Hate: larger plans expensive.


dns made easy.png

  • What: managed DNS.
  • Why: speed, uptime, built-in monitoring, failover and global traffic redirection.
  • Love: just works, cheap, great support.
  • Hate: web interface is a bit clunky/confusing.


pingdom.jpg

  • What: web site monitoring.
  • Why: external notifications when things go down.
  • Love: response time reports (from around the world).
  • Hate: expensive; will prob let expire since I have three other monitors now.


sdensity.jpg

  • What: server monitoring.
  • Why: gives alerts when on-server things go awry.
  • Love: Android notifications; alert types (exact process, system resources); great support.
  • Hate: new alerts don't remember your default (usual) settings. 


Jungle Disk.gif

  • What: online backup.
  • Why: peace of mind.
  • Love: can do network drives; set and forget; can use own S3/rackspace credentials if desired.
  • Hate: nothing, though thinking of moving my music files to Amazon for the streaming ability.


Full disclosure: amazon, DNSMadeEasy, Linode, Ooma, Hulu links above have embedded affiliate codes (and in some cases discounts). Just did that for the fun of it (don't really expect/care if I made $5 or whatever).


Update: there are some good comments on HN about other services people pay for.

Update2: there are more good comments on Lifehacker.

Toddler app user interface guidelines

 
My son Eli has been using iPad apps since he was one and we have about 50 toddler apps. With the big caveat that this post is based off essentially a sample size of one, here are some toddler app user interface guidelines.

  • Load the app as fast possible. By this I mean two separate things: first, reduce the time it takes to press the icon and be able to interact with something on the screen. If this takes too long, impatience sets in and the app is likely to face the gong (home button).

    Second, reduce the number of screens before the actual app appears to zero, i.e. go right into the app. Yes, I understand you want to upsell and do other stuff there, but just don't. When you do that, especially when put links to the app store there, the toddler will click them, and then get confused. For example, Park Math is a great app, but I hate their home screen -- that big yellow bus is so attractive to click!

    button.png


  • Move all settings out of the app. For iOS at least, you can move settings out of the app and into the general settings window. Please do this because toddlers are drawn to your little setting icons, and they a) destroy the flow of the app and b) the toddler will change all of them and put the app into an annoying state, e.g. in another language, too hard for them, etc. Interactive Alphabet is another app we like, but it really needs to drop those 4 unnecessary buttons on the bottom of the home screen.

    abc.png


  • No pop-ups/notifications. See a theme here? If there is a non-app associated thing the toddler will click it :). Pop-ups are the worst because they are modal giving them a 50/50 chance to click on the wrong thing. They're just not appropriate in these apps. There must be a presumption in these scenarios that the parent is controlling the app at that point, but at least in our household this is usually not the case. He scrolls through the screens and picks the apps he wants to use.

    The worst offender I've seen with this point are the Dr. Seuss Apps. These apps need work in general and I don't really recommend them, but this aspect particularly annoys me.

    popup2.png


  • Make everything tappable/clickable. Toddlers love to interact with the app and point out and press everything. When things respond to those taps, it makes the experience a lot better. Itsy Bitsy Spider does this really well.

    spider.png


  • Change it up/give surprises. The best apps not only make everything cilckable, but also do different stuff from screen to screen and even hide games/easter eggs. For example, a click may do something but a long-click may do something else. In a book, one page you may have to color and the next page you may have to click a series of things. Jack and the Beanstalk by Ayars (there are several) does this excellently. Here's an example page where they have this mouse dragging game action built in.

    mouse.png


  • Give multiple ways to do things. Sometimes it is unclear what to do, or the toddler hasn't mastered a way to do something, e.g. the swipe motion. So it is better to offer an alternative. In the swipe example, little arrows to turn pages work well.

  • Give hints. If there is no activity for a bit, it is great to give the toddler a hint of what to do, e.g. by highlighting something they could/should tap. A similar behavior is to offer a mechanism to get hits when something is hard to do. Animal Hide & Seek Adventure does this well. You're supposed to tap the hiding animals, and they offer a dock on the bottom to shake the animals if you can't find them.

    hideseek.png


  • Add delays. Once a toddler understands how to do something, like turn a page, they want to just keep doing it. But this leads to them doing things like rushing through the app without really getting much out of it. The way around this is to delay some of the interface elements (including hints). For example, instead of showing the arrow to the next page right away, wait 5 sec. for doing so. I actually haven't seen anyone doing this yet, but would really appreciate it. The same goes for hints, i.e. the animal game I mentioned above. It would be better if the hint dock did not show all the time.

  • Give instructions. Literally tell the toddler what to do, i.e. speak instructions to them. This can be done in conjunction with the hints and delays mentioned above, and also at the beginning of a task. Monkey Preschool Lunchbox (also on Android) does this very well.

    monkey.png


  • Update the app. It's great when you can update the app and the toddler can see the changes. It doesn't have to be new features, but could just be new themes or other look changes. Elmo's Monster Maker does this well by releasing seasonal and holiday updates.

    elmo.png


  • Highlight words and letters as you say them. When reading to the toddler, I think it will help them associated words and letters with the sounds better if you highlight what you're saying as you're saying it. The Monster at the End of This Book does this well.

    grover.png


Finally, here are some gripes with iOS:

  • Home button needs an off switch. I need some way to disable the home button or make it harder to access during app play, e.g. a triple click or some other morse code sequence.

  • Need a way to hide videos. Eli knows how to get to the videos. He can find the icon no matter where I put it. I can disable videos through restrictions, but that doesn't really solve the problem. I would really like to be able to hide this icon like you can do for system icons on Windows. Another option would be to put the restriction on the icon itself and force me to enter the password when clicking on it. Come to think of it, this would work for the home button too.

Update: there are tons of excellent comments below, including many from great app makers and other parents like myself. Here are some additional guidelines.

--Enable multi-touch
--Setting to disable photo taking. 
--No in-game purchases.
--Cheer for the toddler when they do something right.

Browser market :: Search engine market

 
Here is a chart of the last six years of relatively believable browser share #s (from Net Applications via Wikipedia). In it you can clearly see multiple alternative browsers gain a foothold in the market. 

browsers2.png

Anecdotally this seems true as well. I know plenty of people who use each of these as their primary browsers, with the exception of Netscape (not sure what is going on there--AOL?). And more often than not, it seems people actually switched to these alternative browsers (as opposed to being forced on them by their OS, OEM, IT dept, etc.).

Is this an apt analogy for the future of the search engine market? 

I'm hesitant to quote any search engine share market #s currently because I frankly just don't believe them, or at least they don't seem to line up with anecdotal evidence at all. Google seems currently way more dominant than they sometimes get credit for.

Yet at the same time alternatives are arising. And more importantly wrt to the analogy, there are alternatives that in my opinion offer a different enough overall search experience to be preferable to some people, much like Web browsers.

Search leakage is not FUD. Google et al., please fix it.

 

Lately I've been accused by some of spreading fear, uncertainty and doubt (FUD) by trying to let people know their search terms are being leaked to the sites they click on. I hope to address those concerns in this post.


For those of you who have no idea what I'm talking about: when you click on a link on the Internet, where you clicked from gets automatically sent to the site you clicked on (most of the time). 

For example, if you're on yahoo.com and you click to a story at the New York Times, your browser will send to newyorktimes.com some information that you came from yahoo.com -- namely, the Web address of the page you were just on. This info is called the Referrer.

At issue here is that sometimes the Referrer contains personal information. In particular, when you use most search engines, your search terms are included in the Referrer. That is, when you search on Google/Bing/etc., and you click on a link, your search terms are sent to the site you clicked on. This search leakage doesn't happen at DuckDuckGo.


Now, let's take the FUD arguments in turn.


One site having one of my search terms is irrelevant. That may generally be the case, but unfortunately, tens of millions of sites run ads from just a handful of ad networks. Those ad networks can aggregate your search terms and piece together a large percentage of your search history. 

So the question then becomes do you care if third parties (not associated with your search engine and not bound by its privacy policy) have a significant % of your search history? If you don't care about that, then you probably don't care about Referrers. 


It's not Google's fault. Your browser sends that stuff. That's true, but Google et. al. could easily fix it. It is a technically trivial fix. In fact, Google had done it for a bit when they switched to using Ajax.

So the question then becomes if you're a company that cares about user privacy and can easily stop third-parties from piecing together your users' search histories, why wouldn't you do it?

In other words, I find this FUD argument to be a straw man argument. While you can fault the browser or the Internet, that doesn't mean someone who is able shouldn't come in and fix it.


It would hurt SEO. The only reason I've heard to not prevent search leakage is that marketers use Referrer info to do better search engine optimization (SEO).

But the information doesn't have to disappear, just the current mechanism of transferring the information in a personally identifiable way. Google et al. could provide sites with the information in an anonymous fashion. At that point, I think the only thing marketers couldn't do would be to dynamically serve you different pages based on your personal search terms.

So the question then becomes is that trade-off worth it? 


Google Webmaster Tools (GWT) doesn't provide that full information. Matt Cutts wants me to stop saying GWT can solve this marketer problem because while GWT provides a lot of information, it does not currently provide all the terms people search for to get to your site. That's true; sorry Matt. 

But the key word is currently. There is no reason I can see why it couldn't provide a more comprehensive view into this data. 


Google provides ways to opt-out.  The only thing I know that somewhat protects you from Referrers is Google's encrypted version, which doesn't protect you fully (because https->https traffic still sends Referrer headers).  

Most people have no idea that the encrypted version is related to this problem, or that it even exists. Furthermore, you still can't just type in https://google.com/ to get there (you have to add the www.).

But all that is besides the point, because you shouldn't have to opt-out of this search leakage in the first place. Your search results won't suffer -- Google still has your history. 

Therefore, it should be the default. Matt says SSL can't be the default because of latency, but that is another straw man argument IMHO. You don't need SSL to solve this problem as evidenced by their Ajax incident and DuckDuckGo.


You're just attacking Google when Bing et al. do it too. I want everyone to solve this issue and I've tried to put "et al." in this post a lot. However, the reality is Google is synonymous with search. Despite what search market share #s say (I still don't grok them), pretty much everyone I talk to about search talks about Google. 

In any case -- Bing, Yahoo, etc. -- if you're listening, please solve this issue at your search engines too.


To summarize, here's my basic argument:

1) Search engines say they care about user privacy.

2) They are currently allowing third-parties to aggregate user search history by not blocking the browser from sending search terms in the Referrer header.

3) There is an easy fix.

So why isn't the fix a no-brainer?


Here is a representative example of feedback emails I get on this subject. I got this user's permission to share.

I just replaced Google with DuckDuckGo as my default search engine. I'm VERY tired of having advertisers jump all over me everytime I do a search for, well, anything.

For example: watching THE TUDORS on iTunes, one of the characters had gout. I wanted to know if gout was a recognized disease during the time of the Tudors. So I Googled "gout", and checked out the wikipedia entry on the subject. Turns out it was in fact a recognized disease at the time (although they had no idea what caused it). I don't have the disease. I don't personally know anyone who does. I certainly don't have any need for medications that treat gout. But now I'm constantly bombarded with ads for all kinds of drugs intended to treat it. 

All I did was get currious, just once, about a disease suffered by a TV character on a show I like to watch, and now every advertiser on the planet is apparently convinced that either I, or someone I know, has gout, and they're not about to pass up even the most minuscule chance of selling me something.


Here's the official response Google gave to Wired:

"It's unfortunate that DuckDuckGo is preying on people's fears and offering incomplete information in order to garner attention," a company spokeswoman said in an e-mailed statement.

"For example, it is inaccurate to say that Google uses sensitive health-related terms to target ads on affiliated web pages."

"All search engines and websites use referrer terms as part of the architecture of the web, but we recognize our responsibility to protect the data that users entrust to us and we give them meaningful choices to protect their privacy."

The meaningful choice here would be to drop the personal information from the Referrer. 


Finally, I'm not alone in this call to action. Christopher Soghoian, who previously worked at the FTC and had been a Google intern, filed an FTC complaint in October of last year on this very subject. Here's his post on it and the associated WSJ post.

Address books and social graphs

 
Lately I've been noticing a new viral strategy popping-up, and I'm not quite sure what to make of it. Here's how it works. You upload your Google contacts to the site, perhaps to find other people you know or as part of some other functionality. (Google works better than other sites because of the way Gmail implicitly adds a contact for each correspondence.)

Then I come in to the site through a referral email. Without entering in anything, the landing page can be tuned to my social connections based on my email's appearance in address books' of my contacts. That is, the service has saved all the previous contact lists so they can make a behind-the-scenes social graph for use in converting me. 

Of course when the site is bigger they can use their inherent social graph for most of this logic. But when they're just getting started (or entering new networks), this address book component can really increase their viral coefficient.

This technique isn't definitively bad. You could imagine a big disclosure on the front end (when uploading). On the other side (me coming to the site), it is improving my experience by putting the site into the context of people I know.

But it can get creepy too. I doubt even with seemingly valid disclosures, people realize that their info would be used outside their own account. Also, it can lead to interesting "people you may know" suggestions, sort of the equivalent on twitter of people your friends follow that also follow you, but you otherwise have no shared connections. Those recommendations always leave me wondering, how did they make that connection?

Taking that a step further, the site could continually recommend me to invite people that aren't currently on the site, without me uploading my address book. That is, they know this person is not yet a member of the site and they also know their name and email based on previous uploads. So they could just present the person's name to me, get my authorization, and then kick off the referral email on my behalf without me entering anything.

Thoughts on Yahoo! BOSS Monetization II

 
boss.png
Last year (Feb 11, 2009), Yahoo announced upcoming usage fees for its BOSS API. At that time, I wrote up my thoughts on BOSS monetization

That was before the Microsoft/Yahoo search deal. Yesterday, Yahoo! announced more upcoming BOSS changes in light of that deal. Or, they announced that nothing will change... 

BOSS will apparently remain in existence despite Bing's powering of Yahoo! organic results. And BOSS will get a fee structure just like was proposed last year. Actually, it might be a bit better than that because there is talk of a revenue share program that would waive fees, but we'll have to wait for the details on that come out.

While I use BOSS myself for DuckDuckGo, and will probably continue to do so in some capacity, I find the idea of paying for it a bit curious for a few reasons.

  • Downtime. BOSS has had a lot of down time. Recently, it was down for about 24 hours straight. You can't build a reliable search engine on that kind of down time. Moreover, there is no API status dashboard, which I've pushed for, and there is also no reliable way to communicate with Yahoo! about BOSS downtime. They say to use the usergroup, but is anyone really checking it at 4AM? It doesn't seem like it.

    If I'm paying I would hope to have an SLA with some way to reach the ops team. Sometimes it feels like I'm the only one monitoring! I get alerted (by email and text message) when it goes down and report it after I determine it isn't my fault. And I'm pretty much the only one reporting it like this.

  • Bing. If Bing is powering Yahoo's organic results, will it also be powering BOSS? The announcement seems to indicate this will be the case. If so, why not just use Bing's API, which is currently free and would give the exact same results? If not, then why wouldn't we expect it to just degrade in quality as Yahoo spends less time and money on generating their own organic results (e.g. on crawling, indexing, etc.)?

    If I'm paying I would hope to have solid answers to these questions. I hope that they do continue crawling and indexing, because for me I'd like to use reliable independent APIs. Otherwise, the combined entity is essentially a single point of failure, or maybe single+epsilon.

  • Bugs/Roadmap. There have been a lot of outstanding bugs in the BOSS API for a while now, which I have worked around. For example, the API will change your query in some cases and won't tell you. There hasn't been great communication as to a roadmap, and I get the sense that everything was sort of on hold for a long time because of the Microsoft deal. If the commitment to BOSS is serious, please start treating it accordingly.
In any case, DuckDuckGo should be fine!

Update: in hope that this post does not come across as complaining, I want to add that I'm very grateful for the BOSS platform. I don't know where I'd be today without it. At the same time, I'm in a somewhat unique position to comment on it, and I think the above are valid concerns with regards to charging for it that could benefit from more public discussion. 

Update2: Ashim (head of BOSS) just released this update on the forum, clarifying that indeed BOSS Web results will be powered by Bing in the future.

A new approach to mobile search

 
iphone.jpgThe new DuckDuckGo App is up on iTunes (and is iPad compatible). Android and BlackBerry versions are also in the works.

The premise behind this search app is simple: going to Web sites on your phone to find information is a pain.

In response, the goal of the DDG app is to get you the information you want in zero clicks, without having to go to any sites.

To do that, we go out to the top links in real time and pull back the most relevant paragraphs for your particular search. These paragraphs go in a section labeled 'Topic Summaries.' They function like normal links, but with the readable paragraphs on top.

In addition, you get all the "Zero-click Info" that you get on the Web site, e.g. snippets from Wikipedia & over 25 other sources, instant answers from us & WolframAlpha, etc. You also get the the content pages we make for the Web site (formatted for mobile), e.g. disambiguation & category pages, related topics, etc. And you get normal links as well, with site icons next to them.

Of course, you don't always get what you're looking for with zero clicks. It's a work in progress, and with your feedback I'm sure it will improve over time. But right now I believe it offers a unique and compelling mobile search experience. It really shines when you just want some quick information on a topic.

The DDG app also protects your privacy. All searches are preformed over an encrypted connection (https/SSL), your search strings are not shared with the sites you click on, and no IP addresses are stored on our servers. In short, it follows our privacy policy.

The !bang syntax, which takes you to 100s of sites directly, and most goodies from the Web site also work. There are links explaining both within the app (on the home screen). 

You will notice there are not currently any ads. We tested some but found them to be too distracting and irrelevant. In the future, we may put up one unobtrusive, relevant context ad per search.

Thank you Chris Heimark for making this app possible and also thank you to all the people who beta tested it and gave us great feedback. I'd really appreciate it if you'd check it out now and give us your feedback as well.

My putty settings. What are yours?

 

putty.pngI use putty all the time to connect to my servers and local VMware images. I actually develop over putty via SSH (in emacs).


There are a lot of putty settings and I haven't explored them all, but there are some I make sure are always on. These are:

  • Window -> Lines of scrollback -> 10000. The 200 default is sort of ridiculous. I often have long script outputs and want to scroll back.

  • Window -> Behavior -> Window title -> <server name>. I currently have 19 putty windows running. If they weren't individually named by server, it would be a disaster.

  • Terminal -> Bell -> None (bell disabled). If I don't turn that bell off, I get a headache after like 5 min.

  • Window -> Translation -> UTF-8. The default (Latin encoding) causes all sorts of problems if you tail a log with UTF-8 output.

  • Connection -> Data -> Auto-login username -> yegg. Saves time.

  • In the past, I've also messed with default colors (Window -> Colours), but have always gone back to the default. I've also changed the keepalive (Connection -> Seconds between keepalives), but my current servers simply don't drop ever.

What else do you set?

Google Web spam

 
Yesterday, @mims wrote this post on "content-mills," which prompted this discussion on HN about Web spam. Many of the comments are by moultano, who is on Google's search quality team. This particular comment really drew my attention:

I doubt you'll find MFA spam to be better on DDG than on Google, but please, if you see a query where they are beating us. Send it over. :) I can guarantee you that I'll get a lot of eyes looking at it.

At DDG, I mainly crawl looking for these types of spam domains. On my last crawl, I identified about 37.8M domains as spam in the com/net/org/biz/info/us TLDs. I found Web sites at another 61.3M domains; the rest timed out. So roughly 40% of the domains I visited (with sites) were spam.

I just took a random sample of those spam domains and checked them against Google's index. All of this code as well as the sample and results are now on github.

First I started checking against Google's Web site directly, but their bot detection quickly shut me down. I was able to check 589 domains before being shut down, using the site: syntax. The results are here. The second column is the # of results reported in the index. For example, you can verify the first one with this query.

Of those I checked, 302 came up with at least one result, i.e. are in their index in some form. That means (extrapolating) roughly 50% of my spam domains are in Google's index, or about 19M domains.

Once shut off, I moved to Google's search API to process the full 10K sample. Interestingly though, it apparently returns very different results. For example check out web vs api. The Web shows 1 result, whereas the API shows none. 

Weird. I carried it out anyway though. Of the 10K full sample, I found 719 in Google's API index, or 7%. If you extrapolate that to the full list, that would be ~3M spam domains in the index. 

In any case, these #s are pretty conservative estimates because a) I'm only covering about half the domain space (missing all the country tlds except .us), and b) I know I still have a lot of false negatives (please send me them when you see them).

On the other side, the way I do the identification, there are minimal false positives at the time of identification. However, sites turn from spam/non-spam all the time, and since it takes me a while to crawl, there are certainly a few false positives in there. 

There are also legitimate false positives, and if you see those, please report them as well. I did nothing to hide those from view here, so you can see for yourself in the results.

Of course this says nothing about how much they appear in the rankings. I tried to find the modern equivalent of Metaspy to get some random queries, but I couldn't find such a such a service in existence. Nevertheless, half of the spam domains are not in the index, so it begs the question why the difference? 

If people have lots of links from Google results saved, I'd be happy to run them against my list.

Weird eHow Web spam

 
fnboelwein.com redirects to http://www.ehow.com/apply-card-credit-online/. As does bankofelgin.com.

If you actually go to the link, you get this message at the top:

Hi There! bankofelgin.com isn't available, but you're still in a good place -- ehow.com. We think we might have what you're looking for.

Doubt it.

Both domains have 64.74.223.39 as an A record, which is different than the redirect IP. And both have proxied whois records. However, they all have the same Server headers:

Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET

So I'm inclined to believe these domains are actually powered by eHow, which isn't too surprising since eHow is owned by Demand Media.

I wonder how many of these domains are out there? I found these two because they happen to both be part of my spam/parked domains training set.

I'm starting crawl #18 to detect and weed out such domains from DuckDuckGo. Each time I start a crawl I make sure I have no existing false positives or negatives in my training set.

Interestingly, every time a lot of domains flip from parked to unparked and visa-versa. These two fell into the false negative category since I wasn't labeling these pages as spam. Maybe I should...

Top linked domains from Facebook pages

 
I've been messing around with Facebook pages for an upcoming DuckDuckGo integration, and I came across some data that seemed interesting enough to share. These are the top linked domains from Facebook pages.

  1. myspace.com (269588)
  2. twitter.com (97669)
  3. youtube.com (54238)
  4. facebook.com (50234)
  5. flickr.com (12541)
  6. en.wikipedia.org (10578)
  7. reverbnation.com (10144)
  8. fotolog.com (7840)
  9. imdb.com (5250)
  10. purevolume.com (5217)
  11. last.fm (3740)
  12. linkedin.com (3730)
  13. soundcloud.com (3713)
  14. ilike.com (2652)
  15. cdbaby.com (2377)
  16. apps.facebook.com (2213)
  17. it.wikipedia.org (2019)
  18. sites.google.com (1984)
  19. etsy.com (1977)
  20. vimeo.com (1941)
  21. bebo.com (1939)
  22. sonicbids.com (1923)
  23. modelmayhem.com (1780)
  24. profile.myspace.com (1692)
  25. wix.com (1633)
  26. amazon.com (1613)
  27. soundclick.com (1401)
  28. tinyurl.com (1364)
  29. fr.wikipedia.org (1348)
  30. cafepress.com (1304)
  31. bandzone.cz (1279)
  32. freewebs.com (1264)
  33. es.wikipedia.org (1236)
  34. google.com (1195)
  35. itunes.apple.com (1113)
  36. zazzle.com (1004)
  37. dailymotion.com (996)
  38. friendster.com (982)
  39. imeem.com (901)
  40. bit.ly (895)
  41. profiles.friendster.com (893)
  42. new.facebook.com (872)
  43. virb.com (728)
  44. yelp.com (707)
  45. groups.yahoo.com (686)
  46. picasaweb.google.com (673)
  47. web.me.com (669)
  48. metroflog.com (657)
  49. geocities.com (633)
  50. bbc.co.uk (584)
In particular, this list aggregates domains extracted from links within the 'Website' sections of Facebook pages. For example, on the DuckDuckGo Facebook page there is link to the homepage (duckduckgo.com) and to the DDG twitter steam (twitter.com) within that section. Each of those domains would get one point in the aggregated list. If duckduckgo.com had appeared twice, it would still just get one point.

Of course, real people took the time to link to these domains in the context of promoting their online Web presences, so it was interesting to me what they chose in the aggregate. This data confirms my anecdotal evidence I keep seeing where people promote their FB and Twitter together. I was also intrigued by how high myspace was; I suppose a lot of bands still use it and/or haven't updated their old FB pages.

There were a few sites I actually hadn't heard of, e.g. some of the music stuff, wix, modelmayhem & virb. Not that I should hear of every site, but these must have a lot of traction already to be that high in these lists. 

If you just look at "high quality" FB pages (custom urls, no default images, etc.), you get a similar but slightly different list & ordering.

  1. twitter.com (32597)
  2. myspace.com (28125)
  3. youtube.com (11511)
  4. facebook.com (11007)
  5. flickr.com (3225)
  6. reverbnation.com (1413)
  7. linkedin.com (1267)
  8. en.wikipedia.org (1016)
  9. ilike.com (787)
  10. last.fm (763)
  11. purevolume.com (685)
  12. soundcloud.com (681)
  13. vimeo.com (608)
  14. imdb.com (483)
  15. apps.facebook.com (411)
  16. bebo.com (389)
  17. cdbaby.com (376)
  18. sonicbids.com (372)
  19. bit.ly (337)
  20. tinyurl.com (279)
  21. itunes.apple.com (273)
  22. google.com (264)
  23. fotolog.com (263)
  24. friendfeed.com (259)
  25. nscs.org (242)
  26. imeem.com (241)
  27. etsy.com (212)
  28. it.wikipedia.org (209)
  29. modelmayhem.com (200)
  30. itunes.com (188)
  31. amazon.com (178)
  32. cafepress.com (171)
  33. delicious.com (154)
  34. yelp.com (153)
  35. zazzle.com (153)
  36. dailymotion.com (146)
  37. virb.com (133)
  38. ustream.tv (111)
  39. soundclick.com (100)
  40. bbc.co.uk (99)
  41. legacyrecordings.com (97)
  42. friendster.com (93)
  43. blogtalkradio.com (92)
  44. digg.com (92)
  45. formspring.me (90)
  46. picasaweb.google.com (90)
  47. lululemon.com (90)
  48. woodstock.com (86)
  49. groups.yahoo.com (86)
  50. de.wikipedia.org (86)

Here are the top types (counted for pages with at least some info on them).

  1. Musician (421154)
  2. Other Business (417494)
  3. Other Public Figure (232868)
  4. Professional Service (140365)
  5. Non-Profit (129352)
  6. Website (106490)
  7. Products (95604)
  8. Education (76957)
  9. Store (64733)
  10. Visual Artist (61575)
  11. Club (61043)
  12. Restaurant (56447)
  13. Health and Beauty (51983)
  14. Sports / Athletics (49001)
  15. Fashion (46189)
  16. Food and Beverage (43737)
  17. Communications (36180)
  18. Athlete (30520)
  19. Religious Center (29811)
  20. Technology Product / Service (28728)
  21. Actor (28101)
  22. Hotel / Lodging (27358)
  23. Sports Team (27322)
  24. Online Store (26261)
  25. Film (25190)
  26. Religious Organization (25093)
  27. Writer (25082)
  28. Bar (24957)
  29. Politician (24139)
  30. Consumer Product (22421)
  31. Comedian (22359)
  32. Real Estate (22296)
  33. Technology and Telecommunications Service (21176)
  34. Model (20944)
  35. Event Planning Service (20671)
  36. TV Show (18739)
  37. Museum / Attraction (17960)
  38. Game (16729)
  39. Travel (16097)
  40. Pets (15511)
  41. Retail (15123)
  42. Travel Service (15018)
  43. Automotive (14530)
  44. Cafe (14237)
  45. Government (13601)
  46. Medical Service (12909)
  47. Automotive Dealer / Vehicle Service (9729)
  48. Home Living (9227)
  49. Home Service (8247)
  50. Library / Public Building (7086)

Note that data from this crawl was completed before the whole open graph/like thing, so these were all "real" pages. I'm currently crawling all the new stuff and working on ways to "keep it real," so to speak.

My Gmail is fast again

 
Gmail_logo.pngAfter my super-slow Gmail post was picked up on HN and on NYT, Google reached out to me. I gave them my username and 39hr later my account is back to normal.

I got approval from the person who communicated with me to share the following snippet of our conversation.

"The team is still looking into your account slowness, but it initially appears that the problem is isolated to a small subset of Gmail users...They are still investigating the root cause of the slowness but in the meantime have moved your account to a different set of servers, which should help."

Gmail has become unusably slow

 
When I switched to Gmail in 2004, I believed the hype. Never delete a message again--no need. We have tons of space, and you can search it all really fast like Google.

That time has passed. Gmail has gotten slower and slower for me, and as of the last few weeks it has become unusably slow. Before you ask, yes, I've tried it across lots of browsers and computers.

It can take 20sec to switch labels, and even longer to search for something. But here's the worst part--it takes just as long to send a simple message!?! Why? What does sending have to do with anything?

It's become the bottleneck in my day, and I don't know what to do about it. And I'm not alone.

A few days ago I decided to start taking action. First I emailed support. OK, first, I tried to email support. 

Have you ever tried to email Google support? It's almost impossible to find the contact form. Here's the support home page. I dare you to find out where to report this slowness issue.

You get to this page on slowness. After going through the wizard, you click on 'report your issue' at the bottom, and it takes you here. Wait, that's not a contact form, and you can't get to one from that page! Anyway, here is a contact form; I found it going through another problem wizard.

Needless to say, I haven't heard a response :)

Next step: I disabled chat, buzz & tried the older versions of Gmail. No luck. Then I disabled all labs, after which I perceived a very modest improvement, but still unusable.

Next I removed most of my labels. I have four now (down from 32). This seemed to help a bit as well, but still not much.

So this morning I went drastic. I deleted all my contacts and started deleting mail. Ridiculous huh? That totally breaks the original selling point of Gmail, but like I said I'm at wits end here.

Deleting stuff has resulted in the biggest improvement so far, but it's still slow. Perhaps a bit better than unusable now, but still terrible.

You are currently using 4247 MB (56%) of your 7459 MB.

In a last ditch effort, I bought some extra storage from Google thinking maybe I'd get some kind of premium level service. So far, no.

Google's been recently launching lots of cloud products, most recently a storage product to compete with Amazon's S3.

In other words, they obviously have the resources to make Gmail fast. So what's the deal? They must know about the slowness. The only reasonable explanation is that they are consciously under-resourcing it. Again, why?


Update: there are also a lot of good comments on HN.

Update 2: after a bunch of testing with my account, I'm confident at least my slowness involves something around having more than 4GB of mail. I deleted a lot of messages and got down to 3.6GB. It was then relatively fast again. I then sent myself a 25MB file (the limit) repeatedly until I got back up to 4GB. Right after 4GB, it got slow. Go figure.

Update 3: Google reached out to me and "fixed" my account. Here is what they said

A FB ad targeted at one person (my wife)

 
ad.png
The other day I gave a presentation with Steve Welch on the use of social media in politics. Steve was walking through (live) the process of creating a Facebook ad. 

He started targeting the ad by location and interest, and the number of potential people he was reaching began decreasing on screen (Facebook tells you dynamically). Then I got to thinking--could you target an ad at literally one person? 

In theory, it wouldn't seem that difficult, given you can target by a lot of different things: interests, school, location, workplace, etc. If you could actually see their profile it would be super easy, but I don't think you even need that much. With just basic facts about them, e.g. their LinkedIn, you could target an ad sufficiently narrowly to reach essentially just him/her.

So of course the next step was to actually try it.

My wife goes on Facebook a lot to look at pictures and stuff. She's also mentioned many times how she actually likes the ads because they're targeted pretty well at her interests. So I thought she'd be the perfect subject.
reach.png

First I made the ad (above). Toast is a name we sometimes call our son Eli. Yes, I even messed up grammar in the title of my ad...

Then I started targeting. It proved just as easy as I thought it would be. First I  targeted to literally just her by using the stuff on the right plus her major and gender. Btw, I couldn't get FB to say it would target any lower a number than 20 people (although it does say fewer.)

But when I logged into her account, and it wasn't showing all the time on top.

So I increased my CPM bid. But it still wasn't showing all the time on top (it showed at least occasionally then though). Side note: you can see a bunch of ads by going to the ad board page.

Then I backed off a bit (to the targeting on the right) so I'd be included as well. I logged in and found the ad on my account and "Liked" it as well as clicked on it. My thought was maybe their ad system would then perceive it as a better ad and show it more. This seemed to work, but it is hard to tell whether that really had the effect or not. In any case, it started showing up a lot more. Not all the time, but when it did show up it would stay on top for a number of page views.

Then I waited until the night and subtly prodded her to check out Facebook. We went through old photos of Eli and it was just sitting there on the right on many of the pages, but she didn't notice!

I saw first hand why CTR is so low on FB. I steered us towards the album with the picture I used for ad and literally there was the big version of the picture and then the ad on the right (below). 

And then she noticed it. 

She immediately got what was going on, she looked at me and we broke out laughing pretty hard for a while.


ad2.png
This was of course all in good fun, but I also think there could be some good business cases for this technique :)

Twitter RT Test Results

 
Test: I asked @duckduckgo followers to RT this tweet.

tweet1.png

I also RTd it from @yegg (my personal account) with slightly different text.

tweet2.png

Hypothesis: I wasn't sure what to expect, but figured I would get a bunch of RTs because my followers seem pretty solid (not spam, auto follows or other non-sense).  After that, I thought maybe I'd get some 2nd level RTs. I wasn't even holding out hope it would go viral, and of course it didn't.

Results: I tallied up the RTs using twitter search. The @duckducko tweet was RTd by 18 people using Twitter's RT system. The @yegg tweet was RTd by 6 people. Then there were 11 people who RTd it on their own, 4 of which got RTd 1 time each. This totals 39 RTs.

All of these people for the most part aren't spammy either, i.e. their accounts look real, with real followers. Counting them up they had 4,406 followers (avg 126, min 6, max 408). I threw out one outlier who had 22,150 followers but was also following 24,308.
 
As you can see from the tweet, I linked to a special URL, dukgo.com, that hardly anyone uses, and so I used it to track clicks. All told, 73 clicks. So that's ~2 clicks per RT--not too good.

I also considered RTs/followers. @duckduckgo has 848 followers, so that's 3.4% who RTd it. At the second level you have the 4,406 followers tied to those RTs plus my 911 followers for 5,317 followers. At 10 RTs, that's 1.8%. So you can see the drop off.

Here are my takeaways:

  • You need a lot of followers. I'd say you'd need two orders of magnitude more, i.e. 100K real followers, to make this at all worthwhile. Then we're talking on the order of 10K clicks.

  • To go viral, you'd need RTs by important people. I'm really grateful for all the RT support, but no one had a ton of meaningful followers. I think you'd need that celebrity push to get it out there, which may kick off other celebrity RTs.

  • To go viral, you'd probably need more compelling content. Of course this test was business related, but you'd probably need it tied to either more of a fad/news story or have more of a hook, e.g. a super interesting Web page on the other side.

  • Viral coefficient is not 0. There were second level RTs. If the content also lent itself to RTing, i.e. it was a game or something that involved tweeting, you might be able to bring that up and keep the chain going.

I also tried Fiverr, a service where people say what they'll do for $5.  I spent $15 on 3 people who seemed legit and said they would retweet to all of their x thousand followers. Only one has done it so far, and that yielded a RT by 1 person (none of which I counted above). So I'm guessing that is not going to be a good advertising channel.

What I installed (and uninstalled) on my new computer

 
I just bought a new desktop. Right before it arrived I was listening to Chris Dixon on Mixergy opine that Skype probably couldn't happen today because people don't trust downloads anymore. Somewhat tangentially it got me thinking that people probably download less software now as well because of the ascendence of the cloud, and I probably don't need to install/download much software on this new computer. 

I'll let you decide whether that was the case or not. I wrote down everything I installed (and uninstalled), in order.

  1. Windows updates. I had to go through a few rounds of rebooting to get them all installed. That's pretty amazing (and annoying) since Windows 7 just came out, but whatever.

  2. Google Chrome. My Web browser of choice (at the moment). I love how it syncs my bookmarks now too, which is one of the main reasons I installed it first.

  3. Adobe Acrobat Reader. First thing I did was check my email and someone sent me a PDF. Really, you can't pre-install a PDF reader? There might be a better one to install, but I just went with what I know.

  4. Adobe Flash Player. Next thing I did was go to a Web site that required flash...

  5. Skype. I use Skype all the time, especially to enable video chat between my son and my parents.

  6. Vodburner. This is a Skype add-on that I pay for to help me record Skype video chats for my traction interviews. While I was installing Skype I figured I might as well get this set up too.

  7. Nvidia GeForce GTX 260 drivers. I have two 28" HannsG monitors (HG281) at 1900x1200 resolution (1080p). Yet they were rendering at 1920x1080 and everything was blurry. First it took me a long time to figure out the resolution was wrong. Then it was was really annoying to fix because it wouldn't let me set a custom resolution. So I went to the nvidia site, which sent me to the hp site and their their download said it wasn't compatible with my computer. So I went back to the nvidia site and found the latest drivers. After install, everything worked fine. Side note--now Windows wants to install an "update" of the old drivers. I told it to "hide" that update :).

  8. Putty. Once the text was clear, I wanted to check something on my servers. I use putty for that.

  9. Sonos Desktop Controller. This whole time I was listening to Pandora over my Sonos system. A sucky song came on and it was clear it was also too loud. So I installed the controller that lets me control the music in the house from my desktop.

  10. PGP Desktop. I keep my passwords and other important docs on an encrypted virtual drive that gets mounted as a regular drive by this software. I needed my passwords for facebook, twitter, etc. (I use random passwords), so this was next.

  11. Adobe Illustrator CS2. I have a folder for software to install with this and partition magic in it as I purchased both of them via downloads a long time ago. I saw it next to my PGP folder, so I went ahead and installed it next as I know I'll need it soon enough.

  12. ITunes. Eli (my son) watches videos through here, and it syncs my iPad and iPod Touch, which I use for development.

  13. Firefox. I debug stuff in Firefox, and got an email about a bug, so I decided to download it next.

  14. ForecastFox, Web Developer, Firebug, YSlow. These are the Firefox add-ons I use regularly. While I was installing Firefox I thought I'd go ahead and add these.

  15. Safari. I use it just for testing. But while I was doing Firefox I thought it would be good to just do now.

  16. Opera. Same story.

  17. Uninstall Norton Stuff. Ugh, I hate this stuff and wish I had an option not to have it pre-installed in the first place.

  18. ClamAV. This is my replacement for the virus part of Norton. The other parts I'm fine with the pre-installed Windows firewall and Windows Defender. I have a smoothwall setup in my house for more firewall protection.

  19. WinSCP. This is the other piece of software I use to routinely connect with my servers (for transferring files). I needed to transfer an image, so this was next.

  20. Quickbooks 2008. I went down to the basement to get CDs. This was one of them. I use it to do company accounting.

  21. Adobe Photoshop Elements 5.0. On CD. I use it to do image manipulation. Great deal actually--I've found I've never needed more than this "elements" version.

  22. Picasa. I manage my photos in picasa. Photoshop made me think of it.

  23. Adobe Premier Elements 2.0. On CD. I use it to edit video sometimes, though I don't recommend this program. I just don't have a good alternative at the moment.

  24. Vmware Server. I use it to develop with. I have a FreeBSD image that mimics my servers. I wanted to fix some bugs so this was next.

  25. Uninstall Microsoft Works, Office Home and Student Trial, PowerPoint Viewer, Compatibility Pack for the 2007 Office System. I wanted to install Office 2010 (from BizSpark), but it wouldn't let me install the x64 version before installing all remnants of x32 versions (this stuff). I find this odd since I have an x64 version of Windows--so why would they pre-install x32 versions.

  26. CutePDF. While the uninstalling was going on, I needed to PDF something (I save receipts this way).

  27. WinRAR. Someone emailed me a giziped file, and this is my decompressor of choice.

  28. Microsoft Office 2010. Once that other office crap finished uninstalling, I installed this.

  29. Gmail notifier. Alerts me of new emails.

  30. Gmail notifier https patch. Come on--are you ever going to update the notifier to include this natively? I use https gmail and it doesn't work with notifier without this patch.

  31. VNC. I use this to connect to my desktop from my laptop (usually to get a password). When I went back to my laptop I noticed it was missing :).

I could be an outlier, but that sure seems like a lot of software to me! All in all it was ~20 downloads and 3 CDs. If I wasn't a developer at all, I think I'd still have done ~10 of them.

My personal URL shortener, ye.gg

 
To my surprise and delight, I noticed yesterday that the domain ye.gg was available, and I quickly gobbled it up. .gg is the country code for Guernsey, one of the Channel Islands. It wasn't cheap (GBP 88.00, ~$135 USD), but it's worth it to me!

Side note: in this process I found Domainr, which helps you find short domains.

Unlike godaddy et al., it took ~24hr for the .gg domain to be setup in DNS. So while I was waiting yesterday, I searched for a provider to run my URL shortener. The two providers I found that seemed like they may work were bit.ly Pro and awe.sm, which TechCrunch apparently uses.

I was quickly accepted into the free beta of bit.ly Pro (thanks!), but it has two limitations that prevent me from using it. First, they won't redirect ye.gg/ (with no shortcode) to my Web site. Second, they share the hashspace with everyone else, meaning I can't make ye.gg/1 ye.gg/2 etc. because they're already taken by regular bit.ly users.

Awe.sm looks cool, but they're in closed beta or are charging $99/mo. I'm only going to be making a few short URLs a month (for blog posts), so that price seemed way too steep. I emailed asking if I could get it in on the beta, but haven't heard back.  I can't blame them for not turning around in minutes, but I'm itching to get this thing up!

So I decided to roll my own thing for now--the most basic thing I could come up with in a few minutes. Here's what I did.

  • Pointed the DNS to my server that runs this Web site (via DNS Made Easy).

  • Cooked up this small Perl package.
package yegg;

use nginx;

sub is_rewrite {
    my $r = shift;

    my $uri = $r->uri || '';
    return 0 if !$uri || $uri =~ /[^0-9a-d]/o;

    my $rewrite = 0;
    my $file = qq(/usr/local/ye.gg/$uri);
    if (-f $file) {
        open(IN,'<',$file);
        $rewrite = <IN>;
        chomp($rewrite);
        close(IN);
    }

    return $rewrite;
}

1;

This is intended to run within nginx (my Web server), using the embedded Perl module. All it does is look for the existence of a file matching the URL in the /usr/local/ye.gg/ directory. If found, it opens the file and returns the URL within it. So if I want to make http://ye.gg/angel work I just create the file '/usr/local/ye.gg/angel' and put 'http://www.gabrielweinberg.com/angel.html' in it.

  • Added this code to nginx conf.
    perl_require "/usr/local/etc/nginx/yegg.pm";
    perl_set  $rewrite  '
sub {
  my $r = shift;
  return yegg::is_rewrite($r);
  return "";
}
';

This just uses the the above package and puts it into the $rewrite variable. So when a request comes in, it sets that variable by running the function I defined in the package (is_rewrite).

  • Added more code to my nginx conf.
    server {
        server_name  ye.gg *.ye.gg;

        if ($rewrite) {
          rewrite ^. $rewrite permanent;
        }

        location / {
          rewrite ^(.*) http://www.gabrielweinberg.com permanent;
        }
    }

This says if $rewrite exists (there is a URL to go to), redirect to it. Otherwise, always redirect to my home page.

And that's it--it works! One issue with this setup that I couldn't immediately solve is it checks for the file existence on every request, regardless of whether they are ye.gg request or for other domains. That is, perl_require and perl_set don't seem to operate within server blocks. Not sure why. Anyway, I'll leave that for another day unless anyone has any insight.

Do people subscribe to blogs less now? My blog's #s.

 
Maybe I have rose-colored glasses on, but I remember it being easier to get blog subscribers (a few years ago). Right now I'm getting ~0.05% conversion, extrapolating from these FeedBurner and Google Analytics numbers.

feedburner1.png
analytics2.png
That is, 10K visits for a blog post yields about 50 new FeedBurner subscribers. The sharp increase at the beginning of the year correlates to my increased post frequency.

My sense is that the increased posts not only draw more visitors per unit time, but also keep the blog more present in peoples' minds, making them more likely to subscribe. From Apr 2008 to Jan 2010 I had 48,097 new visitors and then 82,733 new visitors since Jan 1 of this year. But my FeedBurner #s have more than doubled over that period.

Here's the data from the past 30 days.

feedburner2.png
analytics1.png

What I find interesting is that the major posts did not spike FeedBurner in a similar way. It's still a steady increase. My guess is to get a major spike you need someone major recommending your blog in a post like this

Yet to get on a list like that seems sort of random. I think you have to be out there putting out good content regularly so that when someone does make a list like that, they think of you.

Over the whole period, these posts have been the biggest.

analytics6.png
The first column is unique page views. If you sum the %s (taking out the home page), these top 9 posts (out of 107) make up 52%.

Here's where all this traffic comes from.

analytics7.png

Thank you Hacker News and reddit! Without you, my blog #s would be pretty pathetic.

The Google stuff is pretty much all to one post I wrote on Skype high-quality video, which seems to capture a lot of people searching about that. I find that a bit odd in that I used to remember getting a lot more random organic traffic.

With all this in mind, do people subscribe to blogs less now? My hunch is yes and it is due mainly to a few factors.

  1. The rise of social link sharing has really taken the compelling reason out of subscribing to blogs, i.e. that you will miss something awesome. The argument is that if it is so awesome someone will share it with you. I don't think this is quite true, however. As someone who subscribes to a lot of blogs, at least half of the good content I see I don't see on those services. 

  2. Remember when RSS readers were hot? Well now they're not. The business models never really seemed to pan out, and I think that deflated a lot of the interest (and in turn innovation) in the ecosystem. Related to that is they never seemed to really break mainstream as a lot of people thought they would.

  3. The twitter fan relationship. A lot of people seem to opt to follow on Twitter instead of subscribing to RSS to the extent that some people completely ignore RSS in favor of twitter. On Twitter, you get more than straight links, so maybe that is part of the appeal. Again, I disagree, though. I find often I just want the posts and don't want to miss anything. It's real easy to get behind on Twitter and all the UIs really make it too easy to just give up on old Tweets.

About Me

RSS.