July 2010 Archives

Google Web spam

 
Yesterday, @mims wrote this post on "content-mills," which prompted this discussion on HN about Web spam. Many of the comments are by moultano, who is on Google's search quality team. This particular comment really drew my attention:

I doubt you'll find MFA spam to be better on DDG than on Google, but please, if you see a query where they are beating us. Send it over. :) I can guarantee you that I'll get a lot of eyes looking at it.

At DDG, I mainly crawl looking for these types of spam domains. On my last crawl, I identified about 37.8M domains as spam in the com/net/org/biz/info/us TLDs. I found Web sites at another 61.3M domains; the rest timed out. So roughly 40% of the domains I visited (with sites) were spam.

I just took a random sample of those spam domains and checked them against Google's index. All of this code as well as the sample and results are now on github.

First I started checking against Google's Web site directly, but their bot detection quickly shut me down. I was able to check 589 domains before being shut down, using the site: syntax. The results are here. The second column is the # of results reported in the index. For example, you can verify the first one with this query.

Of those I checked, 302 came up with at least one result, i.e. are in their index in some form. That means (extrapolating) roughly 50% of my spam domains are in Google's index, or about 19M domains.

Once shut off, I moved to Google's search API to process the full 10K sample. Interestingly though, it apparently returns very different results. For example check out web vs api. The Web shows 1 result, whereas the API shows none. 

Weird. I carried it out anyway though. Of the 10K full sample, I found 719 in Google's API index, or 7%. If you extrapolate that to the full list, that would be ~3M spam domains in the index. 

In any case, these #s are pretty conservative estimates because a) I'm only covering about half the domain space (missing all the country tlds except .us), and b) I know I still have a lot of false negatives (please send me them when you see them).

On the other side, the way I do the identification, there are minimal false positives at the time of identification. However, sites turn from spam/non-spam all the time, and since it takes me a while to crawl, there are certainly a few false positives in there. 

There are also legitimate false positives, and if you see those, please report them as well. I did nothing to hide those from view here, so you can see for yourself in the results.

Of course this says nothing about how much they appear in the rankings. I tried to find the modern equivalent of Metaspy to get some random queries, but I couldn't find such a such a service in existence. Nevertheless, half of the spam domains are not in the index, so it begs the question why the difference? 

If people have lots of links from Google results saved, I'd be happy to run them against my list.

Wannabe entrepreneur symptoms and cures

 
I was once a wannabe entrepreneur. Fresh out of college and a summer internship at a VC firm, I thought I knew what I was doing. Though this was 2000, and all startup & VC blogs we've grown to love didn't exist yet, I did have mentors available. I should have leaned on them a lot more, but I didn't, or at least not in the right ways.

But all the ways I've failed, and there are certainly many, is not the point. I just want to let you know that I've been there, and that I hope the rest of this post doesn't come off as annoyingly condescending.

Since 2000, I've been doing and thinking about startups constantly. Even though I'm an introvert, I end up meeting or otherwise crossing paths with a lot of entrepreneurs. Unfortunately, I'd classify a lot of them as wannabes.

What follows are some symptoms I've seen over and over that usually (though not always) indicate a wannabe entrepreneur. If any of these describe you (or someone you know), I'd take it as a sign to step back and think hard about what you're doing (or have that conversation with your friend).

There are cures. Usually it means what you (or they) are working on now will fail. But perhaps it is salvageable with a few tweaks or a change in direction. And if you/they are really in it for the long term (as real entrepreneurs are), then there will be other startups.


Symptom: a year has gone by and you have nothing to show for it.

Cure: get stuff done. That's what real startup founders do. Customers don't care about excuses.


Symptom: you haven't really talked to any real customers/users. 

Cure: read Steve Blank's book. Get out of the building. "No plan survives first contact with customers." A related (non-wannabe but first-timer problem) is confusing the user with the customer. I did this on my first startup, and it was one of my primary problems.


Symptom: you're going around calling yourself a CEO. 

Cure: you're a founder. You're not powerful. No one cares about what you're doing...yet.


Symptom: you aren't knowledgeable about startups, especially your own space.

Cure: read stuff & regularly talk with the smartest startup people you know. At the very least, you should know the whole history of your space--failures, acquisitions, IPOs, reasons for such, etc.


Symptom: you just need 10-25K in investment.

Cure: get your own 10-25K. Do consulting. Maybe convince friends and family. If you can't raise that much from yourself and your existing circle, you aren't going to be able to raise more from strangers. I did consulting for a few years, max 4hr a day, so I could focus the rest of time on my startups.


Symptom: you have spent months researching the right architecture to build your site.

Cure: build it already. You seem like someone more interested in technology than startups.


Symptom: you don't understand your startup's assumptions.

Cure: make a spreadsheet and try to predict the key metrics of your business. Yes, the financial projections that come out of the spreadsheet are probably worthless (or grossly inaccurate), but not their underlying assumptions. Those are the things you need to prove and the first step is knowing what they are. As a side note, this exercise will help you understand how much money you need to raise, if any.


Symptom: you've written more than a 5pg business plan (intended for others).

Cure: spend that time talking to real customers or building your product. If you think it will help you understand your business, build a spreadsheet with assumptions instead. If you think investors will read it, know that they won't. Note: I have no problem with people analyzing their businesses internally through brief writing; I do that too.


Symptom: you now just need a programmer to code up your site.

Cure: either convince a real tech co-founder to join you, or learn how to code yourself. It's not that hard, and if you think of startups as a career, it's a great skill to have even if you just manage tech people. You don't have to major in CS in college to be a programmer, e.g. I was a Physics major.

Weird eHow Web spam

 
fnboelwein.com redirects to http://www.ehow.com/apply-card-credit-online/. As does bankofelgin.com.

If you actually go to the link, you get this message at the top:

Hi There! bankofelgin.com isn't available, but you're still in a good place -- ehow.com. We think we might have what you're looking for.

Doubt it.

Both domains have 64.74.223.39 as an A record, which is different than the redirect IP. And both have proxied whois records. However, they all have the same Server headers:

Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET

So I'm inclined to believe these domains are actually powered by eHow, which isn't too surprising since eHow is owned by Demand Media.

I wonder how many of these domains are out there? I found these two because they happen to both be part of my spam/parked domains training set.

I'm starting crawl #18 to detect and weed out such domains from DuckDuckGo. Each time I start a crawl I make sure I have no existing false positives or negatives in my training set.

Interestingly, every time a lot of domains flip from parked to unparked and visa-versa. These two fell into the false negative category since I wasn't labeling these pages as spam. Maybe I should...

Announcing duck.co - The DuckDuckGo Community

 
LOGO_CoFounderWebsite.gif
Today the .co domain launches and I'm proud to be a part of the .co Founders program and have one of the founding Websites.

It's duck.co, which is the new home of the DuckDuckGo Community forum (previously ideas.duckduckgo.com).

I have high hopes for this site, and wish it to become a real community centered around DDG. I've initially created four top-level forums:
  • DuckDuckGo Feedback - report problems and give suggestions
  • Spreading DuckDuckGo - share ideas and experiences.
  • DuckDuckGo Code - discuss DDG open source code & APIs. Over the next year, I hope to put a lot more effort into this area.
  • DuckDuckGo for Educators - use of DDG in the educator/library community in particular.
I of course will be actively participating in all of these forums and would welcome and greatly appreciate your participation as well.

The forum is run by Zoho discussions. I have it set up so you can post anonymously, though if you do sign in you'll be able to more easily follow topics & forums (by email).

You might also be wondering why I switched forum providers. Long story short, Slinkset (the previous provider) was down about 50% of the time and really slow when it was up. That is, it was essentially unusable.

A Board of Directors is not an advisory board

 
An advisory board has no decision making authority for the organization; a Board of Directors does have such authority. That's the key difference between the two bodies. 

As a result of this difference, each body should be managed (and thought of) very differently from the perspective of the startup entrepreneur. You go to advisors when you need advice or help. Your Board can do that too, but it's really there to govern the company.

It's confusing because early on the Board is usually just the founders and so governance and management are one and the same and with the same people. Also, when startups first get going, many get "advisors" on board. These relationships come in all shapes in sizes, from regular formal advisory meetings with equity grants to a little more than name dropping.

This post isn't about advisory boards per se so I'll only say that most of them seem like a waste of time. What's not a waste of time is getting experienced people involved in your business. So personally I'd focus any advisory board effort there, i.e. finding the most useful people you can get and actually getting them involved, e.g. for equity (or a small investment). 

The point is though, you don't have to listen to your advisors at all. They have no authority to make decisions on behalf of your company.

When startups take bigger investment and get investor or independent board members, I've seen a few times now that there is a status quo mindset to treat them like the other advisors, but perhaps with more formal meetings. I completely understand this tendency because that's what the founders are used to, but nevertheless it is a mistake.

The exact purpose of the Board is defined in corporate documents, but it usually consists of selecting executives, setting company objectives & strategy, deciding material financial matters, etc. That is, it has the authority to govern the organization and really should be doing that governance.

Now it may be the case (especially after an angel round) that the founders still "control" the board in the sense that if anything ever came to a vote, which it hardly ever does, then they would win. That is, they aren't in danger of getting fired. 

However, even if that is the case, founders should still be setting strategic direction from the Board, and not simply reporting their chosen direction to the Board. It's a subtle, but very important distinction.

In addition, particular Board members (as a result of investment) may have protective provisions that give them veto power over certain things. Yet even if they don't have these powers, you should still run those kinds of decisions through the board, e.g. financing, acquisition, key hires, etc., and not at the board.

So how do you interact and communicate effectively with your Board? Now that I'm on some Boards, I've been collecting good advice to share with entrepreneurs from people way more experienced than myself. Here's the best stuff I've found so far.

Top linked domains from Facebook pages

 
I've been messing around with Facebook pages for an upcoming DuckDuckGo integration, and I came across some data that seemed interesting enough to share. These are the top linked domains from Facebook pages.

  1. myspace.com (269588)
  2. twitter.com (97669)
  3. youtube.com (54238)
  4. facebook.com (50234)
  5. flickr.com (12541)
  6. en.wikipedia.org (10578)
  7. reverbnation.com (10144)
  8. fotolog.com (7840)
  9. imdb.com (5250)
  10. purevolume.com (5217)
  11. last.fm (3740)
  12. linkedin.com (3730)
  13. soundcloud.com (3713)
  14. ilike.com (2652)
  15. cdbaby.com (2377)
  16. apps.facebook.com (2213)
  17. it.wikipedia.org (2019)
  18. sites.google.com (1984)
  19. etsy.com (1977)
  20. vimeo.com (1941)
  21. bebo.com (1939)
  22. sonicbids.com (1923)
  23. modelmayhem.com (1780)
  24. profile.myspace.com (1692)
  25. wix.com (1633)
  26. amazon.com (1613)
  27. soundclick.com (1401)
  28. tinyurl.com (1364)
  29. fr.wikipedia.org (1348)
  30. cafepress.com (1304)
  31. bandzone.cz (1279)
  32. freewebs.com (1264)
  33. es.wikipedia.org (1236)
  34. google.com (1195)
  35. itunes.apple.com (1113)
  36. zazzle.com (1004)
  37. dailymotion.com (996)
  38. friendster.com (982)
  39. imeem.com (901)
  40. bit.ly (895)
  41. profiles.friendster.com (893)
  42. new.facebook.com (872)
  43. virb.com (728)
  44. yelp.com (707)
  45. groups.yahoo.com (686)
  46. picasaweb.google.com (673)
  47. web.me.com (669)
  48. metroflog.com (657)
  49. geocities.com (633)
  50. bbc.co.uk (584)
In particular, this list aggregates domains extracted from links within the 'Website' sections of Facebook pages. For example, on the DuckDuckGo Facebook page there is link to the homepage (duckduckgo.com) and to the DDG twitter steam (twitter.com) within that section. Each of those domains would get one point in the aggregated list. If duckduckgo.com had appeared twice, it would still just get one point.

Of course, real people took the time to link to these domains in the context of promoting their online Web presences, so it was interesting to me what they chose in the aggregate. This data confirms my anecdotal evidence I keep seeing where people promote their FB and Twitter together. I was also intrigued by how high myspace was; I suppose a lot of bands still use it and/or haven't updated their old FB pages.

There were a few sites I actually hadn't heard of, e.g. some of the music stuff, wix, modelmayhem & virb. Not that I should hear of every site, but these must have a lot of traction already to be that high in these lists. 

If you just look at "high quality" FB pages (custom urls, no default images, etc.), you get a similar but slightly different list & ordering.

  1. twitter.com (32597)
  2. myspace.com (28125)
  3. youtube.com (11511)
  4. facebook.com (11007)
  5. flickr.com (3225)
  6. reverbnation.com (1413)
  7. linkedin.com (1267)
  8. en.wikipedia.org (1016)
  9. ilike.com (787)
  10. last.fm (763)
  11. purevolume.com (685)
  12. soundcloud.com (681)
  13. vimeo.com (608)
  14. imdb.com (483)
  15. apps.facebook.com (411)
  16. bebo.com (389)
  17. cdbaby.com (376)
  18. sonicbids.com (372)
  19. bit.ly (337)
  20. tinyurl.com (279)
  21. itunes.apple.com (273)
  22. google.com (264)
  23. fotolog.com (263)
  24. friendfeed.com (259)
  25. nscs.org (242)
  26. imeem.com (241)
  27. etsy.com (212)
  28. it.wikipedia.org (209)
  29. modelmayhem.com (200)
  30. itunes.com (188)
  31. amazon.com (178)
  32. cafepress.com (171)
  33. delicious.com (154)
  34. yelp.com (153)
  35. zazzle.com (153)
  36. dailymotion.com (146)
  37. virb.com (133)
  38. ustream.tv (111)
  39. soundclick.com (100)
  40. bbc.co.uk (99)
  41. legacyrecordings.com (97)
  42. friendster.com (93)
  43. blogtalkradio.com (92)
  44. digg.com (92)
  45. formspring.me (90)
  46. picasaweb.google.com (90)
  47. lululemon.com (90)
  48. woodstock.com (86)
  49. groups.yahoo.com (86)
  50. de.wikipedia.org (86)

Here are the top types (counted for pages with at least some info on them).

  1. Musician (421154)
  2. Other Business (417494)
  3. Other Public Figure (232868)
  4. Professional Service (140365)
  5. Non-Profit (129352)
  6. Website (106490)
  7. Products (95604)
  8. Education (76957)
  9. Store (64733)
  10. Visual Artist (61575)
  11. Club (61043)
  12. Restaurant (56447)
  13. Health and Beauty (51983)
  14. Sports / Athletics (49001)
  15. Fashion (46189)
  16. Food and Beverage (43737)
  17. Communications (36180)
  18. Athlete (30520)
  19. Religious Center (29811)
  20. Technology Product / Service (28728)
  21. Actor (28101)
  22. Hotel / Lodging (27358)
  23. Sports Team (27322)
  24. Online Store (26261)
  25. Film (25190)
  26. Religious Organization (25093)
  27. Writer (25082)
  28. Bar (24957)
  29. Politician (24139)
  30. Consumer Product (22421)
  31. Comedian (22359)
  32. Real Estate (22296)
  33. Technology and Telecommunications Service (21176)
  34. Model (20944)
  35. Event Planning Service (20671)
  36. TV Show (18739)
  37. Museum / Attraction (17960)
  38. Game (16729)
  39. Travel (16097)
  40. Pets (15511)
  41. Retail (15123)
  42. Travel Service (15018)
  43. Automotive (14530)
  44. Cafe (14237)
  45. Government (13601)
  46. Medical Service (12909)
  47. Automotive Dealer / Vehicle Service (9729)
  48. Home Living (9227)
  49. Home Service (8247)
  50. Library / Public Building (7086)

Note that data from this crawl was completed before the whole open graph/like thing, so these were all "real" pages. I'm currently crawling all the new stuff and working on ways to "keep it real," so to speak.

Traction trumps everything

 
compete.png
If you ask an angel investor what they look for in a company, they'll usually rattle off a list of things that describe the ideal angel investment: huge market, great team, superior product, sustainable competitive advantage, etc. Trouble is, even if your startup has those things (and most don't), you still have to convince each investor.

Good thing there is a shortcut: traction. If you show investors some traction, the rest of the conversation becomes a lot easier. They'll generally be willing to overlook some of your deficiencies, probably even more than they should. 

Traction is real customers. If you charge for your product, it's real paying customers. If your product is free, it's a real user base. In other words, traction is a signal that your team can produce real results in a real market.

You don't need much traction to entice investors. In fact, people like myself prefer just a little because when you have a lot your valuation will probably be too high. The ideal scenario from my perspective is you already got a bit of traction, and you know how to get more with some investment. 

You played around with various traction verticals and you identified a few promising ones that brought in your early customers. Now you just need $x to experiment more heavily with those channels. That's a compelling story.

I should also point out that once you have traction, you may not need investors at all.