Google Web spam

 
Yesterday, @mims wrote this post on "content-mills," which prompted this discussion on HN about Web spam. Many of the comments are by moultano, who is on Google's search quality team. This particular comment really drew my attention:

I doubt you'll find MFA spam to be better on DDG than on Google, but please, if you see a query where they are beating us. Send it over. :) I can guarantee you that I'll get a lot of eyes looking at it.

At DDG, I mainly crawl looking for these types of spam domains. On my last crawl, I identified about 37.8M domains as spam in the com/net/org/biz/info/us TLDs. I found Web sites at another 61.3M domains; the rest timed out. So roughly 40% of the domains I visited (with sites) were spam.

I just took a random sample of those spam domains and checked them against Google's index. All of this code as well as the sample and results are now on github.

First I started checking against Google's Web site directly, but their bot detection quickly shut me down. I was able to check 589 domains before being shut down, using the site: syntax. The results are here. The second column is the # of results reported in the index. For example, you can verify the first one with this query.

Of those I checked, 302 came up with at least one result, i.e. are in their index in some form. That means (extrapolating) roughly 50% of my spam domains are in Google's index, or about 19M domains.

Once shut off, I moved to Google's search API to process the full 10K sample. Interestingly though, it apparently returns very different results. For example check out web vs api. The Web shows 1 result, whereas the API shows none. 

Weird. I carried it out anyway though. Of the 10K full sample, I found 719 in Google's API index, or 7%. If you extrapolate that to the full list, that would be ~3M spam domains in the index. 

In any case, these #s are pretty conservative estimates because a) I'm only covering about half the domain space (missing all the country tlds except .us), and b) I know I still have a lot of false negatives (please send me them when you see them).

On the other side, the way I do the identification, there are minimal false positives at the time of identification. However, sites turn from spam/non-spam all the time, and since it takes me a while to crawl, there are certainly a few false positives in there. 

There are also legitimate false positives, and if you see those, please report them as well. I did nothing to hide those from view here, so you can see for yourself in the results.

Of course this says nothing about how much they appear in the rankings. I tried to find the modern equivalent of Metaspy to get some random queries, but I couldn't find such a such a service in existence. Nevertheless, half of the spam domains are not in the index, so it begs the question why the difference? 

If people have lots of links from Google results saved, I'd be happy to run them against my list.

Wannabe entrepreneur symptoms and cures

 
I was once a wannabe entrepreneur. Fresh out of college and a summer internship at a VC firm, I thought I knew what I was doing. Though this was 2000, and all startup & VC blogs we've grown to love didn't exist yet, I did have mentors available. I should have leaned on them a lot more, but I didn't, or at least not in the right ways.

But all the ways I've failed, and there are certainly many, is not the point. I just want to let you know that I've been there, and that I hope the rest of this post doesn't come off as annoyingly condescending.

Since 2000, I've been doing and thinking about startups constantly. Even though I'm an introvert, I end up meeting or otherwise crossing paths with a lot of entrepreneurs. Unfortunately, I'd classify a lot of them as wannabes.

What follows are some symptoms I've seen over and over that usually (though not always) indicate a wannabe entrepreneur. If any of these describe you (or someone you know), I'd take it as a sign to step back and think hard about what you're doing (or have that conversation with your friend).

There are cures. Usually it means what you (or they) are working on now will fail. But perhaps it is salvageable with a few tweaks or a change in direction. And if you/they are really in it for the long term (as real entrepreneurs are), then there will be other startups.


Symptom: a year has gone by and you have nothing to show for it.

Cure: get stuff done. That's what real startup founders do. Customers don't care about excuses.


Symptom: you haven't really talked to any real customers/users. 

Cure: read Steve Blank's book. Get out of the building. "No plan survives first contact with customers." A related (non-wannabe but first-timer problem) is confusing the user with the customer. I did this on my first startup, and it was one of my primary problems.


Symptom: you're going around calling yourself a CEO. 

Cure: you're a founder. You're not powerful. No one cares about what you're doing...yet.


Symptom: you aren't knowledgeable about startups, especially your own space.

Cure: read stuff & regularly talk with the smartest startup people you know. At the very least, you should know the whole history of your space--failures, acquisitions, IPOs, reasons for such, etc.


Symptom: you just need 10-25K in investment.

Cure: get your own 10-25K. Do consulting. Maybe convince friends and family. If you can't raise that much from yourself and your existing circle, you aren't going to be able to raise more from strangers. I did consulting for a few years, max 4hr a day, so I could focus the rest of time on my startups.


Symptom: you have spent months researching the right architecture to build your site.

Cure: build it already. You seem like someone more interested in technology than startups.


Symptom: you don't understand your startup's assumptions.

Cure: make a spreadsheet and try to predict the key metrics of your business. Yes, the financial projections that come out of the spreadsheet are probably worthless (or grossly inaccurate), but not their underlying assumptions. Those are the things you need to prove and the first step is knowing what they are. As a side note, this exercise will help you understand how much money you need to raise, if any.


Symptom: you've written more than a 5pg business plan (intended for others).

Cure: spend that time talking to real customers or building your product. If you think it will help you understand your business, build a spreadsheet with assumptions instead. If you think investors will read it, know that they won't. Note: I have no problem with people analyzing their businesses internally through brief writing; I do that too.


Symptom: you now just need a programmer to code up your site.

Cure: either convince a real tech co-founder to join you, or learn how to code yourself. It's not that hard, and if you think of startups as a career, it's a great skill to have even if you just manage tech people. You don't have to major in CS in college to be a programmer, e.g. I was a Physics major.

Weird eHow Web spam

 
fnboelwein.com redirects to http://www.ehow.com/apply-card-credit-online/. As does bankofelgin.com.

If you actually go to the link, you get this message at the top:

Hi There! bankofelgin.com isn't available, but you're still in a good place -- ehow.com. We think we might have what you're looking for.

Doubt it.

Both domains have 64.74.223.39 as an A record, which is different than the redirect IP. And both have proxied whois records. However, they all have the same Server headers:

Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET

So I'm inclined to believe these domains are actually powered by eHow, which isn't too surprising since eHow is owned by Demand Media.

I wonder how many of these domains are out there? I found these two because they happen to both be part of my spam/parked domains training set.

I'm starting crawl #18 to detect and weed out such domains from DuckDuckGo. Each time I start a crawl I make sure I have no existing false positives or negatives in my training set.

Interestingly, every time a lot of domains flip from parked to unparked and visa-versa. These two fell into the false negative category since I wasn't labeling these pages as spam. Maybe I should...

Announcing duck.co - The DuckDuckGo Community

 
LOGO_CoFounderWebsite.gif
Today the .co domain launches and I'm proud to be a part of the .co Founders program and have one of the founding Websites.

It's duck.co, which is the new home of the DuckDuckGo Community forum (previously ideas.duckduckgo.com).

I have high hopes for this site, and wish it to become a real community centered around DDG. I've initially created four top-level forums:
  • DuckDuckGo Feedback - report problems and give suggestions
  • Spreading DuckDuckGo - share ideas and experiences.
  • DuckDuckGo Code - discuss DDG open source code & APIs. Over the next year, I hope to put a lot more effort into this area.
  • DuckDuckGo for Educators - use of DDG in the educator/library community in particular.
I of course will be actively participating in all of these forums and would welcome and greatly appreciate your participation as well.

The forum is run by Zoho discussions. I have it set up so you can post anonymously, though if you do sign in you'll be able to more easily follow topics & forums (by email).

You might also be wondering why I switched forum providers. Long story short, Slinkset (the previous provider) was down about 50% of the time and really slow when it was up. That is, it was essentially unusable.

A Board of Directors is not an advisory board

 
An advisory board has no decision making authority for the organization; a Board of Directors does have such authority. That's the key difference between the two bodies. 

As a result of this difference, each body should be managed (and thought of) very differently from the perspective of the startup entrepreneur. You go to advisors when you need advice or help. Your Board can do that too, but it's really there to govern the company.

It's confusing because early on the Board is usually just the founders and so governance and management are one and the same and with the same people. Also, when startups first get going, many get "advisors" on board. These relationships come in all shapes in sizes, from regular formal advisory meetings with equity grants to a little more than name dropping.

This post isn't about advisory boards per se so I'll only say that most of them seem like a waste of time. What's not a waste of time is getting experienced people involved in your business. So personally I'd focus any advisory board effort there, i.e. finding the most useful people you can get and actually getting them involved, e.g. for equity (or a small investment). 

The point is though, you don't have to listen to your advisors at all. They have no authority to make decisions on behalf of your company.

When startups take bigger investment and get investor or independent board members, I've seen a few times now that there is a status quo mindset to treat them like the other advisors, but perhaps with more formal meetings. I completely understand this tendency because that's what the founders are used to, but nevertheless it is a mistake.

The exact purpose of the Board is defined in corporate documents, but it usually consists of selecting executives, setting company objectives & strategy, deciding material financial matters, etc. That is, it has the authority to govern the organization and really should be doing that governance.

Now it may be the case (especially after an angel round) that the founders still "control" the board in the sense that if anything ever came to a vote, which it hardly ever does, then they would win. That is, they aren't in danger of getting fired. 

However, even if that is the case, founders should still be setting strategic direction from the Board, and not simply reporting their chosen direction to the Board. It's a subtle, but very important distinction.

In addition, particular Board members (as a result of investment) may have protective provisions that give them veto power over certain things. Yet even if they don't have these powers, you should still run those kinds of decisions through the board, e.g. financing, acquisition, key hires, etc., and not at the board.

So how do you interact and communicate effectively with your Board? Now that I'm on some Boards, I've been collecting good advice to share with entrepreneurs from people way more experienced than myself. Here's the best stuff I've found so far.

Michael Bodekaer on getting traction

 

Michael Bodekaer was a founder and CTO of Smartlaunch, which makes software for Internet cafes worldwide and IPO'ed in 2005. Michael explains how they initially got traction by, among other things, giving away their product for free to leading cafes when entering new regional markets. He also talks about his Top 7 Beliefs and his new passion, Project Getaway.

We start talking about Smartlaunch in min 13. Also note that I didn't plan shamelessly plugging my startup with my shirt.

This interview is ~45min. For just audio, there is an mp3. You can also get it on your iPod via iTunes

For more, check out the Traction Book site.

Top linked domains from Facebook pages

 
I've been messing around with Facebook pages for an upcoming DuckDuckGo integration, and I came across some data that seemed interesting enough to share. These are the top linked domains from Facebook pages.

  1. myspace.com (269588)
  2. twitter.com (97669)
  3. youtube.com (54238)
  4. facebook.com (50234)
  5. flickr.com (12541)
  6. en.wikipedia.org (10578)
  7. reverbnation.com (10144)
  8. fotolog.com (7840)
  9. imdb.com (5250)
  10. purevolume.com (5217)
  11. last.fm (3740)
  12. linkedin.com (3730)
  13. soundcloud.com (3713)
  14. ilike.com (2652)
  15. cdbaby.com (2377)
  16. apps.facebook.com (2213)
  17. it.wikipedia.org (2019)
  18. sites.google.com (1984)
  19. etsy.com (1977)
  20. vimeo.com (1941)
  21. bebo.com (1939)
  22. sonicbids.com (1923)
  23. modelmayhem.com (1780)
  24. profile.myspace.com (1692)
  25. wix.com (1633)
  26. amazon.com (1613)
  27. soundclick.com (1401)
  28. tinyurl.com (1364)
  29. fr.wikipedia.org (1348)
  30. cafepress.com (1304)
  31. bandzone.cz (1279)
  32. freewebs.com (1264)
  33. es.wikipedia.org (1236)
  34. google.com (1195)
  35. itunes.apple.com (1113)
  36. zazzle.com (1004)
  37. dailymotion.com (996)
  38. friendster.com (982)
  39. imeem.com (901)
  40. bit.ly (895)
  41. profiles.friendster.com (893)
  42. new.facebook.com (872)
  43. virb.com (728)
  44. yelp.com (707)
  45. groups.yahoo.com (686)
  46. picasaweb.google.com (673)
  47. web.me.com (669)
  48. metroflog.com (657)
  49. geocities.com (633)
  50. bbc.co.uk (584)
In particular, this list aggregates domains extracted from links within the 'Website' sections of Facebook pages. For example, on the DuckDuckGo Facebook page there is link to the homepage (duckduckgo.com) and to the DDG twitter steam (twitter.com) within that section. Each of those domains would get one point in the aggregated list. If duckduckgo.com had appeared twice, it would still just get one point.

Of course, real people took the time to link to these domains in the context of promoting their online Web presences, so it was interesting to me what they chose in the aggregate. This data confirms my anecdotal evidence I keep seeing where people promote their FB and Twitter together. I was also intrigued by how high myspace was; I suppose a lot of bands still use it and/or haven't updated their old FB pages.

There were a few sites I actually hadn't heard of, e.g. some of the music stuff, wix, modelmayhem & virb. Not that I should hear of every site, but these must have a lot of traction already to be that high in these lists. 

If you just look at "high quality" FB pages (custom urls, no default images, etc.), you get a similar but slightly different list & ordering.

  1. twitter.com (32597)
  2. myspace.com (28125)
  3. youtube.com (11511)
  4. facebook.com (11007)
  5. flickr.com (3225)
  6. reverbnation.com (1413)
  7. linkedin.com (1267)
  8. en.wikipedia.org (1016)
  9. ilike.com (787)
  10. last.fm (763)
  11. purevolume.com (685)
  12. soundcloud.com (681)
  13. vimeo.com (608)
  14. imdb.com (483)
  15. apps.facebook.com (411)
  16. bebo.com (389)
  17. cdbaby.com (376)
  18. sonicbids.com (372)
  19. bit.ly (337)
  20. tinyurl.com (279)
  21. itunes.apple.com (273)
  22. google.com (264)
  23. fotolog.com (263)
  24. friendfeed.com (259)
  25. nscs.org (242)
  26. imeem.com (241)
  27. etsy.com (212)
  28. it.wikipedia.org (209)
  29. modelmayhem.com (200)
  30. itunes.com (188)
  31. amazon.com (178)
  32. cafepress.com (171)
  33. delicious.com (154)
  34. yelp.com (153)
  35. zazzle.com (153)
  36. dailymotion.com (146)
  37. virb.com (133)
  38. ustream.tv (111)
  39. soundclick.com (100)
  40. bbc.co.uk (99)
  41. legacyrecordings.com (97)
  42. friendster.com (93)
  43. blogtalkradio.com (92)
  44. digg.com (92)
  45. formspring.me (90)
  46. picasaweb.google.com (90)
  47. lululemon.com (90)
  48. woodstock.com (86)
  49. groups.yahoo.com (86)
  50. de.wikipedia.org (86)

Here are the top types (counted for pages with at least some info on them).

  1. Musician (421154)
  2. Other Business (417494)
  3. Other Public Figure (232868)
  4. Professional Service (140365)
  5. Non-Profit (129352)
  6. Website (106490)
  7. Products (95604)
  8. Education (76957)
  9. Store (64733)
  10. Visual Artist (61575)
  11. Club (61043)
  12. Restaurant (56447)
  13. Health and Beauty (51983)
  14. Sports / Athletics (49001)
  15. Fashion (46189)
  16. Food and Beverage (43737)
  17. Communications (36180)
  18. Athlete (30520)
  19. Religious Center (29811)
  20. Technology Product / Service (28728)
  21. Actor (28101)
  22. Hotel / Lodging (27358)
  23. Sports Team (27322)
  24. Online Store (26261)
  25. Film (25190)
  26. Religious Organization (25093)
  27. Writer (25082)
  28. Bar (24957)
  29. Politician (24139)
  30. Consumer Product (22421)
  31. Comedian (22359)
  32. Real Estate (22296)
  33. Technology and Telecommunications Service (21176)
  34. Model (20944)
  35. Event Planning Service (20671)
  36. TV Show (18739)
  37. Museum / Attraction (17960)
  38. Game (16729)
  39. Travel (16097)
  40. Pets (15511)
  41. Retail (15123)
  42. Travel Service (15018)
  43. Automotive (14530)
  44. Cafe (14237)
  45. Government (13601)
  46. Medical Service (12909)
  47. Automotive Dealer / Vehicle Service (9729)
  48. Home Living (9227)
  49. Home Service (8247)
  50. Library / Public Building (7086)

Note that data from this crawl was completed before the whole open graph/like thing, so these were all "real" pages. I'm currently crawling all the new stuff and working on ways to "keep it real," so to speak.

Traction trumps everything

 
compete.png
If you ask an angel investor what they look for in a company, they'll usually rattle off a list of things that describe the ideal angel investment: huge market, great team, superior product, sustainable competitive advantage, etc. Trouble is, even if your startup has those things (and most don't), you still have to convince each investor.

Good thing there is a shortcut: traction. If you show investors some traction, the rest of the conversation becomes a lot easier. They'll generally be willing to overlook some of your deficiencies, probably even more than they should. 

Traction is real customers. If you charge for your product, it's real paying customers. If your product is free, it's a real user base. In other words, traction is a signal that your team can produce real results in a real market.

You don't need much traction to entice investors. In fact, people like myself prefer just a little because when you have a lot your valuation will probably be too high. The ideal scenario from my perspective is you already got a bit of traction, and you know how to get more with some investment. 

You played around with various traction verticals and you identified a few promising ones that brought in your early customers. Now you just need $x to experiment more heavily with those channels. That's a compelling story.

I should also point out that once you have traction, you may not need investors at all. 

How-to learn about angel/vc term sheets

 
I think every startup entrepreneur (and angel investor) should have a good understanding of financing term sheets. Yes, even bootstrappers. I haven't raised any money for my companies that required a term sheet (just friends & family money in my first company), and yet I still think it is important for a number of reasons.

First, most companies will raise money at some point, and you don't want to be learning everything when you need to raise money because it will be distracting and you'll make mistakes that in hindsight seem stupid. Second, you never know exactly when you're going to be in a financing situation. Third, a lot of the same principles carry over into M&A term sheets, and even if you don't raise money I hope you may be involved in an acquisition at some point. And perhaps most importantly, fourth, it doesn't take very long.

I've written up the following directions to help you get there efficiently. Don't do it all in one sitting because you want your mind to digest the concepts over time. I suggest doing it over the course one week, setting aside a half an hour each day to go through this stuff.

Fortunately there are now a lot of great free, public resources to learn about financing term sheets. I would start by familiarizing yourself with some actual term sheets. For seed rounds, check out these (reading them slowly from top to bottom):
Now that you've seen what a term sheet looks like, go through each term and read the associated post in Brad Feld's Term Sheet series, a series of blog post where he explains each term. Some of these terms he covers are not in those docs, because they are more for venture rounds. You can skip those for now.

Once you feel you understand what the terms in the seed docs mean, read this Startup Company Lawyer post explaining how they differ from each other. Once you understand that, then you're ready to get a bit more complicated and look at more complete venture term sheets:
Launch the Term Sheet Generator, and open the NVCA doc. Now go through each term and go to the relevant section in the generator by accessing the select box at the top. Some of the terms will be familiar from before. The generator offers a lot of additional background/insight from the context of building a term sheet. Look for the links on the right entitled 'Click here to hide/show explanatory ntoes.' Also look for the market data links, e.g. liquidation preference.  And of course refer back to Brad's term sheet series for any posts you skipped before.

By now you should be familiar with pretty much every term and its place in the term sheet, and you're ready to digest some more advanced material. First check out this Startup Company Lawyer post on how the YC seed docs differ from traditional Series A docs and this Series Seed post on the same topic. Then check out these applicable Venture Hacks posts: Term Sheet Hacks, Option Pool Shuffle, Term sheet tune-up, & Terms that hurt. Finally, here are ome other posts that I think round out term sheet knowledge: The Challenge of The Ideal First Round Term Sheet (Brad Feld), Ideal first round funding terms & Don't shop your term sheet (Chris Dixon).

If you'd really like a book, I'd suggest the brief Term Sheets & Valuations, although I want to underscore that I don't think it is necessary. I purchased this book a number of years ago before I knew about any of the above (and most of it existed!). I found it useful then to get an intro into this stuff, and I just took it out and skimmed it and think it is still useful.

My experiences with ad.ly

 

ad.ly is a relatively new, well-funded startup that puts ads "in-stream" on twitter, and is supposedly expanding to Facebook, MySpace, etc. Such expansion may be wise in the wake of Twitter's TOS changes, but that's another story.

This story is about what happened when I tried to spend money on ad.ly. In a nutshell:


adly2.png
I tried to spend $130 three times. The first two campaigns (bottom) resulted in no spend. The third attempt resulted in $25 spent, i.e. about 20% of my intention.

So what's going on here? It's this:

adly.png
Pretty much everyone I tried to advertise with Denied me. Except, they actually didn't. It ominously says Denied, but apparently Denied also can mean expired

I ended up getting so frustrated that I contacted everyone who supposedly Denied me and asked them why. It turns out most people never even got notified of the ad request. Each campaign has an expire date, and when it hit that date, it just said Denied for everyone that didn't respond. Not sure why they never received notification--ad.ly doesn't have emails or something? I didn't press this point. 

Those who did get notified either said a) they never intended to take ads in their stream or b) they didn't want to do my ad because they didn't know the product, which incidentally I imagine would be most ads. 

My hunch is ad.ly got a lot of people to sign up who either didn't realize what they were doing or said, sure, I'll make money off my Twitter but when it comes time to actually do so, they're like wait, I don't want to show ads to my followers...

Let me back up a bit.

adly4.png
Ad.ly got a lot press for signing up celebrity accounts. Trouble is they're not cheap.

But that's OK--as lead investor Mark Suster said somewhere it's more about the long-tail of the twitter stream. That, is buying $3 tweets instead of $3,000.

I wasn't going to spend for a huge celebrity just to test out the platform. Additionally, I wasn't going to spend for just anyone. I wanted to get people who actually had influence on Twitter--people that have followers that listen to them to the extent that my messages could possibly be retweeted.

Unfortunately, ad.ly's current UI makes it really difficult to find good people to pick. All they tell you is the above two data points (followers and price). If you click on someone you then also get their avg tweets/day, their about description, categories & a link to their profile page. 

This still isn't enough info. At the very least I want to know how many people they follow to weed out those people who have 30K followers but are also following 30K people. So now just to pre-screen people, I have to click once to open this detail view (which is a slow JS fade-in thing btw), and then again to get to their twitter profile page (and we all know twitter can be slow...).

But that wasn't enough for me anyway, because it doesn't tell me anything about influence. So initially I also searched twitter manually for RTs and divided price/follower count to get a sense for how good a deal it was. Needless to say, it was a lengthy process.

For my first campaign I ended up targeting CaliLewis. It was of course Denied. The tweet was "Check out Duck Duck Go, a cool new search engine http://duckduckgo.com/ RT! (Ad)" btw.

For the second campaign I didn't want to spend all the manual effort, so I spent (probably more time) hacking something together :). I ended up doing this:

  1. Downloading the full list of top influential twitter users from trst.me (~22K users).
  2. Hacking ad.ly's URLs and then downloading a big list of people you can advertise with.
  3. Cross checking with the trst.me list to only keep top influencers.
  4. Cross checking that output with the twitter API to only keep people recently retweeted.
  5. Taking that subset and downloading ad.ly info including price, followers, & avg tweets.
  6. Filter out people > 10 tweets/day and then sort by price/follower count.
I didn't want to put all my eggs in one basket so I did 12 people at lower cost and 3 different tweets. The tweets were:
All Denied. At this point I got pretty frustrated and sent out the emails I talked about above. Are my tweets really that onerous? No. It's just apparently hard to actually find people on ad.ly to advertise with.

So for my third campaign, I went with even lower priced people, but still influencers. I figured if I tried it with enough people I would be bound to get some hits, and perhaps smaller fish would be more likely to respond (and actually get the notification). 

I did 20 people, 5 of which were approved. The end result was what you saw at the top: 71 clicks. Was it worth all of this effort? No.

For the record, I used these three ads:
  • Duck Duck Go is the new Google http://duckduckgo.com/ (Ad)
  • Duck Duck Go is a new search engine http://duckduckgo.com/ (Ad)
  • New search engine Duck Duck Go http://duckduckgo.com/ (Ad)
I had approvals in each category (1, 2, 2) and unsurprisingly the first performed best, though the sample size is way too small to be meaningful.

One other annoyance worth mentioning is ad.ly charged my card for the full possible spend even though I only ended up spending 20% of it. That's just annoying.

In the end I got .35CPC. I suppose that isn't bad compared to some other platforms, e.g. Facebook. I'd be interested to know if you try it if the traffic converts well for you.

And of course I didn't try the celebrity strategy, which may actually work a lot better. If anyone has tried that (high price) I'd also love to know the results.

About

   

My home page.

Online Karma

-
From a new search engine

Online Profiles

-
From a new search engine