Recently in Programming Category

Hack Hack Go

 

iostat.png

I want to make Duck Duck Go a better search engine for programmers like me. If you're a programmer, I'd appreciate your feedback and ideas.

Duck Duck Go is intended to be a general purpose search engine and that isn't going to change. Our user base certainly reflects this purpose, i.e. is quite varied on every metric I've tried to measure.

Yet there are certain search niches like casual research where Duck Duck Go really excels. I'd like programming to be one of those areas.

To that end, here's what I've got so far.

  • A general search engine. The good news here is I know a lot of programmers who use it as their primary search engine. It works and (at least some) people really like it. I'm always willing to add new features whose absence are preventing people from switching. Currently on that list are some maps and images.

  • Zero-click Info. There are red boxes above links on some searches with info you can get without clicking, i.e. on-site. We have a lot of info that is specific to programming topics. Of course we have Wikipedia, e.g. Dijkstra's algorithm. But I've also added software sources, i.e. github, freshmeat, download.com, versiontracker, and sourceforge.

  • Category pages. I've mined sources to create to useful topic lists for browsing/learning, e.g. Search Algorithms.

  • Disambiguation pages. I've created pages to help you isolate programming topics in common query terms, e.g. cookie links to HTTP cookie, which has results more geared toward that meaning. There are also programming specific disambiguation pages, e.g. nearest neighbor.

  • Crowd-sourced links. I also mine links from crowd-sources sites, e.g. coroutine.

  • Wikipedia paragraphs. I've deep-indexed Wikipedia at the paragraph level. You don't have to match a topic nearly exactly anymore to get some Zero-click Info, e.g. python switch statement. This is way more than a regular search index, as it is sub-section/section/title aware and uses some NLP for relevancy. I hope to make that matching algorithm even more sophisticated over time.

  • Bang. There are a few hundred !x shortcuts that can be used, e.g. !cpan Net::DNS

Here's what I'm thinking of doing.

  • O'Reilly Paragraphs. I think it would be awesome if I could index all O'Reilly books at the paragraph level, like I've done for Wikipedia. This content is well-written, encyclopedia-like, is largely in paragraph form, and has surrounding contextual information (section titles, etc.) that will make the relevance matching excellent. Problem is, I don't know anyone at O'Reilly. I think it's a win-win because it can link right to their Safari product or individual book pages. And I don't think it canabalise Safari because you're getting people in a very different context (when searching). Anyway, I thought I'd start by writing them an email. I did that and haven't heard back yet.

  • More topic sources. I'm going to add man/info pages, so you can type in a command and get a description. I could also do packages for distributions/languages in a similar manner if people think that would be useful to them. I've explored indexing these at the paragraph level, but the content doesn't seem to work well for that purpose. Other, more general sources, may be incidentally useful to programmers like Amazon product descriptions. I'd love your thoughts here.

  • Bang documentation. The current bang commands aren't documented. I'll document them as well as add more that are useful to programmers. Any you want?

  • Zero-click Info by IM. I'm thinking of making a chatbot that will respond to you via IM with Zero-click Info (and links). So you send it a search query and we'll send you back a description along with a few links. Would you use that?

  • API integration. I wrote the Perl binding for Wolfram Alpha. I'm exploring ways to use it to integrate good WA content. I'm open to using other APIs, but I'd strongly prefer to get dumps instead so I can ensure speed. Another one I'd like to integrate for programmers is ErrorHelp.com (previously bug.gd).

That's where I'm at right now. If you're a programmer, my questions for you are:

  1. Do you find the above compelling?

  2. Do you have any particular feedback/ideas?
Feel free to comment below, on HN, on reddit, or email me directly.

Things about Web Images I Just Learned

 
I thought I knew everything you needed to know about Web images.  But, of course, I didn't. Here's what I just learned when launching the new icon bar on the homepage of Duck Duck Go. We wanted the it to function sort of like the Apple dashboard (and on the Web like Schmedley's bottom bar).

  • img{-ms-interpolation-mode:bicubic}. Short version: if you resize images dynamically, they will look bad on IE unless you put this in your CSS.

    Longer version:  We ended up using the YUI Animation Library to do the animation.  But no matter how we did it using 1 image, it always looked terrible on IE.  Even if we used an image exactly as big as the big size, and did the smaller image exactly half of the bigger size (which should be easy to resize), it still looked bad.

    So then we tried using two images, which sort-of worked, but had its own issues.  Sometimes it would slow down the animation. It used almost double the image size and requests (a big no-no), and the actual resizing still looked bad (as opposed to the endpoints)!

    This was unacceptable, so I decided to dig deeper on the Web about this issue.  It turns out modern browsers use Bicubic interpolation to resize images and make them look good in the process. For whatever reason, IE7+ has decided to turn it off by default. I'm guessing this is because it takes some processing power, but it renders resized images looking terrible so I personally don't think this is a good trade off.  Anyway, if you add that above CSS to your page, IE7+ will use this method and your images will look good. I suppose I never hit this before because usually you shouldn't be resizing images dynamically. But there are cases where you want to do it...

    Unfortunately, it still doesn't work for IE6, on which you need to use the good ol' AlphaImageLoader (sizingMethod='scale') if you want to support that browser.

  • Photoshop/Illustrator's 'Save for Web...' does not fully optimize. Perhaps my versions of Photoshop and Illustrator are too old, but I suspect this is still the case with the newer versions. I pretty much used these blind, assuming they were optimizing correctly. And don't get me wrong, it does a decent job, but its just not the best. Instead, run your images through Yahoo!'s smush.it site.

  • If you really do not need PNG-24, use PNG-8. PNG-8 is really a better GIF. But it is limited in color palette and transparency with respect to PNG-24. That being said, often you don't need the difference, especially for things like icons. When you can, use PNG-8 because you'll get much smaller file sizes.

    That being said, you might think you need PNG-24 when you really don't. I did. I had these icons made that had full transparency. I knew, however, they were going to be on a white background, so I really didn't need all the transparency. Yet when I tried to save it as PNG-8, it just looked bad. The colors were all off. So it made me think that I needed PNG-24, but in reality it was Photoshop's optimization stuff that was being poor. In their defense, I wasn't helping them out by setting the white background ahead of time, which leads me to:

  • If you want to save a PNG-24 image as PNG-8, put in the background first. Once I made a white background layer, Photoshop then did a great job of saving it as PNG-8. And in fact, I could reduced the file size even more by using even less than 256 colors. Of course, I still had to run it through smush.it.

  • CSS sprites may reduce your page load (and image size further). CSS sprites are a way to group your images into one big file and then split them into separate files via CSS. There is a useful Web site to help you make them at csssprites.com. I couldn't figure out how to use it with my resizing requirements, but in the general case it should be at least tried, especially for icons where the color palette for your icons are similar. You get a win in image size. But you get a bigger win in reducing HTTP requests.

  • Custom icons are not that expensive. We got $40 custom icons and $10 recolored icons from iconshock.com. We talked to other icon designer firms as well, and prices were similar. Full disclosure: we created more than 3 icons (7), so we got a bit of a bulk discount. I did have a bad experience with iconeden.com, however. So I'd stay away from them.
For more image optimization tips, check out Yahoo!'s presentation.

Update: additional comments can be found here.

A Harsh CSS Environment for Testing Widgets

 
Embedded widgets can face harsh CSS environments, but usually not this harsh:

#harsh * {
border: thin dotted #00FF00 !important;
display: block !important;
margin: 20 !important;
outline: 1px dotted red !important;
padding: 20 !important;

background: #00ff00 !important;
cursor: move !important;

clear: both !important;
float: left !important;
height: 0 !important;
max-height: 0 !important;
max-width: 0 !important;
min-height: 100px !important;
min-width: 100px !important;
visibility: hidden !important;
width: 0 !important;

bottom: 100px !important;
clip: rect(100px, 50px, 100px, 50px) !important;
left: 100px !important;
overflow: visible !important;
position: absolute !important;
right: 100px !important;
top: 100px !important;
vertical-align: sub !important;
z-index: 100 !important;

color: red !important;
direction: rtl !important;
font: oblique small-caps 900 20px/50px arial !important;
font-size-adjust: .01 !important;
font-stretch: ultra-expanded !important;
letter-spacing: 20px !important;
list-style: decimal inside !important;
text-align: right !important;
text-decoration: blink !important;
text-indent: 100px !important;
text-shadow: #000 30px !important;
text-transform: uppercase !important;
unicode-bidi: embed;
white-space: pre !important;
word-spacing: 20px !important;

border-collapse: separate !important;
border-spacing: 30px !important;
caption-side: bottom !important;
empty-cells: show !important;
table-layout: fixed !important;
}

If your widget looks OK inside <div id="harsh"></div>, then it will probably look OK anywhere.  I made this HTML example (view source) for easy testing.

Why does this matter? Suppose a site has a black background and white text, but your widget has a white background but no text color set--none of your text would show.

To deal with a harsh environment, you need some armor:

<style type="text/css">
#armor, #armor * {
border: none !important;
display: block !important;
margin: 0 !important;
outline: none !important;
padding: 0 !important;

background: #fff !important;
cursor: auto !important;

clear: none !important;
float: none !important;
height: auto !important;
max-height: none !important;
max-width: none !important;
min-height: 0 !important;
min-width: 0 !important;
visibility: visible !important;
width: auto !important;

bottom: auto !important;
clip: auto !important;
left: auto !important;
overflow: auto !important;
position: relative !important;
right: auto !important;
top: auto !important;
vertical-align: top !important;
z-index: 1 !important;

color: #000 !important;
direction: ltr !important;
font: normal normal normal 11px/14px tahoma,sans-serif !important;
font-size-adjust: none !important;
font-stretch: normal !important;
letter-spacing: normal !important;
list-style: none !important;
text-align: left !important;
text-decoration: none !important;
text-indent: 0 !important;
text-shadow: none !important;
text-transform: none !important;
unicode-bidi: normal;
white-space: normal !important;
word-spacing: normal !important;

border-collapse: collapse !important;
border-spacing: 0 !important;
caption-side: left !important;
empty-cells: hide !important;
table-layout: auto !important;
}

If you wrap your widget in <div id="armor"></div>, it should work OK. I made another HTML example (view source) for testing this armor.

I tested #armor cross browser using my test systems and browsershots.org. Of course, there are most likely still bugs, so please tell me about them!

To develop #harsh, I used the w3schools CSS Reference, which you can also use to figure out if you want to change the properties in #armor, or apply more thereafter. 

To apply additional styling after #armor, use ids instead of classes, e.g. id=""/# and not class=""/. because a particularly harsh use of #id * will override your classes. Of course, if you aren't that paranoid, you could back off the * in #armor and use classes instead.

You could also just use inline styling, i.e. style="". There may also be a better way to do it that I just haven't thought of yet. If you know of one, do tell...

Speeding up Perl Regular Expressions using Regexp::List

 
I spent the last 24 hours optimizing the Web crawler for the Parked Domains Project.  The previous bottleneck was obviously CPU.  After a bunch of profiling and benchmarking, I determined that a particular block of Perl regexp was causing most of the problem.

I was already compiling what I could (using /o and qr//).  I was also already trying to run things I thought would match more and faster first, as well as trying to anchor as much as possible (i.e. using /^ and $/ and just using long literal strings).  And I always use clustering (?: instead of capturing (, where appropriate.

What I didn't do, however, was mess with alternations, e.g. cat|dog|bird.  Disclaimer: there isn't a be all and end all to regexp optimizations, and what works in one situation may not work for another--it totally depends on your regexp and what you are throwing at it.  

Alternation is usually slow in Perl because the engine has to backtrack when trying each alternative.  It's much faster to give perl a character sieve up front, e.g. (?=cdb) and then factor out common prefixes and suffixes.  The problem is that when you have a ton of alternatives, doing all this is a pain and it decreases readability to almost zero.  Which is why I had avoided it to date...

Enter Regexp::List.  I've used this module before, but never as extensively and I never benchmarked it either.  It does all of this stuff automatically.  Not only did my regexp speed increase by about 5x, but my readability increased as well!  

I really didn't think that such a simple change would make such a difference.  The reason for the readability increase, btw, is that I now put all the alternatives in an array and then give that to the module, e.g.:

my @regexp = (
  'cat',
  'dog',
  'bird',
 );

use Regexp::List;
my $regexp  = Regexp::List->new;
my $qr = $regexp->set(modifiers=>'i')->list2re(@regexp);


About

   

I'm a solo founder of a new search engine and an angel investor. There is more about me on my home page.
I'm also doing a book on getting traction. Get notified when it's ready:

Online Karma

-
From a new search engine

Online Profiles

-
From a new search engine