November 2008 Archives

Turning off Logging in daemontools (djbdns' dnscache)

 
After optimizing my Web crawler for the Parked Domains Project, I started crawling so fast that the log process for my DNS server was eating up 20% of one of the CPUs (and wasting a lot of I/O as well).

I run a local dns cache using djbdns on each crawling server, which also runs all local dns queries. My dnscache is run by daemontools, and if you are familiar with this world, you already know I was using multilog for logging.

The logging for dnscache is basically useless when you are not debugging, and it is very extensive.  So it makes perfect sense why it was taking up so much CPU and I/O.  

I'm writing this post to correct the Internet.  When searching for turning off logging for dnscache or multilog, you get a lot of instructions telling you to do replace your log/run file with this:

exec setuidgid daemon multilog -*

That will free up the I/O, but only about half of the CPU utilization (at least in my case).  The problem is your system is still piping the log from the main process to multilog--multilog just isn't writing it anywhere. 

What you really want to do is stop the logging at its source.  To do so, don't mess with your log/run file at all.  Instead, change your actual run file from

exec 2>&1

to

exec 1>/dev/null 2>&1

Now nothing will go to multilog at all.  Of course, you need to restart or HUP the log and run processes 

Speeding up Perl Regular Expressions using Regexp::List

 
I spent the last 24 hours optimizing the Web crawler for the Parked Domains Project.  The previous bottleneck was obviously CPU.  After a bunch of profiling and benchmarking, I determined that a particular block of Perl regexp was causing most of the problem.

I was already compiling what I could (using /o and qr//).  I was also already trying to run things I thought would match more and faster first, as well as trying to anchor as much as possible (i.e. using /^ and $/ and just using long literal strings).  And I always use clustering (?: instead of capturing (, where appropriate.

What I didn't do, however, was mess with alternations, e.g. cat|dog|bird.  Disclaimer: there isn't a be all and end all to regexp optimizations, and what works in one situation may not work for another--it totally depends on your regexp and what you are throwing at it.  

Alternation is usually slow in Perl because the engine has to backtrack when trying each alternative.  It's much faster to give perl a character sieve up front, e.g. (?=cdb) and then factor out common prefixes and suffixes.  The problem is that when you have a ton of alternatives, doing all this is a pain and it decreases readability to almost zero.  Which is why I had avoided it to date...

Enter Regexp::List.  I've used this module before, but never as extensively and I never benchmarked it either.  It does all of this stuff automatically.  Not only did my regexp speed increase by about 5x, but my readability increased as well!  

I really didn't think that such a simple change would make such a difference.  The reason for the readability increase, btw, is that I now put all the alternatives in an array and then give that to the module, e.g.:

my @regexp = (
  'cat',
  'dog',
  'bird',
 );

use Regexp::List;
my $regexp  = Regexp::List->new;
my $qr = $regexp->set(modifiers=>'i')->list2re(@regexp);


Compete.com #s are Way Off on Low Traffic

 
Like many entrepreneurs, I check the compete.com #s for my Web sites.  I've found that they are way off for relatively low traffic sites.  Take my recently launched startup, Duck Duck Go.

The compete stats for Duck Duck Go show about 14K unique visitors in Sep. and then 3.1K in Oct.  In reality (according to awstats), the real #s were 8.3K in Sep. and then 6.2K in Oct.

To most people in the know about this stuff, I suppose this isn't a big shocker.  Me included.  However, I haven't looked at this in a while, and I was still somewhat surprised how far off it was.

Granted, it must be difficult to extrapolate on the low end.  And this case may be even more difficult.  I launched Duck Duck Go on Sep. 25 on Hacker News, and that site and Reddit generated most of the traffic in Sep. over those last few days.  

So maybe that audience has more compete toolbars installed than then the Oct. cohort or something.  I don't know.  But I do know it is way off.  It will be interesting to see if it gets better at higher traffic.  Of course to test that theory, I'd have to get some higher traffic!

I guess I am annoyed about it because I think people evaluating your company do look at these things. I know I do. For instance, it's right there front and center on our crunchbase page.  And the immediate effect I think people get from looking at that is that we kind of suck.  But if the #s were accurate, perhaps they wouldn't get that immediate first impression.

About

   

I'm a solo founder of a new search engine and an angel investor. There is more about me on my home page.
I'm also doing a book on getting traction. Get updates about it:

Online Karma

-
From a new search engine

Online Profiles

-
From a new search engine