I spent the last 24 hours optimizing the Web crawler for the Parked Domains Project. The previous bottleneck was obviously CPU. After a bunch of profiling and benchmarking, I determined that a particular block of Perl regexp was causing most of the problem.
I was already compiling what I could (using /o and qr//). I was also already trying to run things I thought would match more and faster first, as well as trying to anchor as much as possible (i.e. using /^ and $/ and just using long literal strings). And I always use clustering (?: instead of capturing (, where appropriate.
What I didn't do, however, was mess with alternations, e.g. cat|dog|bird. Disclaimer: there isn't a be all and end all to regexp optimizations, and what works in one situation may not work for another--it totally depends on your regexp and what you are throwing at it.
Alternation is usually slow in Perl because the engine has to backtrack when trying each alternative. It's much faster to give perl a character sieve up front, e.g. (?=cdb) and then factor out common prefixes and suffixes. The problem is that when you have a ton of alternatives, doing all this is a pain and it decreases readability to almost zero. Which is why I had avoided it to date...
Enter Regexp::List. I've used this module before, but never as extensively and I never benchmarked it either. It does all of this stuff automatically. Not only did my regexp speed increase by about 5x, but my readability increased as well!
I really didn't think that such a simple change would make such a difference. The reason for the readability increase, btw, is that I now put all the alternatives in an array and then give that to the module, e.g.:
my @regexp = (
'cat',
'dog',
'bird',
);
use Regexp::List;
my $regexp = Regexp::List->new;
my $qr = $regexp->set(modifiers=>'i')->list2re(@regexp);
