Recently in Sysadmin Category

/usr/bin/nice is your friend

 
top.PNGThis is a sysadmin post. The other day I was running a background process on a production machine. I thought it wasn't going to eat up many resources, but I was wrong. It turned out it was doing a lot of random I/O and things ground to a halt.

Now often I will have top, vmstat and iostat open and renice annoying background processes to 20 when appropriate (r in top). If you don't already know, 'nix processes (I use FreeBSD) can have priorities, which the scheduler takes into account when giving out resources.

These priorities range from -20 to 20 (on FreeBSD at least), and you can see them in top under the PRI column. 0 is neutral. If you set something to 20, it will be tied for lowest priority process in the system.

nice is another command that messes with process priorities. It starts them out at a particular priority, as opposed to changing a priority via renice. For example 'nice -n 20 ./process' will start process at priority 20.

Then I got to thinking, why don't I do more of this initial nice stuff? Maybe seasoned sysadmins all already do this, but I went back through all my scripts and crontabs and explicitly set nice values. Then I went through my daemontools run scripts and set nice values there as well.

My web server (nginx) already did this via the 'worker_priority' variable, which I had previously set to -6. I set negative values to my most important scripts and relative values between them in order of importance. I set positive values to less important scripts, and then 20 to various background processes kicked off via cron. For example, backups are now niced to 20. 

The system was already runnnig pretty smoothly, but now it is even more so. And I think, more importantly, it will react better in times of need. 

Final tip. When I want to renice a bunch of stuff, I usually do so on the command line instead of top, e.g.:

ps auxww | grep -i crawl | awk '{print $2}' | xargs renice 10

That will take all the processes that match 'crawl' and renice them to 10.

How-to stop most people from spidering your site and stealing content

 
I run a few sites with a lot of content that I don't want spidered by anyone other than the major search engines. At best, undesired spidering eats at your bandwidth and page response time. At worst, it can lead to widespread stealing and duplication of your content.

Anyway, this post details exactly what I currently do to prevent such spidering. All the code is in my new git repository I created to share code with you.

  • I created a DB named logip, owned by user logip, and then added these tables. The logip table records requests from IP addresses I am tracking. And the badips and badips2 tables hold IP addresses I am presently blocking.

  • I set up this Perl script (logip.pl) to run all the time. It tails my Web server log file and adds IP addresses to the logip table. There are variables at the top you can use to exclude certain virtual hosts, e.g. a site with few pages where you aren't concerned with crawling. You can also exempt certain IPs, e.g. your own.

    logip.pl currently only adds IPs for requests with 200 HTTP status codes (OKs). It also skips images, css, js, etc., IP addresses known to be associated with the major search engines, and repeated requests for the same page by the same IP (hard refreshes). The idea is to record unique successful requests of actual distinct pages from non-search engines.

    I run the script via daemontools, but you could run it through inetd or whatever. If daemontools interests you, the commands I ran to setup management are in logip.sh. If you use those commands, you will want to change them to point to your svscan and logip.pl script directories appropriately.

  • I set up this Perl script (badips.pl) to run periodically. In particular, I have it set up to run every minute via crontab. The frequency at which it is run is the minimum frequency that new violators of my spidering policy will be blocked. So if you run it every minute, people will have (on average) half a minute to grab your stuff before you start blocking them. I haven't found in practice an urge to make the time interval smaller, but if a lot of people desire it, I could rewrite the script for that purpose.

    badips.pl works on a threshold basis. It looks at various tunable timeframes and checks whether new IP addresses have exceeded a page request threshold for those timeframes. The ones I currently use are in the script. For example, 20 page requests in the past minute, or 50 over a day. You can tune these to what is appropriate for your sites.

    The second variable is whether to log violators in either the badips or badips2 tables. The distinction is whether you think a violation is really bad or just pretty bad. For example, I currently mark passing the minute threshold as really bad and all else as just pretty bad. A pretty bad block stays around for 10 days, whereas a really bad block stays around for 180 days.

  • The output of badips.pl is a configuration file that nginx or Apache reads on the fly. It works with both Web servers, and there is a variable at the top of the file to indicate which one you are using. The resulting conf file is a bunch of Deny IP lines that lists out the current IPs you are blocking. 

    For Apache, there are a some other lines that preemptively block suspicious user agents, e.g. curl and wget. I haven't yet ported these preemptive lines over to nginx. The intention is for Apache to see the file via an Allow Override All directive, i.e. via a changing .htaccess file. For nginx, the configuration is reloaded on the fly.

  • If new IPs are added, you are sent an email notifying you of the new block(s). The script attempts to do reverse DNS on the IP and the forward DNS on that host to give you some context. For example, if it is a Google IP, you will want to unblock it. However, in practice, I haven't done that in a while because those IPs are well exempted in logip.pl. badips.pl also cleans up the DB at the end of the script before exiting, deleting expired records and vacuuming the table (for PostgreSQL).

I've evolved this process over the last few years, and it works quite well for me and my sites. Your feedback if of course welcome. I'm always looking for improvements and am willing to make them.

I am aware that the current process has some holes. The two biggest are:

  1. You can spider successfully via a large number of IPs, most notably the TOR network. In the past, I have added those IPs dynamically, and I might do that again. Adding a ton of IPs slows down the Web server considerably, however. This is why I backed off of that approach in the past. 

  2. You can grab pages really slowly. That is, if you stay under the thresholds, you won't get caught by this system.
Enjoy!

Duck Duck Go Architecture

 
I often get asked what Duck Duck Go "runs on."  This post basically answers that question by outlining the major moving parts that serve queries, i.e. its architecture.  I'll detail in another post what, in particular, makes it fast, i.e. tunables and other specifics.

Caveat: this architecture was designed for maximum query speed for our initial soft launch.  While also somewhat designed for eventual scalability, we don't have that much traffic yet (though we are growing at a nice clip).  So don't take this as advice like you might get at High Scalability.  It's really just for your amusement.  However, my last startup did have some scale (relatively speaking of course) so I know a bit about what I'm doing...

  1. DNS served by DNS Made Easy.  I used to serve it myself via djbdns, but DNS Made Easy is faster, makes it easier for me to deal with fail-over, and cheap.

  2. All requests come into nginx. I used to use two instances of Apache, one for dynamic requests and one for static files.  But nginx is faster, uses less memory, and is more stable.

  3. If a static file, nginx serves it directly, e.g. the home page.  It's really good at that.

  4. Otherwise, nginx checks my memcached store.  I hadn't used memcached before this, and find it a big win.

  5. If not in memcached store, nginx proxies to FastCGI processes that are running in the background.  I hadn't used FastCGI before this, as I always had used mod_perl with Apache. 

  6. The FastCGI processes are managed by daemontools (as is memcached).  At first I was worried about stability in these processes, but it hasn't proved to be an issue yet.

  7. Internally, the FastCGI scripts are written in Perl and run by the FCGI::Engine Perl module.

  8. The Perl scripts access a PostgreSQL database (when needed) to retrieve our zero-click information, among other things.

  9. The whole thing runs on FreeBSD.

  10. For fail-over and scalability purposes, I have EC2 images that replicate the above except that they run on Ubuntu (since, at the time, FreeBSD wasn't available).

  11. All of our site icons and zero-click info images are hosted on S3.

  12. We also reference some external YUI JS files.

Any questions?  

Also, I'd love any feedback on this architecture.  I'm always looking for ways to speed it up!

Update: additional comments can be found here.

FreeBSD One-liner to Group Referrers

 
A couple days ago I released two widgets, and since then I've wanted to keep an eye on installations for bug detection and vanity purposes.  Tailing the logs for this purpose was becoming cumbersome, so I whipped up this one-liner to tell me what is going on.

grep [kp].js /var/log/nginx/nginx-access.log | awk '{print $10}' | perl -pe 's/^\"http:\/\/([^\/]+).*\"$/$1/' | sort | uniq -c | sort -n

I'm posting it here to remember it and because it might be useful to you.  Here's what it does.

  1. Greps for the Web log lines desired.  In my case, I'm looking for two JS files in a nginx log.  In your case, you'll probably want to change everything but the grep.

  2. Awks out the referrer line.  In my nginx log this is the 10th field.  In your case, it might be a slightly different #.

  3. Perls out the domain.  You could skip this step if you want to count each different referring URL differently.  In my case, the widget is deployed on blogs, and so each post shows up as a different referring URL, and that creates noise, so I grouped them.

  4. Sorts the domains, so that 5 works.

  5. Uniqs the domains, i.e. groups & counts them (the -c).  

  6. Sorts the grouped domains by the count, numerically (-n).

Enjoy! 

Turning off Logging in daemontools (djbdns' dnscache)

 
After optimizing my Web crawler for the Parked Domains Project, I started crawling so fast that the log process for my DNS server was eating up 20% of one of the CPUs (and wasting a lot of I/O as well).

I run a local dns cache using djbdns on each crawling server, which also runs all local dns queries. My dnscache is run by daemontools, and if you are familiar with this world, you already know I was using multilog for logging.

The logging for dnscache is basically useless when you are not debugging, and it is very extensive.  So it makes perfect sense why it was taking up so much CPU and I/O.  

I'm writing this post to correct the Internet.  When searching for turning off logging for dnscache or multilog, you get a lot of instructions telling you to do replace your log/run file with this:

exec setuidgid daemon multilog -*

That will free up the I/O, but only about half of the CPU utilization (at least in my case).  The problem is your system is still piping the log from the main process to multilog--multilog just isn't writing it anywhere. 

What you really want to do is stop the logging at its source.  To do so, don't mess with your log/run file at all.  Instead, change your actual run file from

exec 2>&1

to

exec 1>/dev/null 2>&1

Now nothing will go to multilog at all.  Of course, you need to restart or HUP the log and run processes 

About

   

My home page.

Online Karma

-
From a new search engine

Online Profiles

-
From a new search engine