Recently in Sysadmin Category

nginx JSON hacks

 
FileNginx.gif

At DuckDuckGo we use a lot of nginx (an awesome Web server) and a lot of JSON, both for our own API and for processing external APIs. Here are some hacks we've been using.


Proxy external JSON calls

You can take an external API and run it through your server instead of letting the client call it directly. 

location ^~ /ext_api2/ {
   proxy_pass http://api.server.com/;
 }

That means a request for


will turn into


Setting up a proxy can yield a number of benefits...


Proxy caching

proxy_cache_path  /tmp/nginx_cache levels=1:2 keys_zone=STATIC:64m inactive=60m max_size=128m;
proxy_cache STATIC;
proxy_cache_valid 200 204 302 1d;

Now it won't hit the external API if the same request is called by multiple clients. If it is a pay API this could save you money and it could also just speed up the responsiveness of your site.


Proxy timeouts

proxy_connect_timeout 5;
proxy_read_timeout 5;
proxy_send_timeout 5;

You can set the timeouts per proxy (or globally), thus controlling how long the client will wait for each request. With timeouts in place you can ensure the page doesn't hang on something, eventually loads, and gracefully degrades the way you want it to -- even for components you don't control.


Strip headers

proxy_hide_header Set-Cookie;

Some external APIs like to do things to clients (like set cookies). You can protect your users from that by stripping them (or other headers). 


Reset headers

proxy_set_header Referer http://duckduckgo.com/;

Similarly, you can reset your headers. This can protect privacy by zeroing out search terms (in the case of the Referrer), but you can also set custom headers.


Hide private API keys

Many APIs require use of a key, which you generally don't want to expose client-side. You can still allow for client-side calls by proxying them and then having nginx add the key.

location ^~ /ext_api5/ {
  rewrite ^/ext_api5/(.*) /api/check/$1/key/e95fad09aa5091b7734d1a268b53cef5  break;
}

Now a request for


will turn into



Turn JSON into JSONP

JSONP is a slight modification of JSON where the object is wrapped in a callback function usually specified by you. For example, say you grab a JSON object from somewhere that looks like this:

{"Name": "Cheeso", "Id" : 1823, "Rank": 7}

With JSONP you specify a callback function like 'parseResponse' and then it looks like this:

parseResponse({"Name": "Cheeso", "Id" : 1823, "Rank": 7})

This is useful for two reasons. First, the function will be called automatically when it is done loading. Second, it allows you to get around cross-domain errors.

If the above API was yours it's easy to call within a client-side script. But if it isn't yours, i.e. on another domain, and you try to call it you'll often get lots of cross-domain errors. The way around this is to use JSONP. Then you can do something like this:

<script type="text/javascript"
         src="http://other.server.com/api/?q=param&callback=parseResponse">
 </script>

You could also do that via JS, e.g.

function add_script(url) {
    var script,scripts;
    script = document.createElement('script');
    script.type='text/javascript';
    script.async = true;
    script.src = url;

    scripts = document.getElementsByTagName('script')[0];
    scripts.parentNode.insertBefore(script, scripts);
}


Now here's the problem. Most external APIs don't have JSONP capability. 

No bother, with nginx you can turn JSON into JSONP.

location ^~ /ext_api3/ {
  echo_before_body 'parseResponse(;
  echo_after_body ');';
}

This uses the HTTPEchoModule to wrap the JSON response in the callback for the external API.


Custom logs

The HTTPLogModule allows you to specify log formats within location blocks, which means you can write your API logs to a separate file. It also means if you proxy via a location block as in the examples above, you could give each proxy their own access and error logs with different parameters, e.g. error log level.


Update: good comments on HN.

On weird botnet traffic

 
Botnets keep sending DuckDuckGo weirder and weirder traffic, and frankly I don't get it. For a while now I've seen a lot of requests like these:


I suppose those forms make some sense. I presume they are looking for sites running exploitable software, and so they set up automated queries to search engines to find new sites. 

However, what doesn't make sense is sending the same query hundreds of thousands of times a day from each machine. Someone presumably took the time to carefully construct these queries, given that they generally appear to be in the right form. And yet they send back the same results a tenth of a second later, so why would you keep repeating them? A computer will pop-up and will just start hammering on the query. If I unblock it even days later, it is still doing it.

But that's not the weirdest behavior. For the past several weeks I've been getting tons of the exact same request:


These requests come in slower per machine but from a much greater number of machines. I honestly don't understand the point of them at all. Does anyone out there?

As you may know, DuckDuckGo does not save IPs (here's how). So if you're wondering how we go about blocking them, it happens all at the firewall level, which is dissociated from query data. If we didn't block the most egregious botnet machines and abusers, our machines would almost instantly be under water.

This discussion now makes me wonder if other search engines include this errant traffic in their query counts. We work hard to keep them completely out because they would overwhelm our real direct queries #s and therefore distort our perception of progress. We also separate out API requests for the same reason, which now also makes me wonder whether everyone else is doing that too.

PostgreSQL tips and tricks

 
www.postgresql.png

I've been using PostgreSQL (an open-source database) for many years. Here are some of the less obvious tips and tricks I've picked up. This post isn't meant to be a comprehensive tuning or scaling guide, though I link to some good documentation below.

  • When importing data, use COPY FROM and no indexes. These are two common ways to increase speed when populating a database (more listed there). The idea is to just let the DB copy the data and don't do extra work (like also messing with indexes or transactional statements at the same time. The COPY command has a lot of arguments but I usually just do something like this:

    COPY table FROM '/tmp/path/to/file.txt';

    where file.txt is a tab delimited file matching the table columns exactly (the default). Another gotcha here is you often get permission errors when running that command. If you connect to the DB as the superuser all of that goes away, e.g.

    psql db postgres

    You can add indexes at the end and everything will be much faster.


  • Replace a live table with DROP TABLE/ALTER TABLE. Suppose you are re-populating a read-only table that is live, so you don't want to drop the indexes or otherwise have no availability. What I do is create a new table based on the first, populate it (with COPY and no indexes), then drop the main table and alter the new table to make it named the first. Minimal downtime, e.g.:

    CREATE TABLE table2 AS SELECT * FROM table LIMIT 1;
    DELETE FROM table2
    COPY table2 FROM '/tmp/path/to/file.txt';
    CREATE INDEX blah2;
    VACUUM ANALYZE VERBOSE table2;
    ALTER TABLE table RENAME TO table3;
    ALTER INDEX blah RENAME TO blah3;
    ALTER TABLE table2 RENAME TO table;
    ALTER INDEX blah2 RENAME TO blah;
    --TEST
    DROP TABLE table3;

    All the ALTER commands happen pretty much instantly. This creates an identical looking table (table2); moves your current table to a backup (table3); moves the new table to be the primary; and finally removes the original (after testing).


  • Throttle pg sysadmin tasks so they don't impact performance. On an active database, you're often running larger tasks that can impact the performance of the database, which is bad for a fast database-backed Web site. For example, pg_dump to make backups, COPY/CREATE INDEX as in the above examples, etc. Not to fear; there are some simple things you can do to lessen the impact, e.g.:

    nice -n 20 ionice -n 3 pg_dump
    nice -n 20 ionice -n 3 psql

    This wraps the command to have the lowest cpu priority (via nice) and ip priority (via ionice). Additionally I will often also use

    cpulimit -e pg_dump -l 10 -z

    which will limit pg_dump (in this case) to 10% of CPU while the current one ie running and then exit (via cpulimit). Of course, it also usually makes sense to run these things at off times using cron, etc. 


  • For fast Web sites, have everything use indexes. IO is slow, so avoid it. You want your queries that users hit on every page to essentially return instantly, which means in-memory index lookups.

    The first step is to make sure you have the indexes defined right so your queries are actually using them. Use EXPLAIN for that, but don't just trust it -- always then do EXPLAIN ANALYZE to make sure it is returning instantly. Note you have to try different queries because often re-running the first query will be fast since it will be thereafter cached for a while.

    If you can't get things to use indexes, and really in any case, you want to tune pg to favor indexes more often. What I usually do is something like

    cpu_index_tuple_cost = 0.0005
    effective_cache_size = 2GB

    These two params effect the calculation pg does to determine whether to use an index or not. If you're confident that your indexes are in memory, then the cpu_index_tuple_cost should be lower than the default (I use the value above). The effective_cache_size is what kind of disk cache you can expect pg to have. If you want to get a bit crazier you can also do

    enable_seqscan = off

    which will try to avoid sequential scans at all costs, though some people think that is a bad idea. You can also see what sequential scans are going on with the following command, which indicates where you need additional indexes or need them to be tweaked:

    SELECT relname,seq_scan FROM pg_stat_all_tables ORDER BY seq_scan DESC LIMIT 20;


  • For fast Web sites, make sure your indexes are in memory. Indexes are great, but when they are not in memory, they can still use a good deal of IO and slow you down.

    First, don't set shared_buffers too high. It clearly varies by application what the right memory param values are, but you have to understand that much of pg caching is done by the OS via disk cache. This is what the aforementioned effective_cache_size is all about. When you set shared_buffers too high, it both eats into the OS disk cache and leads to redundant caching. Read up here on the in-memory values.

    Second, find out how big your indexes are. You can do that like this:

    SELECT sum(((relpages*8)/1024)) as MB FROM pg_class WHERE reltype=0;

    or by individual index like this:

    SELECT relname, ((relpages*8)/1024) as MB, reltype FROM pg_class WHERE reltype=0 ORDER BY relpages DESC LIMIT 30;

    If your indexes are way greater than your memory, figure out how to reduce them, e.g. by dropping ones you don't need, sharding, optimizing (changing what exactly is indexed), etc.

    Once you get them down to a reasonably in-memory value, ideally you'd want effective_cache_size to be above that value and mean it. Since we're talking about OS disk cache note that other IO you do on disk really effects performance because you're flushing that cache. This is why it is often a good idea to either a) separate the DB machine or b) run all other IO tasks on different disks, e.g. have tmp and log and backup stuff on other mount points tied to different physical disks.

  • Make indexes faster. There are a few things you can do to actually speed up indexes. First, you can tell pg to calculate more statistics on a column when analyzing it like this:

    ALTER TABLE table ALTER COLUMN column SET STATISTICS 1000;

    For large and irregular data sets, the default of 100 is too low. These statistics are used by the query planner.

    Second, you need to VACUUM ANALYZE when you sufficiently change a table. Later versions of pg have autovacuum and that may do it for you -- I haven't dug into it enough to know, but I still rely on manual vacuums (for updated index analysis) via cron.

    You can also issue CREATE INDEX commands that operate on a subset of a column. If that is the only subset you're querying via the index than that can also speed it up.

    Finally I'm told the CLUSTER command can further speed things up by organizing the table (on disk) according to an index. If you use one index all the time to query a table, this could make getting the subsequent information off of the disk faster. I have not played around with this feature yet though.


  • Vacuum (at least) occasionally. If you turn off auto-vacuum and never vacuum you will eventually lose data! This query can tell you how far you are from being screwed in each database:

     SELECT datname, age(datfrozenxid) FROM pg_database;

    I hope you never run into this, but I did a few years ago and it is a pain. Just make sure you are routinely vacuuming please. It's another good reason to add a daily/weekly cron job as a backup.


  • Get faster vacuums. The amount of memory pg uses for vacuuming by default is super low. You can increase it and thus dramatically speed up vacuuming by doing something like:

    maintenance_work_mem = 256MB


  • Use Bucardo for easy master/slave stuff. Lots about that in this post.


  • Don't forget listen_address when trying to connect remotely. To connect remotely you need to mess with the pg_hba.conf file to authorize remote connections. But a big gotcha here (that has got me many times) is forgetting to change listen_addresses to actually listen on an interface open to the outside. The default is just localhost.

    listen_addresses = '*'

    will listen on every interface available.


  • When troubleshooting, first check you haven't run out of connections, then check the error log. max_connections is just one of the things you can tune, but it is probably one you want to do so. Once this limit is reached your clients will not be able to connect. You can't set it super high because of memory constraints, but the default is usually too low. You can also reduce your connection limit by doing database pooling of some kind.

    Note that in a situation where you run out of connections, you often will have some superuser connections left, since they are allocated seperately. So you can do

    psql db postgres
    select * from pg_stat_activity;

    and see what is going on. If you see the same query over and over again, you probably have a bottleneck in that query :). See above for making sure everything uses indexes and returns instantly.


  • OS tuning to allow for increased shared_buffers. Often the first thing you'll try to do is increase shared_buffers and then pg won't start because it says you can't allocate enough memory. You need to tell the OS to let it use more than the default, which you can do like:

    echo 'kernel.shmmax=2147483648' >> /etc/sysctl.conf

    for Ubuntu or

    echo 'kern.ipc.shmmax=2147483648' >> /etc/sysctl.conf

    on FreeBSD. This requires a reboot, but on Ubuntu I believe you can do sysctl -w to also make it work immediately.


  • You can HUP pg for most config changes. You generally do not need to restart pg when changing minor stuff, though some things (like annoyingly listen_addresses!) you do. Note HUPping can usually be accomplished by issuing a reload command through the start up script interface.

Update: lots of good comments on HN.

Replicating PostgreSQL with Bucardo

 
If the title makes no sense to you, PostgreSQL is an open-source database. Replication means continuously copying changes from one database to another, e.g. for backup, scalability or high availability. And Bucardo is one of several pieces of software that help you achieve various forms of Postgres replication.

Why Bucardo?

First off, why Bucardo, especially since Postges 9.0 now has built-in hot standby/streaming replication (full docs)? Steve Signer wrote up some cases where you wouldn't want to/be able to use the built-in replication. A more detailed comparison of the various options is on the PG wiki

I won't repeat everything you can find there, but at the highest level the built-in system works on the whole DB cluster, as opposed to specific databases or even tables. Also, it is designed to do so by essentially mirroring files, so it is highly recommended that the platforms (including OS and Postgres versions) match as exactly as possible. Both of these requirements don't suit me. I run FreeBSD and Ubuntu and would really like to replicate only certain tables within certain databases.

After looking at the various options in detail, I wanted to use something (at least initially) both free and open source. Bucardo is written in Perl (so theoretically I could contribute patches); it is active (I looked at the changelog and mailing list); it is simple to setup and use (I looked at the docs); and it is very flexible (offers multi-master, multi-slave, and even cascading slave configurations).

However, note that Bucardo replication usually lags a few sec behind (depending on network config), so you will lose data if the master goes down and is not recovered properly. That is, it is not synchronous replication as available in 9.1, which confirms that changes are made in both DBs before committing (though also has the drawback of slowing things down a bit).

My stuff is generally read-only on the slaves and data can be out of sync (or lost), so the Bucardo model (which works off of triggers) is perfect for me at this time. Nevertheless, you can do failover via Bucardo, easier with a swap sync.

Bucardo Overview

The Bucardo documentation is pretty good so I won't rehash it all. Installation is straightforward -- you install a few Perl modules and then are left with a command line program called bucardo_ctl, which controls everything. I noticed the online documentation can differ from the man page in details, so you might want to look at both. Also on the wiki, i noticed not everything is in linked properly from the main pages, so you might want to use the search feature if you're looking for something in particular.

The documentation goes through an example if you want to play around with it. Here's an even simpler one:

createdb testa
createdb testb
create table test (id integer primary key);
bucardo_ctl add database testa name=testa
bucardo_ctl add database testb name=testb
bucardo_ctl add table test db=testa
bucardo_ctl add herd test_herd test
bucardo_ctl add sync test_sync source=test_herd targetdb=testb type=pushdelta

The above will create two databases. You should issue the create table command in both. Then you add the databases to the bucardo database (which is stored in a postgres DB named bucardo). A "herd" is just a set of tables to replicate, so we add the test table from testa. Then we add a sync.

bucardo_ctl stop
bucardo_ctl start
testa=# insert into test (id) values (1);
bucardo_ctl status
Ins/Upd/Del:          1 / 0 / 0
testb=# select * from test;
 id
----
  1
(1 row)
Stop and start bucardo, which is the easiest way to ensure the new sync starts. Then insert a row in testa and see it appear in testb.  bucardo_ctl status will tell you that the one row was inserted.

Non-obvious notes

After messing around with this live for a week, I've noticed a bunch of things that were non-obvious (at least to me) that you might want to keep in mind.

  • On install, you may need to add Perl to Postgres or you'll get a weird error. You can do that with this commnd:

    createlang plperl template1 -U pgsql

  • The verbosity flags to bucardo_ctl currently don't do much, e.g. --quiet doesn't make things quiet. To turn off the debug log (which gets big fast because it is high verbosity), use debugfile=0 when starting bucardo_ctl. Bucardo is aware of this issue and will be cleaning this up in future releases.

  • The sendmail=1 flag to bucardo_ctl works, but you have to set your from and to email first by doing bucardo_ctl set default_email_from=whatever, etc. However, if you have a situation where a sync fails repeatedly you'll get a ton of emails, like multiple a sec. So I turned this off for now until that case is fixed (I just reported it).

  • There is a debugdir flag you can pass to bucardo_ctl, If you don't set this, logfiles will get printed to the directory where it is started, so I would cd to that directory first or use the flag.

  • If you have a sync that fails, e.g. from network error, and you keep writing stuff to the master db, when it comes back online it will try to copy everything that changed at once. This can dramatically impact performance on the target. You could do a manual sync up in that case (on an off-peak time and in a secondary table) and avoid the bucardo process doing it for you. If you do that (or decide to just ignore those changes), you need to flush the track and delta tables for that database, e.g.

    psql db bucardo
    delete from bucardo_track;
    delete from bucardo_delta;

    If you also want to delete all the old sync info (since it could have failed quickly thousands of times), you need to do the following.

    psql bucardo bucardo
    delete from q;

    Note that in the first case you are connecting to the db as the bucardo user and not your regular database owner user. In the second case you are connecting to the bucardo db and not the db being replicated.

  • To completely remove bucardo (e.g. if you want to start over), it is not enough to remove the bucardo db since extra tables were added to the other databases (being replicated) and extra triggers to the tables within them. You need to also issue this command.

    psql template1 pgsql
    drop schema bucardo cascade;

  • If you want to completely shutdown a running sync going awry and make sure everything is stopped.

    bucardo_ctl stop
    ps auxww | grep -i bucardo | awk {'print $2'} | xargs kill -TERM

    You want to also issue that command on remote machines to ensure those processes are killed.

  • It initially confused me how to add a remote DB. You add it to bucardo like this:

    bucardo_ctl add database waki name=test host=test.duckduckgo.com

    That is you past in the fully qualified domain name as the host parameter.

  • To test that a sync can work, i.e. it can reach the host in a properly authenticated manner, you can do

    bucardo_ctl validate sync_name

    You will need to do two things for this to complete successfully. First, there needs to be a bucardo user on the remote machine with access to the db. The simplest way is to make it a super user, a la 

    su -m pgsql -c 'createuser -sDRw bucardo'

    on FreeBSD or on Ubuntu like this:

    sudo su -m postgres -c 'createuser -sDRw bucardo'

    Second, you need to allow the host machine to talk to the remote machine, i.e. you need to enable remote access. This is a two step process. In postgresql.conf you need to change listen_address and add in that IP or use * to listen on all. Second in pg_hba.conf you need to add in a way for the remote bucardo user to authenticate. Again the easiest way is to trust that IP, though security-minded people will probably want to shoot me for suggesting it. Either way, you can test it by using psql and passing in host parameters or just issuing the validate command above.

  • The quickest way to tell if things are working is to do

    bucardo_stl status

    which will tell you info on all your syncs or

    bucardo_ctl status sync_name

    for detailed info on one sync. Look at the Last_bad and Last_good times in particular.

  • To remove a sync, it is not enough to remove it in bucardo. You also have to remove the triggers it added on the various tables. You could remove everything as noted above and start over. But it if is just one table, you can just remove the triggers like so:

    psql db dbname
    drop trigger bucardo_add_delta ON table;
    drop trigger bucardo_triggerkick_sync_name ON table;

    Note if you have multiple syncs on the table you don't need to drop the delta table, but just the triggerkick one. If you don't do this and add another sync you'll end up with multiple triggers, which you don't want.

  • If you are replicating to multiple slaves, I found that it is better to use a sync for each one instead of one sync to all (via a db group). The reason is if one is unavailable it will bring down syncing to all, which a) stops syncing for good machines and b) results in that problem above where you do a big sync when everything is back up.

  • To delete a db test without removing your whole install, you can do this:

    bucardo_ctl deactivate sync_name
    bucardo_ctl delete sync sync_name
    bucardo_ctl delete target_db_name
    bucardo_ctl delete herd herd_name
    bucardo_ctl delete table table_name
    bucardo_ctl delete db source_db_name

    Then drop the triggers as noted in the above procedure.

  • When upgrading bucardo, it is not enough to just install the new Perl module. You also need to:

    bucardo_ctl upgrade

Conclusion

That was a lot of notes. While it ended up being more complicated than I originally anticipated, I am still currently happy with this solution. It does work pretty well.

Additionally, the Bucardo mailing list is very responsive. I reported two bugs and they were both fixed last night.

RAID0 ephemeral storage on AWS EC2

 
If you're thinking of doing RAID0 (disk striping) on the ephemeral storage disks attached to an EC2 instance, this post is for you. First I'll go through some directions since I didn't find great ones elsewhere when trying to do it myself. Then I'll get to why I wanted to do it in the first place.

When you run your instances, you have to add the extra drives in the block device mapping or they will not be available to you. This confused me for a while. If you do nothing, you'll just get the default, which is one ephemeral drive attached and mounted at /mnt. And there is apparently nothing you can do to connect them in after the fact.

Different types of instances have different numbers and sizes of drives available. Here is the listing. Summary: you get an extra drive (/dev/sdc) when you move up to an m1.large instance. If you move up to an m1.xlarge, you get two more (/dev/sdd & /dev/sde) for a total of four.  

The more drives you have the faster your RAID0 volume will be for random reading (on average). You have to be careful though because the larger instances don't necessarily all have four drives. This table tells you outright how many each has. For example, the high memory extra large seems to only have one drive attached. Go figure.

I couldn't figure out how to make the right block device mappings from the console. It doesn't seem to be an option in the wizard. So you have to do it from the command line. For the m1.large instance it looks like this:

ec2-run-instances -b '/dev/sdb=ephemeral0' -b '/dev/sdc=ephemeral1'

and for m1.xlarge it looks like this:

ec2-run-instances ami-90fe01f9 -b '/dev/sdb=ephemeral0' -b '/dev/sdc=ephemeral1' -b '/dev/sdd=ephemeral2' -b '/dev/sde=ephemeral3'

I actually tried naming them other things and it kept throwing errors at me. I also tried baking in the block device mappings into images, which you can supposedly do by passing the same -b arguments to ec2-register as per the documentation, but that didn't work for me either.

Once booted, you have to unmount the /mnt mount point.

unmount /mnt

This won't work if you are using it for anything, so make sure you do all this before you do (I put it in a user data script). You can then run the following commands to create the RAID0 volume. For m1.large it looks like this:

yes | mdadm --create /dev/md0 --level=0 -c256 --raid-devices=2 /dev/sdb /dev/sdc
echo 'DEVICE /dev/sdb /dev/sdc' > /etc/mdadm.conf
mdadm --detail --scan >> /etc/mdadm.conf

and for m1.xlarge it looks like this:

yes | mdadm --create /dev/md0 --level=0 -c256 --raid-devices=4 /dev/sdb /dev/sdc /dev/sdd /dev/sde
echo 'DEVICE /dev/sdb /dev/sdc /dev/sdd /dev/sde' > /etc/mdadm.conf
mdadm --detail --scan >> /etc/mdadm.conf

Then you need to create the file system, e.g.: 

blockdev --setra 65536 /dev/md0
mkfs.xfs -f /dev/md0
mkdir -p /mnt/md0 && mount -t xfs -o noatime /dev/md0 /mnt/md0
cd /mnt/md0

Finally, if you want it to come up on reboot, remove /mnt from fstab and add /mnt/md0

perl -ne 'print if $_ !~ /mnt/' /etc/fstab > /etc/fstab.2
echo '#/dev/md0  /mnt  xfs    defaults 0 0' >> /etc/fstab.2
mv /etc/fstab.2 /etc/fstab

I'm using Uuntu Lucid 10.04 LTS. You can find the base ami amd aki ids here. In particular, I'm using ami-fa01f193 and aki-427d952b (64-bit instance-store in us-east-1 region).

Here are the best posts I found on benchmarking RAID0 ephemeral storage:
I can't find the link to a final post I wanted to link to that compared variability across instance types over a period of weeks. It found, as you might expect, significant improvements and less variability as you move up the chain and high variability in the smaller instance types. The thinking is that as you move up, it becomes more like dedicated hardware, and of course the underlying hardware itself may actually be improving.

For a while now, I've replicated DuckDuckGo in two completely different hosting providers. Up until very recently, AWS was the backup, and so I didn't spend that much time optimizing my setup (or honestly even understanding all my options) on the platform.

I had used EBS boot AMIs on a path of least resistance so I could shove more than 10GB into my images. I didn't think much of it all until moving to EC2 as my primary hosting provider a few weeks ago. I was in the midst of evaluating options primarily on a performance axis when the recent outage hit me. Then I started also considering other axes, namely durability and availability.

I had minimal downtime because I failed over to my other hosting provider, but the whole thing got me thinking hard about how I should use EC2. Ultimately, I decided to move off of EBS and onto the ephemeral storage.

The official line from Amazon is "[t]he latency and throughput of Amazon EBS volumes is designed to be significantly better than the Amazon EC2 instance stores in nearly all cases." That's pretty powerful stuff, so the decision to move off of EBS was not taken lightly.

I don't think it is the right decision for everyone, but I wanted to throw one more data point out there. On a day to day basis, I'm doing mainly random reads, as I suspect a lot of people are. My particular needs are perhaps a bit peculiar though.
  • Most of my stuff is both read-only and indexed in memory, so my actual IO use is not that high. 
  • I also use a decent amount of network.
  • I will choose availability over durability since I replicate off of EBS anyway.
  • I would like to avoid very low performance periods.

The bottom line for me is that the RAID0 on the m1.xlarge is fast enough such that IO is no longer a significant bottleneck for my usage patterns. If you read the above benchmark posts, you will see that you can get comparable speeds doing RAID0 over 8 EBS volumes, so that's really what I'm choosing between. Let's assume they are the same average speed. Here is what pushed it over the edge to ephemeral storage for me:

  • The benchmarks show that the ephemeral drives have higher consistency in throughput than EBS. In fact, EBS has historically crawled every now and then. You could potentially account for this and failover but more likely your application will just crawl for that time period as a result, which I really don't like.

  • With 8 drives, failure of the RAID device becomes significantly more probable. I could not find any real life failure probabilities beyond the annual failure rate of 0.1-0.5% mentioned by Amazon though. Let's suppose though for sake of argument that because Amazon auto re-mirrors EBS drives, that the 8vol EBS volume has an equivalent failure rate to the 4drive ephemeral volume, and this one is moot. It's all speculation anyway.

  • In Amazon's outage statement they said "when data on a customer's volume is being re-mirrored, access to that data is blocked until the system has identified a new primary (or writable) replica." This reads to me as another scenario when your EBS drive can become really slow for a bit and slow down your application. Of course, this could be viewed as a good thing for durability. But since I don't care about that I'd rather choose more availability.

  • Instances have one network device and EBS runs over network. I suspect that because EBS competes with your application for network, it slows down both. The closest I got to confirming this was this HN comment.

  • Bigger players with more time to spend on benchmarking this stuff and more money to spend on premium support/advice are moving off of EBS or never used it in the first place.

  • EBS has more moving parts. If I had not been using it, I wouldn't have had any issues during the outage (I think).

Of course all this could change! Amazon seems long-term invested in EBS. Also, if too many people move off of it, the instance storage could get less consistent as it is also a shared resource.

Here are a few other notes for anyone else who is thinking about this stuff.

  • Amazon says there is a one-time write penalty on instance storage, which you can clear by writing out all the sectors. However, that takes time. I have not messed with this at all yet.

  • I switched off the EBS boot images as well, figuring that it has the same issues as above and also just increases my failure rates, as now both the EBS boot drive and the instance itself (underlying hardware) can fail. Instead, now I send new instances a user script that RAIDs the ephemeral storage and then pulls down data for it (both from S3 and other servers). To send a script (including the above directions), just use the -f option on ec2-run-instances.

  • RAID1 on EBS seems like a bad idea. It is unclear that the EBS volumes are statistically independent, and so you may not be getting much durability benefit. Additionally, as you can see from the benchmark posts, it really eats into performance. If your IO is low it may not matter that much, but at high IO you max out the network faster.

  • If you do choose to go with ephemeral storage, m1.xlarge seems like the sweet spot. It is the first instance with the four disks and has significantly less variability in resources. Of course you also get double memory, storage and CPU from m1.large though at double cost.

  • The Right Scale blog had a really good post on the AWS outage and links to the best other posts on it as well.

  • Other configurations are possible I suppose, e.g. RAID10 or RAID0 on three drives and then saving the other for moving large files around (sequential IO). But if you're looking for consistent speed on random reads like me, RAID0 across all the drives will be the fastest. And since you're presumably prepared if your instance goes down, a drive failure should be OK. You auto-failover and use your off-instance backups.

    That said, if the performance of RAID0 across two disks is OK, then you can theoretically use the m1.xlarge, get the mirroring and still get decent (presumably EBS level) durability without using EBS. However, I'm still unclear if this is even useful since I don't know what happens when a drive fails, i.e. if the whole instance automatically goes down or not. In that case, the ephemeral mirror is useless, and given the instance can go down anyway -- you need to have off-instance backups regardless. And besides, an m1.xlarge is twice as expensive as an m1.large, so you can already run two m1.large instances and get instance failover for the same price. Note: it's probably clear that I haven't tried this particular case :).

Again, I'm pretty new to all of this, so as always feel free to correct/enlighten me in the comments. Some things in particular I'm still trying to find out are:

  1. What are actual failure rates for EBS volumes and ephemeral drives?

  2. Do people detect and actually recover from drive failure, or does it essentially always bring the instance down?

/usr/bin/nice is your friend

 
top.PNGThis is a sysadmin post. The other day I was running a background process on a production machine. I thought it wasn't going to eat up many resources, but I was wrong. It turned out it was doing a lot of random I/O and things ground to a halt.

Now often I will have top, vmstat and iostat open and renice annoying background processes to 20 when appropriate (r in top). If you don't already know, 'nix processes (I use FreeBSD) can have priorities, which the scheduler takes into account when giving out resources.

These priorities range from -20 to 20 (on FreeBSD at least), and you can see them in top under the PRI column. 0 is neutral. If you set something to 20, it will be tied for lowest priority process in the system.

nice is another command that messes with process priorities. It starts them out at a particular priority, as opposed to changing a priority via renice. For example 'nice -n 20 ./process' will start process at priority 20.

Then I got to thinking, why don't I do more of this initial nice stuff? Maybe seasoned sysadmins all already do this, but I went back through all my scripts and crontabs and explicitly set nice values. Then I went through my daemontools run scripts and set nice values there as well.

My web server (nginx) already did this via the 'worker_priority' variable, which I had previously set to -6. I set negative values to my most important scripts and relative values between them in order of importance. I set positive values to less important scripts, and then 20 to various background processes kicked off via cron. For example, backups are now niced to 20. 

The system was already runnnig pretty smoothly, but now it is even more so. And I think, more importantly, it will react better in times of need. 

Final tip. When I want to renice a bunch of stuff, I usually do so on the command line instead of top, e.g.:

ps auxww | grep -i crawl | awk '{print $2}' | xargs renice 10

That will take all the processes that match 'crawl' and renice them to 10.

How-to stop most people from spidering your site and stealing content

 
I run a few sites with a lot of content that I don't want spidered by anyone other than the major search engines. At best, undesired spidering eats at your bandwidth and page response time. At worst, it can lead to widespread stealing and duplication of your content.

Anyway, this post details exactly what I currently do to prevent such spidering. All the code is in my new git repository I created to share code with you.

  • I created a DB named logip, owned by user logip, and then added these tables. The logip table records requests from IP addresses I am tracking. And the badips and badips2 tables hold IP addresses I am presently blocking.

  • I set up this Perl script (logip.pl) to run all the time. It tails my Web server log file and adds IP addresses to the logip table. There are variables at the top you can use to exclude certain virtual hosts, e.g. a site with few pages where you aren't concerned with crawling. You can also exempt certain IPs, e.g. your own.

    logip.pl currently only adds IPs for requests with 200 HTTP status codes (OKs). It also skips images, css, js, etc., IP addresses known to be associated with the major search engines, and repeated requests for the same page by the same IP (hard refreshes). The idea is to record unique successful requests of actual distinct pages from non-search engines.

    I run the script via daemontools, but you could run it through inetd or whatever. If daemontools interests you, the commands I ran to setup management are in logip.sh. If you use those commands, you will want to change them to point to your svscan and logip.pl script directories appropriately.

  • I set up this Perl script (badips.pl) to run periodically. In particular, I have it set up to run every minute via crontab. The frequency at which it is run is the minimum frequency that new violators of my spidering policy will be blocked. So if you run it every minute, people will have (on average) half a minute to grab your stuff before you start blocking them. I haven't found in practice an urge to make the time interval smaller, but if a lot of people desire it, I could rewrite the script for that purpose.

    badips.pl works on a threshold basis. It looks at various tunable timeframes and checks whether new IP addresses have exceeded a page request threshold for those timeframes. The ones I currently use are in the script. For example, 20 page requests in the past minute, or 50 over a day. You can tune these to what is appropriate for your sites.

    The second variable is whether to log violators in either the badips or badips2 tables. The distinction is whether you think a violation is really bad or just pretty bad. For example, I currently mark passing the minute threshold as really bad and all else as just pretty bad. A pretty bad block stays around for 10 days, whereas a really bad block stays around for 180 days.

  • The output of badips.pl is a configuration file that nginx or Apache reads on the fly. It works with both Web servers, and there is a variable at the top of the file to indicate which one you are using. The resulting conf file is a bunch of Deny IP lines that lists out the current IPs you are blocking. 

    For Apache, there are a some other lines that preemptively block suspicious user agents, e.g. curl and wget. I haven't yet ported these preemptive lines over to nginx. The intention is for Apache to see the file via an Allow Override All directive, i.e. via a changing .htaccess file. For nginx, the configuration is reloaded on the fly.

  • If new IPs are added, you are sent an email notifying you of the new block(s). The script attempts to do reverse DNS on the IP and the forward DNS on that host to give you some context. For example, if it is a Google IP, you will want to unblock it. However, in practice, I haven't done that in a while because those IPs are well exempted in logip.pl. badips.pl also cleans up the DB at the end of the script before exiting, deleting expired records and vacuuming the table (for PostgreSQL).

I've evolved this process over the last few years, and it works quite well for me and my sites. Your feedback if of course welcome. I'm always looking for improvements and am willing to make them.

I am aware that the current process has some holes. The two biggest are:

  1. You can spider successfully via a large number of IPs, most notably the TOR network. In the past, I have added those IPs dynamically, and I might do that again. Adding a ton of IPs slows down the Web server considerably, however. This is why I backed off of that approach in the past. 

  2. You can grab pages really slowly. That is, if you stay under the thresholds, you won't get caught by this system.
Enjoy!

Duck Duck Go Architecture

 
I often get asked what Duck Duck Go "runs on."  This post basically answers that question by outlining the major moving parts that serve queries, i.e. its architecture.  I'll detail in another post what, in particular, makes it fast, i.e. tunables and other specifics.

Caveat: this architecture was designed for maximum query speed for our initial soft launch.  While also somewhat designed for eventual scalability, we don't have that much traffic yet (though we are growing at a nice clip).  So don't take this as advice like you might get at High Scalability.  It's really just for your amusement.  However, my last startup did have some scale (relatively speaking of course) so I know a bit about what I'm doing...

  1. DNS served by DNS Made Easy.  I used to serve it myself via djbdns, but DNS Made Easy is faster, makes it easier for me to deal with fail-over, and cheap.

  2. All requests come into nginx. I used to use two instances of Apache, one for dynamic requests and one for static files.  But nginx is faster, uses less memory, and is more stable.

  3. If a static file, nginx serves it directly, e.g. the home page.  It's really good at that.

  4. Otherwise, nginx checks my memcached store.  I hadn't used memcached before this, and find it a big win.

  5. If not in memcached store, nginx proxies to FastCGI processes that are running in the background.  I hadn't used FastCGI before this, as I always had used mod_perl with Apache. 

  6. The FastCGI processes are managed by daemontools (as is memcached).  At first I was worried about stability in these processes, but it hasn't proved to be an issue yet.

  7. Internally, the FastCGI scripts are written in Perl and run by the FCGI::Engine Perl module.

  8. The Perl scripts access a PostgreSQL database (when needed) to retrieve our zero-click information, among other things.

  9. The whole thing runs on FreeBSD.

  10. For fail-over and scalability purposes, I have EC2 images that replicate the above except that they run on Ubuntu (since, at the time, FreeBSD wasn't available).

  11. All of our site icons and zero-click info images are hosted on S3.

  12. We also reference some external YUI JS files.

Any questions?  

Also, I'd love any feedback on this architecture.  I'm always looking for ways to speed it up!

Update: additional comments can be found here.

FreeBSD One-liner to Group Referrers

 
A couple days ago I released two widgets, and since then I've wanted to keep an eye on installations for bug detection and vanity purposes.  Tailing the logs for this purpose was becoming cumbersome, so I whipped up this one-liner to tell me what is going on.

grep [kp].js /var/log/nginx/nginx-access.log | awk '{print $10}' | perl -pe 's/^\"http:\/\/([^\/]+).*\"$/$1/' | sort | uniq -c | sort -n

I'm posting it here to remember it and because it might be useful to you.  Here's what it does.

  1. Greps for the Web log lines desired.  In my case, I'm looking for two JS files in a nginx log.  In your case, you'll probably want to change everything but the grep.

  2. Awks out the referrer line.  In my nginx log this is the 10th field.  In your case, it might be a slightly different #.

  3. Perls out the domain.  You could skip this step if you want to count each different referring URL differently.  In my case, the widget is deployed on blogs, and so each post shows up as a different referring URL, and that creates noise, so I grouped them.

  4. Sorts the domains, so that 5 works.

  5. Uniqs the domains, i.e. groups & counts them (the -c).  

  6. Sorts the grouped domains by the count, numerically (-n).

Enjoy! 

Turning off Logging in daemontools (djbdns' dnscache)

 
After optimizing my Web crawler for the Parked Domains Project, I started crawling so fast that the log process for my DNS server was eating up 20% of one of the CPUs (and wasting a lot of I/O as well).

I run a local dns cache using djbdns on each crawling server, which also runs all local dns queries. My dnscache is run by daemontools, and if you are familiar with this world, you already know I was using multilog for logging.

The logging for dnscache is basically useless when you are not debugging, and it is very extensive.  So it makes perfect sense why it was taking up so much CPU and I/O.  

I'm writing this post to correct the Internet.  When searching for turning off logging for dnscache or multilog, you get a lot of instructions telling you to do replace your log/run file with this:

exec setuidgid daemon multilog -*

That will free up the I/O, but only about half of the CPU utilization (at least in my case).  The problem is your system is still piping the log from the main process to multilog--multilog just isn't writing it anywhere. 

What you really want to do is stop the logging at its source.  To do so, don't mess with your log/run file at all.  Instead, change your actual run file from

exec 2>&1

to

exec 1>/dev/null 2>&1

Now nothing will go to multilog at all.  Of course, you need to restart or HUP the log and run processes 

About Me

RSS.