May 2011 Archives

PostgreSQL tips and tricks

 
www.postgresql.png

I've been using PostgreSQL (an open-source database) for many years. Here are some of the less obvious tips and tricks I've picked up. This post isn't meant to be a comprehensive tuning or scaling guide, though I link to some good documentation below.

  • When importing data, use COPY FROM and no indexes. These are two common ways to increase speed when populating a database (more listed there). The idea is to just let the DB copy the data and don't do extra work (like also messing with indexes or transactional statements at the same time. The COPY command has a lot of arguments but I usually just do something like this:

    COPY table FROM '/tmp/path/to/file.txt';

    where file.txt is a tab delimited file matching the table columns exactly (the default). Another gotcha here is you often get permission errors when running that command. If you connect to the DB as the superuser all of that goes away, e.g.

    psql db postgres

    You can add indexes at the end and everything will be much faster.


  • Replace a live table with DROP TABLE/ALTER TABLE. Suppose you are re-populating a read-only table that is live, so you don't want to drop the indexes or otherwise have no availability. What I do is create a new table based on the first, populate it (with COPY and no indexes), then drop the main table and alter the new table to make it named the first. Minimal downtime, e.g.:

    CREATE TABLE table2 AS SELECT * FROM table LIMIT 1;
    DELETE FROM table2
    COPY table2 FROM '/tmp/path/to/file.txt';
    CREATE INDEX blah2;
    VACUUM ANALYZE VERBOSE table2;
    ALTER TABLE table RENAME TO table3;
    ALTER INDEX blah RENAME TO blah3;
    ALTER TABLE table2 RENAME TO table;
    ALTER INDEX blah2 RENAME TO blah;
    --TEST
    DROP TABLE table3;

    All the ALTER commands happen pretty much instantly. This creates an identical looking table (table2); moves your current table to a backup (table3); moves the new table to be the primary; and finally removes the original (after testing).


  • Throttle pg sysadmin tasks so they don't impact performance. On an active database, you're often running larger tasks that can impact the performance of the database, which is bad for a fast database-backed Web site. For example, pg_dump to make backups, COPY/CREATE INDEX as in the above examples, etc. Not to fear; there are some simple things you can do to lessen the impact, e.g.:

    nice -n 20 ionice -n 3 pg_dump
    nice -n 20 ionice -n 3 psql

    This wraps the command to have the lowest cpu priority (via nice) and ip priority (via ionice). Additionally I will often also use

    cpulimit -e pg_dump -l 10 -z

    which will limit pg_dump (in this case) to 10% of CPU while the current one ie running and then exit (via cpulimit). Of course, it also usually makes sense to run these things at off times using cron, etc. 


  • For fast Web sites, have everything use indexes. IO is slow, so avoid it. You want your queries that users hit on every page to essentially return instantly, which means in-memory index lookups.

    The first step is to make sure you have the indexes defined right so your queries are actually using them. Use EXPLAIN for that, but don't just trust it -- always then do EXPLAIN ANALYZE to make sure it is returning instantly. Note you have to try different queries because often re-running the first query will be fast since it will be thereafter cached for a while.

    If you can't get things to use indexes, and really in any case, you want to tune pg to favor indexes more often. What I usually do is something like

    cpu_index_tuple_cost = 0.0005
    effective_cache_size = 2GB

    These two params effect the calculation pg does to determine whether to use an index or not. If you're confident that your indexes are in memory, then the cpu_index_tuple_cost should be lower than the default (I use the value above). The effective_cache_size is what kind of disk cache you can expect pg to have. If you want to get a bit crazier you can also do

    enable_seqscan = off

    which will try to avoid sequential scans at all costs, though some people think that is a bad idea. You can also see what sequential scans are going on with the following command, which indicates where you need additional indexes or need them to be tweaked:

    SELECT relname,seq_scan FROM pg_stat_all_tables ORDER BY seq_scan DESC LIMIT 20;


  • For fast Web sites, make sure your indexes are in memory. Indexes are great, but when they are not in memory, they can still use a good deal of IO and slow you down.

    First, don't set shared_buffers too high. It clearly varies by application what the right memory param values are, but you have to understand that much of pg caching is done by the OS via disk cache. This is what the aforementioned effective_cache_size is all about. When you set shared_buffers too high, it both eats into the OS disk cache and leads to redundant caching. Read up here on the in-memory values.

    Second, find out how big your indexes are. You can do that like this:

    SELECT sum(((relpages*8)/1024)) as MB FROM pg_class WHERE reltype=0;

    or by individual index like this:

    SELECT relname, ((relpages*8)/1024) as MB, reltype FROM pg_class WHERE reltype=0 ORDER BY relpages DESC LIMIT 30;

    If your indexes are way greater than your memory, figure out how to reduce them, e.g. by dropping ones you don't need, sharding, optimizing (changing what exactly is indexed), etc.

    Once you get them down to a reasonably in-memory value, ideally you'd want effective_cache_size to be above that value and mean it. Since we're talking about OS disk cache note that other IO you do on disk really effects performance because you're flushing that cache. This is why it is often a good idea to either a) separate the DB machine or b) run all other IO tasks on different disks, e.g. have tmp and log and backup stuff on other mount points tied to different physical disks.

  • Make indexes faster. There are a few things you can do to actually speed up indexes. First, you can tell pg to calculate more statistics on a column when analyzing it like this:

    ALTER TABLE table ALTER COLUMN column SET STATISTICS 1000;

    For large and irregular data sets, the default of 100 is too low. These statistics are used by the query planner.

    Second, you need to VACUUM ANALYZE when you sufficiently change a table. Later versions of pg have autovacuum and that may do it for you -- I haven't dug into it enough to know, but I still rely on manual vacuums (for updated index analysis) via cron.

    You can also issue CREATE INDEX commands that operate on a subset of a column. If that is the only subset you're querying via the index than that can also speed it up.

    Finally I'm told the CLUSTER command can further speed things up by organizing the table (on disk) according to an index. If you use one index all the time to query a table, this could make getting the subsequent information off of the disk faster. I have not played around with this feature yet though.


  • Vacuum (at least) occasionally. If you turn off auto-vacuum and never vacuum you will eventually lose data! This query can tell you how far you are from being screwed in each database:

     SELECT datname, age(datfrozenxid) FROM pg_database;

    I hope you never run into this, but I did a few years ago and it is a pain. Just make sure you are routinely vacuuming please. It's another good reason to add a daily/weekly cron job as a backup.


  • Get faster vacuums. The amount of memory pg uses for vacuuming by default is super low. You can increase it and thus dramatically speed up vacuuming by doing something like:

    maintenance_work_mem = 256MB


  • Use Bucardo for easy master/slave stuff. Lots about that in this post.


  • Don't forget listen_address when trying to connect remotely. To connect remotely you need to mess with the pg_hba.conf file to authorize remote connections. But a big gotcha here (that has got me many times) is forgetting to change listen_addresses to actually listen on an interface open to the outside. The default is just localhost.

    listen_addresses = '*'

    will listen on every interface available.


  • When troubleshooting, first check you haven't run out of connections, then check the error log. max_connections is just one of the things you can tune, but it is probably one you want to do so. Once this limit is reached your clients will not be able to connect. You can't set it super high because of memory constraints, but the default is usually too low. You can also reduce your connection limit by doing database pooling of some kind.

    Note that in a situation where you run out of connections, you often will have some superuser connections left, since they are allocated seperately. So you can do

    psql db postgres
    select * from pg_stat_activity;

    and see what is going on. If you see the same query over and over again, you probably have a bottleneck in that query :). See above for making sure everything uses indexes and returns instantly.


  • OS tuning to allow for increased shared_buffers. Often the first thing you'll try to do is increase shared_buffers and then pg won't start because it says you can't allocate enough memory. You need to tell the OS to let it use more than the default, which you can do like:

    echo 'kernel.shmmax=2147483648' >> /etc/sysctl.conf

    for Ubuntu or

    echo 'kern.ipc.shmmax=2147483648' >> /etc/sysctl.conf

    on FreeBSD. This requires a reboot, but on Ubuntu I believe you can do sysctl -w to also make it work immediately.


  • You can HUP pg for most config changes. You generally do not need to restart pg when changing minor stuff, though some things (like annoyingly listen_addresses!) you do. Note HUPping can usually be accomplished by issuing a reload command through the start up script interface.

Update: lots of good comments on HN.

Online services I pay for

 
I currently pay for the following online services.


Entertainment

Hulu.jpg
  • What: regular TV on iPad.
  • Why: for in bed & on couch - less hot/cumbersome than laptop and can put on while using laptop.
  • Love: portability; queue (though you get that with regular Hulu).
  • Hate: some shows still Web only; has ads; endless buffering.


thedaily.png

  • What: daily newspaper on the iPad.
  • Why: initially bc sister works there (graphic designer), but now bc it is fun to read before going to bed at night.
  • Love: UX, i.e. nice browsing and readability features; good graphics.
  • Hate: can't go back to older issues since I don't pick it up every day.


Productivity


skype.jpg

  • What: 3-way video calling.
  • Why: hate going to meetings in person.
  • Love: way more personal phone.
  • Hate: often peoples' Internet connections aren't fast enough.


prime_landing_logo._V212353139_.gif

  • What: free, two-day shipping on many Amazon products (no minimum, overnight for $3.99).
  • Why: going to the store is a hassle.
  • Love: shipping is so fast (often just one day).
  • Hate: how much stuff on Amazon is not prime-able.

googleapps.jpg

  • What: online email.
  • Why: solved my Gmail issues (slowness, sending limits).
  • Love: 2-step authentication (though can get that on regular Gmail).
  • Hate: can't add free (lower) accounts once paid for higher ones; switching between accounts a pain.


github.png

  • What: online code repository.
  • Why: easy private collaboration; could use my servers, but it has issues built in and don't have to share credentials/resources.
  • Love: easy segmented collaboration.
  • Hate: dealing with pull requests.


Ooma.gif

  • What: Internet-based home phone.
  • Why: free after you buy device
  • Love: subscribed to premium for emailed voicemail and automatic blacklist detection.
  • Hate: faxing doesn't work well.


HelloFax.png
  • What: online fax.
  • Why: my home fax doesn't work very well.
  • Love: easy.
  • Hate: feels expensive.


Infrastructure


Amazon Web Services.gif

  • What: cloud services.
  • Why: initially wanted easy failover and scaling capacity; now run as primary host.
  • Love: spin up new nodes in minutes.
  • Hate: dealing with EBS issues; instances fail more than (I think) they should.


linode.png

  • What: virtual (shared) servers.
  • Why: shared development resource.
  • Love: cheap, root access, no bs.
  • Hate: larger plans expensive.


dns made easy.png

  • What: managed DNS.
  • Why: speed, uptime, built-in monitoring, failover and global traffic redirection.
  • Love: just works, cheap, great support.
  • Hate: web interface is a bit clunky/confusing.


pingdom.jpg

  • What: web site monitoring.
  • Why: external notifications when things go down.
  • Love: response time reports (from around the world).
  • Hate: expensive; will prob let expire since I have three other monitors now.


sdensity.jpg

  • What: server monitoring.
  • Why: gives alerts when on-server things go awry.
  • Love: Android notifications; alert types (exact process, system resources); great support.
  • Hate: new alerts don't remember your default (usual) settings. 


Jungle Disk.gif

  • What: online backup.
  • Why: peace of mind.
  • Love: can do network drives; set and forget; can use own S3/rackspace credentials if desired.
  • Hate: nothing, though thinking of moving my music files to Amazon for the streaming ability.


Full disclosure: amazon, DNSMadeEasy, Linode, Ooma, Hulu links above have embedded affiliate codes (and in some cases discounts). Just did that for the fun of it (don't really expect/care if I made $5 or whatever).


Update: there are some good comments on HN about other services people pay for.

Update2: there are more good comments on Lifehacker.

Replicating PostgreSQL with Bucardo

 
If the title makes no sense to you, PostgreSQL is an open-source database. Replication means continuously copying changes from one database to another, e.g. for backup, scalability or high availability. And Bucardo is one of several pieces of software that help you achieve various forms of Postgres replication.

Why Bucardo?

First off, why Bucardo, especially since Postges 9.0 now has built-in hot standby/streaming replication (full docs)? Steve Signer wrote up some cases where you wouldn't want to/be able to use the built-in replication. A more detailed comparison of the various options is on the PG wiki

I won't repeat everything you can find there, but at the highest level the built-in system works on the whole DB cluster, as opposed to specific databases or even tables. Also, it is designed to do so by essentially mirroring files, so it is highly recommended that the platforms (including OS and Postgres versions) match as exactly as possible. Both of these requirements don't suit me. I run FreeBSD and Ubuntu and would really like to replicate only certain tables within certain databases.

After looking at the various options in detail, I wanted to use something (at least initially) both free and open source. Bucardo is written in Perl (so theoretically I could contribute patches); it is active (I looked at the changelog and mailing list); it is simple to setup and use (I looked at the docs); and it is very flexible (offers multi-master, multi-slave, and even cascading slave configurations).

However, note that Bucardo replication usually lags a few sec behind (depending on network config), so you will lose data if the master goes down and is not recovered properly. That is, it is not synchronous replication as available in 9.1, which confirms that changes are made in both DBs before committing (though also has the drawback of slowing things down a bit).

My stuff is generally read-only on the slaves and data can be out of sync (or lost), so the Bucardo model (which works off of triggers) is perfect for me at this time. Nevertheless, you can do failover via Bucardo, easier with a swap sync.

Bucardo Overview

The Bucardo documentation is pretty good so I won't rehash it all. Installation is straightforward -- you install a few Perl modules and then are left with a command line program called bucardo_ctl, which controls everything. I noticed the online documentation can differ from the man page in details, so you might want to look at both. Also on the wiki, i noticed not everything is in linked properly from the main pages, so you might want to use the search feature if you're looking for something in particular.

The documentation goes through an example if you want to play around with it. Here's an even simpler one:

createdb testa
createdb testb
create table test (id integer primary key);
bucardo_ctl add database testa name=testa
bucardo_ctl add database testb name=testb
bucardo_ctl add table test db=testa
bucardo_ctl add herd test_herd test
bucardo_ctl add sync test_sync source=test_herd targetdb=testb type=pushdelta

The above will create two databases. You should issue the create table command in both. Then you add the databases to the bucardo database (which is stored in a postgres DB named bucardo). A "herd" is just a set of tables to replicate, so we add the test table from testa. Then we add a sync.

bucardo_ctl stop
bucardo_ctl start
testa=# insert into test (id) values (1);
bucardo_ctl status
Ins/Upd/Del:          1 / 0 / 0
testb=# select * from test;
 id
----
  1
(1 row)
Stop and start bucardo, which is the easiest way to ensure the new sync starts. Then insert a row in testa and see it appear in testb.  bucardo_ctl status will tell you that the one row was inserted.

Non-obvious notes

After messing around with this live for a week, I've noticed a bunch of things that were non-obvious (at least to me) that you might want to keep in mind.

  • On install, you may need to add Perl to Postgres or you'll get a weird error. You can do that with this commnd:

    createlang plperl template1 -U pgsql

  • The verbosity flags to bucardo_ctl currently don't do much, e.g. --quiet doesn't make things quiet. To turn off the debug log (which gets big fast because it is high verbosity), use debugfile=0 when starting bucardo_ctl. Bucardo is aware of this issue and will be cleaning this up in future releases.

  • The sendmail=1 flag to bucardo_ctl works, but you have to set your from and to email first by doing bucardo_ctl set default_email_from=whatever, etc. However, if you have a situation where a sync fails repeatedly you'll get a ton of emails, like multiple a sec. So I turned this off for now until that case is fixed (I just reported it).

  • There is a debugdir flag you can pass to bucardo_ctl, If you don't set this, logfiles will get printed to the directory where it is started, so I would cd to that directory first or use the flag.

  • If you have a sync that fails, e.g. from network error, and you keep writing stuff to the master db, when it comes back online it will try to copy everything that changed at once. This can dramatically impact performance on the target. You could do a manual sync up in that case (on an off-peak time and in a secondary table) and avoid the bucardo process doing it for you. If you do that (or decide to just ignore those changes), you need to flush the track and delta tables for that database, e.g.

    psql db bucardo
    delete from bucardo_track;
    delete from bucardo_delta;

    If you also want to delete all the old sync info (since it could have failed quickly thousands of times), you need to do the following.

    psql bucardo bucardo
    delete from q;

    Note that in the first case you are connecting to the db as the bucardo user and not your regular database owner user. In the second case you are connecting to the bucardo db and not the db being replicated.

  • To completely remove bucardo (e.g. if you want to start over), it is not enough to remove the bucardo db since extra tables were added to the other databases (being replicated) and extra triggers to the tables within them. You need to also issue this command.

    psql template1 pgsql
    drop schema bucardo cascade;

  • If you want to completely shutdown a running sync going awry and make sure everything is stopped.

    bucardo_ctl stop
    ps auxww | grep -i bucardo | awk {'print $2'} | xargs kill -TERM

    You want to also issue that command on remote machines to ensure those processes are killed.

  • It initially confused me how to add a remote DB. You add it to bucardo like this:

    bucardo_ctl add database waki name=test host=test.duckduckgo.com

    That is you past in the fully qualified domain name as the host parameter.

  • To test that a sync can work, i.e. it can reach the host in a properly authenticated manner, you can do

    bucardo_ctl validate sync_name

    You will need to do two things for this to complete successfully. First, there needs to be a bucardo user on the remote machine with access to the db. The simplest way is to make it a super user, a la 

    su -m pgsql -c 'createuser -sDRw bucardo'

    on FreeBSD or on Ubuntu like this:

    sudo su -m postgres -c 'createuser -sDRw bucardo'

    Second, you need to allow the host machine to talk to the remote machine, i.e. you need to enable remote access. This is a two step process. In postgresql.conf you need to change listen_address and add in that IP or use * to listen on all. Second in pg_hba.conf you need to add in a way for the remote bucardo user to authenticate. Again the easiest way is to trust that IP, though security-minded people will probably want to shoot me for suggesting it. Either way, you can test it by using psql and passing in host parameters or just issuing the validate command above.

  • The quickest way to tell if things are working is to do

    bucardo_stl status

    which will tell you info on all your syncs or

    bucardo_ctl status sync_name

    for detailed info on one sync. Look at the Last_bad and Last_good times in particular.

  • To remove a sync, it is not enough to remove it in bucardo. You also have to remove the triggers it added on the various tables. You could remove everything as noted above and start over. But it if is just one table, you can just remove the triggers like so:

    psql db dbname
    drop trigger bucardo_add_delta ON table;
    drop trigger bucardo_triggerkick_sync_name ON table;

    Note if you have multiple syncs on the table you don't need to drop the delta table, but just the triggerkick one. If you don't do this and add another sync you'll end up with multiple triggers, which you don't want.

  • If you are replicating to multiple slaves, I found that it is better to use a sync for each one instead of one sync to all (via a db group). The reason is if one is unavailable it will bring down syncing to all, which a) stops syncing for good machines and b) results in that problem above where you do a big sync when everything is back up.

  • To delete a db test without removing your whole install, you can do this:

    bucardo_ctl deactivate sync_name
    bucardo_ctl delete sync sync_name
    bucardo_ctl delete target_db_name
    bucardo_ctl delete herd herd_name
    bucardo_ctl delete table table_name
    bucardo_ctl delete db source_db_name

    Then drop the triggers as noted in the above procedure.

  • When upgrading bucardo, it is not enough to just install the new Perl module. You also need to:

    bucardo_ctl upgrade

Conclusion

That was a lot of notes. While it ended up being more complicated than I originally anticipated, I am still currently happy with this solution. It does work pretty well.

Additionally, the Bucardo mailing list is very responsive. I reported two bugs and they were both fixed last night.

Riding demographic changes

 
One of the things that fascinates most about history is that there were so much fewer people on the planet not too long ago. I'm generally good at empathizing, but I find it very hard to wrap my head around this aspect of say the 1700s.

This morning I asked my Dad whether he has noticed any changes on this axis in his lifetime. I haven't (born in 79'). He has though. He said that up until the mid 80s when he went out (first in NYC, and then in Atlanta), even to the airport (in this case, Atlanta), he would more often than not see multiple people he knew and think nothing of it. Fast forward to today and going to the airport and even the supermarket generally results in not seeing anyone he knows, even though he still lives in the same place.

These demographic changes relate to picking markets in startups. Consider the growth in India in Internet and mobile payments that is currently happening. Thousands of online businesses will spring into being, riding this wave. Yes, Facebook et al. will be big, but opportunity abounds for smaller players too who are catering to this market.

Or consider the well-known-by-now example of the app stores. I saw on Hacker News this morning the announcement of Leafsnap, an app that lets you take a picture of a leaf and tells you what tree it comes from. I told my Dad and he was immediately interested in using it, already having trees he had been waiting to identify. Then this comment caught my eye from a potential customer saying they'd love a whole line of these, for bugs, animal footprints, etc. that they could use as a suite on hiking trips.

Would this make a $100M dollar company? Probably not, and as an investor I'm still very wary of app companies for various reasons. But would this make a great business for a few developers? Absolutely. And I could see REI or EMS or whoever buying up the company for $20M.

I haven't checked, but I would bet that in the 60s a higher % of people read comic books, but today more people in total read them. There are just so many more people. Riding these waves of demographic changes, e.g. moving to mobile/tablets (app stores), or coming online in the first place (India), just makes success that much easier.

The opposite is true too. I cringe when an entrepreneur tells me they're going into news or music. It's a little different since people continue to consume those things, but in general those industries are in turmoil and shedding people and market cap instead of growing substantially. It feels like voluntarily running into a burning building and trying to find some valuables and get out before the whole thing collapses. Whereas India feels more like going to a huge concert and selling water. 

RAID0 ephemeral storage on AWS EC2

 
If you're thinking of doing RAID0 (disk striping) on the ephemeral storage disks attached to an EC2 instance, this post is for you. First I'll go through some directions since I didn't find great ones elsewhere when trying to do it myself. Then I'll get to why I wanted to do it in the first place.

When you run your instances, you have to add the extra drives in the block device mapping or they will not be available to you. This confused me for a while. If you do nothing, you'll just get the default, which is one ephemeral drive attached and mounted at /mnt. And there is apparently nothing you can do to connect them in after the fact.

Different types of instances have different numbers and sizes of drives available. Here is the listing. Summary: you get an extra drive (/dev/sdc) when you move up to an m1.large instance. If you move up to an m1.xlarge, you get two more (/dev/sdd & /dev/sde) for a total of four.  

The more drives you have the faster your RAID0 volume will be for random reading (on average). You have to be careful though because the larger instances don't necessarily all have four drives. This table tells you outright how many each has. For example, the high memory extra large seems to only have one drive attached. Go figure.

I couldn't figure out how to make the right block device mappings from the console. It doesn't seem to be an option in the wizard. So you have to do it from the command line. For the m1.large instance it looks like this:

ec2-run-instances -b '/dev/sdb=ephemeral0' -b '/dev/sdc=ephemeral1'

and for m1.xlarge it looks like this:

ec2-run-instances ami-90fe01f9 -b '/dev/sdb=ephemeral0' -b '/dev/sdc=ephemeral1' -b '/dev/sdd=ephemeral2' -b '/dev/sde=ephemeral3'

I actually tried naming them other things and it kept throwing errors at me. I also tried baking in the block device mappings into images, which you can supposedly do by passing the same -b arguments to ec2-register as per the documentation, but that didn't work for me either.

Once booted, you have to unmount the /mnt mount point.

unmount /mnt

This won't work if you are using it for anything, so make sure you do all this before you do (I put it in a user data script). You can then run the following commands to create the RAID0 volume. For m1.large it looks like this:

yes | mdadm --create /dev/md0 --level=0 -c256 --raid-devices=2 /dev/sdb /dev/sdc
echo 'DEVICE /dev/sdb /dev/sdc' > /etc/mdadm.conf
mdadm --detail --scan >> /etc/mdadm.conf

and for m1.xlarge it looks like this:

yes | mdadm --create /dev/md0 --level=0 -c256 --raid-devices=4 /dev/sdb /dev/sdc /dev/sdd /dev/sde
echo 'DEVICE /dev/sdb /dev/sdc /dev/sdd /dev/sde' > /etc/mdadm.conf
mdadm --detail --scan >> /etc/mdadm.conf

Then you need to create the file system, e.g.: 

blockdev --setra 65536 /dev/md0
mkfs.xfs -f /dev/md0
mkdir -p /mnt/md0 && mount -t xfs -o noatime /dev/md0 /mnt/md0
cd /mnt/md0

Finally, if you want it to come up on reboot, remove /mnt from fstab and add /mnt/md0

perl -ne 'print if $_ !~ /mnt/' /etc/fstab > /etc/fstab.2
echo '#/dev/md0  /mnt  xfs    defaults 0 0' >> /etc/fstab.2
mv /etc/fstab.2 /etc/fstab

I'm using Uuntu Lucid 10.04 LTS. You can find the base ami amd aki ids here. In particular, I'm using ami-fa01f193 and aki-427d952b (64-bit instance-store in us-east-1 region).

Here are the best posts I found on benchmarking RAID0 ephemeral storage:
I can't find the link to a final post I wanted to link to that compared variability across instance types over a period of weeks. It found, as you might expect, significant improvements and less variability as you move up the chain and high variability in the smaller instance types. The thinking is that as you move up, it becomes more like dedicated hardware, and of course the underlying hardware itself may actually be improving.

For a while now, I've replicated DuckDuckGo in two completely different hosting providers. Up until very recently, AWS was the backup, and so I didn't spend that much time optimizing my setup (or honestly even understanding all my options) on the platform.

I had used EBS boot AMIs on a path of least resistance so I could shove more than 10GB into my images. I didn't think much of it all until moving to EC2 as my primary hosting provider a few weeks ago. I was in the midst of evaluating options primarily on a performance axis when the recent outage hit me. Then I started also considering other axes, namely durability and availability.

I had minimal downtime because I failed over to my other hosting provider, but the whole thing got me thinking hard about how I should use EC2. Ultimately, I decided to move off of EBS and onto the ephemeral storage.

The official line from Amazon is "[t]he latency and throughput of Amazon EBS volumes is designed to be significantly better than the Amazon EC2 instance stores in nearly all cases." That's pretty powerful stuff, so the decision to move off of EBS was not taken lightly.

I don't think it is the right decision for everyone, but I wanted to throw one more data point out there. On a day to day basis, I'm doing mainly random reads, as I suspect a lot of people are. My particular needs are perhaps a bit peculiar though.
  • Most of my stuff is both read-only and indexed in memory, so my actual IO use is not that high. 
  • I also use a decent amount of network.
  • I will choose availability over durability since I replicate off of EBS anyway.
  • I would like to avoid very low performance periods.

The bottom line for me is that the RAID0 on the m1.xlarge is fast enough such that IO is no longer a significant bottleneck for my usage patterns. If you read the above benchmark posts, you will see that you can get comparable speeds doing RAID0 over 8 EBS volumes, so that's really what I'm choosing between. Let's assume they are the same average speed. Here is what pushed it over the edge to ephemeral storage for me:

  • The benchmarks show that the ephemeral drives have higher consistency in throughput than EBS. In fact, EBS has historically crawled every now and then. You could potentially account for this and failover but more likely your application will just crawl for that time period as a result, which I really don't like.

  • With 8 drives, failure of the RAID device becomes significantly more probable. I could not find any real life failure probabilities beyond the annual failure rate of 0.1-0.5% mentioned by Amazon though. Let's suppose though for sake of argument that because Amazon auto re-mirrors EBS drives, that the 8vol EBS volume has an equivalent failure rate to the 4drive ephemeral volume, and this one is moot. It's all speculation anyway.

  • In Amazon's outage statement they said "when data on a customer's volume is being re-mirrored, access to that data is blocked until the system has identified a new primary (or writable) replica." This reads to me as another scenario when your EBS drive can become really slow for a bit and slow down your application. Of course, this could be viewed as a good thing for durability. But since I don't care about that I'd rather choose more availability.

  • Instances have one network device and EBS runs over network. I suspect that because EBS competes with your application for network, it slows down both. The closest I got to confirming this was this HN comment.

  • Bigger players with more time to spend on benchmarking this stuff and more money to spend on premium support/advice are moving off of EBS or never used it in the first place.

  • EBS has more moving parts. If I had not been using it, I wouldn't have had any issues during the outage (I think).

Of course all this could change! Amazon seems long-term invested in EBS. Also, if too many people move off of it, the instance storage could get less consistent as it is also a shared resource.

Here are a few other notes for anyone else who is thinking about this stuff.

  • Amazon says there is a one-time write penalty on instance storage, which you can clear by writing out all the sectors. However, that takes time. I have not messed with this at all yet.

  • I switched off the EBS boot images as well, figuring that it has the same issues as above and also just increases my failure rates, as now both the EBS boot drive and the instance itself (underlying hardware) can fail. Instead, now I send new instances a user script that RAIDs the ephemeral storage and then pulls down data for it (both from S3 and other servers). To send a script (including the above directions), just use the -f option on ec2-run-instances.

  • RAID1 on EBS seems like a bad idea. It is unclear that the EBS volumes are statistically independent, and so you may not be getting much durability benefit. Additionally, as you can see from the benchmark posts, it really eats into performance. If your IO is low it may not matter that much, but at high IO you max out the network faster.

  • If you do choose to go with ephemeral storage, m1.xlarge seems like the sweet spot. It is the first instance with the four disks and has significantly less variability in resources. Of course you also get double memory, storage and CPU from m1.large though at double cost.

  • The Right Scale blog had a really good post on the AWS outage and links to the best other posts on it as well.

  • Other configurations are possible I suppose, e.g. RAID10 or RAID0 on three drives and then saving the other for moving large files around (sequential IO). But if you're looking for consistent speed on random reads like me, RAID0 across all the drives will be the fastest. And since you're presumably prepared if your instance goes down, a drive failure should be OK. You auto-failover and use your off-instance backups.

    That said, if the performance of RAID0 across two disks is OK, then you can theoretically use the m1.xlarge, get the mirroring and still get decent (presumably EBS level) durability without using EBS. However, I'm still unclear if this is even useful since I don't know what happens when a drive fails, i.e. if the whole instance automatically goes down or not. In that case, the ephemeral mirror is useless, and given the instance can go down anyway -- you need to have off-instance backups regardless. And besides, an m1.xlarge is twice as expensive as an m1.large, so you can already run two m1.large instances and get instance failover for the same price. Note: it's probably clear that I haven't tried this particular case :).

Again, I'm pretty new to all of this, so as always feel free to correct/enlighten me in the comments. Some things in particular I'm still trying to find out are:

  1. What are actual failure rates for EBS volumes and ephemeral drives?

  2. Do people detect and actually recover from drive failure, or does it essentially always bring the instance down?

About

I'm the founder of DuckDuckGo and an angel investor.