RAID0 ephemeral storage on AWS EC2

 
If you're thinking of doing RAID0 (disk striping) on the ephemeral storage disks attached to an EC2 instance, this post is for you. First I'll go through some directions since I didn't find great ones elsewhere when trying to do it myself. Then I'll get to why I wanted to do it in the first place.

When you run your instances, you have to add the extra drives in the block device mapping or they will not be available to you. This confused me for a while. If you do nothing, you'll just get the default, which is one ephemeral drive attached and mounted at /mnt. And there is apparently nothing you can do to connect them in after the fact.

Different types of instances have different numbers and sizes of drives available. Here is the listing. Summary: you get an extra drive (/dev/sdc) when you move up to an m1.large instance. If you move up to an m1.xlarge, you get two more (/dev/sdd & /dev/sde) for a total of four.  

The more drives you have the faster your RAID0 volume will be for random reading (on average). You have to be careful though because the larger instances don't necessarily all have four drives. This table tells you outright how many each has. For example, the high memory extra large seems to only have one drive attached. Go figure.

I couldn't figure out how to make the right block device mappings from the console. It doesn't seem to be an option in the wizard. So you have to do it from the command line. For the m1.large instance it looks like this:

ec2-run-instances -b '/dev/sdb=ephemeral0' -b '/dev/sdc=ephemeral1'

and for m1.xlarge it looks like this:

ec2-run-instances ami-90fe01f9 -b '/dev/sdb=ephemeral0' -b '/dev/sdc=ephemeral1' -b '/dev/sdd=ephemeral2' -b '/dev/sde=ephemeral3'

I actually tried naming them other things and it kept throwing errors at me. I also tried baking in the block device mappings into images, which you can supposedly do by passing the same -b arguments to ec2-register as per the documentation, but that didn't work for me either.

Once booted, you have to unmount the /mnt mount point.

unmount /mnt

This won't work if you are using it for anything, so make sure you do all this before you do (I put it in a user data script). You can then run the following commands to create the RAID0 volume. For m1.large it looks like this:

yes | mdadm --create /dev/md0 --level=0 -c256 --raid-devices=2 /dev/sdb /dev/sdc
echo 'DEVICE /dev/sdb /dev/sdc' > /etc/mdadm.conf
mdadm --detail --scan >> /etc/mdadm.conf

and for m1.xlarge it looks like this:

yes | mdadm --create /dev/md0 --level=0 -c256 --raid-devices=4 /dev/sdb /dev/sdc /dev/sdd /dev/sde
echo 'DEVICE /dev/sdb /dev/sdc /dev/sdd /dev/sde' > /etc/mdadm.conf
mdadm --detail --scan >> /etc/mdadm.conf

Then you need to create the file system, e.g.: 

blockdev --setra 65536 /dev/md0
mkfs.xfs -f /dev/md0
mkdir -p /mnt/md0 && mount -t xfs -o noatime /dev/md0 /mnt/md0
cd /mnt/md0

Finally, if you want it to come up on reboot, remove /mnt from fstab and add /mnt/md0

perl -ne 'print if $_ !~ /mnt/' /etc/fstab > /etc/fstab.2
echo '#/dev/md0  /mnt  xfs    defaults 0 0' >> /etc/fstab.2
mv /etc/fstab.2 /etc/fstab

I'm using Uuntu Lucid 10.04 LTS. You can find the base ami amd aki ids here. In particular, I'm using ami-fa01f193 and aki-427d952b (64-bit instance-store in us-east-1 region).

Here are the best posts I found on benchmarking RAID0 ephemeral storage:
I can't find the link to a final post I wanted to link to that compared variability across instance types over a period of weeks. It found, as you might expect, significant improvements and less variability as you move up the chain and high variability in the smaller instance types. The thinking is that as you move up, it becomes more like dedicated hardware, and of course the underlying hardware itself may actually be improving.

For a while now, I've replicated DuckDuckGo in two completely different hosting providers. Up until very recently, AWS was the backup, and so I didn't spend that much time optimizing my setup (or honestly even understanding all my options) on the platform.

I had used EBS boot AMIs on a path of least resistance so I could shove more than 10GB into my images. I didn't think much of it all until moving to EC2 as my primary hosting provider a few weeks ago. I was in the midst of evaluating options primarily on a performance axis when the recent outage hit me. Then I started also considering other axes, namely durability and availability.

I had minimal downtime because I failed over to my other hosting provider, but the whole thing got me thinking hard about how I should use EC2. Ultimately, I decided to move off of EBS and onto the ephemeral storage.

The official line from Amazon is "[t]he latency and throughput of Amazon EBS volumes is designed to be significantly better than the Amazon EC2 instance stores in nearly all cases." That's pretty powerful stuff, so the decision to move off of EBS was not taken lightly.

I don't think it is the right decision for everyone, but I wanted to throw one more data point out there. On a day to day basis, I'm doing mainly random reads, as I suspect a lot of people are. My particular needs are perhaps a bit peculiar though.
  • Most of my stuff is both read-only and indexed in memory, so my actual IO use is not that high. 
  • I also use a decent amount of network.
  • I will choose availability over durability since I replicate off of EBS anyway.
  • I would like to avoid very low performance periods.

The bottom line for me is that the RAID0 on the m1.xlarge is fast enough such that IO is no longer a significant bottleneck for my usage patterns. If you read the above benchmark posts, you will see that you can get comparable speeds doing RAID0 over 8 EBS volumes, so that's really what I'm choosing between. Let's assume they are the same average speed. Here is what pushed it over the edge to ephemeral storage for me:

  • The benchmarks show that the ephemeral drives have higher consistency in throughput than EBS. In fact, EBS has historically crawled every now and then. You could potentially account for this and failover but more likely your application will just crawl for that time period as a result, which I really don't like.

  • With 8 drives, failure of the RAID device becomes significantly more probable. I could not find any real life failure probabilities beyond the annual failure rate of 0.1-0.5% mentioned by Amazon though. Let's suppose though for sake of argument that because Amazon auto re-mirrors EBS drives, that the 8vol EBS volume has an equivalent failure rate to the 4drive ephemeral volume, and this one is moot. It's all speculation anyway.

  • In Amazon's outage statement they said "when data on a customer's volume is being re-mirrored, access to that data is blocked until the system has identified a new primary (or writable) replica." This reads to me as another scenario when your EBS drive can become really slow for a bit and slow down your application. Of course, this could be viewed as a good thing for durability. But since I don't care about that I'd rather choose more availability.

  • Instances have one network device and EBS runs over network. I suspect that because EBS competes with your application for network, it slows down both. The closest I got to confirming this was this HN comment.

  • Bigger players with more time to spend on benchmarking this stuff and more money to spend on premium support/advice are moving off of EBS or never used it in the first place.

  • EBS has more moving parts. If I had not been using it, I wouldn't have had any issues during the outage (I think).

Of course all this could change! Amazon seems long-term invested in EBS. Also, if too many people move off of it, the instance storage could get less consistent as it is also a shared resource.

Here are a few other notes for anyone else who is thinking about this stuff.

  • Amazon says there is a one-time write penalty on instance storage, which you can clear by writing out all the sectors. However, that takes time. I have not messed with this at all yet.

  • I switched off the EBS boot images as well, figuring that it has the same issues as above and also just increases my failure rates, as now both the EBS boot drive and the instance itself (underlying hardware) can fail. Instead, now I send new instances a user script that RAIDs the ephemeral storage and then pulls down data for it (both from S3 and other servers). To send a script (including the above directions), just use the -f option on ec2-run-instances.

  • RAID1 on EBS seems like a bad idea. It is unclear that the EBS volumes are statistically independent, and so you may not be getting much durability benefit. Additionally, as you can see from the benchmark posts, it really eats into performance. If your IO is low it may not matter that much, but at high IO you max out the network faster.

  • If you do choose to go with ephemeral storage, m1.xlarge seems like the sweet spot. It is the first instance with the four disks and has significantly less variability in resources. Of course you also get double memory, storage and CPU from m1.large though at double cost.

  • The Right Scale blog had a really good post on the AWS outage and links to the best other posts on it as well.

  • Other configurations are possible I suppose, e.g. RAID10 or RAID0 on three drives and then saving the other for moving large files around (sequential IO). But if you're looking for consistent speed on random reads like me, RAID0 across all the drives will be the fastest. And since you're presumably prepared if your instance goes down, a drive failure should be OK. You auto-failover and use your off-instance backups.

    That said, if the performance of RAID0 across two disks is OK, then you can theoretically use the m1.xlarge, get the mirroring and still get decent (presumably EBS level) durability without using EBS. However, I'm still unclear if this is even useful since I don't know what happens when a drive fails, i.e. if the whole instance automatically goes down or not. In that case, the ephemeral mirror is useless, and given the instance can go down anyway -- you need to have off-instance backups regardless. And besides, an m1.xlarge is twice as expensive as an m1.large, so you can already run two m1.large instances and get instance failover for the same price. Note: it's probably clear that I haven't tried this particular case :).

Again, I'm pretty new to all of this, so as always feel free to correct/enlighten me in the comments. Some things in particular I'm still trying to find out are:

  1. What are actual failure rates for EBS volumes and ephemeral drives?

  2. Do people detect and actually recover from drive failure, or does it essentially always bring the instance down?

If you have comments, hit me up on Twitter:
I'm the Founder & CEO of DuckDuckGo, the search engine that doesn't track you. I'm also the co-author of Traction, the book that helps you get traction. More about me.

About Me

RSS. Email.