Tuesday, September 20, 2011

Over-breeding & Culling EC2 Instances for IO Performance

I've heard other people talk about this on Twitter or at conferences, but as far as I can remember, nobody has described the nuts and bolts of finding tolerable-performance ephemeral disks in EC2.

I recently spun up a 12-node Cassandra cluster in EC2 and, since it's a database, I decided to do some basic tire-kicking and learned a few things along the way.

Rule: always zero your ephemerals if you care about performance.

Why: Amazon is likely using sparse files to back ephemerals (and probably EBS, I have no experience there). This makes perfect sense, because:
  • you get free thin provisioning, so unused disk doesn't go to waste
  • Xen supports it well
  • it's easy to manage lots & lots of them
  • it's trivial to export over all common network block protocols (e.g. AoE, iSCSI)

Because there is an extra step of allocating a backing block for a sparse file for every block in your VM, performance will be all over the map while zeroing the disks.

Script #1: 
#!/bin/bash
# THIS WILL DESTROY ALL OF YOUR DATA
# well, almost all of it, it'll try to keep root around, but no guarantees
echo "Disabling RAID devices ..."
for raid in /dev/md*
do
grep -q $raid /proc/mounts && umount $raid
grep -q $raid /proc/mounts || \ # continue if NOT present
mdadm --stop $raid || \
(echo "could not stop raid device $raid, cannot continue"; exit 1)
done
# if you're an LVM user, add umounts and vgchange -an here
for drive in /sys/block/sd*
do
# root drives can have a partition without a whole device
devname=$(basename $drive |sed 's/[0-9]*$//')
[ "$devname" != "sda" ] || continue # leave root alone
# zero all the drives in parallel
echo "Zeroing /dev/$devname ..."
dd if=/dev/zero of=/dev/$devname bs=1M &
disown $!
done
view raw zero-drives.sh hosted with ❤ by GitHub


I usually launch my zeroing script with cl-run.pl --list burnin -s zero-drives.sh. The "burnin" list is just all the ec2 hostnames, one per line, in ~/.dsh/machines.burnin.

Culling round 1: Look at the raw throughput of all of the nodes and cull anything that looks abnormally low. For example, when building the aforementioned cluster, I kept getting really obviously bad instances in one of the us-east-1 AZ's. This is what I saw when using my cluster netstat tool for a batch of m1.xlarge's in us-east-1c.

hostname: eth0_total eth0_recv eth0_send read_iops write_iops 1min 5min 15min
----------------------------------------------------------------------------------------------------------------
ec2-xx-xx-xx-xx: 816 124 692 0/s 13,765/s 1.01 1.02 1.00
ec2-xx-xx-xxx-xx: 786 113 673 0/s 19,116/s 1.00 1.02 1.00
ec2-xx-xx-xxx-xxx: 784 113 671 0/s 15,573/s 1.10 1.08 1.09
ec2-xx-xx-xx-xxx: 798 120 678 0/s 13,045/s 1.09 1.05 1.01
ec2-xx-xxx-xxx-xxx: 786 113 673 0/s 3,008/s 1.02 1.03 1.00
ec2-xx-xxx-xx-xx: 761 120 641 0/s 0/s 0.00 0.00 0.00
ec2-xx-xx-xxx-xxx: 800 120 680 0/s 3,100/s 1.03 1.04 1.00
ec2-xxx-xx-xx-x: 786 113 673 0/s 3,311/s 1.08 1.06 1.02
ec2-xxx-xx-xx-xxx: 781 113 668 0/s 14,375/s 1.13 1.08 1.04
ec2-xx-xxx-xx-xxx: 781 113 668 0/s 16,077/s 1.18 1.09 1.08
ec2-xx-xx-xxx-xxx: 855 149 706 0/s 16,962/s 1.06 1.12 1.07
ec2-xx-xx-xx-xx: 802 116 686 0/s 0/s 0.00 0.34 0.74
ec2-xxx-xx-xx-xxx: 802 116 686 0/s 13,649/s 1.02 1.04 1.02
ec2-xx-xx-xxx-xxx: 847 129 718 0/s 15,360/s 1.07 1.10 1.05
ec2-xxx-xx-xx-xx: 816 118 698 0/s 14,242/s 1.10 1.07 1.02
ec2-xx-xx-xx-xxx: 841 136 705 0/s 17,185/s 1.21 1.15 1.10
Total: 12,842 Recv: 1,926 Send: 10,916 (0 mbit/s) | 0 read/s 178,768 write/s
Average: 12,564 Recv: 120 Send: 665 (0 mbit/s) | 0 read/s 11,267 write/s
I immediately culled off everything doing under 10k IOPS for more than a minute. If you examine the per-disk stats with iostat -x 2, you'll usually see one disk with insanely high (>1000ms) latency all the time. There are certainly false-negatives at this phase, but I don't really care since instances are cheap and time is not. I ended up starting around 30 instances in that one troublesome AZ to find 3 with sustainable IOPS in the most trivial of tests (dd).

When I think I have enough obviously tolerable nodes for a race, I kick off another zero round. Once the load levels out a little, I take a snapshot I like of the cl-netstat.pl output and process it in a hacky way to sort by IOPS and add which EC2 zone the instance is in and its instance ID so I can kill the losers without digging around. Here's an example from a round of testing I did for a recent MySQL cluster deployment:

m2.4xlarge ------------------------------
ec2-xx-xx-xx-xx 1960 312 1648 39 115141 2.01 2.01 1.83 i-xxxxxxxx us-east-1d m2.4xlarge
ec2-xx-xx-xx-xx 2050 417 1633 61 111533 2.00 1.99 1.82 i-xxxxxxxx us-east-1b m2.4xlarge
ec2-xx-xx-xx-xx 2022 348 1674 39 101746 1.99 1.97 1.82 i-xxxxxxxx us-east-1d m2.4xlarge
ec2-xx-xx-xx-xx 1958 307 1651 61 82827 3.00 3.00 2.75 i-xxxxxxxx us-east-1b m2.4xlarge
m1.xlarge ------------------------------
ec2-xx-xx-xx-xx 633 146 487 140 244191 4.08 4.08 3.78 i-xxxxxxxx us-east-1a m1.xlarge
ec2-xx-xx-xx-xx 816 263 553 138 240646 4.02 4.05 3.75 i-xxxxxxxx us-east-1b m1.xlarge
ec2-xx-xx-xx-xx 708 181 527 132 238866 4.15 4.08 3.75 i-xxxxxxxx us-east-1b m1.xlarge
ec2-xx-xx-xx-xx 780 150 630 155 236553 4.10 4.17 3.86 i-xxxxxxxx us-east-1d m1.xlarge
ec2-xx-xx-xx-xx 775 219 556 163 231223 4.20 4.20 3.85 i-xxxxxxxx us-east-1d m1.xlarge
ec2-xx-xx-xx-xx 848 255 593 145 228697 4.07 4.14 3.82 i-xxxxxxxx us-east-1d m1.xlarge
ec2-xx-xx-xx-xx 684 183 501 158 227458 4.33 4.21 3.86 i-xxxxxxxx us-east-1d m1.xlarge
ec2-xx-xx-xx-xx 640 182 458 134 226359 4.02 4.05 3.75 i-xxxxxxxx us-east-1b m1.xlarge
ec2-xx-xx-xx-xx 624 157 467 135 216373 4.17 4.17 3.83 i-xxxxxxxx us-east-1a m1.xlarge
ec2-xx-xx-xx-xx 808 233 575 138 214440 4.09 4.09 3.77 i-xxxxxxxx us-east-1b m1.xlarge
ec2-xx-xx-xx-xx 695 187 508 137 214412 4.06 4.08 3.75 i-xxxxxxxx us-east-1b m1.xlarge
ec2-xx-xx-xx-xx 662 157 505 144 210226 4.39 4.16 3.83 i-xxxxxxxx us-east-1a m1.xlarge
ec2-xx-xx-xx-xx 843 219 624 114 206998 4.64 4.42 4.03 i-xxxxxxxx us-east-1a m1.xlarge
ec2-xx-xx-xx-xx 667 146 521 193 200849 4.26 4.16 3.83 i-xxxxxxxx us-east-1d m1.xlarge
ec2-xx-xx-xx-xx 614 146 468 146 195275 4.02 4.06 3.76 i-xxxxxxxx us-east-1b m1.xlarge
ec2-xx-xx-xx-xx 748 149 599 143 190615 4.07 4.07 3.79 i-xxxxxxxx us-east-1a m1.xlarge
ec2-xx-xx-xx-xx 646 185 461 144 189506 4.09 4.13 3.79 i-xxxxxxxx us-east-1b m1.xlarge
ec2-xx-xx-xx-xx 684 181 503 134 188118 4.08 4.07 3.76 i-xxxxxxxx us-east-1d m1.xlarge
ec2-xx-xx-xx-xx 680 149 531 134 187769 4.22 4.12 3.78 i-xxxxxxxx us-east-1a m1.xlarge
ec2-xx-xx-xx-xx 672 147 525 145 185026 4.13 4.06 3.73 i-xxxxxxxx us-east-1b m1.xlarge
ec2-xx-xx-xx-xx 747 146 601 130 184975 4.09 4.09 3.78 i-xxxxxxxx us-east-1a m1.xlarge
ec2-xx-xx-xx-xx 638 186 452 130 184736 4.25 4.19 3.83 i-xxxxxxxx us-east-1d m1.xlarge
ec2-xx-xx-xx-xx 761 215 546 132 184571 4.29 4.20 3.86 i-xxxxxxxx us-east-1d m1.xlarge
ec2-xx-xx-xx-xx 1004 231 773 143 183929 4.22 4.16 3.82 i-xxxxxxxx us-east-1b m1.xlarge
ec2-xx-xx-xx-xx 800 219 581 143 181285 4.12 4.15 3.84 i-xxxxxxxx us-east-1a m1.xlarge
ec2-xx-xx-xx-xx 860 262 598 156 180425 4.19 4.26 3.95 i-xxxxxxxx us-east-1a m1.xlarge
ec2-xx-xx-xx-xx 620 181 439 144 164457 4.10 4.15 3.84 i-xxxxxxxx us-east-1d m1.xlarge
ec2-xx-xx-xx-xx 715 146 569 145 159775 4.10 4.10 3.82 i-xxxxxxxx us-east-1d m1.xlarge
ec2-xx-xx-xx-xx 655 150 505 101 135984 4.11 4.15 3.83 i-xxxxxxxx us-east-1c m1.xlarge
ec2-xx-xx-xx-xx 675 146 529 139 135006 4.38 4.42 4.06 i-xxxxxxxx us-east-1c m1.xlarge
ec2-xx-xx-xx-xx 735 179 556 136 132891 4.03 4.08 3.77 i-xxxxxxxx us-east-1a m1.xlarge
ec2-xx-xx-xx-xx 637 169 468 136 132663 4.25 4.28 3.96 i-xxxxxxxx us-east-1c m1.xlarge
ec2-xx-xx-xx-xx 702 150 552 55 130381 4.21 4.20 3.86 i-xxxxxxxx us-east-1c m1.xlarge
ec2-xx-xx-xx-xx 621 147 474 109 119761 4.11 4.13 3.81 i-xxxxxxxx us-east-1b m1.xlarge
ec2-xx-xx-xx-xx 804 192 612 100 69018 4.20 4.49 4.17 i-xxxxxxxx us-east-1c m1.xlarge
ec2-xx-xx-xx-xx 634 146 488 0 16120 5.82 5.77 5.72 i-xxxxxxxx us-east-1c m1.xlarge
ec2-xx-xx-xx-xx 880 181 699 0 5411 5.64 5.61 5.66 i-xxxxxxxx us-east-1c m1.xlarge
ec2-xx-xx-xx-xx 741 192 549 0 5091 5.70 5.61 5.56 i-xxxxxxxx us-east-1c m1.xlarge
ec2-xx-xx-xx-xx 570 56 514 27 4786 5.63 5.63 5.59 i-xxxxxxxx us-east-1c m1.xlarge
~
view raw final.txt hosted with ❤ by GitHub

I picked the top few instances from each AZ and terminated the rest. Job done.

This is a pretty crude process in many ways. It's very manual, it requires a lot of human judgement, and most importantly, dd if=/dev/zero not a good measure of real-world performance. This process is just barely good enough to cull the worst offenders in EC2, which seem to be quite common in my recent experience.

In the future, I will likely automate most of this burn-in process and add some real-world I/O generation, probably using real data.