I recently spun up a 12-node Cassandra cluster in EC2 and, since it's a database, I decided to do some basic tire-kicking and learned a few things along the way.
Rule: always zero your ephemerals if you care about performance.
Why: Amazon is likely using sparse files to back ephemerals (and probably EBS, I have no experience there). This makes perfect sense, because:
- you get free thin provisioning, so unused disk doesn't go to waste
- Xen supports it well
- it's easy to manage lots & lots of them
- it's trivial to export over all common network block protocols (e.g. AoE, iSCSI)
Because there is an extra step of allocating a backing block for a sparse file for every block in your VM, performance will be all over the map while zeroing the disks.
Script #1:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# THIS WILL DESTROY ALL OF YOUR DATA | |
# well, almost all of it, it'll try to keep root around, but no guarantees | |
echo "Disabling RAID devices ..." | |
for raid in /dev/md* | |
do | |
grep -q $raid /proc/mounts && umount $raid | |
grep -q $raid /proc/mounts || \ # continue if NOT present | |
mdadm --stop $raid || \ | |
(echo "could not stop raid device $raid, cannot continue"; exit 1) | |
done | |
# if you're an LVM user, add umounts and vgchange -an here | |
for drive in /sys/block/sd* | |
do | |
# root drives can have a partition without a whole device | |
devname=$(basename $drive |sed 's/[0-9]*$//') | |
[ "$devname" != "sda" ] || continue # leave root alone | |
# zero all the drives in parallel | |
echo "Zeroing /dev/$devname ..." | |
dd if=/dev/zero of=/dev/$devname bs=1M & | |
disown $! | |
done | |
I usually launch my zeroing script with cl-run.pl --list burnin -s zero-drives.sh. The "burnin" list is just all the ec2 hostnames, one per line, in ~/.dsh/machines.burnin.
Culling round 1: Look at the raw throughput of all of the nodes and cull anything that looks abnormally low. For example, when building the aforementioned cluster, I kept getting really obviously bad instances in one of the us-east-1 AZ's. This is what I saw when using my cluster netstat tool for a batch of m1.xlarge's in us-east-1c.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
hostname: eth0_total eth0_recv eth0_send read_iops write_iops 1min 5min 15min | |
---------------------------------------------------------------------------------------------------------------- | |
ec2-xx-xx-xx-xx: 816 124 692 0/s 13,765/s 1.01 1.02 1.00 | |
ec2-xx-xx-xxx-xx: 786 113 673 0/s 19,116/s 1.00 1.02 1.00 | |
ec2-xx-xx-xxx-xxx: 784 113 671 0/s 15,573/s 1.10 1.08 1.09 | |
ec2-xx-xx-xx-xxx: 798 120 678 0/s 13,045/s 1.09 1.05 1.01 | |
ec2-xx-xxx-xxx-xxx: 786 113 673 0/s 3,008/s 1.02 1.03 1.00 | |
ec2-xx-xxx-xx-xx: 761 120 641 0/s 0/s 0.00 0.00 0.00 | |
ec2-xx-xx-xxx-xxx: 800 120 680 0/s 3,100/s 1.03 1.04 1.00 | |
ec2-xxx-xx-xx-x: 786 113 673 0/s 3,311/s 1.08 1.06 1.02 | |
ec2-xxx-xx-xx-xxx: 781 113 668 0/s 14,375/s 1.13 1.08 1.04 | |
ec2-xx-xxx-xx-xxx: 781 113 668 0/s 16,077/s 1.18 1.09 1.08 | |
ec2-xx-xx-xxx-xxx: 855 149 706 0/s 16,962/s 1.06 1.12 1.07 | |
ec2-xx-xx-xx-xx: 802 116 686 0/s 0/s 0.00 0.34 0.74 | |
ec2-xxx-xx-xx-xxx: 802 116 686 0/s 13,649/s 1.02 1.04 1.02 | |
ec2-xx-xx-xxx-xxx: 847 129 718 0/s 15,360/s 1.07 1.10 1.05 | |
ec2-xxx-xx-xx-xx: 816 118 698 0/s 14,242/s 1.10 1.07 1.02 | |
ec2-xx-xx-xx-xxx: 841 136 705 0/s 17,185/s 1.21 1.15 1.10 | |
Total: 12,842 Recv: 1,926 Send: 10,916 (0 mbit/s) | 0 read/s 178,768 write/s | |
Average: 12,564 Recv: 120 Send: 665 (0 mbit/s) | 0 read/s 11,267 write/s |
When I think I have enough obviously tolerable nodes for a race, I kick off another zero round. Once the load levels out a little, I take a snapshot I like of the cl-netstat.pl output and process it in a hacky way to sort by IOPS and add which EC2 zone the instance is in and its instance ID so I can kill the losers without digging around. Here's an example from a round of testing I did for a recent MySQL cluster deployment:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
m2.4xlarge ------------------------------ | |
ec2-xx-xx-xx-xx 1960 312 1648 39 115141 2.01 2.01 1.83 i-xxxxxxxx us-east-1d m2.4xlarge | |
ec2-xx-xx-xx-xx 2050 417 1633 61 111533 2.00 1.99 1.82 i-xxxxxxxx us-east-1b m2.4xlarge | |
ec2-xx-xx-xx-xx 2022 348 1674 39 101746 1.99 1.97 1.82 i-xxxxxxxx us-east-1d m2.4xlarge | |
ec2-xx-xx-xx-xx 1958 307 1651 61 82827 3.00 3.00 2.75 i-xxxxxxxx us-east-1b m2.4xlarge | |
m1.xlarge ------------------------------ | |
ec2-xx-xx-xx-xx 633 146 487 140 244191 4.08 4.08 3.78 i-xxxxxxxx us-east-1a m1.xlarge | |
ec2-xx-xx-xx-xx 816 263 553 138 240646 4.02 4.05 3.75 i-xxxxxxxx us-east-1b m1.xlarge | |
ec2-xx-xx-xx-xx 708 181 527 132 238866 4.15 4.08 3.75 i-xxxxxxxx us-east-1b m1.xlarge | |
ec2-xx-xx-xx-xx 780 150 630 155 236553 4.10 4.17 3.86 i-xxxxxxxx us-east-1d m1.xlarge | |
ec2-xx-xx-xx-xx 775 219 556 163 231223 4.20 4.20 3.85 i-xxxxxxxx us-east-1d m1.xlarge | |
ec2-xx-xx-xx-xx 848 255 593 145 228697 4.07 4.14 3.82 i-xxxxxxxx us-east-1d m1.xlarge | |
ec2-xx-xx-xx-xx 684 183 501 158 227458 4.33 4.21 3.86 i-xxxxxxxx us-east-1d m1.xlarge | |
ec2-xx-xx-xx-xx 640 182 458 134 226359 4.02 4.05 3.75 i-xxxxxxxx us-east-1b m1.xlarge | |
ec2-xx-xx-xx-xx 624 157 467 135 216373 4.17 4.17 3.83 i-xxxxxxxx us-east-1a m1.xlarge | |
ec2-xx-xx-xx-xx 808 233 575 138 214440 4.09 4.09 3.77 i-xxxxxxxx us-east-1b m1.xlarge | |
ec2-xx-xx-xx-xx 695 187 508 137 214412 4.06 4.08 3.75 i-xxxxxxxx us-east-1b m1.xlarge | |
ec2-xx-xx-xx-xx 662 157 505 144 210226 4.39 4.16 3.83 i-xxxxxxxx us-east-1a m1.xlarge | |
ec2-xx-xx-xx-xx 843 219 624 114 206998 4.64 4.42 4.03 i-xxxxxxxx us-east-1a m1.xlarge | |
ec2-xx-xx-xx-xx 667 146 521 193 200849 4.26 4.16 3.83 i-xxxxxxxx us-east-1d m1.xlarge | |
ec2-xx-xx-xx-xx 614 146 468 146 195275 4.02 4.06 3.76 i-xxxxxxxx us-east-1b m1.xlarge | |
ec2-xx-xx-xx-xx 748 149 599 143 190615 4.07 4.07 3.79 i-xxxxxxxx us-east-1a m1.xlarge | |
ec2-xx-xx-xx-xx 646 185 461 144 189506 4.09 4.13 3.79 i-xxxxxxxx us-east-1b m1.xlarge | |
ec2-xx-xx-xx-xx 684 181 503 134 188118 4.08 4.07 3.76 i-xxxxxxxx us-east-1d m1.xlarge | |
ec2-xx-xx-xx-xx 680 149 531 134 187769 4.22 4.12 3.78 i-xxxxxxxx us-east-1a m1.xlarge | |
ec2-xx-xx-xx-xx 672 147 525 145 185026 4.13 4.06 3.73 i-xxxxxxxx us-east-1b m1.xlarge | |
ec2-xx-xx-xx-xx 747 146 601 130 184975 4.09 4.09 3.78 i-xxxxxxxx us-east-1a m1.xlarge | |
ec2-xx-xx-xx-xx 638 186 452 130 184736 4.25 4.19 3.83 i-xxxxxxxx us-east-1d m1.xlarge | |
ec2-xx-xx-xx-xx 761 215 546 132 184571 4.29 4.20 3.86 i-xxxxxxxx us-east-1d m1.xlarge | |
ec2-xx-xx-xx-xx 1004 231 773 143 183929 4.22 4.16 3.82 i-xxxxxxxx us-east-1b m1.xlarge | |
ec2-xx-xx-xx-xx 800 219 581 143 181285 4.12 4.15 3.84 i-xxxxxxxx us-east-1a m1.xlarge | |
ec2-xx-xx-xx-xx 860 262 598 156 180425 4.19 4.26 3.95 i-xxxxxxxx us-east-1a m1.xlarge | |
ec2-xx-xx-xx-xx 620 181 439 144 164457 4.10 4.15 3.84 i-xxxxxxxx us-east-1d m1.xlarge | |
ec2-xx-xx-xx-xx 715 146 569 145 159775 4.10 4.10 3.82 i-xxxxxxxx us-east-1d m1.xlarge | |
ec2-xx-xx-xx-xx 655 150 505 101 135984 4.11 4.15 3.83 i-xxxxxxxx us-east-1c m1.xlarge | |
ec2-xx-xx-xx-xx 675 146 529 139 135006 4.38 4.42 4.06 i-xxxxxxxx us-east-1c m1.xlarge | |
ec2-xx-xx-xx-xx 735 179 556 136 132891 4.03 4.08 3.77 i-xxxxxxxx us-east-1a m1.xlarge | |
ec2-xx-xx-xx-xx 637 169 468 136 132663 4.25 4.28 3.96 i-xxxxxxxx us-east-1c m1.xlarge | |
ec2-xx-xx-xx-xx 702 150 552 55 130381 4.21 4.20 3.86 i-xxxxxxxx us-east-1c m1.xlarge | |
ec2-xx-xx-xx-xx 621 147 474 109 119761 4.11 4.13 3.81 i-xxxxxxxx us-east-1b m1.xlarge | |
ec2-xx-xx-xx-xx 804 192 612 100 69018 4.20 4.49 4.17 i-xxxxxxxx us-east-1c m1.xlarge | |
ec2-xx-xx-xx-xx 634 146 488 0 16120 5.82 5.77 5.72 i-xxxxxxxx us-east-1c m1.xlarge | |
ec2-xx-xx-xx-xx 880 181 699 0 5411 5.64 5.61 5.66 i-xxxxxxxx us-east-1c m1.xlarge | |
ec2-xx-xx-xx-xx 741 192 549 0 5091 5.70 5.61 5.56 i-xxxxxxxx us-east-1c m1.xlarge | |
ec2-xx-xx-xx-xx 570 56 514 27 4786 5.63 5.63 5.59 i-xxxxxxxx us-east-1c m1.xlarge | |
~ |
I picked the top few instances from each AZ and terminated the rest. Job done.
This is a pretty crude process in many ways. It's very manual, it requires a lot of human judgement, and most importantly, dd if=/dev/zero not a good measure of real-world performance. This process is just barely good enough to cull the worst offenders in EC2, which seem to be quite common in my recent experience.
In the future, I will likely automate most of this burn-in process and add some real-world I/O generation, probably using real data.