Tag Archives: emergency

Mounting an LVM snapshot containing partitions

LVM snapshots can be really handy when you’re trying to take a backup of a running virtual machine. However, mounting the snapshot can be tricky if the logical volume is partitioned.

I have a virtual machine running zoneminder on one of my servers at home and I needed to take a backup of the instance with rdiff-backup. I made a snapshot of the logical volume and attempted to mount it:

[root@i7tiny ~]# lvcreate -s -n snap -L 5G /dev/vg_i7tiny/vm_zoneminder 
  Logical volume "snap" created
[root@i7tiny ~]# mount /dev/vg_i7tiny/snap /mnt/snap/
mount: wrong fs type, bad option, bad superblock on /dev/mapper/vg_i7tiny-snap,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail or so

Oops. The logical volume has partitions. We will need to mount the volume with an offset so that we can get the right partition. Figuring out the offset can be done fairly easily with fdisk:

[root@i7tiny ~]# fdisk -l /dev/vg_i7tiny/vm_zoneminder 
 
Disk /dev/vg_i7tiny/vm_zoneminder: 53.7 GB, 53687091200 bytes
255 heads, 63 sectors/track, 6527 cylinders, total 104857600 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0007a1d5
 
                       Device Boot      Start         End      Blocks   Id  System
/dev/vg_i7tiny/vm_zoneminder1   *        2048     1026047      512000   83  Linux
/dev/vg_i7tiny/vm_zoneminder2         1026048   102825983    50899968   83  Linux
/dev/vg_i7tiny/vm_zoneminder3       102825984   104857599     1015808   82  Linux swap / Solaris

It looks like we have a small boot partition, a big root partition and a swap volume. We want to mount the second volume to copy files from the root filesystem. There are two critical pieces of information here that we need:

  • the sector where the partition starts (the Start column from fdisk)
  • the number of bytes per sector (512 in this case — see the third line of the fdisk output)

Let’s calculate how many bytes we need to skip when we mount the partition and then mount it:

[root@i7tiny ~]# echo "512 * 1026048" | bc
525336576
[root@i7tiny ~]# mount -o offset=525336576 /dev/mapper/vg_i7tiny-snap /mnt/snap/
[root@i7tiny ~]# ls /mnt/snap/
bin  boot  dev  etc  home  lib  lib64  lost+found  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

The root filesystem from the virtual machine is now mounted and we can copy some files from it. Don’t forget to clean up when you’re finished:

[root@i7tiny ~]# umount /mnt/snap/
[root@i7tiny ~]# lvremove -f /dev/vg_i7tiny/snap 
  Logical volume "snap" successfully removed

If you need to do this with file-backed virtual machine storage or with a flat file you made with dd/dd_rescue, read my post from 2010 about tackling that similar problem.

Tagged , , , , ,

Lessons learned in the ambulance pay dividends in the datacenter

While cleaning up a room at home in preparation for some new flooring, I found my original documents from when I first became certified as an Emergency Medical Technician (EMT) in Texas. That was way back in May of 2000 and I received it just before I graduated from high school later in the month. After renewing it twice, I decided to let my certification go this year. It expires today and although I’m sad to see it go, I know that sometimes you have to let one thing go so that you can excel in something else.

I mentioned this yesterday on Twitter and Jesse Newland from GitHub came back with a good reply:

EMT, ops, oncall, incident management tweet

The tweet that inspired this post

It began to make more sense the more I thought about it (and once Mark Imbriaco and Jerry Chen asked for it as well). Working in Operations in a large server environment has a lot of similarities to working on an ambulance:

  • both involve fixing things (whether it’s technology or an illness/injury)
  • there are plenty of highly stressful situations in both occupations
  • lots of money is riding on the decisions made at a keyboard or at a stretcher
  • if you can’t work as a team, you can’t do either job effectively
  • there is always room for improvement (and I do mean always)
  • not having all the facts can lead to perilous situations

Without further ado, here are some lessons I learned on the ambulance which have really helped me as a member of an operations team. I’ve broken them up into separate chunks (more on that lesson shortly) to make it a little easier to read:

Whatever happens, keep your cool
One of the worst situations you can have on an ambulance is when an EMT or paramedic feels overwhelmed to the point that they can’t function. Imagine rolling up with your partner on a multi-car collision with several injured drivers and passengers. It’s just the two of you at the scene and you need to start working. You’re obviously outnumbered and you won’t be able to treat everyone at once. Now, imagine that your partner hasn’t seen this type of situation and is actively buckling under the pressure. The quality of care you’re trained to deliver and the efficiency at which you can deliver it has now been slashed in half. Even worse, getting your partner back on track might take some work and this may slow you down even more.

The same can be said about working on large incidents affecting your customers. You’re probably going to be outnumbered by the amount of servers having a problem and you won’t get them back online any sooner if you’re beginning to freak out. Just remember, as with servers and as with people (most of the time), they were running fine at one time and they’ll be running fine again soon. Your job is to bridge the gap between those times and try to get to the end goal as soon as possible.

You might miss some things or not complete certain tasks as well as you’d like to. You might slip and make things worse than they were before. One step backward and two steps forward is painful, but it’s still progress. Keep your mind clear and focused so that you can use your knowledge, skills, and experience to pave a path out.

Flickr via instantvantage

Triage, triage, triage
Going back to the multi-car collision scenario, you’re well aware that you won’t be able to take care of everyone at once. This is where skillful triaging is key. Find the people who are in the most dire situations and treat them first. Although it seems counterproductive, you may have to pass over the people who are hurt so badly that they have little chance of survival. Spending additional time with those people may cause patients with treatable conditions to deteriorate further unnecessarily. It may sound callous, but I’d rather have a few people with serious injuries get treated than lose all of them while I’m treating someone who is essentially near death.

Lots of this can be carried over into maintaining servers. When a big problem occurs, you can spend all of your time wrestling with servers that are beyond repair only to watch the remainder of your environment crash around you. Find ways to stop the bleeding first and then figure out some solid fixes.

For example, if your database cluster gets out of sync, think of the things you can do to reduce the amount of bad data coming in. Could you have your load balancer send traffic elsewhere? Could you disable your application until the database problem is solved? If you lose sight of what’s causing you immediate pain, you may spend all day trying to fix the broken database cluster only to find that you have many multitudes more data to sort out due to your application running throughout the whole process.

Head in hands

Flickr via jar0d

Learn from your mistakes and don’t dwell on them
Medical mistakes can range anywhere from unnoticeable to career-endingly serious. One missed tidbit of a patient’s medical history, one small math error when administering drugs, or one slip of the hand can make a bad situation much worse. I’ve made mistakes on the ambulance and I’ve been very fortunate that almost all of them were very small and inconsequential. If I made one that went unnoticed, I made an effort to notify my supervisor and whoever would be taking over care of my patient. For the mistakes I didn’t even notice on my own, my partners would often be quick to point out the error.

Getting called out on a mistake (even if you call yourself out on it) hurts. Funnel the frustration from it into a plan to fix it. Do some reading to understand the right solution. Learn mnemonics to remember in stressful situations. Make notes for yourself. Practice. Those small steps will reduce your mistakes through increasing your confidence.

Although most Ops engineers should survive big incidents with their lives intact, mistakes are still made and they can be costly. Mistakes can turn into a positive learning experience for everyone on the team. There’s a great post on Etsy’s “Code as Craft” blog about this topic.

John Allspaw wrote:

A funny thing happens when engineers make mistakes and feel safe when giving details about it: they are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company avoid the same error in the future. They are, after all, the most expert in their own error.

The only true mistake is the one which is made but never learned from. Accept it, learn from it, teach others to avoid it and move forward.

Get all the facts to avoid assumptions
My mother (an Engish teacher) always told me to put the most important things at the beginning and and the end when I write. If there’s anything more important than keeping your cool under duress, it’s that you should have as many facts as you can before you get started.

On the ambulance, you’re always looking for the very small clues to ensure that your patient is getting the proper treatment. You may walk up to a patient with slurred speech who can’t walk straight. You may think he’s drunk until you see a small bottle of insulin and a blood glucose meter. Wait, did his blood sugar bottom out? Did he take his insulin at the wrong time? Did he take the wrong amount? Missing that small bit of information may lead you to put your “drunk” patient onto a stretcher without the proper treatment only to find that you’re dealing with a diabetic coma as you get to the hospital. That incorrect assumption could have turned a serious situation into a possibly fatal one.

Responding to incidents with servers is much the same. Skipping over a server with data corruption or not realizing that a change was made (and documented) earlier in the day could lead to serious damage. Forgetting to check log files, streams of exceptions, or reports from customers can lead to bad assumptions which could extend your downtime or cause the loss of data.


In summary, here’s my internal runbook from when I was working full time as an EMT:

  1. Stop the bleeding
  2. Find the root cause of the problem
  3. Make a plan (or plans) to fix it
  4. Vet out your best plan with your partner if it seems risky
  5. Execute the plan
  6. Monitor the results
  7. Review the plan’s success or failure with a trusted expert

When I’m fighting outages at work, I reach back into this runbook and try my best to follow the steps. It helps me keep my cool, reduce mistakes, and proceed with better plans. I’d be curious to hear your feedback about how this runbook could work for your Operations team or if you have ideas for edits.

Tagged , , ,

Strategies for storing backups

Although it’s not a glamorous subject for system administrators, backups are necessary for any production environment. Those who run their systems without backups generally learn from their errors in a very painful way. However, the way you store your backups may sometimes prove to be just as vital as the methods you use to backup your data.

For my environments, I follow a strategy like this: I have some backups immediately accessible, others that are accessible very quickly (but not instantly), and others that are offsite and may take a bit more time to access.

Immediately accessible backups
One of the easiest way to have an immediately accessible backup is to have multiple machines online running the same versions of code or databases in a high availability group. If you have a node which fails, the remaining nodes should be able to handle the requests immediately. You may not consider this to be a backup under the traditional definition of what a backup should be, but it’s functionally similar.

Backups that are accessible quickly
This second level of backups should be stored very close to your environment or within the environment itself. If you have multiple database and web server nodes, you could consider storing your web backups on the database servers and vice versa. For those who run very sensitive applications, this may violate the provisions of different certifications and regulations. A server dedicated to holding backups may be a viable alternative for additional security.

Offsite backup storage
These are the backups that need to be geographically distant from your main environment. Also, you should always consider storing these backups on more than one medium with more than one company.

For example, if your hosting providers offers a storage service, it’s fine to store one set of your backups there, but consider storing them with a competitor as well. If you store your backups with your hosting provider in multiple places, you could be caught be a provider issue and lose access to your backups entirely. Hosting with multiple providers will allow you to access at least one copy of your backups even if there are billing or technical issues with a particular provider.

Another thing to keep in mind with offsite backup storage is how long it will take to transfer the backups to your hosting environment in case of an emergency. If your hosting environment is in Texas, but your backups are stored in Australia, you’re going to have a longer wait when you transfer your data back.

A specific example
My environments are all in Dallas, Texas and I have a highly available environment with multiple instances. My second layer of backups are stored within the environment as well as in Rackspace’s Cloud Files in Dallas. My third layer of backups are stored with Amazon S3 via Jungle Disk and at my home on a RAID array.

While I hope you never need to access your backups under duress, these tips should help to reduce your stress if you need to restore data in a hurry.

Tagged , ,

Mounting a raw partition file made with dd or dd_rescue in Linux

This situation might not affect everyone, but it struck me today and left me scratching my head. Consider a situation where you need to clone one drive to another with dd or when a hard drive is failing badly and you use dd_rescue to salvage whatever data you can.

Let’s say you cloned data from a drive using something like this:

# dd if=/dev/sda of=/mnt/nfs/backup/harddrive.img

Once that’s finished, you should end up with your partition table as well as the grub data from the MBR in your image file. If you run file against the image file you made, you should see something like this:

# file harddrive.img
harddrive.img: x86 boot sector; GRand Unified Bootloader, stage1 version 0x3, stage2 
address 0x2000, stage2 segment 0x200, GRUB version 0.97; partition 1: ID=0x83, 
active, starthead 1, startsector 63, 33640047 sectors, code offset 0x48

What if you want to pull some files from this image without writing it out to another disk? Mounting it like a loop file isn’t going to work:

# mount harddrive /mnt/temp
mount: you must specify the filesystem type

The key is to mount the file with an offset specified. In the output from file, there is a particular portion of the output that will help you:

... startsector 63 ...

This means that the filesystem itself starts on sector 63. You can also view this with fdisk -l:

# fdisk -l harddrive.img
                    Device Boot      Start         End      Blocks   Id  System
harddrive.img                *          63    33640109    16820023+  83  Linux

Since we need to scoot 63 sectors ahead, and each sector is 512 bytes long, we need to use an offset of 32,256 bytes. Fire up the mount command and you’ll be on your way:

# mount -o ro,loop,offset=32256 harddrive.img /mnt/loop
# mount | grep harddrive.img
/root/harddrive.img on /mnt/loop type ext3 (ro,loop=/dev/loop1,offset=32256)

If you made this image under duress (due to a failing drive or other emergency), you might have to check and repair the filesystem first. Doing that is easy if you make a loop device:

# losetup --offset 32256 /dev/loop2 harddrive.img
# fsck /dev/loop2

Once that’s complete, you can save some time and mount the loop device directly:

# mount /dev/loop2 /mnt/loop
Tagged , ,

Monitor MySQL restore progress with pv

The pv command is one that I really enjoy using but it’s also one that I often forget about. You can’t get a much more concise definition of what pv does than this one:

pv allows a user to see the progress of data through a pipeline, by giving information such as time elapsed, percentage completed (with progress bar), current throughput rate, total data transferred, and ETA.

The usage certainly isn’t complicated:

To use it, insert it in a pipeline between two processes, with the appropriate options. Its standard input will be passed through to its standard output and progress will be shown on standard error.

A great application of pv is when you’re restoring large amounts of data into MySQL, especially if you’re restoring data under duress due to an accidentally-dropped table or database. (Who hasn’t been there before?) The standard way of restoring data is something we’re all familiar with:

# mysql my_database < database_backup.sql

The downside of this method is that you have no idea how quickly your restore is working or when it might be done. You could always open another terminal to monitor the tables and databases as they’re created, but that can be hard to follow.

Toss in pv and that problem is solved:

# pv database_backup.sql | mysql my_database
96.8MB 0:00:17 [5.51MB/s] [==>                                ] 11% ETA 0:02:10

When it comes to MySQL, your restore rate is going to be different based on some different factors, so the ETA might not be entirely accurate.

Tagged , , , ,