Dual-primary DRBD with OCFS2

As promised in one of my previous posts about dual-primary DRBD and OCFS2, I’ve compiled a step-by-step guide for Fedora. These instructions should be somewhat close to what you would use on CentOS or Red Hat Enterprise Linux. However, CentOS and Red Hat don’t provide some of the packages needed, so you will need to use other software repositories like RPMFusion or EPEL.

In this guide, I’ll be using two Fedora 14 instances in the Rackspace Cloud with separate public and private networks. The instances are called server1 and server2 to make things easier to follow.

NOTE: All of the instructions below should be done on both servers unless otherwise specified.

- *First, we need to set up DRBD with two primary nodes. I’ll be using loop files for this setup since I don’t have access to raw partitions.

yum -y install drbd-utils
dd if=/dev/zero of=/drbd-loop.img bs=1M count=1000

Put this loop file initialization init script in /etc/init.d/loop-for-drbd and finish setting it up:

chmod a+x /etc/init.d/loop-for-drbd
chkconfig loop-for-drbd on
/etc/init.d/loop-for-drbd start

Place this DRBD resource file in /etc/drbd.d/r0.res. Be sure to adjust the server names and IP addresses for your servers.

resource r0 {
	meta-disk internal;
	device /dev/drbd0;
	disk /dev/loop7;

	syncer { rate 1000M; }
        net {
                allow-two-primaries;
                after-sb-0pri discard-zero-changes;
                after-sb-1pri discard-secondary;
                after-sb-2pri disconnect;
        }
	startup { become-primary-on both; }

	on server1 { address 10.181.76.0:7789; }
	on server2 { address 10.181.76.1:7789; }
}

The net section is telling DRBD to do the following:

allow-two-primaries – Generally, DRBD has a primary and a secondary node. In this case, we will allow both nodes to have the filesystem mounted at the same time. Do this only with a clustered filesystem. If you do this with a non-clustered filesystem like ext2/ext3/ext4 or reiserfs, you will have data corruption. Seriously!
after-sb-0pri discard-zero-changes – DRBD detected a split-brain scenario, but none of the nodes think they’re a primary. DRBD will take the newest modifications and apply them to the node that didn’t have any changes.
after-sb-1pri discard-secondary – DRBD detected a split-brain scenario, but one node is the primary and the other is the secondary. In this case, DRBD will decide that the secondary node is the victim and it will sync data from the primary to the secondary automatically.
after-sb-2pri disconnect – DRBD detected a split-brain scenario, but it can’t figure out which node has the right data. It tries to protect the consistency of both nodes by disconnecting the DRBD volume entirely. You’ll have to tell DRBD which node has the valid data in order to reconnect the volume. Use extreme caution if you find yourself in this scenario.

If you’d like to read about DRBD split-brain behavior in more detail, review the documentation.

I generally turn off the usage reporting functionality in DRBD within /etc/drbd.d/global_common.conf:

global {
	usage-count no;
}

Now we can create the volume and start DRBD:

drbdadm create-md r0
/etc/init.d/drbd start && chkconfig drbd on

You may see some errors thrown about having two primaries but neither are up to date. That can be fixed by running the following command on the primary node only:

drbdsetup /dev/drbd0 primary -o

If you run cat /proc/drbd on the secondary node, you should see the DRBD sync running:

version: 8.3.8 (api:88/proto:86-94)
srcversion: 299AFE04D7AFD98B3CA0AF9
 0: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r----
    ns:0 nr:210272 dw:210272 dr:0 al:0 bm:12 lo:1 pe:2682 ua:0 ap:0 ep:1 wo:b oos:813660
        [===>................] sync'ed: 20.8% (813660/1023932)K queue_delay: 0.0 ms
        finish: 0:01:30 speed: 8,976 (6,368) want: 1024,000 K/sec

Before you go any further, wait for the DRBD sync to fully finish. When it completes, it should look like this:

version: 8.3.8 (api:88/proto:86-94)
srcversion: 299AFE04D7AFD98B3CA0AF9
 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----
    ns:0 nr:1023932 dw:1023932 dr:0 al:0 bm:63 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Now, on the secondary node only make it a primary node as well:

drbdadm primary r0

You should see this on the secondary node if you’ve done everything properly:

version: 8.3.8 (api:88/proto:86-94)
srcversion: 299AFE04D7AFD98B3CA0AF9
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----
    ns:1122 nr:1119 dw:2241 dr:4550 al:2 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

We’re now ready to move on to configuring OCFS2. Only one package is needed:

yum -y install ocfs2-tools

Ensure that you have your servers and their private IP addresses in /etc/hosts before proceeding. Create the /etc/ocfs2 directory and place the following configuration in /etc/ocfs2/cluster.conf (adjust the server names and IP addresses):

cluster:
	node_count = 2
	name = web

node:
	ip_port = 7777
	ip_address = 10.181.76.0
	number = 1
	name = server1
	cluster = web

node:
	ip_port = 7777
	ip_address = 10.181.76.1
	number = 2
	name = server2
	cluster = web

Now it’s time to configure OCFS2. Run service o2cb configure and follow the prompts. Use the defaults for all of the responses except for two questions:

Answer “y” to “Load O2CB driver on boot”
Answer “web” to “Cluster to start on boot”

Start OCFS2 and enable it at boot up:

chkconfig o2cb on && chkconfig ocfs2 on
/etc/init.d/o2cb start && /etc/init.d/ocfs2 start

Create an OCFS2 partition on the primary node only:

mkfs.ocfs2 -L "web" /dev/drbd0

Mount the volumes and configure them to automatically mount at boot time. You might be wondering why I do the mounting within /etc/rc.local. I chose to go that route since mounting via fstab was often unreliable for me due to the incorrect ordering of events at boot time. Using rc.local allows the mounts to work properly upon every reboot.

mkdir /mnt/storage
echo "/dev/drbd0  /mnt/storage  ocfs2  noauto,noatime  0 0" >> /etc/fstab
mount /dev/drbd0
echo "mount /dev/drbd0" >> /etc/rc.local

At this point, you should be all done. If you want to test OCFS2, copy a file into your /mnt/storage mount on one node and check that it appears on the other node. If you remove it, it should be gone instantly on both nodes. This is a great opportunity to test reboots of both machines to ensure that everything comes up properly at boot time.