2016년 5월 21일 토요일

Setting up Sheepdog v0.9 Distributed Object Storage on Fedora 22/23

While many people have heard of Ceph distributed storage (an open-source project developed by Inktank, which was purchased by Redhat), not so many people have heard of Sheepdog distributed storage (an open-source project developed by NTT of Japan).

I first learned of Sheepdog from a watching a 2012 presentation on Windows VDI (Virtual Desktop Infrastructure) made by the CTO (now CEO) of Atlantis. The 68-minute talk is up on Youtube. I was shocked to learn that Software Defined Storage (SDS) in a distributed architecture with 5+ nodes could boast higher IOPS than enterprise SAN hardware.

At work, I have tested Ceph as a storage backend for Openstack, namely as a backend for Nova ephemeral VM's, Glance images, and Cinder block storage volumes.

According to the documentation from various versions of Openstack (from Juno onwards), the Sheepdog storage driver is supported. For example, here's what the Openstack Kilo docs say about Sheepdog and Cinder:

http://docs.openstack.org/kilo/config-reference/content/sheepdog-driver.html

This driver enables use of Sheepdog through Qemu/KVM.
Set the following volume_driver in cinder.conf:
volume_driver=cinder.volume.drivers.sheepdog.SheepdogDriver
In another post, I talk about setting up Sheepdog as a backend for Openstack Kilo.

Of course, Sheepdog can be used as distributed storage on its own without Openstack. In this post I will cover setting up Sheepdog on Fedora 22/23 and mounting an LVM block device using the sheepdog daemon sheep.


Compile Sheepdog v0.9 from Github

As of May 2016, the upstream version of sheepdog from Github is v0.9.0...

By contrast, the sheepdog package provided by the RDO (Redhat Distribution of Openstack) Kilo repos for Fedora is at version 0.3, which is incompatible with libcpg from corosync 2.3.5 in the default Fedora repos for f22/23. (sheep daemon fails to start because of a segfault in libcpg).

When trying to start the v0.3 sheep daemon I got the following error in dmesg:
...
[Apr25 14:52] sheep[11897]: segfault at 7fdb24f59a08 ip 00007fdb2ccc7cd8 sp 00007fdb24f59a10 error 6 in libcpg.so.4.1.0[7fdb2ccc6000+5000]
...

As you can see above, the sheep daemon fails to start because of a segfault in libcpg which is part of corosync.

This issue does not occur, however, when I use the v0.9 sheep daemon.

Here are the steps to compile Sheepdog from the upstream repo on Github:

(1) RENAME OLD SHEEPDOG 0.3 BINARIES

If you have RDO installed on your Fedora machine, sheepdog v0.3 binaries sheep and collie will already exist in /usr/sbin, but when you build sheepdog v0.9, it will install binaries into both /usr/sbin and /usr/bin:
  • sheep will be created in /usr/sbin
  • dog (the replacement for collie since v0.6) will be created in /usr/bin

To avoid namespace conflicts, it's a good idea to rename the old binaries from sheepdog v0.3. You might wonder why I bother renaming the binaries instead of doing dnf remove sheepdog. The reason you cannot just remove the old package is that sheepdog is one of the dependencies of RDO. Even marking the package as "manually installed" and trying to remove it didn't work for me.

mv /usr/sbin/collie /usr/sbin/collie_0.3
mv /usr/sbin/sheep /usr/sbin/sheep_0.3

(2) BUILD FROM UPSTREAM SOURCE

As of May 2016, the current sheepdog version is 0.9.0 ...

git clone git://github.com/collie/sheepdog.git
sudo dnf install -y autoconf automake libtool yasm userspace-rcu-devel \
corosynclib-devel
cd sheepdog
./autogen.sh
./configure

If you wish to build sheepdog with support for zookeeper as the sync agent (corosync is used by default) you must invoke the following:

./configure --enable-zookeeper

Finally, invoke:

sudo make install

Sheepdog 0.9 binaries will be installed into /usr/bin and /usr/sbin, so make sure the old sheepdog binaries in /usr/sbin have been renamed! BTW, there is no collie command in sheepdog v0.9. It has been replaced with dog.


Setup Corosync 

Before starting corosync, you must ensure that TCP port 7000 has been opened in your firewall on all the machines you plan to use as sheepdog storage nodes. In a simple lab environment, you may be able to get away with temporarily stopping your firewall with systemctl stop firewalld, but don't do this in a production environment!

(1) CREATE COROSYNC CONFIG FILES

sudo vim /etc/corosync/corosync.conf

# Please read the corosync.conf 5 manual page
compatibility: whitetank
totem {
  version: 2
  secauth: off
  threads: 0
  # Note, fail_recv_const is only needed if you're
  # having problems with corosync crashing under
  # heavy sheepdog traffic. This crash is due to
  # delayed/resent/misordered multicast packets.
  # fail_recv_const: 5000
  interface {
    ringnumber: 0
    bindnetaddr: 192.168.95.146
    mcastaddr: 226.94.1.1
    mcastport: 5405
  }
}
logging {
  fileline: off
  to_stderr: no
  to_logfile: yes
  to_syslog: yes
  # the pathname of the log file
  logfile: /var/log/cluster/corosync.log
  debug: off
  timestamp: on
  logger_subsys {
    subsys: AMF
    debug: off
  }
}
amf {
  mode: disabled
}


For bindnetaddr, use your local server's IP on a subnet which will be available to other Sheepdog storage nodes. In my lab environment, my sheepdog nodes are on ...95.{146,147,148}.

This probably isn't necessary, but if you want a regular user myuser to be able to access the corosync daemon, create the following file:

sudo vim /etc/corosync/uidgid.d/myuser

uidgid {
   uid: myuser
   gid: myuser
}


(2) START THE COROSYNC SERVICE

The corosync systemd service is not enabled by default, so enable the service and start it:

sudo systemctl enable corosync
sudo systemctl start corosync

When you check the corosync service status with systemctl status corosync you should see something like this:

corosync.service - Corosync Cluster Engine
   Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled; vendor preset: disabled)
   Active: active (running) since Mon 2016-05-09 11:45:06 KST; 1 weeks 4 days ago
 Main PID: 2248 (corosync)
   CGroup: /system.slice/corosync.service
           └─2248 corosync

May 09 11:45:06 fx8350no3 corosync[2248]:   [QB    ] server name: cpg
May 09 11:45:06 fx8350no3 corosync[2248]:   [SERV  ] Service engine loaded: corosync...4]
May 09 11:45:06 fx8350no3 corosync[2248]:   [SERV  ] Service engine loaded: corosync...3]
May 09 11:45:06 fx8350no3 corosync[2248]:   [QB    ] server name: quorum
May 09 11:45:06 fx8350no3 corosync[2248]:   [TOTEM ] A new membership (192.168.95.14...86
May 09 11:45:06 fx8350no3 corosync[2248]:   [MAIN  ] Completed service synchronizati...e.
May 09 11:45:06 fx8350no3 corosync[2236]: Starting Corosync Cluster Engine (corosync... ]
May 09 11:45:06 fx8350no3 systemd[1]: Started Corosync Cluster Engine.
May 09 11:48:46 fx8350no3 corosync[2248]:   [TOTEM ] A new membership (192.168.95.14...88
May 09 11:48:46 fx8350no3 corosync[2248]:   [MAIN  ] Completed service synchronizati...e.
Hint: Some lines were ellipsized, use -l to show in full.

(3) REPEAT STEPS 1 & 2 ON ALL MACHINES YOU WISH TO USE AS STORAGE NODES

In /etc/corosync/corosync.conf make sure to change bindnetaddr to the IP for each different machine.


Launch Sheepdog Daemon on LVM Block Device

Sheepdog can use an entire disk as a storage node, but for testing purposes, it is easier to just mount an LVM block device with sheep.

(1) CREATE A MOUNTPOINT FOR SHEEP TO USE

sudo mkdir /mnt/sheep

(2) CREATE A LVM BLOCK DEVICE FOR SHEEPDOG

sudo pvcreate /dev/sdxy
sudo vgcreate /dev/sdxy VGNAME
sudo lvcreate -L nG VGNAME -n /dev/VGNAME/LVNAME

where x is a letter (such as a, b, ...z), y is a whole number (i.e., 1, 2, 3, ...), and n is a whole number.

(3) CREATE File System ON LV

sudo mkfs.ext4 /dev/VGNAME/LVNAME

In this example, I created an ext4 file system, but you could use XFS or anything else.

(4) MOUNT BLOCK DEVICE ON MOUNTPOINT

sudo mount /dev/VGNAME/LVNAME /mnt/sheep

(5) RUN SHEEP DAEMON ON MOUNTPOINT

sudo sheep /mnt/sheep

To make sure the daemon is running you can also try pidof sheep which should return two Process ID's.


(6) VERIFY DEFAULT FILES IN SHEEPDOG MOUNT

cd /mnt/sheep
ls

This should show the following files and directories:

config  epoch  lock  obj  sheep.log  sock

If you don't see anything in the mount point, the sheep daemon failed to load.

(7) REPEAT STEPS 1-6 ON ALL MACHINES TO BE USED AS STORAGE NODES

(8) CHECK SHEEPDOG NODES

Now that the sheep has been launched you should check if it can see other sheepdog nodes. Sheepdog commands can be invoked by the regular user.

dog node list
 Id   Host:Port         V-Nodes       Zone
  0   192.168.95.146:7000      128 2455742656
  1   192.168.95.147:7000      128 2472519872
  2   192.168.95.148:7000      128 2489297088

The sheepdog daemon should automatically be able to see all the other nodes on which sheep is running (if corosync is working properly, that is).

You can get a list of valid dog commands by just invoking dog without any arguments:

dog
Sheepdog administrator utility (version 0.9.0_352_g3d5438a)
Usage: dog [options]

Available commands:
  vdi check               check and repair image's consistency
  vdi create              create an image
  ...

(9) DO INITIAL FORMAT OF SHEEPDOG CLUSTER

This step only needs to be done once from any node in the cluster.

dog cluster format
using backend plain store
dog cluster info
Cluster status: running, auto-recovery enabled

Cluster created at Mon Apr 25 19:22:14 2016

Epoch Time           Version [Host:Port:V-Nodes,,,]
#2016-04-25 19:22:14      1 [192.168.95.146:7000:128, 192.168.95.147:7000:128, 192.168.95.148:7000:128]


Convert RAW/QCOW2 Image to Sheepdog VDI Format

(1) INSTALL QEMU

sudo dnf install -y qemu qemu-kvm

(2) CONVERT A VM IMAGE TO SHEEPDOG VDI FORMAT

qemu-img convert -f qcow2 xenial-amd64.qcow2 sheepdog:xenial

In this example, I am converting an Ubuntu cloud image for 16.04 64-bit to sheepdog VDI format. Note that cloud images do not contain any user:pass info so it will be impossible to login without first injecting an ssh keypair with cloud-init. This can be achieved by first booting the image in Openstack, selecting a keypair, and then logging into the launched instance through the console in Horizon. Once you are logged in, you can create a user and password. Then take a snapshot of the instance and download it for use in qemu or virt-manager.

NOTE: The format for sheepdog images is sheepdog:imgName
Converting a RAW or QCOW2 image to sheepdog format will cause a new sheepdog VDI image to be created in the distributed storage nodes (which you can verify by navigating to the sheep mountpoint and running ls (but the image file itself won't appear, just a bunch of new file chunks, as this is object storage).

(3) VERIFY SHEEPDOG VDI CREATION

dog vdi list
 Name        Id    Size    Used  Shared    Creation time   VDI id  Copies  Tag   Block Size Shift
 xenial       0  2.2 GB  976 MB  0.0 MB 2016-04-25 20:22   4f6c3e      3                22

(4) LAUNCH VDI VIA QEMU-SYSTEM & X11 FORWARDING

The Sheepdog storage nodes will probably be server machines without Xorg X11 / Desktop Environment installed.

You can still launch qemu-system if you use ssh x11 forwarding.

First check that the server machine has xauth installed:

rpm -q xorg-x11-xauth

Then from another machine that has X11 installed, invoke ssh with -X option to run the remote program with your local X session and run qemu-system:

ssh -X fedjun@192.168.95.148 qemu-system-x86_64 -enable-kvm -m 1024 \
-cpu host -drive file=trusty64-bench-test1

The -m flag designates memory in MB; qemu's default is only 128MB so you need to specify this manually if you need more memory.

Note that qemu uses NAT for VM's by default. You cannot directly communicate from host to VM, but you can go from VM to host by ssh'ing or pinging 10.0.2.2.