{"id":395,"date":"2016-05-31T13:26:23","date_gmt":"2016-05-31T03:26:23","guid":{"rendered":"https:\/\/icicimov.com\/blog\/?p=395"},"modified":"2017-02-23T17:32:59","modified_gmt":"2017-02-23T06:32:59","slug":"clustering-with-pacemaker-drbd-and-gfs2-on-bare-metal-servers-in-softlayer","status":"publish","type":"post","link":"https:\/\/icicimov.com\/blog\/?p=395","title":{"rendered":"Clustering with Pacemaker, DRBD and GFS2 on Bare-Metal servers in SoftLayer"},"content":{"rendered":"<p><a href=\"http:\/\/www.softlayer.com\/\">SoftLayer<\/a> is IBM company providing cloud and Bare-Metal hosting services. We are going to setup a cluster of Pacemaker, DRBD and GFS2 on couple of Bare-Metal servers to host our Encompass services. This will provide high availability of the shared storage for our applications.<\/p>\n<p>The services are running on two 2U Supermicro 2 x Hexa Core (6 cores per cpu = 24 cpu&#8217;s in total due to hyper threading) Intel Xeon 2650 bare-metal servers with 64GB of RAM and Ubuntu-14.04.4 server minimal install for OS and 4 x 1TB hard drives. The root file system is on one 1TB SATA drive and the other 3 x 1TB are in hardware RAID5 array via LSI controller, to be used for the shared storage.<\/p>\n<p>The shared file system resides on the 2TB RAID5 SATA array and is kept in sync via DRBD (on top of LVM for easy extension) block level replication and GFS2 clustered file system. The DRBD and GFS2 are managed as resources by Pacemaker. The below ASCII chart might describe this layout better:<\/p>\n<pre><code>+----------+  +----------+             +----------+  +----------+\n|  Service |  |  Service |             |  Service |  |  Service |\n+----------+  +----------+             +----------+  +----------+\n     ||            ||                       ||            ||\n+------------------------+  cluster FS +------------------------+\n|          gfs2          |&lt; ~~~~~~~~~~&gt;|          gfs2          |\n+------------------------+ replication +------------------------+\n|        drbd r0         |&lt; ~~~~~~~~~~&gt;|         drbd r0        |\n+------------------------+             +------------------------+\n|        lv_vol          |             |         lv_vol         |\n+------------------------+             +------------------------+\n|   volume group vg1     |             |    volume group vg1    |\n+------------------------+             +------------------------+\n|     physical volume    |             |     physical volume    |\n+------------------------+             +------------------------+\n|          sdb1          |             |          sdb1          |\n+------------------------+             +------------------------+\n         server01                               server02\n<\/code><\/pre>\n<p>SoftLayer gives you one public and one private VLAN to connect your server for which you can opt for 0.1, 1 or 10 Gbps throughput. Each server has bond of 2 interfaces connected to each VLAN for HA and fail-over plus one IPMI\/KVM BCM interface connected to the private VLAN.<\/p>\n<h1>Disk Setup<\/h1>\n<p>We have 3 x 1TB SATA3 disks in RAID5 =~ 2TB usable space. I have created the following partitions on the RAID5 block device <code>\/dev\/sdb<\/code> (using GPT partition table since it&#8217;s 2TB disk):<\/p>\n<pre><code>root@server01:~# gdisk -l \/dev\/sdb\nGPT fdisk (gdisk) version 0.8.8\nPartition table scan:\n  MBR: protective\n  BSD: not present\n  APM: not present\n  GPT: present\n\nFound valid GPT with protective MBR; using GPT.\nDisk \/dev\/sdb: 3904897024 sectors, 1.8 TiB\nLogical sector size: 512 bytes\nDisk identifier (GUID): 18E19822-8B06-460E-B2C4-A98E63C284FD\nPartition table holds up to 128 entries\nFirst usable sector is 34, last usable sector is 3904896990\nPartitions will be aligned on 2048-sector boundaries\nTotal free space is 2604662717 sectors (1.2 TiB)\n\nNumber  Start (sector)    End (sector)  Size       Code  Name\n   1            2048       524290047   250.0 GiB   8300  Linux filesystem\n   2       524290048      1048578047   250.0 GiB   8300  Linux filesystem\n   3      1048578048      1300236287   120.0 GiB   8300  Linux filesystem\n<\/code><\/pre>\n<p>For optimal performance we need to find the Strip size of the RAID5 array of the volume we will create the file system on:<\/p>\n<pre><code>root@server01:~# storcli \/c0\/v1 show all | grep Strip\nStrip Size = 256 KB\n<\/code><\/pre>\n<p>So the strip size is 256KB and we have 2 data disks in RAID5 so we ned to take this into account when creating the LVM and the file system.<\/p>\n<p>For the shared file system I used the first partition to create LVM of size 200GB leaving around 20% for snapshots:<\/p>\n<pre><code>[ALL]:~# pvcreate --dataalignment 512K \/dev\/sdb1\n  Physical volume \"\/dev\/sdb1\" successfully created\n<\/code><\/pre>\n<p>where <code>dataalignment<\/code> is calculated as <code>Strip size * No. Data disks<\/code>. To check the data alignment we can run:<\/p>\n<pre><code>root@server01:~# pvs -o +pe_start \/dev\/sdb1\n<\/code><\/pre>\n<p>Next we create the VG and LV:<\/p>\n<pre><code>[ALL]:~# vgcreate -A y vg_drbd0 \/dev\/sdb1\n  Volume group \"vg_drbd0\" successfully created\n\n[ALL]:~# lvcreate --name lv_drbd0 -L 200G vg_drbd0\n  Logical volume \"lv_drbd0\" created\n<\/code><\/pre>\n<p>At the end we need to tell LVM where to look for logical volumes and which devices to skip:<\/p>\n<pre><code>[ALL]:~# vi \/etc\/lvm\/lvm.conf\n...\n    filter = [ \"r|^\/dev\/drbd.*$|\", \"a|^\/dev\/sda.*$|\", \"a|^\/dev\/sdb.*$|\", \"r\/.*\/\" ]\n    write_cache_state = 0\n...\n<\/code><\/pre>\n<p>and we also turn off the LVM write cache to avoid another caching level. Then we need to update the <code>ramdisk<\/code> in order to synchronize the initramfs&#8217;s copy of <code>lvm.conf<\/code> with the main system one:<\/p>\n<pre><code>[ALL]:~# # update-initramfs -u\nupdate-initramfs: Generating \/boot\/initrd.img-3.13.0-86-generic\n<\/code><\/pre>\n<p>otherwise devices might go missing upon reboot.<\/p>\n<h1>Services Setup<\/h1>\n<p>We start by updating the kernel and the packages and installing the needed software:<\/p>\n<pre><code>[ALL]:~# aptitude update &amp;&amp; aptitude safe-upgrade -y &amp;&amp; shutdown -r now\n[ALL]:~# aptitude install -y heartbeat pacemaker corosync fence-agents openais cluster-glue resource-agents xfsprogs lvm2 gfs2-utils dlm\n[ALL]:~# aptitude install -y linux-headers build-essential module-assistant flex debconf-utils docbook-xml docbook-xsl dpatch xsltproc autoconf2.13 autoconf debhelper git\n<\/code><\/pre>\n<p>I also setup DNS names for the private VLAN ip&#8217;s in the <code>\/etc\/hosts<\/code> file:<\/p>\n<pre><code>...\n10.10.10.91    sl01.private\n10.10.10.26    sl02.private\n<\/code><\/pre>\n<p>Now we can go on and configure our services.<\/p>\n<h2>Clustering Components<\/h2>\n<p>For this to work properly we must set passwordless access for the root user on the private VLAN. We generate SSH keys on both servers:<\/p>\n<pre><code>[ALL]:~# ssh-keygen -t rsa -b 2048 -f ~\/.ssh\/id_rsa -N ''\n<\/code><\/pre>\n<p>and copy-paste the public key into the others server <code>\/root\/.ssh\/authorized_keys<\/code> file or use <code>ssh-copy-id<\/code> for that purpose.<\/p>\n<h3>Corosync<\/h3>\n<p>We start by generating private key on one of the servers and copying it over to the other:<\/p>\n<pre><code>root@server01:~# corosync-keygen -l\nroot@server01:~# scp \/etc\/corosync\/authkey server02.private:\/etc\/corosync\/authkey\n<\/code><\/pre>\n<p>In this way, for added security, only a server that has this key can join the cluster communication. Next is the config file <code>\/etc\/corosync\/corosync.conf<\/code>:<\/p>\n<pre><code>totem {\n    version: 2\n\n    # How long before declaring a token lost (ms)\n    token: 3000\n\n    # How many token retransmits before forming a new configuration\n    token_retransmits_before_loss_const: 10\n\n    # How long to wait for join messages in the membership protocol (ms)\n    join: 60\n\n    # How long to wait for consensus to be achieved before starting a new round of membership configuration (ms)\n    consensus: 3600\n\n    # Turn off the virtual synchrony filter\n    vsftype: none\n\n    # Number of messages that may be sent by one processor on receipt of the token\n    max_messages: 20\n\n    # Limit generated nodeids to 31-bits (positive signed integers)\n    clear_node_high_bit: yes\n\n    # Disable encryption\n    secauth: off\n\n    # How many threads to use for encryption\/decryption\n    threads: 0\n\n    # Optionally assign a fixed node id (integer)\n    # nodeid: 1234\n\n    # CLuster name, needed for GFS2 and DLM or DLM wouldn't start\n    cluster_name: slcluster\n\n    # This specifies the mode of redundant ring, which may be none, active, or passive.\n    rrp_mode: none\n\n    interface {\n        # The following values need to be set based on your environment\n        ringnumber: 0\n        bindnetaddr: 10.10.10.91\n        mcastaddr: 226.94.1.1\n        mcastport: 5405\n    }\n    transport: udpu\n}\n\nnodelist {\n    node {\n        ring0_addr: 10.10.10.91\n        nodeid: 1\n    }\n    node {\n        ring0_addr: 10.10.10.26\n        nodeid: 2\n    }\n}\n\namf {\n    mode: disabled\n}\n\nquorum {\n    # Quorum for the Pacemaker Cluster Resource Manager\n    provider: corosync_votequorum\n    expected_votes: 1\n    two_node: 1\n    wait_for_all: 1\n    last_man_standing: 1\n    auto_tie_breaker: 0\n}\n\naisexec {\n        user:   root\n        group:  root\n}\n\nlogging {\n        fileline: off\n        to_stderr: yes\n        to_logfile: no\n        to_syslog: yes\n        syslog_facility: daemon\n        debug: off\n        timestamp: on\n        logger_subsys {\n                subsys: AMF\n                debug: off\n                tags: enter|leave|trace1|trace2|trace3|trace4|trace6\n        }\n}\n<\/code><\/pre>\n<p>On the other node we replace <code>bindnetaddr<\/code> to read <code>bindnetaddr: 10.10.10.26<\/code>. Then we enable the service on both servers in <code>\/etc\/default\/corosync<\/code> file:<\/p>\n<pre><code># start corosync at boot [yes|no]\nSTART=yes\n<\/code><\/pre>\n<p>and start it up:<\/p>\n<pre><code>[ALL]:~# service corosync start\n<\/code><\/pre>\n<p>Confirm all is ok:<\/p>\n<pre><code>root@server02:~# corosync-cfgtool -s\nPrinting ring status.\nLocal node ID 2\nRING ID 0\n    id    = 10.10.10.26\n    status    = ring 0 active with no faults\n\nroot@server02:~# corosync-quorumtool\nQuorum information\n------------------\nDate:             Mon May 23 01:46:03 2016\nQuorum provider:  corosync_votequorum\nNodes:            2\nNode ID:          2\nRing ID:          24\nQuorate:          Yes\nVotequorum information\n----------------------\nExpected votes:   2\nHighest expected: 2\nTotal votes:      2\nQuorum:           2 \nFlags:            Quorate\nMembership information\n----------------------\n    Nodeid      Votes Name\n         2          1 10.10.10.26 (local)\n         1          1 10.10.10.91\n<\/code><\/pre>\n<p>For the end, we make sure to open <code>UDP port 5405<\/code> in the firewall on the private VLAN interface and make sure the service is enabled on startup:<\/p>\n<pre><code>[ALL]# update-rc.d corosync enable\n<\/code><\/pre>\n<h3>Pacemaker<\/h3>\n<p>Since we already installed it all we need to do is start it up:<\/p>\n<pre><code>[ALL]:~# service pacemaker start\n<\/code><\/pre>\n<p>then set &#8220;no-quorum-policy<code>to<\/code>ignore` since this is a 2-node cluster and we want to continue running when one of them crushes (meaning we&#8217;ve lost quorum) and disable fencing for now.:<\/p>\n<pre><code>root@server01:~# crm configure property stonith-enabled=false\nroot@server01:~# crm configure property no-quorum-policy=ignore\n<\/code><\/pre>\n<p>and then we should see both nodes online if we check the status:<\/p>\n<pre><code>root@server01:~# crm status   \nLast updated: Mon May 23 01:42:02 2016\nLast change: Mon May 23 01:08:41 2016 via cibadmin on server02\nStack: corosync\nCurrent DC: server01 (1) - partition with quorum\nVersion: 1.1.10-42f2063\n2 Nodes configured\n2 Resources configured\n\nOnline: [ server01 server02 ]\n<\/code><\/pre>\n<p>Last, we enable the Pacemaker service on startup and make sure it starts after Corosync:<\/p>\n<pre><code>[ALL]# update-rc.d -f pacemaker remove\n[ALL]# update-rc.d pacemaker start 50 1 2 3 4 5 . stop 01 0 6 .\n[ALL]# update-rc.d pacemaker enable\n<\/code><\/pre>\n<h3>Fencing<\/h3>\n<p>To make sure the cluster functions properly we need to configure some kind of fencing. This is to prevent <code>split-brain<\/code> situation in case of partitioned cluster. In Pacemaker terms this is called STONITH (Shoot The Other Node In The Head) and we&#8217;ll be using the <code>IPMI-over-lan<\/code> device we saw configured above. On one node only we do:<\/p>\n<pre><code>root@server01:~# crm configure\ncrm(live)configure# primitive p_fence_server01 stonith:fence_ipmilan \\\n   pcmk_host_list=\"server01\" ipaddr=\"10.10.10.52\" \\\n   action=\"reboot\" login=\"&lt;my -admin-user&gt;\" passwd=\"&lt;\/my&gt;&lt;my -admin-password&gt;\" delay=15 \\\n   op monitor interval=\"60s\"\ncrm(live)configure# primitive p_fence_server02 stonith:fence_ipmilan \\\n   params pcmk_host_list=\"server02\" ipaddr=\"10.10.10.71\" \\\n   action=\"reboot\" login=\"&lt;\/my&gt;&lt;my -admin-user&gt;\" passwd=\"&lt;\/my&gt;&lt;my -admin-password&gt;\" delay=5 \\\n   op monitor interval=60s\ncrm(live)configure# location l_fence_server01 p_fence_server01 -inf: server01\ncrm(live)configure# location l_fence_server02 p_fence_server02 -inf: server02\ncrm(live)configure# property stonith-enabled=\"true\"\ncrm(live)configure# commit\ncrm(live)configure# exit\nroot@server01:~#\n<\/code><\/pre>\n<p>Now if we check the cluster state we can see our new fencing resources configured:<\/p>\n<pre><code>root@server01:~# crm status   \nLast updated: Mon May 23 01:42:02 2016\nLast change: Mon May 23 01:08:41 2016 via cibadmin on server02\nStack: corosync\nCurrent DC: server01 (1) - partition with quorum\nVersion: 1.1.10-42f2063\n2 Nodes configured\n2 Resources configured\n\nOnline: [ server01 server02 ]\n\n p_fence_server01    (stonith:fence_ipmilan):    Started server02\n p_fence_server02    (stonith:fence_ipmilan):    Started server01\n<\/code><\/pre>\n<h2>DRBD<\/h2>\n<p>I built DRBD kernel module and the utilities for the current running kernel <code>3.13.0-86-generic<\/code> from the current git repository. For DRBD utils:<\/p>\n<pre><code>[ALL]:~# git clone --recursive git:\/\/git.drbd.org\/drbd-utils.git\n[ALL]:~# cd drbd-utils\/\n[ALL]:~\/drbd-utils# .\/autogen.sh\n[ALL]:~\/drbd-utils# .\/configure --prefix=\/usr --localstatedir=\/var --sysconfdir=\/etc \\\n                          --with-pacemaker=yes --with-heartbeat=yes --with-rgmanager=yes \\\n                          --with-xen=yes --with-bashcompletion=yes\n[ALL]:~\/drbd-utils# make\n[ALL]:~\/drbd-utils# debuild -i -us -uc -b\n<\/code><\/pre>\n<p>And for the kernel driver:<\/p>\n<pre><code>[ALL]:~# git clone --recursive git:\/\/git.drbd.org\/drbd-8.4.git\n[ALL]:~# cd drbd-8.4\n[ALL]:~\/drbd-8.4# git checkout drbd-8.4.7\n[ALL]:~\/drbd-8.4# make &amp;&amp; make clean\n[ALL]:~\/drbd-8.4# debuild -i -us -uc -b\n<\/code><\/pre>\n<p>This has created <code>.deb<\/code> packages in the parent directory of the current working directory. All is left is to install them:<\/p>\n<pre><code>[ALL]:~\/drbd-8.4# dpkg -i ..\/drbd-dkms_8.4.1-1_all.deb ..\/drbd-utils_8.9.6-1_amd64.deb\n<\/code><\/pre>\n<p>At the end we pin the kernel so we don&#8217;t accidentally run upgrade:<\/p>\n<pre><code>[ALL]:~\/drbd-8.4# vi \/etc\/apt\/preferences.d\/kernel\nPackage: linux-generic linux-headers-generic linux-image-generic linux-restricted-modules-generic\nPin: version 3.13.0-86\nPin-Priority: 1001\n<\/code><\/pre>\n<p>To confirm the installation we run:<\/p>\n<pre><code>root@server01:~# modinfo drbd\nfilename:       \/lib\/modules\/3.13.0-86-generic\/updates\/drbd.ko\nalias:          block-major-147-*\nlicense:        GPL\nversion:        8.4.7-2\ndescription:    drbd - Distributed Replicated Block Device v8.4.7-2\nauthor:         Philipp Reisner &lt;phil @linbit.com&gt;, Lars Ellenberg &lt;lars @linbit.com&gt;\nsrcversion:     74731AD693E4C2E56E1C448\ndepends:        libcrc32c\nvermagic:       3.13.0-86-generic SMP mod_unload modversions\nparm:           minor_count:Approximate number of drbd devices (1-255) (uint)\nparm:           disable_sendpage:bool\nparm:           allow_oos:DONT USE! (bool)\nparm:           proc_details:int\nparm:           enable_faults:int\nparm:           fault_rate:int\nparm:           fault_count:int\nparm:           fault_devs:int\nparm:           usermode_helper:string\n\nroot@server01:~# drbdadm --version\nDRBDADM_BUILDTAG=GIT-hash:\\ c6e62702d5e4fb2cf6b3fa27e67cb0d4b399a30b\\ build\\ by\\ ubuntu@server01\\,\\ 2016-05-23\\ 05:30:41\nDRBDADM_API_VERSION=1\nDRBD_KERNEL_VERSION_CODE=0x080407\nDRBDADM_VERSION_CODE=0x080906\nDRBDADM_VERSION=8.9.6\n<\/code><\/pre>\n<p>Now we can start with the configuration, first is the common config file <code>\/etc\/drbd.d\/global_common.conf<\/code> on one server only:<\/p>\n<pre><code>global {\n    usage-count no;\n    # minor-count dialog-refresh disable-ip-verification\n}\ncommon {\n    handlers {\n        # These are EXAMPLE handlers only.\n        # They may have severe implications,\n        # like hard resetting the node under certain circumstances.\n        # Be careful when chosing your poison.\n        pri-on-incon-degr \"\/usr\/lib\/drbd\/notify-pri-on-incon-degr.sh; \/usr\/lib\/drbd\/notify-emergency-reboot.sh; echo b &gt; \/proc\/sysrq-trigger ; reboot -f\";\n        pri-lost-after-sb \"\/usr\/lib\/drbd\/notify-pri-lost-after-sb.sh; \/usr\/lib\/drbd\/notify-emergency-reboot.sh; echo b &gt; \/proc\/sysrq-trigger ; reboot -f\";\n        local-io-error \"\/usr\/lib\/drbd\/notify-io-error.sh; \/usr\/lib\/drbd\/notify-emergency-shutdown.sh; echo o &gt; \/proc\/sysrq-trigger ; halt -f\";\n        #  Hook into Pacemaker's fencing\n        fence-peer \"\/usr\/lib\/drbd\/crm-fence-peer.sh\";\n        after-resync-target \"\/usr\/lib\/drbd\/crm-unfence-peer.sh\";\n        # split-brain \"\/usr\/lib\/drbd\/notify-split-brain.sh root\";\n        # out-of-sync \"\/usr\/lib\/drbd\/notify-out-of-sync.sh root\";\n        # before-resync-target \"\/usr\/lib\/drbd\/snapshot-resync-target-lvm.sh -p 15 -- -c 16k\";\n        # after-resync-target \/usr\/lib\/drbd\/unsnapshot-resync-target-lvm.sh;\n    }\n    startup {\n        # wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb\n        wfc-timeout 300;\n        degr-wfc-timeout 120;\n        outdated-wfc-timeout 120;\n    }\n    options {\n        # cpu-mask on-no-data-accessible\n        on-no-data-accessible io-error;\n        #on-no-data-accessible suspend-io;\n    }\n    disk {\n        # size max-bio-bvecs on-io-error fencing disk-barrier disk-flushes\n        # disk-drain md-flushes resync-rate resync-after al-extents\n        # c-plan-ahead c-delay-target c-fill-target c-max-rate\n        # c-min-rate disk-timeout\n        fencing resource-and-stonith;\n\n        # Setup syncer rate, start with 30% and let the dynamic planer do the job by\n        # letting it know our network parameters (1Gbps), and c-fill-target which is\n        # calucated as BDP x 2 (twice the Bandwith Delay Product)\n        # used http:\/\/www.speedguide.net\/bdp.php to find the BDP\n        resync-rate 33M;\n        c-max-rate 110M;\n        c-min-rate 10M;\n        c-fill-target 16M;\n    }\n    net {\n        # protocol timeout max-epoch-size max-buffers unplug-watermark\n        # connect-int ping-int sndbuf-size rcvbuf-size ko-count\n        # allow-two-primaries cram-hmac-alg shared-secret after-sb-0pri\n        # after-sb-1pri after-sb-2pri always-asbp rr-conflict\n        # ping-timeout data-integrity-alg tcp-cork on-congestion\n        # congestion-fill congestion-extents csums-alg verify-alg\n        # use-rle\n        # Protocol \"C\" tells DRBD not to tell the operating system that\n        # the write is complete until the data has reach persistent\n        # storage on both nodes. This is the slowest option, but it is\n        # also the only one that guarantees consistency between the\n        # nodes. It is also required for dual-primary, which we will\n        # be using.\n        protocol C;\n\n        # Tell DRBD to allow dual-primary. This is needed to enable\n        # live-migration of our servers.\n        allow-two-primaries yes;\n\n        # This tells DRBD what to do in the case of a split-brain when\n        # neither node was primary, when one node was primary and when\n        # both nodes are primary. In our case, we'll be running\n        # dual-primary, so we can not safely recover automatically. The\n        # only safe option is for the nodes to disconnect from one\n        # another and let a human decide which node to invalidate.\n        after-sb-0pri discard-zero-changes;\n        after-sb-1pri discard-secondary;\n        after-sb-2pri disconnect;\n    }\n}\n<\/code><\/pre>\n<p>then we create a resource config file <code>\/etc\/drbd.d\/r0.res<\/code> where we utilize previously created LVM:<\/p>\n<pre><code>resource r0 {\n    startup {\n        # This tells DRBD to promote both nodes to 'primary' when this\n        # resource starts. However, we will let pacemaker control this\n        # so we comment it out, which tells DRBD to leave both nodes\n        # as secondary when drbd starts.\n        #become-primary-on both;\n    }\n\n    net {\n        # This tells DRBD how to do a block-by-block verification of\n        # the data stored on the backing devices. Any verification\n        # failures will result in the effected block being marked\n        # out-of-sync.\n        verify-alg md5;\n\n        # This tells DRBD to generate a checksum for each transmitted\n        # packet. If the data received data doesn't generate the same\n        # sum, a retransmit request is generated. This protects against\n        # otherwise-undetected errors in transmission, like\n        # bit-flipping. See:\n        # http:\/\/www.drbd.org\/users-guide\/s-integrity-check.html\n        data-integrity-alg md5;\n\n        # Increase send buffer since we are on 1Gbs bonded network\n        sndbuf-size 512k;\n\n        # Improve write performance of the replicated data on the\n        # receiving node\n        max-buffers 8000;\n        max-epoch-size 8000;\n    }\n\n    disk {\n        # This tells DRBD not to bypass the write-back caching on the\n        # RAID controller. Normally, DRBD forces the data to be flushed\n        # to disk, rather than allowing the write-back cachine to\n        # handle it. Normally this is dangerous, but with BBU-backed\n        # caching, it is safe. The first option disables disk flushing\n        # and the second disabled metadata flushes.\n        disk-flushes no;\n        md-flushes no;\n        disk-barrier no;\n\n        # In case of error DRBD will operate in diskless mode, and carries    \n        # all subsequent I\/O operations, read and write, on the peer node   \n        on-io-error detach;\n\n        # Increase metadata activity log to reduce disk writing and\n        # improve performance\n        al-extents 3389;\n    }\n\n    volume 0 {\n       device      \/dev\/drbd0;\n       disk        \/dev\/mapper\/vg_drbd0-lv_drbd0;\n       meta-disk   internal;\n    }\n\n    on server01 {\n       address     10.10.10.91:7788;\n    }\n\n    on server02 {\n       address     10.10.10.26:7788;\n    }\n} \n<\/code><\/pre>\n<p>To note here is we disable the disk flushes and disk barriers to improve performance since our disk controller has BBU backed volatile cache:<\/p>\n<pre><code>root@server01:~# \/opt\/MegaRAID\/storcli\/storcli64 \/c0 show all | grep BBU\nBBU Status = 0\nBBU  = Yes\nBBU = Present\nCache When BBU Bad = Off\n\nroot@server01:~# \/opt\/MegaRAID\/storcli\/storcli64 -LDInfo -L1 -aALL -NoLog | grep 'Current Cache Policy'\nCurrent Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU\n<\/code><\/pre>\n<p>Since everything needs to be identical on the second server we simply copy over the files:<\/p>\n<pre><code>root@server01:~# rsync -r \/etc\/drbd.d\/ server02:\/etc\/drbd.d\/\n<\/code><\/pre>\n<p>Then on both servers we load the kernel module, create the resource and its meta data and bring the resource up:<\/p>\n<pre><code>[ALL]:~# modprobe drbd\n[ALL]:~# drbdadm create-md r0\n[ALL]:~# drbdadm up r0\n<\/code><\/pre>\n<p>By default both resources will come up as <code>Secondary<\/code> so on one node only we make the resource <code>Primary<\/code> which will trigger the initial disk synchronization:<\/p>\n<pre><code>root@server01:~# drbdadm primary --force r0\n<\/code><\/pre>\n<p>This can take lots of time depending on the disk size so to speedup the initial sync, on the sync target we run:<\/p>\n<pre><code>root@server02:~# drbdadm disk-options --c-plan-ahead=0 --resync-rate=110M r0\n<\/code><\/pre>\n<p>to let it take as much as possible of the 1Gb bandwidth we have. After the initial sync has completed we can make the second node <code>Primary<\/code> too:<\/p>\n<pre><code>root@server02:~# drbdadm primary r0\n<\/code><\/pre>\n<p>and check the final status of the resource:<\/p>\n<pre><code>root@server01:~# cat \/proc\/drbd\nversion: 8.4.7-2 (api:1\/proto:86-101)\nGIT-hash: e0fc2176f53dda5aa32a59e6466af9d9dc6493be build by root@server01, 2016-05-23 02:14:03\n 0: cs:Connected ro:Primary\/Primary ds:UpToDate\/UpToDate C r-----\n    ns:209989680 nr:0 dw:280916 dr:209974404 al:858 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0\n<\/code><\/pre>\n<p>And to get back to the configured re-sync speed we run on the sync target node:<\/p>\n<pre><code>root@server02:~# drbdadm adjust r0\n<\/code><\/pre>\n<p>At the end some settings to reduce latency. Enabling the deadline scheduler as recommended by LinBit:<\/p>\n<pre><code>[ALL]:~# echo deadline &gt; \/sys\/block\/sdb\/queue\/scheduler\n<\/code><\/pre>\n<p>Reduce read I\/O deadline to 150 milliseconds (the default is 500ms):<\/p>\n<pre><code>[ALL]:~# echo 150 &gt; \/sys\/block\/sdb\/queue\/iosched\/read_expire\n<\/code><\/pre>\n<p>Reduce write I\/O deadline to 1500 milliseconds (the default is 3000ms):<\/p>\n<pre><code>[ALL]:~# echo 1500 &gt; \/sys\/block\/sdb\/queue\/iosched\/write_expire\n<\/code><\/pre>\n<p>and we also put them in the <code>\/etc\/sysctl.conf<\/code> to make them permanent.<\/p>\n<h2>GFS2<\/h2>\n<p>On one node only, we create the file system:<\/p>\n<pre><code>root@server01:~# mkfs.gfs2 -p lock_dlm -j 2 -t slcluster:slgfs2 \/dev\/drbd0\nThis will destroy any data on \/dev\/drbd0\nAre you sure you want to proceed? [y\/n]y\nDevice:                    \/dev\/drbd0\nBlock size:                4096\nDevice size:               199.99 GB (52427191 blocks)\nFilesystem size:           199.99 GB (52427189 blocks)\nJournals:                  2\nResource groups:           800\nLocking protocol:          \"lock_dlm\"\nLock table:                \"slcluster:slgfs2\"\nUUID:                      701d9bfe-b220-d58a-2734-ad10efc2afdc\n<\/code><\/pre>\n<p>where <code>slcluster<\/code> is the cluster name we setup in <code>corosync<\/code> previously:<\/p>\n<pre><code>root@server02:~# grep cluster \/etc\/corosync\/corosync.conf\n    cluster_name: slcluster\n<\/code><\/pre>\n<p>and <code>slgfs2<\/code> is an unique file system name. On each node, make the file system mount point and configure it in <code>\/etc\/fstab<\/code> for GFS2 daemon to find it on startup:<\/p>\n<pre><code> ...\n# GFS2\/DRBD mount point\nUUID=701d9bfe-b220-d58a-2734-ad10efc2afdc       \/data   gfs2    defaults,noauto,noatime,nodiratime,nobootwait      0 0\n<\/code><\/pre>\n<h2>Finishing off the Cluster Configuration<\/h2>\n<p>Now that we have DRBD and DLM configured we can add them to Pacemaker for management. We also add some constraints and ordering so the resources start and stop in proper order and dependencies. When finished with the configuration and all changes are committed Pacemaker will automatically start the services, mount file systems etc. The final Pacemaker config looks like this:<\/p>\n<pre><code>root@server01:~# crm configure show | cat\nnode $id=\"1\" server01\nnode $id=\"2\" server02\nprimitive p_controld ocf:pacemaker:controld \\\n    op monitor interval=\"60\" timeout=\"60\" \\\n    op start interval=\"0\" timeout=\"90\" \\\n    op stop interval=\"0\" timeout=\"100\" \\\n    params daemon=\"dlm_controld\" \\\n    meta target-role=\"Started\"\nprimitive p_drbd_r0 ocf:linbit:drbd \\\n    params drbd_resource=\"r0\" \\\n    op monitor interval=\"10\" role=\"Master\" \\\n    op monitor interval=\"20\" role=\"Slave\" \\\n    op start interval=\"0\" timeout=\"240\" \\\n    op stop interval=\"0\" timeout=\"100\"\nprimitive p_fence_server01 stonith:fence_ipmilan \\\n    params pcmk_host_list=\"server01\" ipaddr=\"10.10.10.52\" action=\"reboot\" login=\"&lt;my -admin-user&gt;\" passwd=\"&lt;\/my&gt;&lt;my -admin-password&gt;\" delay=\"15\" \\\n    op monitor interval=\"60s\"\nprimitive p_fence_server02 stonith:fence_ipmilan \\\n    params pcmk_host_list=\"server02\" ipaddr=\"10.10.10.71\" action=\"reboot\" login=\"&lt;\/my&gt;&lt;my -admin-user&gt;\" passwd=\"&lt;\/my&gt;&lt;my -admin-password&gt;\" delay=\"5\" \\\n    op monitor interval=\"60s\"\nprimitive p_fs_gfs2 ocf:heartbeat:Filesystem \\\n    params device=\"\/dev\/drbd0\" directory=\"\/data\" fstype=\"gfs2\" options=\"_netdev,noatime,rw,acl\" \\\n    op monitor interval=\"20\" timeout=\"40\" \\\n    op start interval=\"0\" timeout=\"60\" \\\n    op stop interval=\"0\" timeout=\"60\" \\\n    meta is-managed=\"true\"\nms ms_drbd p_drbd_r0 \\\n    meta master-max=\"2\" master-node-max=\"1\" clone-max=\"2\" clone-node-max=\"1\" notify=\"true\" interleave=\"true\"\nclone cl_dlm p_controld \\\n    meta globally-unique=\"false\" interleave=\"true\" target-role=\"Started\"\nclone cl_fs_gfs2 p_fs_gfs2 \\\n    meta globally-unique=\"false\" interleave=\"true\" ordered=\"true\" target-role=\"Started\"\nlocation l_fence_server01 p_fence_server01 -inf: server01\nlocation l_fence_server02 p_fence_server02 -inf: server02\ncolocation cl_fs_gfs2_dlm inf: cl_fs_gfs2 cl_dlm\ncolocation co_drbd_dlm inf: cl_dlm ms_drbd:Master\norder o_dlm_fs_gfs2 inf: cl_dlm:start cl_fs_gfs2:start\norder o_drbd_dlm_fs_gfs2 inf: ms_drbd:promote cl_dlm:start cl_fs_gfs2:start\nproperty $id=\"cib-bootstrap-options\" \\\n    dc-version=\"1.1.10-42f2063\" \\\n    cluster-infrastructure=\"corosync\" \\\n    no-quorum-policy=\"ignore\" \\\n    stonith-enabled=\"true\" \\\n    last-lrm-refresh=\"1464141632\"\nrsc_defaults $id=\"rsc-options\" \\\n    resource-stickiness=\"100\" \\\n    migration-threshold=\"3\"\n<\/code><\/pre>\n<p>Now we can disable the drbd service from autostart since Pacemaker will take care of that for us:<\/p>\n<pre><code>[ALL]# update-rc.d drbd disable\n<\/code><\/pre>\n<p>Some useful commands we can run to check and confirm the status of all resources in Pacemaker:<\/p>\n<pre><code>root@server02:~# crm_mon -Qrf1\nStack: corosync\nCurrent DC: server01 (1) - partition with quorum\nVersion: 1.1.10-42f2063\n2 Nodes configured\n8 Resources configured\n\nOnline: [ server01 server02 ]\n\nFull list of resources:\n\n p_fence_server01    (stonith:fence_ipmilan):    Started server02\n p_fence_server02    (stonith:fence_ipmilan):    Started server01\n Master\/Slave Set: ms_drbd [p_drbd_r0]\n     Masters: [ server01 server02 ]\n Clone Set: cl_dlm [p_controld]\n     Started: [ server01 server02 ]\n Clone Set: cl_fs_gfs2 [p_fs_gfs2]\n     Started: [ server01 server02 ]\n\nMigration summary:\n* Node server02:\n* Node server01:\n<\/code><\/pre>\n<p>The DLM lock manager has its own tool as well:<\/p>\n<pre><code>root@server02:~# dlm_tool status\ncluster nodeid 2 quorate 1 ring seq 24 24\ndaemon now 262695 fence_pid 0\nnode 1 M add 262497 rem 0 fail 0 fence 0 at 0 0\nnode 2 M add 262497 rem 0 fail 0 fence 0 at 0 0\n\nroot@server02:~# dlm_tool ls\ndlm lockspaces\nname          slgfs2\nid            0x966db418\nflags         0x00000000\nchange        member 2 joined 1 remove 0 failed 0 seq 1,1\nmembers       1 2\n<\/code><\/pre>\n<p>Simple check if the GFS2 file system is mounted:<\/p>\n<pre><code>root@server02:~# cat \/proc\/mounts | grep \/data\n\/dev\/drbd0 \/data gfs2 rw,noatime,acl 0 0\n<\/code><\/pre>\n<p>And maybe GFS2 overview using one of the GFS2 own tools <code>gfs2_edit<\/code>:<\/p>\n<pre><code>root@server01:~# gfs2_edit -p sb master \/dev\/drbd0\nBlock #16    (0x10) of 52427191 (0x31ff9b7) (superblock)\n\nSuperblock:\n  mh_magic              0x01161970(hex)\n  mh_type               1                   0x1\n  mh_format             100                 0x64\n  sb_fs_format          1801                0x709\n  sb_multihost_format   1900                0x76c\n  sb_bsize              4096                0x1000\n  sb_bsize_shift        12                  0xc\n  master dir:           2                   0x2\n        addr:           134                 0x86\n  root dir  :           1                   0x1\n        addr:           133                 0x85\n  sb_lockproto          lock_dlm\n  sb_locktable          slcluster:slgfs2\n  sb_uuid               701d9bfe-b220-d58a-2734-ad10efc2afdc\n\nThe superblock has 2 directories\n   1\/1 [00000000] 1\/133 (0x1\/0x85): Dir     root\n   2\/2 [00000000] 2\/134 (0x2\/0x86): Dir     master\n------------------------------------------------------\nBlock #134    (0x86) of 52427191 (0x31ff9b7) (disk inode)\n-------------- Master directory -----------------\nDinode:\n  mh_magic              0x01161970(hex)\n  mh_type               4                   0x4\n  mh_format             400                 0x190\n  no_formal_ino         2                   0x2\n  no_addr               134                 0x86\n  di_mode               040755(decimal)\n  di_uid                0                   0x0\n  di_gid                0                   0x0\n  di_nlink              4                   0x4\n  di_size               3864                0xf18\n  di_blocks             1                   0x1\n  di_atime              1463999842          0x5742dd62\n  di_mtime              1463999842          0x5742dd62\n  di_ctime              1463999842          0x5742dd62\n  di_major              0                   0x0\n  di_minor              0                   0x0\n  di_goal_meta          134                 0x86\n  di_goal_data          134                 0x86\n  di_flags              0x00000201(hex)\n  di_payload_format     1200                0x4b0\n  di_height             0                   0x0\n  di_depth              0                   0x0\n  di_entries            8                   0x8\n  di_eattr              0                   0x0\n\nDirectory block: lf_depth:0, lf_entries:0,fmt:0 next=0x0 (8 dirents).\n   1\/1 [0ed4e242] 2\/134 (0x2\/0x86): Dir     .\n   2\/2 [9608161c] 2\/134 (0x2\/0x86): Dir     ..\n   3\/3 [5efc1d83] 3\/135 (0x3\/0x87): Dir     jindex\n   4\/4 [486eee32] 6\/65812 (0x6\/0x10114): Dir     per_node\n   5\/5 [446811e9] 13\/66331 (0xd\/0x1031b): File    inum\n   6\/6 [1aef248e] 14\/66332 (0xe\/0x1031c): File    statfs\n   7\/7 [b1799d75] 15\/66333 (0xf\/0x1031d): File    rindex\n   8\/8 [6c1c0fed] 16\/66353 (0x10\/0x10331): File    quota\n------------------------------------------------------\n<\/code><\/pre>\n<h3>Cluster testing<\/h3>\n<p>Hang the first node and monitor how the second node initiates fencing:<\/p>\n<pre><code>root@server01:~# echo c &gt; \/proc\/sysrq-trigger\n<\/code><\/pre>\n<p>Monitor the logs on the second node:<\/p>\n<pre><code>root@server02:~# tail -f \/var\/log\/syslog\n...\nMay 23 07:21:26 server02 pengine[4342]:  warning: process_pe_message: Calculated Transition 17: \/var\/lib\/pacemaker\/pengine\/pe-warn-3.bz2\nMay 23 07:21:26 server02 crmd[4343]:   notice: te_fence_node: Executing reboot fencing operation (56) on server01 (timeout=60000)\nMay 23 07:21:26 server02 crmd[4343]:   notice: te_rsc_command: Initiating action 69: notify p_drbd_r0_pre_notify_demote_0 on server02 (local)\nMay 23 07:21:26 server02 stonith-ng[4339]:   notice: handle_request: Client crmd.4343.6f0f4fdc wants to fence (reboot) 'server01' with device '(any)'\nMay 23 07:21:26 server02 stonith-ng[4339]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for server01: c2fb8a55-7d37-479b-a913-42dc30b61e70 (0)\n<\/code><\/pre>\n<p>We can see fencing in action and the stalled node being rebooted. We check the cluster state:<\/p>\n<pre><code>root@server02:~# crm status\nLast updated: Mon May 23 07:24:21 2016\nLast change: Mon May 23 07:21:52 2016 via cibadmin on server02\nStack: corosync\nCurrent DC: server02 (2) - partition WITHOUT quorum\nVersion: 1.1.10-42f2063\n2 Nodes configured\n8 Resources configured\n\n\nOnline: [ server02 ]\nOFFLINE: [ server01 ]\n\n p_fence_server01    (stonith:fence_ipmilan):    Started server02\n Master\/Slave Set: ms_drbd [p_drbd_r0]\n     Masters: [ server02 ]\n     Stopped: [ server01 ]\n Clone Set: cl_dlm [p_controld]\n     Started: [ server02 ]\n     Stopped: [ server01 ]\n Clone Set: cl_fs_gfs2 [p_fs_gfs2]\n     Started: [ server02 ]\n     Stopped: [ server01 ]\n<\/code><\/pre>\n<p>and can see all is still running on the surviving node.<\/p>\n<h3>Cluster Monitoring<\/h3>\n<p>We can use the <code>crm_mon<\/code> cluster tool for this purpose started in daemon mode on both nodes and managed by <code>Supervisord<\/code>. We create our <code>\/etc\/supervisor\/conf.d\/local.conf<\/code> file:<\/p>\n<pre><code>[program:crm_mon]\ncommand=crm_mon --daemonize --timing-details --watch-fencing --mail-to igorc@encompasscorporation.com --mail-host smtp.mydomain.com --mail-prefix \"Pacemaker cluster alert\"\nprocess_name=%(program_name)s\nautostart=true\nautorestart=true\nstartsecs=0\nstopsignal=QUIT\nuser=root\nstdout_logfile=\/var\/log\/crm_mon.log\nstdout_logfile_maxbytes=1MB\nstdout_logfile_backups=3\nstderr_logfile=\/var\/log\/crm_mon.log\nstderr_logfile_maxbytes=1MB\nstderr_logfile_backups=3\n<\/code><\/pre>\n<p>Then we reload <code>Supervisord<\/code> and start the process:<\/p>\n<pre><code>root@server02:~# supervisorctl reread\ncrm_mon: available\nhttp-server: changed\n\nroot@server02:~# supervisorctl reload\nRestarted supervisord\n\nroot@server02:~# supervisorctl status\ncrm_mon                          RUNNING    pid 18259, uptime 0:00:00\n<\/code><\/pre>\n<p>The daemon will now send me emails every time the cluster state changes. It can also create a web page if used with <code>--as-html=\/path\/to\/page<\/code> parameter for monitoring the state using browser.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>SoftLayer is IBM company providing cloud and Bare-Metal hosting services. We are going to setup a cluster of Pacemaker, DRBD and GFS2 on couple of Bare-Metal servers to host our Encompass services. This will provide high availability of the shared&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[17,9],"tags":[26,21,37,25,20],"class_list":["post-395","post","type-post","status-publish","format-standard","hentry","category-cluster","category-high-availability","tag-cluster","tag-drbd","tag-gfs2","tag-high-availability","tag-pacemaker"],"_links":{"self":[{"href":"https:\/\/icicimov.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/395","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/icicimov.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/icicimov.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/icicimov.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/icicimov.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=395"}],"version-history":[{"count":3,"href":"https:\/\/icicimov.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/395\/revisions"}],"predecessor-version":[{"id":398,"href":"https:\/\/icicimov.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/395\/revisions\/398"}],"wp:attachment":[{"href":"https:\/\/icicimov.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=395"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/icicimov.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=395"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/icicimov.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=395"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}