Pacemaker in Ubuntu 16.04

Prerequisite

Two nodes of Ubuntu 16.04 with below initial configuration. 

/etc/hostname /etc/hosts IP

Node1 node1.hatest.com 192.168.1.191 node1.hatest.com node1 192.168.1.191

Node2 Node2.hatest.com 192.168.1.192 node2.hatest.com node2 192.168.1.192

Alias 192.168.1.190


Preparations

Before proceed to the main installation phase please check below issues, that may help us during the whole process.

You may change the APT sources from bd to us

apt-get update

apt-get upgrade

Enable root login


Installation:

For an instant keep on a ping on the alias IP (192.168.1.190) from another host of the LAN. After the installation, we will check whether we are getting response from it or not…!!!!!!!!!!  

We can sync time on both nodes by below steps.


root@node1:/home/sharif# dpkg-reconfigure tzdata 

[Follow next easy steps to get below output on both nodes]


Current default time zone: 'Asia/Dhaka'

Local time is now:      Tue Jun 19 15:19:53 +06 2018.

Universal Time is now:  Tue Jun 19 09:19:53 UTC 2018.


Perform apt update & upgrade on both nodes.

Install pacemaker on both nodes. 


root@node1:/home/sharif# apt install pacemaker

root@node2:/home/sharif# apt install pacemaker


Corosync is installed as a dependency of the Pacemaker package. So we do not need to install it manually.

Follow below steps only on Node1.


root@node1:/home/sharif# apt install haveged

root@node1:/home/sharif# corosync-keygen


This haveged package allows us to easily increase the amount of entropy on our server, which is required by the corosync-keygen script.


corosync-keygen will generate a 128-byte cluster authorization key, and write it to /etc/corosync/authkey


Now copy the authkeys from Node1 to Node2 as shown below.


root@node1:/home/sharif# scp /etc/corosync/authkey root@192.168.1.192:/tmp

The authenticity of host '192.168.1.192 (192.168.1.192)' can't be established.

ECDSA key fingerprint is SHA256:fnsDOzHiX3MHlhsqr97PKZqgwBKG0gmCOnbKUDpzDYE.

Are you sure you want to continue connecting (yes/no)? yes


Warning: Permanently added '192.168.1.192' (ECDSA) to the list of known hosts.

root@192.168.1.192's password: 

authkey                                                                                100% 128     0.1KB/s   00:00    


On Node2, move the authkey file to the proper location (/etc/corosync/), and restrict its permissions to root. Check below steps:

root@node2:/home/sharif# mv /tmp/authkey /etc/corosync

root@node2:/home/sharif# chown root: /etc/corosync/authkey

root@node2:/home/sharif# chmod 400 /etc/corosync/authkey


On both nodes, open the /etc/corosync/corosync.conf file and modify/add it as per below example file.


totem {

  version: 2

  cluster_name: lbcluster

  transport: udpu

  interface {

    ringnumber: 0

    bindnetaddr: 192.168.1.190

    broadcast: yes

    mcastport: 5405

  }

}


quorum {

  provider: corosync_votequorum

  two_node: 1

}


nodelist {

  node {

    ring0_addr: 192.168.1.191

    name: node1.hatest.com

    nodeid: 1

  }

  node {

    ring0_addr: 192.168.1.192

    name: node2.hatest.com

    nodeid: 2

  }

}


logging {

  to_logfile: yes

  logfile: /var/log/corosync/corosync.log

  to_syslog: yes

  timestamp: on

}


*** Green marked lies may not found at the file, please add it under the previous line/existing point. If any further change noticed at the file please match it exactly like the above example file.



Now, we need to configure Corosync to allow the Pacemaker service. We will do this on both nodes.

root@node1:/home/sharif# mkdir -p /etc/corosync/service.d

root@node1:/home/sharif# vim /etc/corosync/service.d/pcmk


service {

  name: pacemaker

  ver: 1

}



Open file /etc/default/corosync and add this line (if there is already a line START=no, change it to YES as below OR add a line with START=yes. Do it on both nodes as per below output example file.


root@node1:/home/sharif# cat /etc/default/corosync 

# Corosync runtime directory

#COROSYNC_RUN_DIR=/var/lib/corosync


# Path to corosync.conf

#COROSYNC_MAIN_CONFIG_FILE=/etc/corosync/corosync.conf


# Path to authfile

#COROSYNC_TOTEM_AUTHKEY_FILE=/etc/corosync/authkey


# Command line options

#OPTIONS=""


START=yes




Now restart the corosync service on both nodes.

root@node1:/home/sharif# service corosync stop

root@node1:/home/sharif# service corosync start

root@node2:/home/sharif# service corosync stop

root@node2:/home/sharif# service corosync start


Once Corosync is running on both nodes, they should be clustered together. We can verify this by running this command with given output:


root@node1:/home/sharif# corosync-cmapctl | grep members

runtime.totem.pg.mrp.srp. members.1.config_version (u64) = 0

runtime.totem.pg.mrp.srp. members.1.ip (str) = r(0) ip(192.168.1.191) 

runtime.totem.pg.mrp.srp. members.1.join_count (u32) = 1

runtime.totem.pg.mrp.srp. members.1.status (str) = joined

runtime.totem.pg.mrp.srp. members.2.config_version (u64) = 0

runtime.totem.pg.mrp.srp. members.2.ip (str) = r(0) ip(192.168.1.192) 

runtime.totem.pg.mrp.srp. members.2.join_count (u32) = 1

runtime.totem.pg.mrp.srp. members.2.status (str) = joined


So far corosync set up properly, now we will configure Pacemaker. Pacemaker, which depends on the messaging capabilities of Corosync, is now ready to be started and to have its basic properties configured.

The Pacemaker service requires Corosync to be running, so it is disabled by default.

Enable Pacemaker to start on system boot with this command: (On both nodes)


root@node1:/home/sharif# update-rc.d pacemaker defaults 20 01

root@node1:/home/sharif# update-rc.d pacemaker defaults 20 01


we set Pacemaker's start priority to 20. It is important to specify a start priority that is higher than Corosync's (which is 19 by default), so that Pacemaker starts after Corosync.


Restart the pacemaker service on both nodes.

root@node1:/home/sharif# service pacemaker stop

root@node1:/home/sharif# service pacemaker start

root@node2:/home/sharif# service pacemaker stop

root@node2:/home/sharif# service pacemaker start


Check pacemaker status on both nodes with crm utility. The output will be as given below.

root@node1:/home/sharif# crm status

Last updated: Tue Jun 19 16:33:44 2018 Last change: Tue Jun 19 16:32:17 2018 by hacluster via crmd on node1.hatest.com

Stack: corosync

Current DC: node1.hatest.com (version 1.1.14-70404b0) - partition with quorum

2 nodes and 0 resources configured

Online: [ node1.hatest.com node2.hatest.com]

Full list of resources:

Key points

Current DC = Current Designated Coordinator.

2 Nodes and 0 resources configured

Online = Both nodes should be online. For us they are node1 & node2


We can also use crm_mon as an alternative of crm status. crm_mon shows the output in a continuous manner.


We are still not getting any response at 192.168.1.190

One last task to do. Follow below step on node1.


root@node1:/home/sharif# crm configure primitive virtual_public_ip \ocf:heartbeat:IPaddr2 params ip="192.168.1.190" \cidr_netmask="32" op monitor interval="10s" \meta migration-threshold="2" failure-timeout="60s" resource-stickiness="100"


Now check the ping status of 192.168.1.190


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


Check crm status on either node. Compare it with previous crm status. 


root@node1:/home/sharif# crm status

Last updated: Tue Jun 19 16:58:38 2018 Last change: Tue Jun 19 16:55:07 2018 by root via cibadmin on node1.hatest.com

Stack: corosync

Current DC: node1.hatest.com (version 1.1.14-70404b0) - partition with quorum

2 nodes and 1 resource configured

Online: [ node1.hatest.com node2.hatest.com ]

Full list of resources:

virtual_public_ip (ocf::heartbeat:IPaddr2): Started node1.hatest.com


Check the IP configuration of node1. For that use,

ip -4 addr ls  OR 

ip a s


root@node1:/home/sharif# ip -4 addr ls

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1

    inet 127.0.0.1/8 scope host lo

    valid_lft forever preferred_lft forever

2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000

    inet 192.168.1.191/24 brd 192.168.1.255 scope global ens33

    valid_lft forever preferred_lft forever

    inet 192.168.1.190/32 brd 192.168.1.255 scope global ens33

    valid_lft forever preferred_lft forever


We noticed the alias IP is assigned at Node1 as it is the active node for now. If we shut down or standby the Node1, it will move to the standby mode and the alias IP will be assigned at Node2.Check below step.


Perform below steps on Node1 & Node2 as per mentioned below.

root@node1:/home/sharif# crm node standby node1.hatest.com

Check the crm status.


root@node1:/home/sharif# crm status

Last updated: Tue Jun 19 17:05:33 2018 Last change: Tue Jun 19 17:05:25 2018 by root via crm_attribute on node1.hatest.com

Stack: corosync

Current DC: node1.hatest.com (version 1.1.14-70404b0) - partition with quorum

2 nodes and 1 resource configured


Node node1.hatest.com: standby

Online: [ node2.hatest.com ]


Full list of resources:


virtual_public_ip (ocf::heartbeat:IPaddr2): Started node2.hatest.com


Check IP address of Node2.


root@node2:/home/sharif# ip a s

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1

    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

    inet 127.0.0.1/8 scope host lo

       valid_lft forever preferred_lft forever

    inet6 ::1/128 scope host 

       valid_lft forever preferred_lft forever

2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000

    link/ether 00:0c:29:78:28:85 brd ff:ff:ff:ff:ff:ff

    inet 192.168.1.192/24 brd 192.168.1.255 scope global ens33

    valid_lft forever preferred_lft forever

    inet 192.168.1.190/32 brd 192.168.1.255 scope global ens33

    valid_lft forever preferred_lft forever

    inet6 fe80::20c:29ff:fe78:2885/64 scope link 

    valid_lft forever preferred_lft forever



And, we are still getting response from the alias ip address.


Again make the Node1 active with below step.


root@node1:/home/sharif# crm node online node1.hatest.com


Check the crm status now.


root@node1:/home/sharif# crm status

Last updated: Tue Jun 19 17:14:27 2018 Last change: Tue Jun 19 17:12:34 2018 by root via crm_attribute on node1.hatest.com

Stack: corosync

Current DC: node1.hatest.com (version 1.1.14-70404b0) - partition with quorum

2 nodes and 1 resource configured


Online: [ node1.hatest.com node2.hatest.com ]


Full list of resources:


 virtual_public_ip (ocf::heartbeat:IPaddr2): Started node2.hatest.com


Here, is a confusion, We have again set the Node1 active/online. Now both nodes are online. The service should be running from Node1 but still we see service s are running from Node2 (Started node2.hatest.com) because of one of our previous command 


crm configure primitive virtual_public_ip \ocf:heartbeat:IPaddr2 params ip="192.168.1.190" \cidr_netmask="32" op monitor interval="10s" \meta migration-threshold="2" failure-timeout="60s" resource-stickiness="100


Now, we have to manually set the node2 standby and then online to resolve the issue. See below steps.


root@node2:/home/sharif# crm node standby node2.hatest.com

root@node2:/home/sharif# crm node online node2.hatest.com


Now again check the crm status.


root@node2:/home/sharif# crm status

Last updated: Tue Jun 19 17:22:55 2018 Last change: Tue Jun 19 17:16:16 2018 by root via crm_attribute on node2.hatest.com

Stack: corosync

Current DC: node1.hatest.com (version 1.1.14-70404b0) - partition with quorum

2 nodes and 1 resource configured


Online: [ node1.hatest.com node2.hatest.com ]


Full list of resources:

virtual_public_ip (ocf::heartbeat:IPaddr2): Started node1.hatest.com


Now consider below scenario:


Node1 & Node2 both are online. Node2 crm status is


root@node2:/home/sharif# crm status

Last updated: Tue Jun 19 19:07:24 2018 Last change: Tue Jun 19 17:16:16 2018 by root via crm_attribute on node2.hatest.com

Stack: corosync

Current DC: node1.hatest.com (version 1.1.14-70404b0) - partition with quorum

2 nodes and 1 resource configured


Online: [ node1.hatest.com node2.hatest.com ]


Full list of resources:


virtual_public_ip (ocf::heartbeat:IPaddr2): Started node1.hatest.com


Shutdown the Node1 and check the crm status on Node2.


root@node2:/home/sharif# crm status

Last updated: Tue Jun 19 19:07:33 2018 Last change: Tue Jun 19 17:16:16 2018 by root via crm_attribute on node2.hatest.com

Stack: corosync

Current DC: node2.hatest.com (version 1.1.14-70404b0) - partition with quorum

2 nodes and 1 resource configured


Online: [ node2.hatest.com ]

OFFLINE: [ node1.hatest.com ]


Full list of resources:

virtual_public_ip (ocf::heartbeat:IPaddr2): Started node2.hatest.com


Still, We are getting uninterrupted ICMP from alias IP (192.168.1.190)

Comments

Popular posts from this blog

Disabling Zimbra's AntiSpam, Amavis and AntiVirus filtering

Cambium cnPilot E400/E410/E500 Configuration Tutorial

Error "Unable to retrive Zimbra GPG key for package validation"