自习之Linux HA Cluster

[toc]

Linux HA Cluster

一、理论介绍

LB,HA,HP,hadoop

1、LB（负载均衡）:

传输层:lvs
应用层:nginx,haproxy,httpd,perlbal,ats,varnish

2、HA(高可用）:

vrrp: keepalived
HA Cluster: heartbeat,OpenAIS corosync/pacemaker,cman,RHCS（红帽集群套件）
corosync/pacemaker,cman 是从 OpenAIS中独立出来的

3、HA:

    故障场景:  
    硬件故障:设计缺陷，使用过久自然损坏，人为故障  
    软件故障:设计缺陷，bug,人为误操作  
    A(可用性)=MTBF[平均无故障时间]/(MTBF[平均无故障时间]+MTTR[平均修复时间])  
    MTBF:Mean Time Between Failre
    MTTR:Mean Time To Repair  
    可用性区间:0<A<1  
    百分比:90%,95%,99%  
    通常会用几个9来衡量可用性

提供冗余: 
    Messaging Layer (同可用结构间的信息传递层)  
    Resource Manager (在Messaging Layer之上，管理多个资源)  
    Local Resource Manager (在Resource Messaging Layer之上，决定在哪个结点上启动服务)  
    local Resource agent: 四种基本功能(start,stop,restart,status)

    network partition: 发生了网络分区  
    隔离：  
        STONITH: shoot the other node on the head (节点级别的隔离)  
        Fence:资源级别的隔离  
        通过交换机断掉网络，或者通过电源交换机，把服务器的电源停丢

    集群节点必须大于半数，而非等于  
    或者加一个仲裁设备  
    高可用节点，只能有一个节点是工作的状态  
    但可以在多个节点上，配置出来多个服务，N-M模型

failover domain: 
故障转移域 
    fda: node1,node5  
    fdb: node2,node5 
    fdc: node3,node5 
    fdd: node4,node5

资源的约束性:
    位置约束，资源对节点的倾向性; 
    排列约束，资源彼此间是否能运行于同一节点的倾向性； 
    顺序约束，多个资源启动顺序依赖关系；

vote system:
    少数服从多数:quorum 
            with quorum:拥有法定票数，( > total/2 )
            without quorum:不拥有法定票数  
    两个节点(偶数个节点):
        Ping node 
        qdisk         
    failover (资源将要转移出去)
    failback (资源将要转移回来)

4、Messaging Layer:

     heartbeat,
         v1
         v2
         v3
     corosync,cman

5、Cluster Resource Manager (CRM):

    heartbeat v1 haresources (配置接口:配置文件[haresources])
    heartbeat v2 crm（在每个节点运行一个crmd守护进程（5560/tcp），有命令行接口[crmsh],很大程度上需要编辑xml的配置文件）; GUI;hb_gui
    heartbeat v3 pacemaker (心跳启读器) （配置接口:crmsh,pcs;GUI: hawk(suse),LCMC,pacemaker-gui)
    rgmanager （配置文件:cluster.conf,system-config-cluster,conga（webgui）, cman_tool, clustat,）

6、组合方式:

    heartbeat v1 (haresources)
    heartbeat v2 (crm)
    heartbeat v3 + pacemaker 
    heartbeat + pacemaker
    corosync + pacemaker 
        corosync v1 + pacemaker (plugin)
        corosync v2 + pacemaker (standalone service)
    cman + rgmanger 
    CentOS 6 cman + pacemaker  
    CentOS 6 (corosync v1 + cman +pacemaker) 
    RHCS: Red Hat Cluster Suite
        RHEL5: cman + rgmanager + conga (ricci/luci)
        RHEL6: cman + rgmanager + conga (ricci/luci)
                corosync + pacemaker 
                corosync + cman + pacemaker 
        RHEL7: corosync + pacemaker

7、资源代理:

Resource Agent:
    hearteat legacy: /etc/ha.d/haresources.d/目录下的脚本；
    LSB: /etc/rc.d/init.d/ 目录下的脚本；
    OCF: Open Cluster Framework;
        provider: 
    STONITH设备:
    Systemctl:

8、资源类型:

    primitive: 主资源，原始资源:在集群中只能运行一个实例;
    clone:克隆资源，在集群中可以运行多个实例: 
        匿名克隆，全局惟一克隆、状态克隆(主动、被动)
    multi-state(master/slave);克隆资源的特殊实现;多状态资源:
    group: 组资源:
        启动或停止:
        资源监视;
        相关性;
    资源属性:
        priority:优先级;
        target-role:started,stopped,master 目标角色;
        is-managed:是否允许群集管理此资源;
        resource-stickiness: 资源粘性; 
        allow-migrate:是否允许资源迁移;
    约束:score
        位置约束:资源对节点的倾向性；
            (-oo,+oo)
                任何值+无穷大=无穷大
                任何值+负无穷大=负无穷大
                无穷大+负无穷大=负无穷大
        排列约束:资源彼此间是否能运行同一节点的倾向性;
            (-oo,+oo)
        顺序的约束:多个资源启动顺序依赖关系;
            (-oo,+oo)
                Mandatory

二、安装配置

1、 CentOS 7: corosync v2 + pacemaker
corosync v2: vote system
pacemaker:独立服务
集群的全生命周期管理工具:
pcs: agent(pcsd)
crmsh : agentless (pssh)

# yum info pcs

2、集群配置前提

a、时间同步;
b、基于当前正在使用的主机名互相访问; 
c、是否会用到仲裁设备; 
e、双机互信

3.1安装并启动pcsd

# vim /etc/hosts 
xx.xxx.xx.xxx node1.ssjinyao.com
xx.xxx.xx.xxx node2.ssjinyao.com
# scp /etc/hosts root@node2.ssjinyao.com:/etc/
# date ; ssh root@node2.ssjinyao.com 'date' #查看两个节点服务器时间是否一致
# ntpdate date.xxx.com #若服务器时间不同步，则同步服务器的时间

Node1 AND Node2
# yum install -y pacemaker pcs psmisc policycoreutils-python
# 也可以 yum insetall -y pcs 
# systemctl start pcsd.service 
# systemctl enable pcsd.service

3.2 各节点统一执行

# vim /etc/ansible
[ha]
node1.ssjinyao.com
node2.ssjinyao.com

Node1 AND Node2
# ansible ha -m service -a "name=pcsd state=started enabled=yes"
# systemctl status pcsd 
# ansible ha -m shell -a 'echo "xxxxxx" | passwd --stdin hacluster'

# pcs help 
# pcs cluster --help 
# pcs cluster auth node1.ssjinyao.com node2.ssjinyao.com -u hacluster # pcs认证通过
# pcs cluster setup --name jycluster node1.ssjinyao.com node2.ssjinyao.com
# cd /etc/corosync/
# cat corosync.conf 
# 注 totem是专门传递集群间的心跳信息;
     nodelist 集群中目前存的节点;
     quorum 法定票数的投票机制; corosync_votequorum; 
# cd /etc/corosync/ && cat corosync.conf.example.udpu # 醒看集群配置实例; 
# vim corosync.conf 
修改 loggin
logging {
to_logfile:yes 
logfile: /var/log/cluster/corosync.log 
}
# scp corosync.conf node2.ssjinyao.com:/etc/corosync/

3.3 启动集群

Node1 Or Node2 
# pcs cluster  start --all 
# corosync-cfgtool --help 
检查各节点通信状态(显示为no faults即为OK):
# corosync-cfgtool -s 
Printing ring status.
Local node ID 1 
RING ID 0 
     id = xx.xxx.xx.xxx
     status = ring 0 active with no faults s
同时也可以在node2.ssjinyao.com上查看节点状态 
检查集群成员关及Quorum API:
# corosync-cmapctl | grep members 
# pcs staus corosync 
# pcs staus 
查看pcs资源可配置选项有哪些
# pcs property list --all
查看pcs配置是否存在问题
# crm_verify -L -V  
# pcs property se stonith-enabled=false 
# pcs property list 
# crm_verify -L -V

3.4 crmsh 下载并安装

# wget crmsh-2.1.4.x86_64.rpm 
# wget pssh-2.3.1-4.2.x86_64.rpm
# wget python-pssh-2.3.1-4.2.x86_64.rpm
# yum install -y *rpm 
注 crmsh 找一个结点安装就可以，当然两个节点可以都安装 
    pssh 是并行执行ssh命令

# crm help 
# crm status 
# crm 可以直接进入交互式接口 
# help start 
>crm(live)node# configure 
>crm(live)configure# help 
>crm(live)configure# show
>crm(live)configure# edit

3.5 利用crmsh配置一个web serice:
vip:10.180.xxx.xx
httpd

Node1 AND Node2
# yum -y install httpd 
Node1
# echo "<h1>node1.ssjinyao.com</h1>" > /var/www/html/index.html
# systemctl enabled httpd 
注:对于CentOS 6 来说必需要禁用 
# systemctl stop httpd  
Node2 
# echo "<h1>nodd2.ssjinyao.com</h1>" > /var/www/html/index.html
# systemctl enabled httpd 
# systemctl stop httpd  
测试服务器可以访问后，再将httpd服务停止

# crm    或者 # crm ra 
>crm(live)# ra
>crm(live)ra# list systemd
>crm(live)ra# list ocf heartbeat 
>crm(live)ra# info ocf:heartbeat:IPaddr 
>crm(live)ra# info ocf:heartbeat:IPaddr 
>crm(live)ra# cd ..
>crm(live)# configure 
>crm(live)configure# primitive webip ocf:heartbeat:IPaddr params ip=10.xxx.xx.xxx
>crm(live)# status 
# ifconfig 可以看到资源已经配置好了 
把当前结点变成备用模式
>crm(live)node# standby # 软下线
>crm(live)node# status
>crm(live)# node only
>crm(live)# configure 
>crm(live)configure # primitive webserver systemd:httpd
>crm(live)configure # verify
>crm(live)configure # commit 
>crm(live)configure # group
>crm(live)configure # grouop webservice webip webserivce 
>crm(live)configure # verfy
>crm(live)configure # commit 
# crm node standby 
注:可以查看web请求的反回内容，与资源定义情况;

# crm node onlie 
# crm 
>crm(live) # configure
>crm(live)configure # verify
# crm status 
# systemctl stop  pacemaker.service corosync.serivce 
# crm status 
在CentOS 6的情况下，假如将服务器直接关机后，会造成 partition with quorum 
# crm configure 
>crm(live)configure # property no-quorum-policy=ignore