当集群节点出现故障时,分为可恢复和不可恢复两种,对应的GBase 8a提供了2种节点状态来应对。
目录导航
failure 状态
针对可恢复的情况,当然也包括排查种的情况。。
被记录failure的节点,和该节点有关的event,不再检测各服务状态,不再下发任务,可以恢复到正常(normal)状态。
如下是一个节点模拟断电故障,且短时间无法恢复。从OFFLINE强行设置为FAILURE的操作过程。
注意:从gcadmin执行耗时看,OFFLINE时,明显在故障节点卡了一下,在等待检测超时。而在FAILURE时,系统忽略了检测,瞬间执行完成。
[root@rh6-1 ~]# gcadmin
CLUSTER STATE:  ACTIVE
CLUSTER MODE:   NORMAL
=================================================================
|             GBASE COORDINATOR CLUSTER INFORMATION             |
=================================================================
|   NodeName   |     IpAddress     |gcware |gcluster |DataState |
-----------------------------------------------------------------
| coordinator1 |    10.0.2.201     | OPEN  |  OPEN   |    0     |
-----------------------------------------------------------------
================================================================
|                GBASE DATA CLUSTER INFORMATION                |
================================================================
|NodeName |     IpAddress     |  gnode  |syncserver |DataState |
----------------------------------------------------------------
|  node1  |    10.0.2.201     |  OPEN   |   OPEN    |    0     |
----------------------------------------------------------------
|  node2  |    10.0.2.202     | OFFLINE |           |          |
----------------------------------------------------------------
[root@rh6-1 ~]# gcadmin setnodestate 10.0.2.202 failure
current user is not DBA user, please switch user to [gbase]
gcadmin set node state failed
[root@rh6-1 ~]# su - gbase
[gbase@rh6-1 ~]$ gcadmin setnodestate 10.0.2.202 failure
load gbase client dll start ......
load gbase client dll end ......
[gbase@rh6-1 ~]$ gcadmin
CLUSTER STATE:  ACTIVE
CLUSTER MODE:   NORMAL
=================================================================
|             GBASE COORDINATOR CLUSTER INFORMATION             |
=================================================================
|   NodeName   |     IpAddress     |gcware |gcluster |DataState |
-----------------------------------------------------------------
| coordinator1 |    10.0.2.201     | OPEN  |  OPEN   |    0     |
-----------------------------------------------------------------
================================================================
|                GBASE DATA CLUSTER INFORMATION                |
================================================================
|NodeName |     IpAddress     |  gnode  |syncserver |DataState |
----------------------------------------------------------------
|  node1  |    10.0.2.201     |  OPEN   |   OPEN    |    0     |
----------------------------------------------------------------
|  node2  |    10.0.2.202     | FAILURE |           |          |
----------------------------------------------------------------
[gbase@rh6-1 ~]$ unavailable 状态
当节点判定不可恢复故障,特别是RAID损坏,文件系统损坏,数据丢失时,设置这个状态。
被设置unavailable的节点,【不再】记录event, 不再检测节点状态,不再下发任务,不能恢复到正常(normal)状态,只能做节点替换。
gcadmin setnodestate 10.0.2.202 unavailable 具体操作过程,请参考 GBase 8a 强制节点离线和节点替换replace