南大通用GBase 8a设置节点故障failure和unavailable的区别

当集群节点出现故障时,分为可恢复和不可恢复两种,对应的GBase 8a提供了2种节点状态来应对。

failure 状态

针对可恢复的情况,当然也包括排查种的情况。。

被记录failure的节点,和该节点有关的event,不再检测各服务状态,不再下发任务,可以恢复到正常(normal)状态。

如下是一个节点模拟断电故障,且短时间无法恢复。从OFFLINE强行设置为FAILURE的操作过程。

注意:从gcadmin执行耗时看,OFFLINE时,明显在故障节点卡了一下,在等待检测超时。而在FAILURE时,系统忽略了检测,瞬间执行完成。

[root@rh6-1 ~]# gcadmin
CLUSTER STATE:  ACTIVE
CLUSTER MODE:   NORMAL

=================================================================
|             GBASE COORDINATOR CLUSTER INFORMATION             |
=================================================================
|   NodeName   |     IpAddress     |gcware |gcluster |DataState |
-----------------------------------------------------------------
| coordinator1 |    10.0.2.201     | OPEN  |  OPEN   |    0     |
-----------------------------------------------------------------
================================================================
|                GBASE DATA CLUSTER INFORMATION                |
================================================================
|NodeName |     IpAddress     |  gnode  |syncserver |DataState |
----------------------------------------------------------------
|  node1  |    10.0.2.201     |  OPEN   |   OPEN    |    0     |
----------------------------------------------------------------
|  node2  |    10.0.2.202     | OFFLINE |           |          |
----------------------------------------------------------------
[root@rh6-1 ~]# gcadmin setnodestate 10.0.2.202 failure
current user is not DBA user, please switch user to [gbase]
gcadmin set node state failed
[root@rh6-1 ~]# su - gbase
[gbase@rh6-1 ~]$ gcadmin setnodestate 10.0.2.202 failure
load gbase client dll start ......
load gbase client dll end ......

[gbase@rh6-1 ~]$ gcadmin
CLUSTER STATE:  ACTIVE
CLUSTER MODE:   NORMAL

=================================================================
|             GBASE COORDINATOR CLUSTER INFORMATION             |
=================================================================
|   NodeName   |     IpAddress     |gcware |gcluster |DataState |
-----------------------------------------------------------------
| coordinator1 |    10.0.2.201     | OPEN  |  OPEN   |    0     |
-----------------------------------------------------------------
================================================================
|                GBASE DATA CLUSTER INFORMATION                |
================================================================
|NodeName |     IpAddress     |  gnode  |syncserver |DataState |
----------------------------------------------------------------
|  node1  |    10.0.2.201     |  OPEN   |   OPEN    |    0     |
----------------------------------------------------------------
|  node2  |    10.0.2.202     | FAILURE |           |          |
----------------------------------------------------------------
[gbase@rh6-1 ~]$ 

unavailable 状态

当节点判定不可恢复故障,特别是RAID损坏,文件系统损坏,数据丢失时,设置这个状态。

被设置unavailable的节点,【不再】记录event, 不再检测节点状态,不再下发任务,不能恢复到正常(normal)状态,只能做节点替换。

gcadmin setnodestate 10.0.2.202 unavailable 

具体操作过程,请参考 GBase 8a 强制节点离线和节点替换replace