南大通用GBase 8a服务状态主动检测机制介绍

发表于2022年6月2日2022年6月6日作者 laozizhu

GBase 8a通过gcware集群维护集群状态，包括各个节点服务，数据一致性等。其中主动检查（所有版本支持）机制是gcware定时扫描各个节点的服务状态，被动检查(9.5.3版本支持)是通过各个节点服务向gcware注册并上报状态给gcware。本文介绍gcware主动检查机制。

目录导航

参考

GBase 8a 9.5.3多实例版本的服务状态注册和检测机制

主动检测方法

如下的4中检测方法，其中ssh和socket从最早的版本一直支持，GBase Ping和本文重点介绍的SQL检测方法则要求8.6.2Build43R33及之后发布的版本支持。

所有参数均在gcware的配置文件内，分别对应V86版本的corosync.conf和V95版本的gcware.conf。

ssh服务

主机检测。检查节点的ssh服务能否连通，如不能或超时，则判断节点离线。

ssh的端口由 node_ssh_port: 参数指定，默认是22

服务端口socket探测

服务检测。通过TCP连接服务的端口，不能连接则判断服务CLOSE。如下是3个被检测服务的端口配置。

    gcluster_port: 5258
    gnode_port: 5050
    syncserver_port: 5288

执行内部的Ping命令

SQL执行前检测。每个SQL下发前，先发送Ping【内部命令】检测服务状态，如异常则不会下发任务。

该命令为内部命令，不对外使用。

从V862Build43R33及之后的新版本开始正式支持。

从审计日志里可以看到如下信息，先ping成功，再执行了业务语句 count(*)

# Threadid=66;
# Taskid=0;
# Time: 220602 14:49:13
# End_time: 220602 14:49:13
# User@Host: root[root] @  [10.0.2.101]
# UID: 1
# Query_time: 0.000032 Rows: 0
# SET timestamp=1654152553;
# administrator command: Ping;
# Sql_type: OTHERS;
# Sql_command: Ping;
# Status: SUCCESS;
# Connect Type: CAPI;

# Threadid=66;
# Taskid=0;
# Time: 220602 14:49:14
# End_time: 220602 14:49:14
# User@Host: root[root] @  [10.0.2.101]
# UID: 1
# Query_time: 0.004065 Rows: 1
# Tables: WRITE: ; READ: `testdb`.`tt_n2`; OTHER: ; ;
# SET timestamp=1654152554;
# Sql_text: SELECT  COUNT(1) FROM `testdb`.`tt_n2`  `vcname000001.testdb.tt`;
# Sql_type: DQL;
# Sql_command: SELECT;
# Status: SUCCESS;
# Connect Type: CAPI;

SQL检测

定时下发一个SQL检测服务状态。如超时没有返回则判断节点服务异常，设置为CLOSE,阻止后续SQL下发。

现有版本的检测SQL是 select 1, 不支持修改。

从V862Build43R33及之后的新版本开始正式支持。如不确认版本情况，请联系技术支持人员。

需要修改参数：check_tcp_only，默认值为1，只检查tcp, 也就是前面三种(SSH,SOCKET和ping)

    check_tcp_only: 1

修改成如下内容，其中0是开关。 inner_connect_read_write_timeout是执行SQL如果超过这个时间，则判定节点服务异常，设置为CLOSE。默认值为15。

警告：请【一定要】根据实际情况调整该参数，避免在负载本来已经极高，SQL返回已经很慢的环境，由于该检测而频繁出现服务CLOSE,导致业务更加繁重。

该参数适合于整体负载不高（包括忙时），为了避免【个别】节点意外导致的性能问题，比如服务能连接但卡住或性能极差的情况。

    check_tcp_only: 0
    inner_connect_read_write_timeout:5

现有版本的所有定时检测周期(包括SQL）由如下2个参数控制。其中 check_interval (默认30秒)是检测间隔，对异常节点间隔多长时间检测一次是否恢复了； whole_check_interval_num 指经过多少次检测后，要整体检测一次(所有节点，所有服务)。集群全部正常时，30*20=600秒=10分钟整体检测一次。

    check_interval: 30
    whole_check_interval_num: 20

SQL检测例子

正常时

每10分钟收到一次检测，完整的connect，select, quit 过程。

# Threadid=55;
# Taskid=0;
# Time: 700101  8:00:00
# End_time: 700101  8:00:00
# User@Host: gbase[gbase] @  [10.0.2.101]
# UID: 2
# Query_time: 0.000000 Rows: 0
# SET timestamp=0;
# administrator command: Connect;
# Sql_type: OTHERS;
^@# Sql_command: Connect;
# Status: SUCCESS;
^@# Connect Type: CAPI;

# Threadid=55;
# Taskid=0;
# End_time: 220602 14:25:24
# User@Host: gbase[gbase] @  [10.0.2.101]
# UID: 2
# Query_time: 0.000107 Rows: 1
# use gbase;
# Tables: WRITE: ; READ: ; OTHER: ; ;
# SET timestamp=1654151124;
# Sql_text: select 1;
# Sql_type: DQL;
^@# Sql_command: SELECT;
# Status: SUCCESS;
^@# Connect Type: CAPI;

# Threadid=55;
# Taskid=0;
# End_time: 220602 14:25:24
# User@Host: gbase[gbase] @  [10.0.2.101]
# UID: 2
# Query_time: 0.000005 Rows: 0
# SET timestamp=1654151124;
# administrator command: Quit;
# Sql_type: OTHERS;
^@# Sql_command: Quit;
# Status: SUCCESS;
^@# Connect Type: CAPI;

.......................


# Threadid=57;
# Taskid=0;
# Time: 700101  8:00:00
# End_time: 700101  8:00:00
# User@Host: gbase[gbase] @  [10.0.2.101]
# UID: 2
# Query_time: 0.000000 Rows: 0
# SET timestamp=0;
# administrator command: Connect;
# Sql_type: OTHERS;
^@# Sql_command: Connect;
# Status: SUCCESS;
^@# Connect Type: CAPI;

# Threadid=57;
# Taskid=0;
# End_time: 220602 14:35:24
# User@Host: gbase[gbase] @  [10.0.2.101]
# UID: 2
# Query_time: 0.000129 Rows: 1
# use gbase;
# Tables: WRITE: ; READ: ; OTHER: ; ;
# SET timestamp=1654151724;
# Sql_text: select 1;
# Sql_type: DQL;
^@# Sql_command: SELECT;
# Status: SUCCESS;
^@# Connect Type: CAPI;

# Threadid=57;
# Taskid=0;
# End_time: 220602 14:35:24
# User@Host: gbase[gbase] @  [10.0.2.101]
# UID: 2
# Query_time: 0.000005 Rows: 0
# SET timestamp=1654151724;
# administrator command: Quit;
# Sql_type: OTHERS;
^@# Sql_command: Quit;
# Status: SUCCESS;
^@# Connect Type: CAPI;

模拟故障

我们将超时参数减少到1秒，方便测试。

    check_tcp_only: 0
    inner_connect_read_write_timeout:1

通过tc工具，模拟网络故障，将网卡的延迟改成1000ms

 tc qdisc  add  dev  enp0s3  root  netem  delay  1000ms

查看gcware故障检测日志

能看到如下输出，大约30秒检测一轮，每轮尝试3次（由参数cfg_check_times_judge_failure控制）。

Jun 02 14:53:05.713318 ERROR [CLM   ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Lost connection to GBase server at 'waiting for initial communication packet', system error: 115
Jun 02 14:53:06.715107 ERROR [CLM   ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Can't connect to GBase server on '10.0.2.115' (4)
Jun 02 14:53:08.717442 ERROR [CLM   ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Lost connection to GBase server at 'waiting for initial communication packet', system error: 115
Jun 02 14:53:22.695797 ERROR [CLM   ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Can't connect to GBase server on '10.0.2.115' (4)
Jun 02 14:53:25.700085 ERROR [CLM   ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Lost connection to GBase server at 'reading authorization packet', system error: 11
Jun 02 14:53:27.703212 ERROR [CLM   ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Lost connection to GBase server at 'waiting for initial communication packet', system error: 115


Jun 02 14:54:01.714689 ERROR [CLM   ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Can't connect to GBase server on '10.0.2.115' (4)
Jun 02 14:54:04.718350 ERROR [CLM   ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Lost connection to GBase server at 'reading authorization packet', system error: 11
Jun 02 14:54:06.721413 ERROR [CLM   ] Inner Connect error ip:10.0.2.115 port:5050 inner_connect_read_write_timeout:1 cfg_check_times_judge_failure :3 Error : Lost connection to GBase server at 'waiting for initial communication packet', system error: 115

查看集群故障状态

故障节点gnode服务为CLOSE。

[gbase@gbase_rh7_001 gcluster]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.101                |       5        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.102                |       5        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node3   |                10.0.2.115                |       5        | CLOSE |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------

[gbase@gbase_rh7_001 gcluster]$

模拟故障恢复

tc qdisc del dev enp0s3 root

查看gcware故障恢复后日志

Jun 02 14:54:37.751090 NOTIC [CLM   ] EXEC request: invalid node del 1929510922
Jun 02 14:54:37.751146 NOTIC [CLM   ] EXEC request: invalid node del delete node: 1929510922
Jun 02 14:54:37.751165 NOTIC [CLM   ] EXEC request: notification_clusterstate_changed clusterstatechange = 1, trackflag = 8, num = 1
Jun 02 14:54:37.751181 NOTIC [CLM   ] nodeid = 1929510922

查看集群故障恢复状态

[gbase@gbase_rh7_001 gcluster]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.101                |       5        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.102                |       5        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node3   |                10.0.2.115                |       5        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------

[gbase@gbase_rh7_001 gcluster]$

总结

对于日常负载不重的系统，为了减少部分节点性能问题导致的整体性能下降或卡住，可以通过本文提到的SQL检测，设置超时的节点停止服务。待性能恢复后，再继续提供服务。

该参数需要根据现场情况正确设置，本人认为默认的5秒有点小了，在不确定影响的情况下，建议设置的高一些。

如果有条件，可以采集下每个gnode节点执行select 1的实际耗时，建议覆盖最繁忙的周末，周初，月末，月初等时间，以繁忙时的耗时为基准，再加上一个安全系数，来设置这个超时时间。

Post Views: 644

2025年7月
一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

参考