GBase 8a 针对分片表故障自动恢复能力的测试和功能验证

GBase 8a是通过副本来实现高可用的。当某个节点的分片数据出现故障时,集群将自动从副本恢复该分片,本文将测试几种常见的分片故障下的GBase 8a的恢复能力,包括数据不一致,表丢失,表结构不同等。

环境

2节点集群,手工关闭集群monit工具后,通过关停服务来模拟故障。

节点

101和115两个节点

[gbase@gbase_rh7_001 ~]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator2 | 10.0.2.115 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.101                |       1        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.115                |       1        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------

版本

采用了9.5.2版本。

gbase> select version();
+-------------------+
| version()         |
+-------------------+
| 9.5.2.44.283f8db1 |
+-------------------+
1 row in set (Elapsed: 00:00:00.00)

数据

创建最简单的单列的表,并插入几行数据。为了确保数据分布到不同的分片,将id做成了分布列。

[gbase@gbase_rh7_001 ~]$ gccli testdb

GBase client 9.5.2.44.283f8db1. Copyright (c) 2004-2022, GBase.  All Rights Reserved.

gbase> create table test1(id int)distributed by('id');
Query OK, 0 rows affected (Elapsed: 00:00:00.27)

gbase> insert into test1 values(1),(2),(3),(4);
Query OK, 4 rows affected (Elapsed: 00:00:00.17)
Records: 4  Duplicates: 0  Warnings: 0

gbase> select * from test1;
+------+
| id   |
+------+
|    2 |
|    4 |
|    1 |
|    3 |
+------+
4 rows in set (Elapsed: 00:00:00.02)

[gbase@gbase_rh7_001 ~]$ gncli testdb

GBase client 9.5.2.44.283f8db1. Copyright (c) 2004-2022, GBase.  All Rights Reserved.

Please modify user password

gbase> select * from test1_n1;
+------+
| id   |
+------+
|    2 |
|    4 |
+------+
2 rows in set (Elapsed: 00:00:00.00)

gbase> select * from test1_n2;
+------+
| id   |
+------+
|    1 |
|    3 |
+------+
2 rows in set (Elapsed: 00:00:00.00)

关闭集群进程监控服务(看门狗)

如不关闭,进程会被自动拉起,影响测试中的数据采集。

[gbase@gbase_rh7_001 ~]$ gcmonit.sh stop
Stopping GCMonit success!

故障1:单纯数据不一致

该情况最常见,分片节点故障,比如离线,网络不通等,导致数据和其它备份节点不同,集群将记录dmlevent。

当故障恢复后,集群通过监控进程gc_recover,发起sync_client同步进程,将数据从副本同步到故障分片。详情可以参考本站的其它文章。

GBase 8a数据库数据不一致dmlevent故障模拟和恢复过程分析

GBase 8a存储类故障DMLStorageEvent恢复日志解析

GBase 8a数据库数据不一致ddlevent故障模拟和恢复过程分析

分片表数据文件损坏

比较常见的是磁盘空间满以及没有电池RAID卡出现断电宕机等,造成数据文件没有完整写入,且当时event无法正常记录。

模拟操作

将数据文件mv走,模拟文件损坏。

[gbase@gbase_rh7_015 test1_n2]$ ll
total 8
-rw------- 1 gbase gbase 91 Feb 14 16:06 C00000.seg.1
-rw------- 1 gbase gbase 88 Feb 14 15:56 C00000.seg.BAK
[gbase@gbase_rh7_015 test1_n2]$ mv C00000.seg.1 C00000.seg.1.BAK

测试时,为了避免数据缓冲的影响,可以重启节点服务,或者集群层release cache


gbase> release cache;
Query OK, 0 rows affected (Elapsed: 00:00:00.00)

查询报错

gbase> select * from test1;
ERROR 1708 (HY000): [10.0.2.115:5050](GBA-02AD-0005)Failed to query in gnode:
DETAIL: Gbase database/table corrupted (testdb.test1_n2.1.75./opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1): I/O operation on /opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1 failed with error - No such file or directory, File name /opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1
SQL: SELECT /*10.0.2.101_31_9_2022-02-14_16:08:13*/ /*+ TID('3473447') */ `vcname000001.
gbase>

gcluster的express.log里报错,并设置了event

设置信息如下

2022-02-14 16:08:13.466 [DDL][WARN ][S:31][Q:57]<DealCorruptError|2687>:Set testdb.test1 StorageEvent successfully, corrupt info: Gbase database/table corrupted 

完整日志如下

2022-02-14 16:08:13.433 [SQLDISP][ERROR][S:31][Q:57]<StoreResult|2277>:ForwardResult failed. reason: HOST(10.0.2.115:5050->74), Gbase database/table corrupted (testdb.test1_n2.1.75./opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1): I/O operation on /opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1 failed with error - No such file or directory, File name /opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1
2022-02-14 16:08:13.436 [SQLDISP][ERROR][S:31][Q:57]<ExecSQL|2567>:StoreResult failed, THD(0x3b54000) HOST(10.0.2.115:5050->74) reason: Gbase database/table corrupted (testdb.test1_n2.1.75./opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1): I/O operation on /opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1 failed with error - No such file or directory, File name /opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1
2022-02-14 16:08:13.436 [DDL][WARN ][S:31][Q:57]<DealCorruptError|2506>:Catch corrupt info from gnode - Gbase database/table corrupted (testdb.test1_n2.1.75./opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1): I/O operation on /opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1 failed with error - No such file or directory, File name /opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1
2022-02-14 16:08:13.466 [DDL][WARN ][S:31][Q:57]<DealCorruptError|2687>:Set testdb.test1 StorageEvent successfully, corrupt info: Gbase database/table corrupted (testdb.test1_n2.1.75./opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1): I/O operation on /opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1 failed with error - No such file or directory, File name /opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1
2022-02-14 16:08:13.466 [EXECTR][ERROR][S:31][Q:57]<DispatchTasks|3248>:Async task error: [10.0.2.115:5050](GBA-02AD-0005)Failed to query in gnode:
DETAIL: Gbase database/table corrupted (testdb.test1_n2.1.75./opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1): I/O operation on /opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1 failed with error - No such file or directory, File name /opt/gbase/gnode/userdata/gbase/testdb/sys_tablespace/test1_n2/C00000.seg.1
SQL: SELECT /*10.0.2.101_31_9_2022-02-14_16:08:13*/ /*+ TID('3473447') */ `vcname000001.testdb.test1`.`id` AS `id` FROM `testdb`.`test1_n2` `vcname000001.testdb.test1`
2022-02-14 16:08:13.469 [EXECTR][ERROR][S:31][Q:57]<ExecutePlan|153>:ExecutePlan failed.

分片表丢失

此情况多由于环境问题导致,比如磁盘空间满,文件系统故障等,也包括人工误操作删除。此时故障节点不存在分片表了。

删掉表分片

通过gncli 登录,删掉test1_n1分片。

gbase> drop table test1_n1;
Query OK, 0 rows affected (Elapsed: 00:00:00.00)

gbase> select * from test1_n1;
ERROR 1146 (42S02): Table 'testdb.test1_n1' doesn't exist

关闭节点101, 默认系统故障

可以看到101的gnode状态为CLOSE

[gbase@gbase_rh7_001 ~]$ gcluster_services gbase stop
Stopping gbase :                                           [  OK  ]
[gbase@gbase_rh7_001 ~]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator2 | 10.0.2.115 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.101                |       1        | CLOSE |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.115                |       1        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------

默认后续数据操作,触发event

再次向test1表插入数据

[gbase@gbase_rh7_001 ~]$ gccli testdb

GBase client 9.5.2.44.283f8db1. Copyright (c) 2004-2022, GBase.  All Rights Reserved.

gbase> insert into test1 values(5),(6),(7),(8);
Query OK, 4 rows affected (Elapsed: 00:00:00.15)
Records: 4  Duplicates: 0  Warnings: 0

查看event

查看集群状态gcadmin, 看到可以101的DataState为1,有不一致状态。

查看event详情,有1个dmlevent, 对应101节点的test1表的分片n1和n2, 其中n1的表已经被删掉, n2只是单纯的数据不一致。

[gbase@gbase_rh7_001 ~]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator2 | 10.0.2.115 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.101                |       1        | CLOSE |    OPEN    |     1     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.115                |       1        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------

[gbase@gbase_rh7_001 ~]$ gcadmin showdmlevent
Vc event count:1
Event ID:    2
ObjectName: testdb.test1

Fail Data Copy:
------------------------------------------------------
SegName: n1     SCN: 19459      NodeIP: 10.0.2.101      FAILURE
SegName: n2     SCN: 19459      NodeIP: 10.0.2.101      FAILURE


[gbase@gbase_rh7_001 ~]$ gcadmin showddlevent
Vc event count:0
[gbase@gbase_rh7_001 ~]$ gcadmin showdmlstorageevent
Vc event count:0
[gbase@gbase_rh7_001 ~]$ 

启动服务,模拟故障恢复

启动服务后,101节点的gnode服务状态从CLOSE变成了OPEN

[gbase@gbase_rh7_001 ~]$ gcluster_services gbase start
Starting gbase :                                           [  OK  ]
[gbase@gbase_rh7_001 ~]$ 
[gbase@gbase_rh7_001 gcluster]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator2 | 10.0.2.115 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.101                |       1        | OPEN  |    OPEN    |     1     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.115                |       1        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------

查看同步日志

该日志gc_recover.log可能出现在任意一台调度节点的日志目录内,一般是第一台,如发现其日志长时间没有变动,则表示同步的master迁移到其它节点了。

从日志看,期间自动生成了dmlstorageevent, 并重建建了分片表。

DoDmlRecover 609 table not in local data node,but in source gcluster, convert to dmlstorage.

后面从副本分片拿到了表结构

QueryExecute ip 10.0.2.115,port 5050, do "show full create table "testdb"."test1_n1";"

并重建了丢失的分片表

MatchJustOneResult:CREATE TABLE "test1_n1" (
  "id" int(11) DEFAULT NULL
) ENGINE=EXPRESS TID(610) UID(1) DEFAULT CHARSET=utf8 TABLESPACE='sys_tablespace' COLUMN_IDS(0)

之后调用sync_client同步了数据

 QueryExecute ip 10.0.2.101,port 5050, do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n1 10.0.2.101 5258 2 0'"

完整的同步信息如下

2022-02-14 10:14:33.240 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: recoverinfo gn dml: eventid:2 DB(testdb), TABLE(test1), SLICE(1), TID(609), src:1929510922(10.0.2.115) dst:1694629898(10.0.2.101)
2022-02-14 10:14:33.263 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: PreWriteDMLSBeforeSync eventid 2
2022-02-14 10:14:33.263 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5258, do "select table_id from information_schema.tables where table_schema='testdb' and table_name='test1' and TABLE_VC='vcname000001'"
2022-02-14 10:14:33.263 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:609
2022-02-14 10:14:33.263 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:14:33.264 [ERROR] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult, row is null, or row number is not one
2022-02-14 10:14:33.264 [ERROR] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryTableExists null
2022-02-14 10:14:33.264 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: DoDmlRecover 609 table not in local data node,but in source gcluster, convert to dmlstorage.
2022-02-14 10:14:33.264 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: delete dml event, table testdb.test1,event id 2 suffix n1
2022-02-14 10:14:33.276 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: SET_ONLINE, Succeeded in setting local node table 'testdb.test1_n1' online, eventinfo(eventid:2,segId:1)
2022-02-14 10:14:33.276 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:2,en:node,ip:10.0.2.101>: CheckToClearOrKeepPreWriteDMLS keep eventid 2
2022-02-14 10:14:33.293 [INFO ] <session:4>: DealDMLRecoverUnLock vc00001.testdb.test1
2022-02-14 10:14:33.521 [INFO ] <session:4>: source node is 0x7302000a, table test1, suffix n2
2022-02-14 10:14:33.555 [INFO ] <session:4>: DealDMLRecoverLock vc00001.testdb.test1 success
2022-02-14 10:14:33.564 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: recoverinfo gn dml: eventid:2 DB(testdb), TABLE(test1), SLICE(2), TID(609), src:1929510922(10.0.2.115) dst:1694629898(10.0.2.101)
2022-02-14 10:14:33.573 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: PreWriteDMLSBeforeSync eventid 3
2022-02-14 10:14:33.573 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5258, do "select table_id from information_schema.tables where table_schema='testdb' and table_name='test1' and TABLE_VC='vcname000001'"
2022-02-14 10:14:33.574 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:609
2022-02-14 10:14:33.574 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n2'"
2022-02-14 10:14:33.575 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:19458
2022-02-14 10:14:33.575 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "FLUSH ROLLBACK "testdb"."test1_n2""
2022-02-14 10:14:33.576 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "revert table `testdb`.`test1_n2` scn_number 19459"
2022-02-14 10:14:33.576 [ERROR] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute, (1753)(Can't revert table: Table testdb.test1_n2 can not do revert with scn(19459/cur:19458)), do "revert table `testdb`.`test1_n2` scn_number 19459" error
2022-02-14 10:14:33.576 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: do revert success!
2022-02-14 10:14:33.576 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:14:33.576 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:14:33.577 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n2 10.0.2.101 5258 2 0'"
2022-02-14 10:14:33.728 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: ExecSyncQuery, sync table returned with (0)
2022-02-14 10:14:33.728 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: check drop sql thread quit.
2022-02-14 10:14:33.729 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:14:33.729 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:14:33.729 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: CALL_SYNC, Sync client executed successfully
2022-02-14 10:14:33.730 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "refresh table "testdb"."test1_n2""
2022-02-14 10:14:33.730 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: Sync client returned errcode:[0],errmsg:[success]
2022-02-14 10:14:33.730 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: CALL_SYNC, Sync end, return 0
2022-02-14 10:14:33.738 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,Checking whether SYNC is succeeded{DB: 'testdb', Table: 'test1', slice: 'n2'}
2022-02-14 10:14:33.738 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n2'"
2022-02-14 10:14:33.738 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:19459
2022-02-14 10:14:33.738 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "refresh table "testdb"."test1_n2""
2022-02-14 10:14:33.739 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n2'"
2022-02-14 10:14:33.740 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:19459
2022-02-14 10:14:33.740 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,        recover GNode: 19459
2022-02-14 10:14:33.740 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,      source GNode[1]: 19459
2022-02-14 10:14:33.740 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: delete dml event, table testdb.test1,event id 2 suffix n2
2022-02-14 10:14:33.750 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: SET_ONLINE, Succeeded in setting local node table 'testdb.test1_n2' online, eventinfo(eventid:2,segId:2)
2022-02-14 10:14:33.750 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: CheckToClearOrKeepPreWriteDMLS delete eventid 3
2022-02-14 10:14:33.759 [INFO ] <session:4, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:2,en:node,ip:10.0.2.101>: delete dmlstorage event, event id: 3
2022-02-14 10:14:33.769 [INFO ] <session:4>: DealDMLRecoverUnLock vc00001.testdb.test1
2022-02-14 10:14:33.973 [INFO ] <session:4>: End dml recover .,tid 609
2022-02-14 10:14:33.982 [INFO ] <session:4>: UnLock: GCRECOVER_LOCK_TID_vc00001_609 unlock success, 8130053026070134784
2022-02-14 10:14:33.993 [INFO ] <session:4>: Lock: GCRECOVER_LOCK_TID_vc00001_609 end success, 1600173590353805312
2022-02-14 10:14:33.995 [INFO ] <session:4>: Start DMLStorge recover .,tid 609 eventnum 1
2022-02-14 10:14:34.012 [INFO ] <session:4>: QueryExecute ip 10.0.2.101,port 5258, do "select table_id from information_schema.tables where table_schema='testdb' and table_name='test1' and TABLE_VC='vcname000001'"
2022-02-14 10:14:34.013 [INFO ] <session:4>: MatchJustOneResult:609
2022-02-14 10:14:34.027 [INFO ] <session:4>: DealDMLRecoverLock vc00001.testdb.test1 success
2022-02-14 10:14:34.035 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: recoverinfo gn(dmls): eventid=2, tablename=testdb.test1, src:0x1929510922(10.0.2.115) dst:0x1694629898(10.0.2.101)
2022-02-14 10:14:34.035 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "DROP SELF TABLE IF EXISTS "testdb"."test1_n1";"
2022-02-14 10:14:34.035 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:14:34.036 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:19459
2022-02-14 10:14:34.036 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:14:34.037 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:19459
2022-02-14 10:14:34.037 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "show full create table "testdb"."test1_n1";"
2022-02-14 10:14:34.037 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:CREATE TABLE "test1_n1" (
  "id" int(11) DEFAULT NULL
) ENGINE=EXPRESS TID(610) UID(1) DEFAULT CHARSET=utf8 TABLESPACE='sys_tablespace' COLUMN_IDS(0)
2022-02-14 10:14:34.037 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "use self "testdb";"
2022-02-14 10:14:34.038 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "SET SELF SCN = 19459"
2022-02-14 10:14:34.038 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "CREATE SELF TABLE "test1_n1" (
  "id" int(11) DEFAULT NULL
) ENGINE=EXPRESS TID(610) UID(1) DEFAULT CHARSET=utf8 TABLESPACE='sys_tablespace' COLUMN_IDS(0)"
2022-02-14 10:14:34.060 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:14:34.060 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:14:34.061 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n1 10.0.2.101 5258 2 0'"
2022-02-14 10:14:34.374 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: ExecSyncQuery, sync table returned with (0)
2022-02-14 10:14:34.374 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: check drop sql thread quit.
2022-02-14 10:14:34.374 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:14:34.375 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:14:34.375 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: CALL_SYNC, Sync client executed successfully
2022-02-14 10:14:34.375 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "refresh table "testdb"."test1_n1""
2022-02-14 10:14:34.376 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: Sync client returned errcode:[0],errmsg:[success]
2022-02-14 10:14:34.376 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: CALL_SYNC, Sync end, return 0
2022-02-14 10:14:34.380 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,Checking whether SYNC is succeeded{DB: 'testdb', Table: 'test1', slice: 'n1'}
2022-02-14 10:14:34.380 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:14:34.381 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:19459
2022-02-14 10:14:34.381 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "refresh table "testdb"."test1_n1""
2022-02-14 10:14:34.382 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:14:34.383 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: MatchJustOneResult:19459
2022-02-14 10:14:34.383 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,        recover GNode: 19459
2022-02-14 10:14:34.383 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,      source GNode[1]: 19459
2022-02-14 10:14:34.383 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: Delete DMLStorge eventId 2
2022-02-14 10:14:34.396 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: delete dmlstorage event, event id: 2
2022-02-14 10:14:34.404 [INFO ] <session:4>: DealDMLSRecoverUnLock vc00001.testdb.test1
2022-02-14 10:14:34.817 [INFO ] <session:4,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:2,en:node,ip:10.0.2.101>: UnLock: GCRECOVER_LOCK_TID_vc00001_609 unlock success, 1600173590353805312

表丢失时自动恢复的总结

当集群同步进程发现目标表丢失时,会自动从数据不一致的dmlevent升级到存储不一致的dmlstorageevent, 然后从副本获得表结构,重建建立分片表,然后再同步数据实现整个恢复过程。

分片表的表结构不一致

此种情况,在普通场景是不会出现的,除了非常古老版本的产品BUG,在新版本里,最常见的是磁盘满了,或者异常断电,导致分片没有完成表结构变动,同时,因为磁盘满了或异常断电,连event都没有被正确记录到物理磁盘上。

构造表结构不一致

通过在分片表,增加一个字段来模拟主副分片的表结构不一致。

gncli ......

gbase> desc test1_n1;
+-------+---------+------+-----+---------+-------+
| Field | Type    | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+-------+
| id    | int(11) | YES  |     | NULL    |       |
+-------+---------+------+-----+---------+-------+
1 row in set (Elapsed: 00:00:00.01)

gbase> alter table test1_n1 add column name varchar(100);\
Query OK, 5 rows affected (Elapsed: 00:00:00.07)
Records: 5  Duplicates: 0  Warnings: 0

gbase> desc test1_n1;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id    | int(11)      | YES  |     | NULL    |       |
| name  | varchar(100) | YES  |     | NULL    |       |
+-------+--------------+------+-----+---------+-------+
2 rows in set (Elapsed: 00:00:00.00)

gbase> desc test1_n2;
+-------+---------+------+-----+---------+-------+
| Field | Type    | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+-------+
| id    | int(11) | YES  |     | NULL    |       |
+-------+---------+------+-----+---------+-------+
1 row in set (Elapsed: 00:00:00.00)

模拟故障

停掉服务模拟故障,变动数据生成event,启动服务模拟故障恢复。

[gbase@gbase_rh7_001 gcluster]$ gcluster_services gbase stop
Stopping gbase :                                           [  OK  ]
[gbase@gbase_rh7_001 gcluster]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator2 | 10.0.2.115 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.101                |       1        | CLOSE |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.115                |       1        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------

[gbase@gbase_rh7_001 gcluster]$ gccli testdb

GBase client 9.5.2.44.283f8db1. Copyright (c) 2004-2022, GBase.  All Rights Reserved.

gbase> insert into test1 values(9),(10),(11),(12);
Query OK, 4 rows affected (Elapsed: 00:00:00.15)
Records: 4  Duplicates: 0  Warnings: 0

gbase> ^CAborted
[gbase@gbase_rh7_001 gcluster]$ gcadmin
CLUSTER STATE:         ACTIVE
VIRTUAL CLUSTER MODE:  NORMAL

=============================================================
|           GBASE COORDINATOR CLUSTER INFORMATION           |
=============================================================
|   NodeName   | IpAddress  | gcware | gcluster | DataState |
-------------------------------------------------------------
| coordinator1 | 10.0.2.101 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
| coordinator2 | 10.0.2.115 |  OPEN  |   OPEN   |     0     |
-------------------------------------------------------------
=========================================================================================================
|                                    GBASE DATA CLUSTER INFORMATION                                     |
=========================================================================================================
| NodeName |                IpAddress                 | DistributionId | gnode | syncserver | DataState |
---------------------------------------------------------------------------------------------------------
|  node1   |                10.0.2.101                |       1        | CLOSE |    OPEN    |     1     |
---------------------------------------------------------------------------------------------------------
|  node2   |                10.0.2.115                |       1        | OPEN  |    OPEN    |     0     |
---------------------------------------------------------------------------------------------------------

[gbase@gbase_rh7_001 gcluster]$
[gbase@gbase_rh7_001 gcluster]$ gcadmin showdmlevent
Vc event count:1
Event ID:    3
ObjectName: testdb.test1

Fail Data Copy:
------------------------------------------------------
SegName: n1     SCN: 19460      NodeIP: 10.0.2.101      FAILURE
SegName: n2     SCN: 19460      NodeIP: 10.0.2.101      FAILURE


[gbase@gbase_rh7_001 gcluster]$ gcadmin showddlevent
Vc event count:0
[gbase@gbase_rh7_001 gcluster]$ gcadmin showdmlstorageevent
Vc event count:0

[gbase@gbase_rh7_001 gcluster]$ gcluster_services gbase start
Starting gbase :                                           [  OK  ]
[gbase@gbase_rh7_001 gcluster]$ 

查看同步日志

集群同步服务先调用sync_client进行数据恢复

2022-02-14 10:24:32.732 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n1 10.0.2.101 5258 2 0'"

恢复数报错

2022-02-14 10:24:32.784 [ERROR] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute, (1733)((GBA-01EX-700) Gbase general error: gc_sync_client returned error code 30.), do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n1 10.0.2.101 5258 2 0'" error

报错原因是表结构不一致

2022-02-14 10:24:32.787 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: Sync client returned errcode:[30],errmsg:[show full create table for sync is different.]

删掉了不一致的分片表

2022-02-14 10:24:33.674 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "DROP SELF TABLE IF EXISTS "testdb"."test1_n1";"

从副本拿到了表结构

2022-02-14 10:24:33.677 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "show full create table "testdb"."test1_n1";"

创建分片表

2022-02-14 10:24:33.678 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "CREATE SELF TABLE "test1_n1" (
  "id" int(11) DEFAULT NULL
) ENGINE=EXPRESS TID(610) UID(1) DEFAULT CHARSET=utf8 TABLESPACE='sys_tablespace' COLUMN_IDS(0)"

调用sync_client恢复数据

2022-02-14 10:24:33.708 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n1 10.0.2.101 5258 2 0'"

完整的恢复日志如下

2022-02-14 10:24:32.659 [INFO ] <session:6>: Lock: GCRECOVER_LOCK_TID_vc00001_609 end success, 1091491459250520064
2022-02-14 10:24:32.660 [INFO ] <session:6>: Start dml recover .,tid 609, eventid 3, Num 2
2022-02-14 10:24:32.669 [INFO ] <session:6>: GetSourceNodeForDMl for nodeId(1694629898), the node is inValid, uValid=1
2022-02-14 10:24:32.669 [INFO ] <session:6>: source node is 0x7302000a, table test1, suffix n1
2022-02-14 10:24:32.705 [INFO ] <session:6>: DealDMLRecoverLock vc00001.testdb.test1 success
2022-02-14 10:24:32.707 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: recoverinfo gn dml: eventid:3 DB(testdb), TABLE(test1), SLICE(1), TID(609), src:1929510922(10.0.2.115) dst:1694629898(10.0.2.101)
2022-02-14 10:24:32.726 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: PreWriteDMLSBeforeSync eventid 4
2022-02-14 10:24:32.726 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5258, do "select table_id from information_schema.tables where table_schema='testdb' and table_name='test1' and TABLE_VC='vcname000001'"
2022-02-14 10:24:32.728 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: MatchJustOneResult:609
2022-02-14 10:24:32.728 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:24:32.730 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: MatchJustOneResult:0
2022-02-14 10:24:32.730 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "FLUSH ROLLBACK "testdb"."test1_n1""
2022-02-14 10:24:32.731 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "revert table `testdb`.`test1_n1` scn_number 19460"
2022-02-14 10:24:32.732 [ERROR] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute, (1753)(Can't revert table: Table testdb.test1_n1 can not do revert with scn(19460/cur:0)), do "revert table `testdb`.`test1_n1` scn_number 19460" error
2022-02-14 10:24:32.732 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: do revert success!
2022-02-14 10:24:32.732 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:24:32.732 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:24:32.732 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n1 10.0.2.101 5258 2 0'"
2022-02-14 10:24:32.784 [ERROR] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute, (1733)((GBA-01EX-700) Gbase general error: gc_sync_client returned error code 30.), do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n1 10.0.2.101 5258 2 0'" error
2022-02-14 10:24:32.784 [ERROR] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QuerySyncCommand, Sync error
2022-02-14 10:24:32.784 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: ExecSyncQuery, sync table returned with (30)
2022-02-14 10:24:32.785 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: check drop sql thread quit.
2022-02-14 10:24:32.785 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:24:32.786 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:24:32.786 [ERROR] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: CALL_SYNC, Sync client return ERR_SHOW_CREATE_TABLE_IS_DIFF
2022-02-14 10:24:32.787 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: Sync client returned errcode:[30],errmsg:[show full create table for sync is different.]
2022-02-14 10:24:32.787 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: CALL_SYNC, Sync end, return 26
2022-02-14 10:24:32.790 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: delete dml event, table testdb.test1,event id 3 suffix n1
2022-02-14 10:24:32.799 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: SET_ONLINE, Succeeded in setting local node table 'testdb.test1_n1' online, eventinfo(eventid:3,segId:1)
2022-02-14 10:24:32.799 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n1,et:DML,eid:3,en:node,ip:10.0.2.101>: CheckToClearOrKeepPreWriteDMLS keep eventid 4
2022-02-14 10:24:32.809 [INFO ] <session:6>: DealDMLRecoverUnLock vc00001.testdb.test1
2022-02-14 10:24:33.025 [INFO ] <session:6>: source node is 0x7302000a, table test1, suffix n2
2022-02-14 10:24:33.071 [INFO ] <session:6>: DealDMLRecoverLock vc00001.testdb.test1 success
2022-02-14 10:24:33.074 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: recoverinfo gn dml: eventid:3 DB(testdb), TABLE(test1), SLICE(2), TID(609), src:1929510922(10.0.2.115) dst:1694629898(10.0.2.101)
2022-02-14 10:24:33.087 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: PreWriteDMLSBeforeSync eventid 5
2022-02-14 10:24:33.088 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5258, do "select table_id from information_schema.tables where table_schema='testdb' and table_name='test1' and TABLE_VC='vcname000001'"
2022-02-14 10:24:33.089 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: MatchJustOneResult:609
2022-02-14 10:24:33.089 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n2'"
2022-02-14 10:24:33.090 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: MatchJustOneResult:19459
2022-02-14 10:24:33.090 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "FLUSH ROLLBACK "testdb"."test1_n2""
2022-02-14 10:24:33.091 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "revert table `testdb`.`test1_n2` scn_number 19460"
2022-02-14 10:24:33.092 [ERROR] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute, (1753)(Can't revert table: Table testdb.test1_n2 can not do revert with scn(19460/cur:19459)), do "revert table `testdb`.`test1_n2` scn_number 19460" error
2022-02-14 10:24:33.092 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: do revert success!
2022-02-14 10:24:33.092 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:24:33.093 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:24:33.093 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n2 10.0.2.101 5258 2 0'"
2022-02-14 10:24:33.393 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: ExecSyncQuery, sync table returned with (0)
2022-02-14 10:24:33.393 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: check drop sql thread quit.
2022-02-14 10:24:33.393 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:24:33.394 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:24:33.394 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: CALL_SYNC, Sync client executed successfully
2022-02-14 10:24:33.394 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "refresh table "testdb"."test1_n2""
2022-02-14 10:24:33.395 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: Sync client returned errcode:[0],errmsg:[success]
2022-02-14 10:24:33.395 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: CALL_SYNC, Sync end, return 0
2022-02-14 10:24:33.399 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,Checking whether SYNC is succeeded{DB: 'testdb', Table: 'test1', slice: 'n2'}
2022-02-14 10:24:33.399 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n2'"
2022-02-14 10:24:33.400 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: MatchJustOneResult:19460
2022-02-14 10:24:33.400 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "refresh table "testdb"."test1_n2""
2022-02-14 10:24:33.401 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n2'"
2022-02-14 10:24:33.401 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: MatchJustOneResult:19460
2022-02-14 10:24:33.402 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,        recover GNode: 19460
2022-02-14 10:24:33.402 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,      source GNode[1]: 19460
2022-02-14 10:24:33.402 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: delete dml event, table testdb.test1,event id 3 suffix n2
2022-02-14 10:24:33.411 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: SET_ONLINE, Succeeded in setting local node table 'testdb.test1_n2' online, eventinfo(eventid:3,segId:2)
2022-02-14 10:24:33.411 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: CheckToClearOrKeepPreWriteDMLS delete eventid 5
2022-02-14 10:24:33.419 [INFO ] <session:6, tid:609, db:testdb,tb:test1,nx:n2,et:DML,eid:3,en:node,ip:10.0.2.101>: delete dmlstorage event, event id: 5
2022-02-14 10:24:33.430 [INFO ] <session:6>: DealDMLRecoverUnLock vc00001.testdb.test1
2022-02-14 10:24:33.631 [INFO ] <session:6>: End dml recover .,tid 609
2022-02-14 10:24:33.633 [INFO ] <session:6>: UnLock: GCRECOVER_LOCK_TID_vc00001_609 unlock success, 1091491459250520064
2022-02-14 10:24:33.636 [INFO ] <session:6>: Lock: GCRECOVER_LOCK_TID_vc00001_609 end success, 2113927811545694208
2022-02-14 10:24:33.636 [INFO ] <session:6>: Start DMLStorge recover .,tid 609 eventnum 1
2022-02-14 10:24:33.650 [INFO ] <session:6>: QueryExecute ip 10.0.2.101,port 5258, do "select table_id from information_schema.tables where table_schema='testdb' and table_name='test1' and TABLE_VC='vcname000001'"
2022-02-14 10:24:33.651 [INFO ] <session:6>: MatchJustOneResult:609
2022-02-14 10:24:33.671 [INFO ] <session:6>: DealDMLRecoverLock vc00001.testdb.test1 success
2022-02-14 10:24:33.674 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: recoverinfo gn(dmls): eventid=4, tablename=testdb.test1, src:0x1929510922(10.0.2.115) dst:0x1694629898(10.0.2.101)
2022-02-14 10:24:33.674 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "DROP SELF TABLE IF EXISTS "testdb"."test1_n1";"
2022-02-14 10:24:33.675 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:24:33.676 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: MatchJustOneResult:19460
2022-02-14 10:24:33.676 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:24:33.677 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: MatchJustOneResult:19460
2022-02-14 10:24:33.677 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "show full create table "testdb"."test1_n1";"
2022-02-14 10:24:33.677 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: MatchJustOneResult:CREATE TABLE "test1_n1" (
  "id" int(11) DEFAULT NULL
) ENGINE=EXPRESS TID(610) UID(1) DEFAULT CHARSET=utf8 TABLESPACE='sys_tablespace' COLUMN_IDS(0)
2022-02-14 10:24:33.677 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "use self "testdb";"
2022-02-14 10:24:33.678 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "SET SELF SCN = 19460"
2022-02-14 10:24:33.678 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "CREATE SELF TABLE "test1_n1" (
  "id" int(11) DEFAULT NULL
) ENGINE=EXPRESS TID(610) UID(1) DEFAULT CHARSET=utf8 TABLESPACE='sys_tablespace' COLUMN_IDS(0)"
2022-02-14 10:24:33.707 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:24:33.707 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:24:33.708 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "start 'gc_sync_client' '10.0.2.115 vcname000001 testdb test1_n1 10.0.2.101 5258 2 0'"
2022-02-14 10:24:33.879 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: ExecSyncQuery, sync table returned with (0)
2022-02-14 10:24:33.879 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: check drop sql thread quit.
2022-02-14 10:24:33.879 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "flush transaction_log"
2022-02-14 10:24:33.880 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "flush transaction_log"
2022-02-14 10:24:33.880 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: CALL_SYNC, Sync client executed successfully
2022-02-14 10:24:33.880 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "refresh table "testdb"."test1_n1""
2022-02-14 10:24:33.881 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: Sync client returned errcode:[0],errmsg:[success]
2022-02-14 10:24:33.881 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: CALL_SYNC, Sync end, return 0
2022-02-14 10:24:33.884 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,Checking whether SYNC is succeeded{DB: 'testdb', Table: 'test1', slice: 'n1'}
2022-02-14 10:24:33.884 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.101,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:24:33.885 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: MatchJustOneResult:19460
2022-02-14 10:24:33.885 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "refresh table "testdb"."test1_n1""
2022-02-14 10:24:33.886 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: QueryExecute ip 10.0.2.115,port 5050, do "select scn from information_schema.tables where table_schema='testdb' and table_name='test1_n1'"
2022-02-14 10:24:33.886 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: MatchJustOneResult:19460
2022-02-14 10:24:33.887 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,        recover GNode: 19460
2022-02-14 10:24:33.887 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: CHECK_SYNC_RESULT,      source GNode[1]: 19460
2022-02-14 10:24:33.887 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: Delete DMLStorge eventId 4
2022-02-14 10:24:33.895 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: delete dmlstorage event, event id: 4
2022-02-14 10:24:33.904 [INFO ] <session:6>: DealDMLSRecoverUnLock vc00001.testdb.test1
2022-02-14 10:24:34.109 [INFO ] <session:6,tid:609, db:testdb,tb:test1,nx:n1,et:DMLS,eid:4,en:node,ip:10.0.2.101>: UnLock: GCRECOVER_LOCK_TID_vc00001_609 unlock success, 2113927811545694208

分片表结构不一致的恢复总结

当同步进程发送同步命令返回失败时,根据错误原因判断是主副本表结构不同,则删除掉故障的表,从副本获得比结构后重建分片。最后同步数据完成整个恢复过程。

触发Event的场景

除了前面介绍的在服务离线时,有数据操作会出现event,在如下条件也能触发event。

INSERT操作涉及到了丢失的分片

如果Insert操作涉及到了故障分片,则集群会自动设置event。如果不涉及该分片,则不会触发。

注意,在V95里如果有可用的分片,则insert会成功,但会设置event,在V8里会报错,再设置event。

gbase> insert into test1 values(99999),(100000),(121212),(1213133131);
Query OK, 4 rows affected (Elapsed: 00:00:00.18)
Records: 4  Duplicates: 0  Warnings: 0

注意,该SQL能正常运行,也没有wanrings,只是在连接节点gcluster的express.log里能看到如下内容:

Failed to flush data, host: 10.0.2.101, suffix: n1, scn:19463, cause: Table 'testdb.test1_n1' doesn't exist。

然后设置了event:Set Node(10.0.2.101)testdb.test1[n1][scn:19463] data state

2022-02-14 14:41:20.750 [INSERT][WARN ][S:109][Q:70]:Failed to flush data, host: 10.0.2.101, suffix: n1, scn:19463, cause: Table 'testdb.test1_n1' doesn't exist
2022-02-14 14:41:20.863 [DEF  ] [S:109][Q:70][insert] Set Node(10.0.2.101)testdb.test1[n1][scn:19463] data state
2022-02-14 14:41:20.891 [DEF  ] [S:109][Q:70]Successfully set data(testdb.test1:normal) LOCKED, event id is 4

LOAD加载语句涉及了丢失的分片

与insert 不同,load语句都会报错(表不存在),然后设置event, 同步完成后,可以正常加载了。

gbase> load data infile 'sftp://gbase:gbase1234@10.0.2.101/home/gbase/test1.txt' into table test1;
ERROR 1733 (HY000): (GBA-01EX-700) Gbase general error: Task 20481 failed, [10.0.2.115:5050](GBA-02AD-0005)Failed to query in gnode:
DETAIL: (GBA-01EX-700) Gbase general error: (gns_host: 10.0.2.101) Table 'testdb.test1_n1' doesn't exist
SQL: LOAD /*+ TID('3473414') */ DATA INFILE 'sftp://gbase:*********@10.0.2.101/home/gbase/test1.txt#offset=0&length=22&firstblock&ffsize=22' INTO TABLE `testdb`.`test1_n1` DATA_FORMAT 3 FILE_FORMAT UNDEFINED  HOST '10.0.2.115' CURRENT_TIMESTAMP 1644825079 SCN_NUMBER 20481 GCLUSTER_PORT 5258 I
gbase>
gbase> load data infile 'sftp://gbase:gbase1234@10.0.2.101/home/gbase/test1.txt' into table test1;
Query OK, 11 rows affected (Elapsed: 00:00:01.59)
Task 20482 finished, Loaded 11 records, Skipped 0 records

gcluster的express.log日志

2022-02-14 15:51:19.806 [SQLDISP][ERROR][S:10][Q:8]<ExecSQL|2545>:Query failed, THD(0x6170000) HOST(10.0.2.115:5050->18) reason: (GBA-01EX-700) Gbase general error: (gns_host: 10.0.2.101) Table 'testdb.test1_n1' doesn't exist SQL(LOAD /*+ TID('3473414') */ DATA INFILE 'sftp://gbase:*********@10.0.2.101/home/gbase/test1.txt#offset=0&length=22&firstblock&ffsize=22' INTO TABLE `testdb`.`test1_n1` DATA_FORMAT 3 FILE_FORMAT UNDEFINED  HOST '10.0.2.115' CURRENT_TIMESTAMP 1644825079 SCN_NUMBER 20481 GCLUSTER_PORT 5258 INTO SERVER (HOST '10.0.2.101, 10.0.2.115', PORT '5050', USER 'root', DATABASE 'testdb', TABLE 'test1_n1, test1_n2', COMMENT 'table_host 0 0 1 # 1 1 0, scn 20481, group 0 ,distribution 1, hash_map 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
2022-02-14 15:51:19.854 [LOAD][ERROR][S:10][Q:8]<ExecuteLoad|2055>:Task 20481 failed, [10.0.2.115:5050](GBA-02AD-0005)Failed to query in gnode:
DETAIL: (GBA-01EX-700) Gbase general error: (gns_host: 10.0.2.101) Table 'testdb.test1_n1' doesn't exist
SQL: LOAD /*+ TID('3473414') */ DATA INFILE 'sftp://gbase:*********@10.0.2.101/home/gbase/test1.txt#offset=0&length=22&firstblock&ffsize=22' INTO TABLE `testdb`.`test1_n1` DATA_FORMAT 3 FILE_FORMAT UNDEFINED  HOST '10.0.2.115' CURRENT_TIMESTAMP 1644825079 SCN_NUMBER 20481 GCLUSTER_PORT 5258 INTO SERVER (HOST '10.0.2.101, 10.0.2.115', PORT '5050', USER 'root', DATABASE 'testdb', TABLE 'test1_n1, test1_n2', COMMENT 'table_host 0 0 1 # 1 1 0, scn 20481, group 0 ,distribution 1, hash_map 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 ...  0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
2022-02-14 15:51:19.861 [SQLDISP][ERROR][S:10][Q:8]<ExecSQL|2545>:Query failed, THD(0x6170000) HOST(10.0.2.101:5050->17) reason: (GBA-01EX-700) Gbase general error: table doesn't exist SQL(/*10.0.2.101_10_2_2022-02-14_15:51:19*/ flush /*+TID('3473414')*/ rollback "testdb"."test1_n1")
2022-02-14 15:51:19.863 [COMMIT][WARN ][S:10][Q:8]<FlushRollbackAsync|4581>:Some 'Flush rollback' failed, cause: [10.0.2.101:5050](GBA-02AD-0005)Failed to query in gnode:
DETAIL: (GBA-01EX-700) Gbase general error: table doesn't exist
SQL: /*10.0.2.101_10_2_2022-02-14_15:51:19*/ flush /*+TID('3473414')*/ rollback "testdb"."test1_n1"
2022-02-14 15:51:19.863 [COMMIT][ERROR][S:10][Q:8]<FlushRollbackAsync|4620>:Failed to flush rollback table 'test1[n1]' in 10.0.2.101, cause: (GBA-01EX-700) Gbase general error: table doesn't exist
2022-02-14 15:51:19.869 [DEF  ] [S:10][Q:8][load] Set Node(10.0.2.101)testdb.test1[n1][scn:20481] data state
2022-02-14 15:51:19.908 [DEF  ] [S:10][Q:8]Successfully set data(testdb.test1:normal) LOCKED, event id is 5

当前版本未触发event的场景

截至2022-02-14 ,9.5.2.44.2版本,如下场景没有触发event.

集群正常,但主副本表结构不一致

一个分片多了一列,但此时由于集群正常,没有不一致的标记,所以集群此时无法判断哪个是正确的。SQL会报错。

gbase> insert into test1 (id) values(99999),(100000),(121212),(1213133131);
ERROR 1705 (HY000): gcluster DML error: Commit failed
ERROR NO: 1733ERROR DETAIL: (GBA-01EX-700) Gbase general error: (gns_host: 10.0.2.101) source table and destination table are not same: src:( localhost:testdb.test1_n1 ) dst:( 10.0.2.115:testdb.test1_n1 ) column count: 2 -- 1
SQL: insert into test1 (id) values(99999),(100000),(121212),(1213133131)
gbase>

日志里报错,但没有设置event

2022-02-14 15:57:25.262 [SQLDISP][ERROR][S:21][Q:32]<ExecSQL|2545>:Query failed, THD(0x3b54000) HOST(10.0.2.101:5050->62) reason: (GBA-01EX-700) Gbase general error: (gns_host: 10.0.2.101) source table and destination table are not same: src:( localhost:testdb.test1_n1 ) dst:( 10.0.2.115:testdb.test1_n1 ) column count: 2 -- 1 SQL(/*10.0.2.101_21_2_2022-02-14_15:57:25*/ SELECT /*+TID('3473427')*/ * from `testdb`.`test1_n1` limit 0 target into server (HOST '10.0.2.101,10.0.2.115',PORT '5050', USER 'root ', PASSWORD '',DATABASE 'testdb',TABLE 'test1_n1',COMMENT 'table_host 0 0 1, scn 20485, gc_check '))
2022-02-14 15:57:25.266 [SQLDISP][ERROR][S:21][Q:32]<ExecSQL|2545>:Query failed, THD(0x3b54000) HOST(10.0.2.101:5050->63) reason: (GBA-01EX-700) Gbase general error: Query execution was interrupted.  SQL(/*10.0.2.101_21_2_2022-02-14_15:57:25*/ SELECT /*+TID('3473427')*/ * from `testdb`.`test1_n2` limit 0 target into server (HOST '10.0.2.101,10.0.2.115',PORT '5050', USER 'root ', PASSWORD '',DATABASE 'testdb',TABLE 'test1_n2',COMMENT 'table_host 0 0 1, scn 20485, gc_check '))
2022-02-14 15:57:25.266 [COMMIT][WARN ][S:21][Q:32]<CommitAsync|4127>:Do rollback ....
2022-02-14 15:57:25.595 [COMMIT][WARN ][S:21][Q:32]<DoRevert|357>:Revert 'testdb.test1[n1]' success, 10.0.2.101
2022-02-14 15:57:25.595 [COMMIT][WARN ][S:21][Q:32]<DoRevert|357>:Revert 'testdb.test1[n2]' success, 10.0.2.101
2022-02-14 15:57:25.595 [COMMIT][WARN ][S:21][Q:32]<DoRevert|357>:Revert 'testdb.test1[n1]' success, 10.0.2.115
2022-02-14 15:57:25.595 [COMMIT][WARN ][S:21][Q:32]<DoRevert|357>:Revert 'testdb.test1[n2]' success, 10.0.2.115
2022-02-14 15:57:25.620 [INSERT][ERROR][S:21][Q:32]:Commit failed
ERROR NO: 1733ERROR DETAIL: (GBA-01EX-700) Gbase general error: (gns_host: 10.0.2.101) source table and destination table are not same: src:( localhost:testdb.test1_n1 ) dst:( 10.0.2.115:testdb.test1_n1 ) column count: 2 -- 1
SQL: insert into test1 (id) values(99999),(100000),(121212),(1213133131)

解决方法

可以考虑手工删掉一个,触发前面你的表删除场景。当然删除哪个,需要用户自行判断。从理论上,可以从数据库各个节点做横向对比,选择超过50%的表结构为准,做自动恢复。 后续产品会考虑完善这个问题。

集群正常时,表分片丢失的普通查询

此时查询的日志会报出表分片不存在,但在V95版本里会自动尝试查询备用分片,当前版本并没有设置event信息。

2022-02-14 14:40:07.011 [SQLDISP][ERROR][S:109][Q:69]<ExecSQL|2545>:Query failed, THD(0x6508000) HOST(10.0.2.101:5050->96) reason: Table 'testdb.test1_n1' doesn't exist SQL(SELECT /*10.0.2.101_109_1_2022-02-14_14:40:06*/ /*+ TID('3342383') */ `vcname000001.testdb.test1`.`id` AS `id` FROM `testdb`.`test1_n1` `vcname000001.testdb.test1`)

在V8版本里,查询会直接包分片不存在的错误,在V9里进行了查询完善。

该现象与表数据文件损坏不同,故是否select时设置event还在确认中。

总结

本文通过几种常见的主副本不一致的情况测试,包括数据不一致,表丢失,表结构不一致等,确认GBase 8a在这些情况下都能自动完成恢复工作,确保集群后续的数据运行安全。