GBase 扩容操作重分布完成后清理旧的distribution时报错FCan not drop nodedatamap EventLog is using distribution

GBase 8a在扩容操作中,当所有表已经全部重分布到新的分布策略distribution以后,老的distribution就可以用refreshnodedatamap drop删除了。 但如果此时有些表存在event,且使用的老的策略,则会出现这个错误:Can not drop nodedatamap EventLog is using distribution。此时需要将原有的event处理完成才可以继续操作。

换个角度,如果是扩容,在操作前将集群状态全部恢复正常,没有event会更合适一些,可以减少运维的耗时。

报错样例

gbase> refreshnodedatamap drop 1;
ERROR 1707 (HY000): gcluster command error: Can not drop nodedatamap 1. FEventLog is using distribution.

原因

查看gcadmin,确实有event

[gbase@rh6-1 gcinstall_43R33]$ gcadmin
CLUSTER STATE:  ACTIVE
CLUSTER MODE:   NORMAL

=================================================================
|             GBASE COORDINATOR CLUSTER INFORMATION             |
=================================================================
|   NodeName   |     IpAddress     |gcware |gcluster |DataState |
-----------------------------------------------------------------
| coordinator1 |    10.0.2.201     | OPEN  |  OPEN   |    1     |
-----------------------------------------------------------------
=============================================================
|              GBASE DATA CLUSTER INFORMATION               |
=============================================================
|NodeName |     IpAddress     |gnode |syncserver |DataState |
-------------------------------------------------------------
|  node1  |    10.0.2.201     | OPEN |   OPEN    |    0     |
-------------------------------------------------------------
|  node2  |    10.0.2.202     | OPEN |   OPEN    |    0     |
-------------------------------------------------------------

查看具体event,发现本次比较特殊,是gssys表的审计日志audit_log出了问题。

[gbase@rh6-1 gcinstall_43R33]$ gcadmin showdmlevent
Event count:0
[gbase@rh6-1 gcinstall_43R33]$ gcadmin showddlevent
Event count:0
[gbase@rh6-1 gcinstall_43R33]$ gcadmin showdmlstorageevent
Event count:1
Event ID:    2
ObjectName: gbase.audit_log
TableID: 0

Fail Data Copy:
------------------------------------------------------
NodeIP: 10.0.2.201      FAILURE


处理方案

修复该event。如果系统无法自动同步完成,排查原因。

查看gcluster日志下的gc_recovery.log,发现该event无法自动恢复,因为gssys表是本地表,没有副本。

2022-04-14 08:57:40.898 [ERROR] <STORAGE-Recover-0>: GetSyncDmlStorgeInfo error, eventid=2, tablename=gbase.audit_log, content=gbase.audit_log,,true
2022-04-14 08:57:40.898 [INFO ] <RECOVER-INFO-0>: Finishing Recovering gbase.audit_log,tid 0
2022-04-14 08:57:41.119 [INFO ] <RECOVER-INFO>: MasterAssignTask dmlstoragetid num 1.
2022-04-14 08:57:41.119 [INFO ] <RECOVER-INFO-0>: Start Recovering gbase.audit_log tid 0
2022-04-14 08:57:41.119 [INFO ] <STORAGE-Recover-0>: Start DMLStorge recover gbase.audit_log,tid 0 eventnum 1
2022-04-14 08:57:41.119 [INFO ] <STORAGE-Recover-0>: Start to DMLStorge recover of eventid(2)
2022-04-14 08:57:41.119 [ERROR] <GCWare>: sys gbase.audit_log nodeid: 3372351498, have dmlstorageevent,eventid: 2
2022-04-14 08:57:41.119 [ERROR] <STORAGE-Recover>: GetDataCopyMap error, can't get a source node, because of no normal
2022-04-14 08:57:41.119 [ERROR] <STORAGE-Recover-0>: GetSyncDmlStorgeInfo error, eventid=2, tablename=gbase.audit_log, content=gbase.audit_log,,true

登录节点,修复该表,发现报错

gbase> repair table gbase.audit_log;
+-----------------+--------+----------+-----------------------------------+
| Table           | Op     | Msg_type | Msg_text                          |
+-----------------+--------+----------+-----------------------------------+
| gbase.audit_log | repair | Error    | Incorrect file format 'audit_log' |
| gbase.audit_log | repair | error    | Corrupt                           |
+-----------------+--------+----------+-----------------------------------+
2 rows in set (Elapsed: 00:00:00.01)

确认是表数据文件彻底损坏,只能清空数据

gbase> repair table gbase.audit_log use_frm;
+-----------------+--------+----------+-----------------------------------+
| Table           | Op     | Msg_type | Msg_text                          |
+-----------------+--------+----------+-----------------------------------+
| gbase.audit_log | repair | Error    | Incorrect file format 'audit_log' |
| gbase.audit_log | repair | status   | OK                                |
+-----------------+--------+----------+-----------------------------------+
2 rows in set (Elapsed: 00:00:00.00)

然后清理event

[gbase@rh6-1 gcinstall_43R33]$ gcadmin rmdmlstorageevent 0 2
[gbase@rh6-1 gcinstall_43R33]$ gcadmin
CLUSTER STATE:  ACTIVE
CLUSTER MODE:   NORMAL

=================================================================
|             GBASE COORDINATOR CLUSTER INFORMATION             |
=================================================================
|   NodeName   |     IpAddress     |gcware |gcluster |DataState |
-----------------------------------------------------------------
| coordinator1 |    10.0.2.201     | OPEN  |  OPEN   |    0     |
-----------------------------------------------------------------
=============================================================
|              GBASE DATA CLUSTER INFORMATION               |
=============================================================
|NodeName |     IpAddress     |gnode |syncserver |DataState |
-------------------------------------------------------------
|  node1  |    10.0.2.201     | OPEN |   OPEN    |    0     |
-------------------------------------------------------------
|  node2  |    10.0.2.202     | OPEN |   OPEN    |    0     |
-------------------------------------------------------------

重新删除旧的分布策略成功

gbase> refreshnodedatamap drop 1;
Query OK, 0 rows affected (Elapsed: 00:00:04.64)

gbase> ^CAborted
[gbase@rh6-1 gcluster]$ gcadmin rmdistribution 1
cluster distribution ID [1]
it will be removed now
please ensure this is ok, input y or n: y
gcadmin remove distribution [1] success
[gbase@rh6-1 gcluster]$

总结

当数据库存在event时要及时关注,如果数据库自身无法自动恢复,要排查原因,在排除环境自身问题,比如磁盘损坏,空间满,网络不稳定等。 等event完成恢复后再进行扩容。

如果存在逻辑上的不能自动恢复,比如主副本都被设置了不一致标记,或者如本例的这种本地gssys类型的表,要根据实际清空手工处理。

参考

GBase 8a发生主副本都损坏状态为1的几种原因