GBase 8a集群加载的详细过程日志排查加载性能问题

GBase 8a数据库通过LOAD方式加载时,如果出现功能报错,或者性能问题,可以通过本文档介绍的方案,打开加载的详细日志,分析每一步的执行情况,包括获得锁,连接数据源获得文件大小,分配任务并开始加载,提交等等阶段。以便为问题的排查提供帮助。

1、前提参数设置
2、执行语句
3、日志

前提参数设置

// 默认值是3,这个是session级别的,无需恢复
gbase> set gcluster_log_level=7;
Query OK, 0 rows affected (Elapsed: 00:00:00.00)

// 默认值是0,建议先show一下原始值,排查完了记得改回去。
// 在一些新版本里,支持session配置, 无需在保留原始值了
gbase> set global gbase_sql_trace_level=15;
Query OK, 0 rows affected (Elapsed: 00:00:00.12)

// 默认值是0,这个是session级别,无需恢复
gbase> set gbase_sql_trace=1;
Query OK, 0 rows affected (Elapsed: 00:00:00.01)

执行语句

gbase> load data infile 'ftp://gbase:gbase1234@192.168.163.100/t1.txt' into table testdb.t1;
Query OK, 4 rows affected (Elapsed: 00:00:00.35)
Task 394392 finished, Loaded 4 records, Skipped 0 records

查看日志

/opt/gbase/gcluster/log/gcluster/express.log

# SQL处于check permission状态
# 获得表的锁
# 如果长时间无法拿到锁,通过gcadmin showlock 看看对应的表的锁在哪个节点哪个ID持有
2019-07-03 15:02:41.352 [LOCK][INFO ][S:2108][Q:13221]:acquired WRITE lock: testdb.t1580D5F90-B287-4199-B057-E6FBD44B5BFA, lwp:35693
2019-07-03 15:02:41.352 [LOCK][INFO ][S:2108][Q:13221]:acquired READ lock: testdb.t1, lwp:35693
2019-07-03 15:02:41.353 [LOCK][INFO ][S:2108][Q:13221]:acquired READ lock: testdb.t1.rsync, lwp:35693

# 读取数据源ftp文件信息
# 检查数据库和FTP之间的性能问题,包括网络和FTP服务器的设置
2019-07-03 15:02:41.354 [LOAD][INFO ][S:2108][Q:13221]:Start cluster load
2019-07-03 15:02:41.358 == Information :   Trying 192.168.163.100...
2019-07-03 15:02:41.362 == Information : Connected to 192.168.163.100 (192.168.163.100) port 21 (#0)
2019-07-03 15:02:41.370 <= Recv header : 220 (vsFTPd 3.0.2)
2019-07-03 15:02:41.371 => Send header : USER gbase
2019-07-03 15:02:41.371 <= Recv header : 331 Please specify the password.
2019-07-03 15:02:41.371 => Send header : PASS gbase1234
2019-07-03 15:02:41.419 <= Recv header : 230 Login successful.
2019-07-03 15:02:41.419 => Send header : PWD
2019-07-03 15:02:41.419 <= Recv header : 257 "/home/gbase"
2019-07-03 15:02:41.419 == Information : Entry path is '/home/gbase'
2019-07-03 15:02:41.419 => Send header : TYPE I
2019-07-03 15:02:41.419 == Information : ftp_perform ends with SECONDARY: 0
2019-07-03 15:02:41.420 <= Recv header : 200 Switching to Binary mode.
2019-07-03 15:02:41.420 => Send header : SIZE t1.txt
2019-07-03 15:02:41.420 <= Recv header : 213 8
2019-07-03 15:02:41.420 => Send header : REST 0
2019-07-03 15:02:41.420 <= Recv header : 350 Restart position accepted (0).
2019-07-03 15:02:41.421 == Information : Remembering we are in dir ""
2019-07-03 15:02:41.421 == Information : Connection #0 to host 192.168.163.100 left intact
2019-07-03 15:02:41.422 == Information :   Trying 192.168.163.100...
2019-07-03 15:02:41.422 == Information : Connected to 192.168.163.100 (192.168.163.100) port 21 (#0)
2019-07-03 15:02:41.424 <= Recv header : 220 (vsFTPd 3.0.2)
2019-07-03 15:02:41.424 => Send header : USER gbase
2019-07-03 15:02:41.425 <= Recv header : 331 Please specify the password.
2019-07-03 15:02:41.425 => Send header : PASS gbase1234
2019-07-03 15:02:41.465 <= Recv header : 230 Login successful.
2019-07-03 15:02:41.465 => Send header : PWD
2019-07-03 15:02:41.465 <= Recv header : 257 "/home/gbase"
2019-07-03 15:02:41.465 == Information : Entry path is '/home/gbase'
2019-07-03 15:02:41.465 => Send header : TYPE I
2019-07-03 15:02:41.465 == Information : ftp_perform ends with SECONDARY: 0
2019-07-03 15:02:41.465 <= Recv header : 200 Switching to Binary mode.
2019-07-03 15:02:41.465 => Send header : SIZE t1_2.txt
2019-07-03 15:02:41.466 <= Recv header : 213 8
2019-07-03 15:02:41.466 => Send header : REST 0
2019-07-03 15:02:41.466 <= Recv header : 350 Restart position accepted (0).
2019-07-03 15:02:41.466 == Information : Remembering we are in dir ""
2019-07-03 15:02:41.466 == Information : Connection #0 to host 192.168.163.100 left intact

#准备分派任务
2019-07-03 15:02:41.467 [LOAD][INFO ][S:2108][Q:13221]:AddNodeLoadFileNum host[::ffff:192.168.163.101]add num [1].
2019-07-03 15:02:41.474 [LOCK][INFO ][S:2108][Q:13221]:unlocked: testdb.t1
2019-07-03 15:02:41.474 [LOCK][INFO ][S:2108][Q:13221]:acquired WRITE lock: testdb.t16ef6f8a9-87f0-4d6c-8043-899367d02df3, lwp:35693
2019-07-03 15:02:41.475 [LOAD][INFO ][S:2108][Q:13221]:AddNodeLoadFileNum cur node state:[ip:::ffff:192.168.163.100:num:0][ip:::ffff:192.168.163.101:num:1].

# 此时SQL已经在loading状态,
# 开始下发任务。一般是某个节点性能导致,或者数据非常倾斜。
# a) 可以测试从每个数据库的数据节点,用ftp直接获取数据文件的性能
# 比如  curl ftp://XXXXX/XX.csv > 1.txt
# 多跑几次,看看哪个节点比较慢
# b) 排查机器的内存是否出现SWAP, 操作系统/var/log/messages是否有磁盘报错
# c) 查看CPUINFO是否有异常降频
# d) 用top、iotop、nmon等工具观察一段时间系统资源,是否有资源紧张
#
2019-07-03 15:02:41.476 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.101, SQL:set scn = 395447.
2019-07-03 15:02:41.521 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.101, SQL:LOAD DATA INFILE 'ftp://gbase:*********@192.168.163.100/t1.txt#offset=0&length=8&firstblock&ffsize=8,ftp://gbase:*********@192.168.163.100/t1_2.txt#offset=0&length=8&firstblock&ffsize=8' INTO TABLE `testdb`.`t1_n1` DATA_FORMAT 3 FILE_FORMAT UNDEFINED  HOST '::ffff:192.168.163.101' CURRENT_TIMESTAMP 1562137361 SCN_NUMBER 395447 GCLUSTER_PORT 5258 INTO SERVER (HOST '::ffff:192.168.163.100, ::ffff:192.168.163.101', PORT '5050', USER 'root', DATABASE 'testdb', TABLE 't1_n1, t1_n2', COMMENT 'table_host 0 1 0 # 1 0 1, scn 395447, group -1').
2019-07-03 15:02:42.004 [LOAD][INFO ][S:2108][Q:13221]:ReduceNodeLoadFileNum host[::ffff:192.168.163.100] reduce num [0].
2019-07-03 15:02:42.004 [LOAD][INFO ][S:2108][Q:13221]:ReduceNodeLoadFileNum host[::ffff:192.168.163.101] reduce num [1].
2019-07-03 15:02:42.004 [LOAD][INFO ][S:2108][Q:13221]:ReduceNodeLoadFileNum cur node state:[ip:::ffff:192.168.163.100:num:0][ip:::ffff:192.168.163.101:num:0].
2019-07-03 15:02:42.004 [LOAD][INFO ][S:2108][Q:13221]:Task 395447 finished, Loaded 8 records, Skipped 0 records

# 拿到排它锁,要切换版本了。
2019-07-03 15:02:42.008 [LOCK][INFO ][S:2108][Q:13221]:acquired WRITE lock: testdb.t1.09B5BEEC-1EF7-4FA6-9850-C4217A781E0F, lwp:35693

# 设置SCN
2019-07-03 15:02:42.008 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.101, SQL:SET SELF SCN = 395447.
2019-07-03 15:02:42.009 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.100, SQL:SET SELF SCN = 395447.
2019-07-03 15:02:42.009 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.100, SQL:SET SELF SCN = 395447.
2019-07-03 15:02:42.009 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.101, SQL:SET SELF SCN = 395447.

# 以指定的SCN来刷新数据
2019-07-03 15:02:42.011 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.100, SQL:flush commit "testdb"."t1_n1" scn_number 395447.
2019-07-03 15:02:42.011 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.100, SQL:flush commit "testdb"."t1_n2" scn_number 395447.
2019-07-03 15:02:42.019 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.101, SQL:flush commit "testdb"."t1_n2" scn_number 395447.
2019-07-03 15:02:42.055 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.101, SQL:flush commit "testdb"."t1_n1" scn_number 395447.

# 验证SCN
2019-07-03 15:02:42.060 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.101, SQL:select scn from information_schema.tables where table_schema = 'testdb' and table_name = 't1_n1'.
2019-07-03 15:02:42.061 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.100, SQL:select scn from information_schema.tables where table_schema = 'testdb' and table_name = 't1_n1'.
2019-07-03 15:02:42.061 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.100, SQL:select scn from information_schema.tables where table_schema = 'testdb' and table_name = 't1_n2'.
2019-07-03 15:02:42.061 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.101, SQL:select scn from information_schema.tables where table_schema = 'testdb' and table_name = 't1_n2'.
2019-07-03 15:02:42.063 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.101, SQL:set autocommit=1.
2019-07-03 15:02:42.063 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.100, SQL:set autocommit=1.
2019-07-03 15:02:42.063 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.100, SQL:set autocommit=1.
2019-07-03 15:02:42.063 [SQLDISP][INFO ][S:2108][Q:13221]:Target:::ffff:192.168.163.101, SQL:set autocommit=1.

# SQL处于commit阶段 或者 rollback阶段
# 提交commit成功
2019-07-03 15:02:42.067 [COMMIT][INFO ][S:2108][Q:13221]:commit table(testdb.t1) success for scn:395447.
2019-07-03 15:02:42.068 [LOAD][INFO ][S:2108][Q:13221]:successful to cluster load

# 释放锁
2019-07-03 15:02:42.068 [LOCK][INFO ][S:2108][Q:13221]:unlocked: testdb
2019-07-03 15:02:42.069 [LOCK][INFO ][S:2108][Q:13221]:unlocked: testdb.t1.09B5BEEC-1EF7-4FA6-9850-C4217A781E0F
2019-07-03 15:02:42.070 [LOCK][INFO ][S:2108][Q:13221]:unlocked: testdb.t1.rsync
2019-07-03 15:02:42.071 [LOCK][INFO ][S:2108][Q:13221]:unlocked: testdb.t1580D5F90-B287-4199-B057-E6FBD44B5BFA
2019-07-03 15:02:42.072 [LOCK][INFO ][S:2108][Q:13221]:unlocked: testdb.t16ef6f8a9-87f0-4d6c-8043-899367d02df3

加载更多相关内容,请参考其它文档

http://www.gbase8.cn/?tag=%e5%8a%a0%e8%bd%bd

参考

GBase 8a集群加载的详细过程日志排查加载性能问题》有1条评论

评论已关闭。