GBase 8a数据库网卡故障导致gcware服务异常

GBase 8a数据库集群,通过gcware服务,在网络多个节点维护一致性,当网络出现故障时,会一起gcware服务异常。

某项目发现gcwre服务状态异常CLOSE,持续约4分钟后恢复。

1、目前排查结论为:
Aug 10 00:01~Aug 10 00:04,这4分钟内可能有网络抖动的情况,几秒内集群部分gcware节点之间无法通讯,导致gcware之间不停的重新选举,影响集群可用性。
2、集群现有机制说明:
集群中的leader节点会每隔200毫秒发送一条心跳消息给其他follower节点,follower节点收到leader节点发来的心跳消息后,认为现在leader节点工作正常。如果follower节点在2秒钟之内没有收到leader节点发来的心跳消息,则认为leader节点出现故障,这时follower节点会发起选举,准备选出新一轮的leader节点。
3、上述的选举过程在日志中体现如下:
Aug 10 00:01:11.282487 INFO [GCWARE] vote req from 3532925962, term:5069, candidate:3532925962, term:5068,index:241635506,mine term:5068 index:241635506, previous vote:0, current vote :3532925962 grant:1
第一次的异常节点是10.44.148.210在2秒之内没有收到leader节点发来的心跳,触发选举过程,紧接着在2秒之后,10.44.148.209节点被选举为leader,但是短时间后,又与集群其他节点之间的通讯产生异常,导致再次触发选举,如此循环,直到Aug 10 00:04 10.44.151.16节点成为leader后,集群再未出现重新选举的情况。
4、通过message 看在10.44.151.18 上 发生过两次网络层的crash,
Aug 9 23:58:05 dpuc4 kernel: [] __warn+0xd8/0x100
Aug 9 23:58:05 dpuc4 kernel: [] warn_slowpath_null+0x1d/0x20
Aug 9 23:58:05 dpuc4 kernel: [] tcp_fragment+0x371/0x380
Aug 9 23:58:05 dpuc4 kernel: [] tcp_match_skb_to_sack+0x73/0xd0
…………..
10.44.148.207-- 10.44.148.210 没有8月9日到8月10日的日志

5、如果现场网络不好,可以调大gcware.conf 中的两个参数
leader_heartbeat表示leader每隔多长时间发一次心跳,默认200毫秒
election_timeout参数表示多长时间没有收到心跳信息就发起选举,默认2秒

完整的messages信息如下

Aug  9 23:58:05 dpuc4 kernel: ------------[ cut here ]------------
Aug  9 23:58:05 dpuc4 kernel: WARNING: CPU: 0 PID: 0 at net/ipv4/tcp_output.c:1134 tcp_fragment+0x371/0x380
Aug  9 23:58:05 dpuc4 kernel: Modules linked in: binfmt_misc 8021q garp mrp stp llc iptable_filter ext4 mbcache jbd2 ppdev dm_mod cirrus ttm edac_core drm_kms_helper iosf_mbi crc32_pclmul syscopyarea sysfillrect sysimgblt fb_sys_fops ghash_clmulni_intel drm aesni_intel parport_pc lrw gf128mul parport glue_helper sg ablk_helper cryptd virtio_balloon joydev i2c_piix4 i6300esb i2c_core pcspkr ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi virtio_console virtio_net virtio_scsi virtio_blk ata_piix libata serio_raw crct10dif_pclmul crct10dif_common virtio_pci crc32c_intel floppy virtio_ring virtio
Aug  9 23:58:05 dpuc4 kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W      ------------   3.10.0-693.21.1.el7.x86_64 #1
Aug  9 23:58:05 dpuc4 kernel: Hardware name: RDO OpenStack Compute, BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
Aug  9 23:58:05 dpuc4 kernel: Call Trace:
Aug  9 23:58:05 dpuc4 kernel: <IRQ>  [<ffffffff816bd804>] dump_stack+0x19/0x1b
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff8108df18>] __warn+0xd8/0x100
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff8108e05d>] warn_slowpath_null+0x1d/0x20
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815fe131>] tcp_fragment+0x371/0x380
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815f2b93>] tcp_match_skb_to_sack+0x73/0xd0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815f7562>] tcp_sacktag_walk+0xf2/0x550
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815f7df3>] tcp_sacktag_write_queue+0x433/0xb50
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815f9ce5>] tcp_ack+0x3d5/0x1760
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815fed4c>] ? tcp_transmit_skb+0x52c/0xa20
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815fb695>] tcp_rcv_established+0x225/0x7e0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff816065aa>] tcp_v4_do_rcv+0x10a/0x350
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81607dbc>] tcp_v4_rcv+0x7bc/0x9c0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffffc0234036>] ? iptable_filter_hook+0x36/0x80 [iptable_filter]
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815e01c9>] ip_local_deliver_finish+0xb9/0x200
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815e04b9>] ip_local_deliver+0x59/0xd0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815e0110>] ? ip_rcv_finish+0x3c0/0x3c0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815dfe7c>] ip_rcv_finish+0x12c/0x3c0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815e07e9>] ip_rcv+0x2b9/0x410
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81587245>] ? skb_checksum+0x35/0x50
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815872a0>] ? skb_push+0x40/0x40
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81586050>] ? reqsk_fastopen_remove+0x150/0x150
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff8159c9e4>] __netif_receive_skb_core+0x594/0x7e0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff810f3c0f>] ? __getnstimeofday64+0x3f/0xd0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff8159cc48>] __netif_receive_skb+0x18/0x60
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff8159ccd0>] netif_receive_skb_internal+0x40/0xc0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff8159dc08>] napi_gro_receive+0xd8/0x130
Aug  9 23:58:05 dpuc4 kernel: [<ffffffffc006c495>] virtnet_poll+0x2a5/0x7b0 [virtio_net]
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff8159d1a3>] net_rx_action+0x173/0x380
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff810974bd>] __do_softirq+0xfd/0x290
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff816d16cc>] call_softirq+0x1c/0x30
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff8102d465>] do_softirq+0x65/0xa0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81097905>] irq_exit+0x175/0x180
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff816d2846>] do_IRQ+0x56/0xf0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff816c628f>] common_interrupt+0x8f/0x8f
Aug  9 23:58:05 dpuc4 kernel: <EOI>  [<ffffffff816c4f80>] ? __cpuidle_text_start+0x8/0x8
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff816c520b>] ? native_safe_halt+0xb/0x20
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff816c4f9e>] default_idle+0x1e/0xc0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff810353d3>] arch_cpu_idle+0x23/0x120
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff810f0eba>] cpu_startup_entry+0x14a/0x1c0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff816aca57>] rest_init+0x77/0x80
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81b6d0f2>] start_kernel+0x44c/0x46d
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81b6caac>] ? repair_env_string+0x5c/0x5c
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81b6c120>] ? early_idt_handler_array+0x120/0x120
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81b6c665>] x86_64_start_reservations+0x24/0x26
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81b6c7b6>] x86_64_start_kernel+0x14f/0x172
Aug  9 23:58:05 dpuc4 kernel: ---[ end trace d39b4c790733d599 ]---
Aug  9 23:58:05 dpuc4 kernel: ------------[ cut here ]------------
Aug  9 23:58:05 dpuc4 kernel: WARNING: CPU: 0 PID: 0 at net/ipv4/tcp_output.c:1134 tcp_fragment+0x371/0x380
Aug  9 23:58:05 dpuc4 kernel: Modules linked in: binfmt_misc 8021q garp mrp stp llc iptable_filter ext4 mbcache jbd2 ppdev dm_mod cirrus ttm edac_core drm_kms_helper iosf_mbi crc32_pclmul syscopyarea sysfillrect sysimgblt fb_sys_fops ghash_clmulni_intel drm aesni_intel parport_pc lrw gf128mul parport glue_helper sg ablk_helper cryptd virtio_balloon joydev i2c_piix4 i6300esb i2c_core pcspkr ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi virtio_console virtio_net virtio_scsi virtio_blk ata_piix libata serio_raw crct10dif_pclmul crct10dif_common virtio_pci crc32c_intel floppy virtio_ring virtio
Aug  9 23:58:05 dpuc4 kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W      ------------   3.10.0-693.21.1.el7.x86_64 #1
Aug  9 23:58:05 dpuc4 kernel: Hardware name: RDO OpenStack Compute, BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
Aug  9 23:58:05 dpuc4 kernel: Call Trace:
Aug  9 23:58:05 dpuc4 kernel: <IRQ>  [<ffffffff816bd804>] dump_stack+0x19/0x1b
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff8108df18>] __warn+0xd8/0x100
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff8108e05d>] warn_slowpath_null+0x1d/0x20
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815fe131>] tcp_fragment+0x371/0x380
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815f2b93>] tcp_match_skb_to_sack+0x73/0xd0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815f7562>] tcp_sacktag_walk+0xf2/0x550
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815f7df3>] tcp_sacktag_write_queue+0x433/0xb50
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815f9ce5>] tcp_ack+0x3d5/0x1760
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815fed4c>] ? tcp_transmit_skb+0x52c/0xa20
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815fb695>] tcp_rcv_established+0x225/0x7e0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff816065aa>] tcp_v4_do_rcv+0x10a/0x350
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81607dbc>] tcp_v4_rcv+0x7bc/0x9c0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffffc0234036>] ? iptable_filter_hook+0x36/0x80 [iptable_filter]
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815e01c9>] ip_local_deliver_finish+0xb9/0x200
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815e04b9>] ip_local_deliver+0x59/0xd0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815e0110>] ? ip_rcv_finish+0x3c0/0x3c0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815dfe7c>] ip_rcv_finish+0x12c/0x3c0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815e07e9>] ip_rcv+0x2b9/0x410
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81587245>] ? skb_checksum+0x35/0x50
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff815872a0>] ? skb_push+0x40/0x40
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81586050>] ? reqsk_fastopen_remove+0x150/0x150
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff8159c9e4>] __netif_receive_skb_core+0x594/0x7e0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff810f3c0f>] ? __getnstimeofday64+0x3f/0xd0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff8159cc48>] __netif_receive_skb+0x18/0x60
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff8159ccd0>] netif_receive_skb_internal+0x40/0xc0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff8159dc08>] napi_gro_receive+0xd8/0x130
Aug  9 23:58:05 dpuc4 kernel: [<ffffffffc006c495>] virtnet_poll+0x2a5/0x7b0 [virtio_net]
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff8159d1a3>] net_rx_action+0x173/0x380
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff810974bd>] __do_softirq+0xfd/0x290
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff816d16cc>] call_softirq+0x1c/0x30
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff8102d465>] do_softirq+0x65/0xa0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81097905>] irq_exit+0x175/0x180
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff816d2846>] do_IRQ+0x56/0xf0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff816c628f>] common_interrupt+0x8f/0x8f
Aug  9 23:58:05 dpuc4 kernel: <EOI>  [<ffffffff816c4f80>] ? __cpuidle_text_start+0x8/0x8
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff816c520b>] ? native_safe_halt+0xb/0x20
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff816c4f9e>] default_idle+0x1e/0xc0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff810353d3>] arch_cpu_idle+0x23/0x120
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff810f0eba>] cpu_startup_entry+0x14a/0x1c0
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff816aca57>] rest_init+0x77/0x80
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81b6d0f2>] start_kernel+0x44c/0x46d
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81b6caac>] ? repair_env_string+0x5c/0x5c
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81b6c120>] ? early_idt_handler_array+0x120/0x120
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81b6c665>] x86_64_start_reservations+0x24/0x26
Aug  9 23:58:05 dpuc4 kernel: [<ffffffff81b6c7b6>] x86_64_start_kernel+0x14f/0x172
Aug  9 23:58:05 dpuc4 kernel: ---[ end trace d39b4c790733d59a ]---