南大通用GCDW元数据服务FoundationDB的集群模式配置和高可用测试

发表于2022年7月7日2023年6月1日作者 laozizhu

GBase GCDW默认采用foundationDB作为元数据数据库服务，本文介绍FoundationDB的集群配置方法，以及高可用测试。

参考

GBase 8a GCDW存算分离主机版安装使用预览

FoundationDB集群配置

本文将配置3个节点1副本的FoundationDB集群，使用的IP是 10.0.2.210，10.0.2.211，10.0.2.212。

下载

rmp包的下载请看前面的参考文章的FoundationDB部分。

安装服务

通过rpm -ivh 对服务server进行安装

[root@localhost ~]# rpm -ivh foundationdb-server-6.3.24-1.el7.x86_64.rpm
Preparing...                          ################################# [100%]
Updating / installing...
   1:foundationdb-server-6.3.24-1     ################################# [100%]

为了测试，也将客户端安装了。这里就不写了。

修改配置文件

/etc/foundationdb/fdb.cluster

将里面的127.0.0.1改成本机对外服务IP, 比如10.0.2.210

重启服务

systemctl restart foundationdb

配置FDB集群多个调度节点

多个节点都安装好后，通过某一台fdbcli客户端进行配置。用coordinators，设定多个IP为调度节点，建议为单数。

如下是设置3个的例子

fdb> coordinators 10.0.2.210:4500 10.0.2.211:4500

fdb> coordinators 10.0.2.210:4500 10.0.2.211:4500 10.0.2.212:4500
Coordination state changed
fdb>

设置成功后，通过status可以看到Coordinators输出为3，但FoundationDB processes为1，Machines也是1.

fdb> status

Using cluster file `fdb.cluster'.

Configuration:
  Redundancy mode        - single
  Storage engine         - memory-2
  Coordinators           - 3
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 1
  Zones                  - 1
  Machines               - 1

查看本机的fdb.cluster,其字符串已经变动，包含了3个IP。

检查其它节点的配置文件，有可能会自动修改成3个IP的，如果确认未修改，可以将前面的复制一份过来。然后记得重启一下服务

systemctl restart foundationdb

再查看status,其中的FoundationDB processes为3，Machines也是3. 同时也请注意Redundancy mode为single,后面要修改冗余配置。

fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - single
  Storage engine         - memory-2
  Coordinators           - 3
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 3 (less 0 excluded; 1 with errors)
  Zones                  - 3
  Machines               - 3
  Memory availability    - 2.9 GB per process on machine with least available
                           >>>>> (WARNING: 4.0 GB recommended) <<<<<
  Retransmissions rate   - 0 Hz
  Fault Tolerance        - 1 machines
  Server time            - 07/07/22 09:57:55

Data:
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)
  Sum of key-value sizes - unknown
  Disk space used        - 325 MB

Operating space:
  Storage server         - 1.0 GB free on most full server
  Log server             - 20.8 GB free on most full server

Workload:
  Read rate              - 17 Hz
  Write rate             - 3 Hz
  Transactions started   - 9 Hz
  Transactions committed - 2 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 07/07/22 09:57:55

fdb>

配置FDB集群的冗余模式

默认single是单数据模式，double是2个副本，triple是3个副本。参考如下官方文档的介绍。

https://apple.github.io/foundationdb/configuration.html#configuration-choosing-redundancy-mode

通过fdbcli进行配置，注意Redundancy mode变成了double、

观察Fault Tolerance信息，需要一点时间才能从1machines 变成1 machines，也就是允许1台机器故障。

fdb> configure double
Configuration changed
fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - double
  Storage engine         - memory-2
  Coordinators           - 3
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 3 (less 0 excluded; 1 with errors)
  Zones                  - 3
  Machines               - 3
  Memory availability    - 3.0 GB per process on machine with least available
                           >>>>> (WARNING: 4.0 GB recommended) <<<<<
  Fault Tolerance        - 1 machines
  Server time            - 07/07/22 10:04:08

Data:
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)
  Sum of key-value sizes - unknown
  Disk space used        - 330 MB

Operating space:
  Storage server         - 1.0 GB free on most full server
  Log server             - 20.8 GB free on most full server

Workload:
  Read rate              - 16 Hz
  Write rate             - 0 Hz
  Transactions started   - 0 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 07/07/22 10:04:08

fdb>

FoundationDB高可用测试

模拟故障

我们将210节点的服务停下来

[root@rh7_210 ~]# systemctl stop foundationdb
[root@rh7_210 ~]# systemctl status foundationdb
● foundationdb.service - FoundationDB Key-Value Store
   Loaded: loaded (/usr/lib/systemd/system/foundationdb.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Thu 2022-07-07 09:44:12 CST; 37s ago
  Process: 3871 ExecStart=/usr/lib/foundationdb/fdbmonitor --conffile /etc/foundationdb/foundationdb.conf --lockfile /var/run/fdbmonitor.pid --daemonize (code=exited, status=0/SUCCESS)
 Main PID: 3873 (code=exited, status=0/SUCCESS)

Jul 07 09:33:08 rh7_210 fdbmonitor[3873]: LogGroup="default" Process="fdbmonitor": Watching conf dir /etc/foundationdb/ (2)
Jul 07 09:33:08 rh7_210 fdbmonitor[3873]: LogGroup="default" Process="fdbmonitor": Loading configuration /etc/foundationdb/...db.conf
Jul 07 09:33:08 rh7_210 fdbmonitor[3873]: LogGroup="default" Process="fdbmonitor": Starting backup_agent.1
Jul 07 09:33:08 rh7_210 fdbmonitor[3873]: LogGroup="default" Process="fdbmonitor": Starting fdbserver.4500
Jul 07 09:33:08 rh7_210 fdbmonitor[3873]: LogGroup="default" Process="backup_agent.1": Launching /usr/lib/foundationdb/back...agent.1
Jul 07 09:33:08 rh7_210 fdbmonitor[3873]: LogGroup="default" Process="fdbserver.4500": Launching /usr/sbin/fdbserver (3875)...er.4500
Jul 07 09:33:08 rh7_210 fdbmonitor[3873]: LogGroup="default" Process="fdbserver.4500": FDBD joined cluster.
Jul 07 09:44:12 rh7_210 systemd[1]: Stopping FoundationDB Key-Value Store...
Jul 07 09:44:12 rh7_210 fdbmonitor[3873]: LogGroup="default" Process="fdbmonitor": Received signal 15 (Terminated), shutting down
Jul 07 09:44:12 rh7_210 systemd[1]: Stopped FoundationDB Key-Value Store.
Hint: Some lines were ellipsized, use -l to show in full.

Status状态

status可以看到变化，其中10.0.2.210:4500 (unreachable)，以及Fault Tolerance - 0 machines。表示有故障发生了，但集群还是可以对外提供服务的。

fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster'.

Could not communicate with all of the coordination servers.
  The database will remain operational as long as we
  can connect to a quorum of servers, however the fault
  tolerance of the system is reduced as long as the
  servers remain disconnected.

  10.0.2.210:4500  (unreachable)
  10.0.2.211:4500  (reachable)
  10.0.2.212:4500  (reachable)

Configuration:
  Redundancy mode        - double
  Storage engine         - memory-2
  Coordinators           - 3
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 2 (less 0 excluded; 1 with errors)
  Zones                  - 2
  Machines               - 2
  Memory availability    - 2.9 GB per process on machine with least available
                           >>>>> (WARNING: 4.0 GB recommended) <<<<<
  Fault Tolerance        - 0 machines
  Server time            - 07/07/22 10:57:39

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 1 MB
  Disk space used        - 346 MB

Operating space:
  Storage server         - 1.0 GB free on most full server
  Log server             - 20.8 GB free on most full server

Workload:
  Read rate              - 5 Hz
  Write rate             - 0 Hz
  Transactions started   - 0 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 07/07/22 10:57:35

fdb>

集群正常读写服务

fdb> writemode on
fdb> set sign 1234
Committed (290192586305)
fdb> get sign
`sign' is `1234'
fdb>

模拟再故障一台

将211服务也停了。

[root@rh7_211 ~]# systemctl stop foundationdb
[root@rh7_211 ~]#

集群无法对外提供服务

fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster'.

Could not communicate with a quorum of coordination servers:
  10.0.2.210:4500  (unreachable)
  10.0.2.211:4500  (unreachable)
  10.0.2.212:4500  (reachable)

fdb>

故障恢复

将210，211的服务恢复启动

[root@rh7_210 ~]# systemctl start foundationdb
[root@rh7_210 ~]#

[root@rh7_211 ~]# systemctl start foundationdb
[root@rh7_211 ~]#

服务恢复正常

fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - double
  Storage engine         - memory-2
  Coordinators           - 3
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 3 (less 0 excluded; 1 with errors)
  Zones                  - 3
  Machines               - 3
  Memory availability    - 2.9 GB per process on machine with least available
                           >>>>> (WARNING: 4.0 GB recommended) <<<<<
  Fault Tolerance        - 1 machines
  Server time            - 07/07/22 11:03:51

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 1 MB
  Disk space used        - 336 MB

Operating space:
  Storage server         - 1.0 GB free on most full server
  Log server             - 20.8 GB free on most full server

Workload:
  Read rate              - 35 Hz
  Write rate             - 0 Hz
  Transactions started   - 8 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 07/07/22 11:03:51

fdb> get sign
`sign' is `1234'
fdb>

扩容

安装服务与前面一致，然后将IP修改，重启服务.

扩容coor和数据节点

将节点通过coordinator加入集群. 命令和输出参考前面集群安装部分。

coordinatio是可以随时调整的，但建议数量为单数。

coordinators 10.0.2.210:4500 10.0.2.211:4500

仅扩容数据

如果不想增加coor,那就将fdb.cluster配置文件覆盖新节点的配置文件，然后重启服务即可。

缩容

用命令执行即可。如果包含coor，先用coordinators命令调整。

命令可以是某个IP的所有服务，也可以是某个IP的某个端口。

exclude 1.2.3.4 1.2.3.5 1.2.3.6

为了避免服务重启后影响，建议停止服务启动。并卸载删掉服务。

systemctl stop foundationdb
yum remove  -e XXXXX
或者
rpm -e XXXX

Post Views: 3,721

2026年1月
一	二	三	四	五	六	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

参考