gcdw云数仓之foundationdb 数据库集群的监控status命令输出

GBase 8a的云数仓GCDW采用foundationdb作为元数据存储服务,本文介绍该数据库监控功能status输出内容。

参考

gcdw云数仓之foundationdb 数据库集群的管理(启停,配置文件,扩容缩容和替换)

监控命令status

通过status命令获得集群的状态,前提是fdbcli能连接上。

fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster'.

2 client(s) reported: Cluster file contents do not match current cluster connection string. Verify the cluster file and its parent directory are writable and that the cluster file has not been overwritten externally.

Configuration:
  Redundancy mode        - double
  Storage engine         - memory-2
  Coordinators           - 3
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 3 (less 0 excluded; 1 with errors)
  Zones                  - 3
  Machines               - 3
  Memory availability    - 1.3 GB per process on machine with least available
                           >>>>> (WARNING: 4.0 GB recommended) <<<<<
  Retransmissions rate   - 0 Hz
  Fault Tolerance        - 0 machines
  Server time            - 03/03/23 14:23:45

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 1 MB
  Disk space used        - 326 MB

Operating space:
  Storage server         - 1.0 GB free on most full server
  Log server             - 13.3 GB free on most full server

Workload:
  Read rate              - 14 Hz
  Write rate             - 0 Hz
  Transactions started   - 5 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 03/03/23 14:23:45

fdb>

status 输出项介绍

Redundancy mode

冗余模式,也就是副本策略。

推荐singledoubletriple
推荐机器数量(台)1-23-45+
副本数量(个)123
能用,需要存活的最少机器数量(台)123
能容错,需要存活的最少机器数量(台)不能34
推荐coordinator的数量(个)135
同时故障机器数量,可能会导致数据丢失任意一个2+3+

Storage engine

存储引擎,支持memory和ssd两种。

这两个存储引擎,FoundationDB在报告提交事务commit之前,都会将事务提交到磁盘,包括副本,来确保ACID所需要的持久性。在提交时,FoundationDB可能只记录了事务日志,推迟刷新到磁盘。这对突发性能具有显著优势,但磁盘使用率可能在上次提交后继续增加。

修改存储引擎类型

fdb> configure memory-3
Usage: configure [new] <single|double|triple|three_data_hall|three_datacenter|ssd|memory|memory-radixtree-beta|proxies=<PROXIES>|logs=<LOGS>|resolvers=<RESOLVERS>>*
fdb> configure ssd
Configuration changed
fdb> status

Using cluster file `fdb.cluster'.

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2

ssd storage engine

适合大数据量。采用Btree数据存储结构。

从数据库中删除数据后,ssd引擎会延迟恢复存储空间。引擎会将空的B树页面慢慢地重构到数据库文件的末尾并截断它们。此操作的优先级较低,某些场景会出现延迟。

数据空间虽然可以重用,但如果空间是主要矛盾,可以用exclude将节点排除,再用include加入,来生成新的数据文件,不包含已被删除的数据。类似机器替换。

此引擎在较差的硬件上,比如网络存储,可能性能较差或可用性低。

memory storage engine

适合少量数据,小型数据库。

数据记录在内存里,然后将日志写到磁盘。所有读取都是内存操作,写入则是要刷到磁盘。对于顺序读写好,随机读写差的硬件更适合。

数据量大时,启动耗时较长,因为要在内存重构数据结构。

Coordinators

调度节点数量。建议单数

FoundationDB processes

处理进程数量。默认1个进程使用cpu的1个core, 最少4G内存。为了充分利用资源,可以1台机器开启多个进程。每个进程处理一份数据(切块)。

/etc/foundationdb/foundationdb.conf

[fdbserver.4500]

[fdbserver.4501]

[fdbserver.4502]

[fdbserver.4503]

其中的ID,会用于端口号,同时是也是数据目录。

[root@k8s-81 foundationdb]# ll /var/lib/foundationdb/data/
total 8
drwxr-xr-x. 2 foundationdb foundationdb 4096 Mar  3 15:35 4500
drwxr-xr-x. 2 foundationdb foundationdb  137 Mar  3 16:24 4501
drwxr-xr-x. 2 foundationdb foundationdb 4096 Mar  3 16:24 4502
drwxr-xr-x. 2 foundationdb foundationdb  137 Mar  3 16:24 4503
[root@k8s-81 foundationdb]#

其中的错误数量,可以通过status details看到,一般包括节点无法连接,客户端连接配置与服务器不同等。

fdb> status details

Using cluster file `/etc/foundationdb/fdb.cluster'.

2 client(s) reported: Cluster file contents do not match current cluster connection string. Verify the cluster file and its parent directory are writable and that the cluster file has not been overwritten externally.
  10.244.0.85:51536
  10.0.2.81:50188

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 3
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 3 (less 0 excluded; 2 with errors)

fdb> status details

Using cluster file `/etc/foundationdb/fdb.cluster'.

Could not communicate with all of the coordination servers.
  The database will remain operational as long as we
  can connect to a quorum of servers, however the fault
  tolerance of the system is reduced as long as the
  servers remain disconnected.

  10.0.2.81:4500  (unreachable)
  10.0.2.82:4500  (reachable)
  10.0.2.83:4500  (reachable)

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 3
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 2 (less 0 excluded; 1 with errors)
  Zones                  - 2

Machines

最少有1个进程参与集群的物理机数量。

Memory availability

一台机器上,每个进程最少可用内存数量,建议4G+。

基于保守的评估((内存总量-commit用的内存总量)+进程本身占用的物理内存综合)/进程数量。 如果低于4G,则会给出警告。

>>>>> (WARNING: 4.0 GB recommended) <<<<<

Fault tolerance

在不丢失数据和可用性的情况下,允许故障的最多节点数量。如果为0, 则标识任何一台节点故障,都会导致集群数据丢失或者无法使用的情况。

Server time

服务器端当前时间戳

Replication health

数据副本健康状态评估。Healthy=健康。

Moving data

当前正在节点间移动的数据量。

Sum of key-value sizes

k-v存储大小的评估。不包括副本和其它开销。

Disk space used

集群占用的磁盘空间大小。

Storage server

服务器上可用的存储空间。对于ssd存储类型,只包括磁盘,对于memory,包括磁盘和内存。

Log server

日志所在服务器的可用空间。

Read rate

当前每秒读取次数

Write rate

当前每秒写入次数

Transactions started

当前每秒启动的事务数量

Transactions committed

当前每秒事务提交的数量

Conflict rate

当前每秒冲突次数

Running backups

当前正在运行的备份数。不同的备份可以备份到不同的前缀和/或不同的目标

Running DRs

当前正在运行的DR数。不同的DR可以流式传输不同的前缀和/或到不同的DR集群

status detail 输出

进程详情Process details

通过status 带有details参数,可以查看到集群和数据库的进程详细情况。

fdb> status details

Using cluster file `/etc/foundationdb/fdb.cluster'.

2 client(s) reported: Cluster file contents do not match current cluster connection string. Verify the cluster file and its parent directory are writable and that the cluster file has not been overwritten externally.
  10.0.2.81:47377
  10.0.2.81:8544

Configuration:
  Redundancy mode        - double
  Storage engine         - memory-2
  Coordinators           - 3
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 3 (less 0 excluded; 1 with errors)
  Zones                  - 3
  Machines               - 3
  Memory availability    - 1.4 GB per process on machine with least available
                           >>>>> (WARNING: 4.0 GB recommended) <<<<<
  Retransmissions rate   - 0 Hz
  Fault Tolerance        - 1 machines
  Server time            - 03/06/23 09:15:35

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 1 MB
  Disk space used        - 332 MB

Operating space:
  Storage server         - 1.0 GB free on most full server
  Log server             - 12.8 GB free on most full server

Workload:
  Read rate              - 14 Hz
  Write rate             - 0 Hz
  Transactions started   - 5 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.0.2.81:4500         (  4% cpu;  5% machine; 0.000 Gbps;  1% disk IO; 0.5 GB / 2.7 GB RAM  )
  10.0.2.82:4500         (  6% cpu;  2% machine; 0.000 Gbps;  1% disk IO; 0.4 GB / 2.8 GB RAM  )
  10.0.2.83:4500         (  4% cpu;  2% machine; 0.000 Gbps;  0% disk IO; 0.5 GB / 1.4 GB RAM  )
    Cluster file contents do not match current cluster connection string. Verify the cluster file and its parent directory are writable and that the cluster file has not been overwritten externally.

Coordination servers:
  10.0.2.81:4500  (reachable)
  10.0.2.82:4500  (reachable)
  10.0.2.83:4500  (reachable)

Client time: 03/06/23 09:15:35

fdb>

每个进程,包括IP和端口信息,以及该进程的详细信息

cpu

单个进程的CPU利用率。fdbserver单个进程只用1个核。

machine

运行进程的计算机的CPU利用率(在所有内核上),也就是宿主机的CPU整体利用率。

Gbps

网络流入和流出的总吞吐量。

disk IO

数据所在磁盘的繁忙百分比。

RAM

进程使用的总物理内存,以及每个进程可用的内存。

Status 输出Coordinator节点信息

包括调度节点IP,端口以及能否连接的状态。 能连接为 reachable,不能连接为unreachable。

Coordination servers:
  10.0.2.81:4500  (reachable)
  10.0.2.82:4500  (reachable)
  10.0.2.83:4500  (reachable)