Verifying XCO System Health

Use this topic to learn about the methods for verifying the health of various XCO services.

SLX Device Health

By default, health check functionality is deactivated when SLX devices are registered. You can verify the status of the functionality with the following XCO command.

$ efa inventory device setting show --ip <ip-addr>

+--------------------------------+-------+
| NAME                           | VALUE |
+--------------------------------+-------+
| Maintenance Mode Enable On     | No    |
| Reboot                         |       |
+--------------------------------+-------+
| Maintenance Mode Enable        | No    |
+--------------------------------+-------+
| Health Check Enabled           | No    |
+--------------------------------+-------+
| Health Check Interval          | 6m    |
+--------------------------------+-------+
| Health Check Heartbeat Miss    | 2     |
| Threshold                      |       |
+--------------------------------+-------+
| Periodic Backup Enabled        | Yes   |
+--------------------------------+-------+
| Config Backup Interval         | 24h   |
+--------------------------------+-------+
| Config Backup Count            | 4     |
+--------------------------------+-------+
--- Time Elapsed: 270.251797ms ---

You can enable health check functionality on the device. And you can configure XCO to regularly back up the device configuration (every 6 minutes by default). For more information, see Configure Backup and Replay.

If the threshold for missed heartbeats is exceeded, XCO begins the drift and reconcile process after connectivity to the device is re-established. For more information, see Drift and Reconcile.

XCO Services Health

All services in XCO have internal health REST APIs that Kubernetes uses to restart pods that are deemed unhealthy. The results of a liveness probe determines whether a pod is healthy. Typical values for liveness probes are as follows:

initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 15

RabbitMQ Liveness

The XCO message bus is the workhorse for asynchronous inter-service communication. Therefore, XCO uses the RabbitMQ built-in ping functionality to determine the liveness of the RabbitMQ pod.

As part of a health check, each XCO service also validates its connection to RabbitMQ and attempts to reconnect to RabbitMQ when necessary.

XCO System Health for High-availability Deployments

During installation or upgrade of XCO, a system service called efamonitor is set up. This service runs validations every minute to check XCO database cluster, glusterFS, and RabbitMQ are functioning correctly.

As needed, the efamonitor service remediates the MariaDB Galera cluster and RabbitMQ connection issues, and logs the stats of the system.

Node Health

To ensure that the active and standby nodes are operational, ping checks occur between the nodes. The pings determine whether the active node is up and running. If not, the virtual IP addresses are switched over to the other node.

To ensure that failover does not occur due to a network issue, if a ping to the peer fails, a ping is also attempted to the default gateway. If ping to default gateway fails, ping is attempted to any alternative gateway that may have been provided during installation or upgrade.

If all of the pings fail, keepalived triggers Kubernetes to switch over to the active node and to put the other node in a Fault state.