Verifying EFA System Health

This topic describes methods for verifying the health of various EFA services.

SLX Device Health

By default, health check functionality is deactivated when SLX devices are registered. You can verify the status of the functionality with the following EFA command.
$ efa inventory device setting show --ip <ip-addr>

| NAME                           | VALUE |
| Maintenance Mode Enable On     | No    |
| Reboot                         |       |
| Maintenance Mode Enable        | No    |
| Health Check Enabled           | No    |
| Health Check Interval          | 6m    |
| Health Check Heartbeat Miss    | 2     |
| Threshold                      |       |
| Periodic Backup Enabled        | Yes   |
| Config Backup Interval         | 24h   |
| Config Backup Count            | 4     |
--- Time Elapsed: 270.251797ms ---

You can enable health check functionality on the device. And you can configure EFA to regularly back up the device configuration (every 6 minutes by default). For more information, see Configure Backup and Replay.

If the threshold for missed heartbeats is exceeded, EFA begins the drift and reconcile process after connectivity to the device is re-established. For more information, see Drift and Reconcile.

EFA Services Health

All services in EFA have internal health REST APIs that Kubernetes uses to restart pods that are deemed unhealthy. The results of a liveness probe determines whether a pod is healthy. Typical values for liveness probes are as follows:
  • initialDelaySeconds: 60
  • periodSeconds: 10
  • timeoutSeconds: 15

RabbitMQ Liveness

The EFA message bus is the workhorse for asynchronous inter-service communication. Therefore, EFA uses the RabbitMQ built-in ping functionality to determine the liveness of the RabbitMQ pod.

As part of a health check, each EFA service also validates its connection to RabbitMQ and attempts to reconnect to RabbitMQ when necessary.

EFA System Health for High-availability Deployments

During installation or upgrade of EFA, a systemd service called efamonitor is set up. This service runs validations every minute to check EFA database cluster, glusterFS, and RabbitMQ are functioning correctly.

As needed, the efamonitor service remediates the MariaDB Galera cluster and RabbitMQ connection issues, and logs the stats of the system.

Node Health

To ensure that the active and standby nodes are operational, ping checks occur between the nodes. The pings determine whether the active node is up and running. If not, the virtual IP addresses are switched over to the other node.

To ensure that failover does not occur due to a network issue, if a ping to the peer fails, a ping is also attempted to the default gateway. If ping to default gateway fails, ping is attempted to any alternative gateway that may have been provided during installation or upgrade.

If all of the pings fail, keepalived triggers Kubernetes to switch over to the active node and to put the other node in a Fault state.