This topic describes methods for verifying the health of various EFA services.
$ efa inventory device setting show --ip <ip-addr> +--------------------------------+-------+ | NAME | VALUE | +--------------------------------+-------+ | Maintenance Mode Enable On | No | | Reboot | | +--------------------------------+-------+ | Maintenance Mode Enable | No | +--------------------------------+-------+ | Health Check Enabled | No | +--------------------------------+-------+ | Health Check Interval | 6m | +--------------------------------+-------+ | Health Check Heartbeat Miss | 2 | | Threshold | | +--------------------------------+-------+ | Periodic Backup Enabled | Yes | +--------------------------------+-------+ | Config Backup Interval | 24h | +--------------------------------+-------+ | Config Backup Count | 4 | +--------------------------------+-------+ --- Time Elapsed: 270.251797ms ---
You can enable health check functionality on the device. And you can configure EFA to regularly back up the device configuration (every 6 minutes by default). For more information, see Configure Backup and Replay.
If the threshold for missed heartbeats is exceeded, EFA begins the drift and reconcile process after connectivity to the device is re-established. For more information, see Drift and Reconcile.
The EFA message bus is the workhorse for asynchronous inter-service communication. Therefore, EFA uses the RabbitMQ built-in ping functionality to determine the liveness of the RabbitMQ pod.
As part of a health check, each EFA service also validates its connection to RabbitMQ and attempts to reconnect to RabbitMQ when necessary.
During of installation or upgrade of EFA, a systemd service called
					efamonitor is set up. This service validates every minute
				during which the EFA database cluster, Kubernetes cluster, and RabbitMQ cluster are
				formed and functioning correctly.
As needed, the efamonitor service remediates the MariaDB Galera and
				K3s clusters, and deletes pods that are not in the correct deployment state.
				Finally, the service reforms the RabbitMQ cluster if a split brain state occurs.
To ensure that the active and standby nodes are operational, ping checks occur between the nodes. The pings determine whether the active node is up and running. If not, the virtual IP addresses are switched over to the other node.
To ensure that failover does not occur due to a network issue, if a ping to the peer fails, a ping is also attempted to the default gateway. If ping to default gateway fails, ping is attempted to any alternative gateway that may have been provided during installation or upgrade.
If all of the pings fail, keepalived triggers Kubernetes to switch over to the active node and to put the other node in a Fault state.