A high-availability cluster is a group of servers that provide continuous up time or minimum
down time for the applications on the servers in the group. If an application on one
server fails, another server in the cluster maintains the availability of the
application.
In the following diagram, XCO is deployed on the Server. The
two XCO instances are clustered and
configured with one IP address ensuring that clients need to reach only one
endpoint. All XCO services are installed on each
node. The node on which XCO is installed is the active node
and processes all requests. The other node is the standby node that processes all
the requests when the active node fails.
All operations provided by XCO services must be idempotent,
meaning they produce the same result for multiple identical requests or operations.
For more information, see the "Idempotency" section in the
ExtremeCloud Orchestrator CLI Administration
Guide, 3.2.0
.
XCO uses the following services to implement an HA deployment:
Keepalived (VRRP) – It is a program which
runs on both nodes. The active node frequently sends VRRP packets to the standby
node. If the active node stops sending the packets, keepalived on the standby
performs the active role. Thus, the standby node becomes an active node. Each
state change runs a keepalived notify script containing logic to ensure XCO‘s continued operation after
a failure. With a two-node cluster, a ”split-brain” may occur due to a network
partition which leads to two active nodes. When the network recovers, VRRP
establishes a single active node that determines the state of XCO.
K3s server runs on active node. Kubernetes
state is stored in SQLite and is synced in real-time to the standby node using a
dedicated daemon, litestream. On a failover, the keepalive notify script on the
new active node reconstructs the Kubernetes SQLite DB from the synced state and
starts the k3s. K3s runs on one node at a time, not on both nodes. Therefore,
the HA cluster looks like a single-node cluster. However, the HA cluster ties
itself to the keepalived-managed virtual IP.
MariaDB and Galera – XCO business states (device,
fabric, and tenant registrations and configuration) are stored in a set of
databases managed by MariaDB. Both the nodes run on a MariaDB server, and the
Galera clustering technology is used to keep the business state in sync on both
the nodes during normal operation.
Glusterfs – This is a clustering filesystem
used to store XCO log files, certificates,
and subinterface definitions. A daemon runs on both the nodes which seamlessly
syncs several directories.
Note
Although Kubernetes run as a single-node cluster tied to the virtual IP, XCO CLIs still operate
correctly when they run from active or standby node. Commands are converted
to REST and run over HTTPS to the ingress controller via the virtual IP tied
to the active node.
The efa status confirms the
following:
For the active node:
All enabled XCO services are
Ready
Kubernetes state is consistent with all the
enabled XCO services
The host is a member of Galera or MariaDB
cluster
For the standby node:
It is reachable via SSH from the active
node
It is a member of Galera or MariaDB
cluster
For both the nodes:
The Galera cluster size is 2 if both the
nodes are up. The cluster size is >= 1 if the standby node is
down.
Example
The following example shows the active and
standby TPVM node status in a multi-node TPVM
deployment:
NH-1# show efa status
===================================================
EFA version details
===================================================
Version : 3.1.0
Build: 109
Time Stamp: 22-10-25:12:45:44
Mode: Secure
Deployment Type: multi-node
Deployment Platform: TPVM
Deployment Suite: Fabric Automation
Virtual IP: 10.20.246.103
Node IPs: 10.20.246.101,10.20.246.102
--- Time Elapsed: 8.512402ms ---
===================================================
EFA Status
===================================================
+-----------+---------+--------+---------------+
| Node Name | Role | Status | IP |
+-----------+---------+--------+---------------+
| tpvm2 | active | up | 10.20.246.102 |
+-----------+---------+--------+---------------+
| tpvm | standby | up | 10.20.246.101 |
+-----------+---------+--------+---------------+
--- Time Elapsed: 19.168973841s --