Protocol Support for Hitless Failover

Protocol Support for Hitless Failover summarizes the protocol support for hitless failover. Unless otherwise noted, the behavior is the same for all switches.

If a protocol indicates support for hitless failover, additional information is also available in that particular chapter. For example, for information about network login support of hitless failover, see Network Login.

Table 1. Protocol Support for Hitless Failover
Protocol Behavior Hitless
Bootstrap Protocol Relay All bootprelay statistics (including option 82 statistics) are available on the backup node also Yes
BGP

If you configure BGP graceful restart, by default the route manager does not delete BGP routes until 120 seconds after failover occurs. There is no traffic interruption. However, after BGP comes up after restart, BGP re-establishes sessions with its neighbors and relearns routes from all of them. This causes an increase in control traffic onto the network.

If you do not configure graceful restart, the route manager deletes all BGP routes 1 second after the failover occurs, which results in a traffic interruption in addition to the increased control traffic.

Yes
Connectivity Fault Management (IEEE 802.1ag)

An ExtremeXOS process running on the master node should continuously send the MEP state changes to the backup. Replicating the protocol packets from an master node to a backup may be a huge overhead if CCMs are to be initiated/received in the CPU and if the CCM interval is in the order of milliseconds.

RMEP timeout does not occur on a remote node during the hitless failover.

RMEP expiry time on the new master node in case of double failures, when the RMEP expiry timer is already in progress, is as follows:

RMEP Expiry Time = elapsed expiry time on the master node + 3.5 * ccmIntervaltime + master node convergence time.

Yes
Dynamic Host Configuration Protocol client The IP addresses learned on all DHCP enabled VLANs are retained on the backup node after failover. Yes
Dynamic Host Configuration Protocol server A DHCP server continues to maintain the IP addresses assigned to various clients and the lease times even after failover. When a failover happens, all the clients work as earlier. Yes
EAPS

The primary node replicates all EAPS BPDUs to the backup, which allows the backup to be aware of the state of the EAPS domain. Since both primary and backup nodes receive EAPS BPDUs, each node maintains equivalent EAPS states.

By knowing the state of the EAPS domain, the EAPS process running on the backup node can quickly recover after a primary node failover. Although both primary and backup nodes receive EAPS BPDUs, only the primary transmits EAPS BPDUs to neighboring switches and actively participates in EAPS.

Yes
EDP EDP does not checkpoint protocol data units (PDUs) or states, so the backup node does not have the neighbor‘s information. If the backup node becomes the primary node, and starts receiving PDUs, the new primary learns about its neighbors. No
Extreme Loop Recovery Protocol (ELRP)

If you use ELRP as a standalone tool, hitless failover support is not needed since you initiate the loop detection. If you use ELRP in conjunction with ESRP, ELRP does not interfere with the hitless failover support provided by ESRP.

Although there is no hitless failover support in ELRP itself, ELRP does not affect the network behavior if a failover occurs.

No
Extreme Standby Router Protocol (ESRP)

If failover occurs on the ESRP MASTER switch, it sends a hello packet with the HOLD bit set. On receiving this packet, the ESRP SLAVE switch freezes all further state transitions. The MASTER switch keeps sending hellos with the HOLD bit set on every hello interval. When the MASTER is done with its failover, it sends another hello with the HOLD bit reset. The SLAVE switch resumes normal processing. (If no packet is received with the HOLD bit reset, the SLAVE timeouts after a certain time interval and resumes normal processing.)

Failover on the ESRP SLAVE switch is of no importance because it is the SLAVE switch.

Yes
Intermediate System-Intermediate System (IS-IS)

If you configure IS-IS graceful restart, there is no traffic interruption. However, after IS-IS comes up after restart, IS-IS re-establishes sessions with its neighbors and relearns Link State Packets (LSPs) from all of the neighbors. This causes an increase in network control traffic.

If you do not configure graceful restart, the route manager deletes all IS-IS routes one second after the failover occurs, which results in a traffic interruption and increased control traffic. IS-IS for IPv6 does not support hitless restart .

IS-IS (IPv4) Yes

IS-IS (IPv6) No

Link Aggregation Control Protocol (LACP) If the backup node becomes the primary node, there is no traffic disruption. Yes
LLDP LLDP is more of a tool than a protocol, so there is no hitless failover support. LLDP is similar to EDP, but there is also a MIB interface to query the information learned. After a failover, it takes 30 seconds or greater before the MIB database is fully populated again. No
MSDP If the master node fails, the MSDP process loses all state information and the backup node becomes active. However, the failover from the master to the backup causes MSDP to lose all state information and dynamic data, so it is not a hitless failover. No
MLAG

All MLAG user configuration is executed on both master and backup nodes. Both nodes open listening health-check and checkpoint listening sockets on the respective well-known ports. All FDB entries and IPMC group/cache information that were received through ISC checkpointing is synchronized to the backup node.

After failover, the TCP session, which is handled by the failed master, tears down and there is a new session with the MLAG peer switch. After the failover, the FDB & McMgr processes trigger bulk checkpointing of all its entries to the MLAG peer upon receiving ISC up notification.

Yes
Network Login

802.1X Authentication—Authenticated clients continue to remain authenticated after failover. However, one second after failover, all authenticated clients are forced to re-authenticate themselves.

Information about unauthenticated clients is not checkpointed, so any such clients that were in the process of being authenticated at the instant of failover must go through the authentication process again from the beginning after failover.

MAC-Based Authentication—Authenticated clients continue to remain authenticated after failover so the failover is transparent to them. Information about unauthenticated clients is not checkpointed so any such clients that were in the process of being authenticated at the instant of failover must go through the authentication process again from the beginning after failover.

In the case of MAC-Based authentication, the authentication process is very short with only a single packet being sent to the switch so it is expected to be transparent to the client stations.

Web-Based Authentication—Web-based Netlogin users continue to be authenticated after a failover.

Yes
Yes
Yes
OSPF

If you configure OSPF graceful restart, there is no traffic interruption. However, after OSPF comes up after restart, OSPF re-establishes sessions with its neighbors and relearns Link State Advertisements (LSAs) from all of the neighbors. This causes an increase in control traffic onto the network.

If you do not configure graceful restart, the route manager deletes all OSPF routes one second after the failover occurs, which results in a traffic interruption in addition to the increased control traffic.

Yes
OSPFv3

If you configure OSPFv3 graceful restart, there is no traffic interruption. However, after OSPFv3 comes up after restart, OSPFv3 re-establishes sessions with its neighbors and relearns Link State Advertisements (LSAs) from all of the neighbors. If you do not configure graceful restart, the route manager deletes all OSPFv3 routes after failover occurs, which results in a traffic interruption in addition to the increased control traffic.

Yes
PoE

The PoE configuration is checkpointed to the backup node. This ensures that, if the backup takes over, all ports currently powered stay powered after the failover and the configured power policies are still in place.

Yes
Protocol Independent Multicast (PIM) After a failover, all hardware and software caches are cleared and learning from the hardware is restarted. This causes a traffic interruption since it is the same as if the switch rebooted for all Layer 3 multicast traffic. No
RIP RIP does not support graceful restart, so the route manager deletes all RIP routes one second after the failover occurs. This results in a traffic interruption as well as an increase in control traffic as RIP re-establishes its database. No
RIPng

RIPng does not support graceful restart, so the route manager deletes all RIPng routes one second after the failover occurs. This results in a traffic interruption.

After RIPng comes up on the new primary node, it relearns the routes from its neighbors. This causes an increase in control traffic onto the network.

No
Simple Network Time Protocol Client SNTP client will keep the backup node updated about the last server from which a valid update was received, the time at which the last update was received, whether the SNTP time is currently good or not and all other statistics. Yes
STP STP supports hitless failover including catastrophic failure of the primary node without interruption. There should be no discernible network event external to the switch. The protocol runs in lock step on both master and backup nodes and the backup node is a hot spare that can take over at any time with no impact on the network. Yes
VRRP VRRP supports hitless failover. The primary node replicates VRRP PDUs to the backup, which allows the primary and backup nodes to run VRRP in parallel. Although both nodes receive VRRP PDUs, only the primary transmits VRRP PDUs to neighboring switches and participates in VRRP. Yes