Health Monitoring
EXOS can monitor processings and restart them upon failure. Additionally, it
can monitor threads within a process. Threads must first be registered with
EXOS, then they must periodically send a heartbeat, indiciating they are still
functioning properly.
Thread Monitoring
The following methods have been added to the threading.Thread
class to
support thread monitoring.
-
class
threading.
Thread
-
Thread.
register
(period)
Register this thread for monitoring by the EXOS process manager. The
keepalive() method must be called at least every period seconds to
indicate it is still alive. Otherwise, the process will be considered
failed and recovery will be attempted.
-
Thread.
deregister
()
Deregister this thread from monitoring.
-
Thread.
keepalive
()
Indicate that this thread is still alive. Should be called at
least as often as registered period. It is typical to call it
at the top of a processing loop.
Pool Monitoring
The MonitoredThreadPool
class inherits from
futures.ThreadPoolExecutor
to implement a monitored pool. It
automatically registers a pool monitoring thread with EXOS and periodically
sends heartbeats. As long as the pool is still processing jobs, the monitoring
thread continues to send heartbeats. If the pool gets stuck, EXOS will note the
failure and can recover the process.
-
class
exos.api.
MonitoredThreadPool
(*args, **kwds)[source]
A ThreadPoolExecutor that is monitored by the EXOS process manager.
If the thread pool stops processing jobs, the process manager will
fail the process and attempt to recover it.
-
__init__
(self, max_workers, name=None, period=15, *args, **kwargs)[source]
Create a new MonitoredThreadPool named name. The thread pool will
send a keepalive message to the process manager every period seconds.
If the pool is stuck for more than 3 periods, the process will be
failed and recovered.
Remaining arguments are passed to futures.ThreadPoolExecutor
.
Process State
An EXOS processes have a state, indicating whether it is READY
, STOPPED
, etc.
When a process is started, it will transition from BOOTING
to LOADCFG
and finally
to READY
. A process must call ready()
to enter the READY
state.
-
exos.api.
ready
()[source]
Declare this process ready for clients. This will update the
process state to READY
.
-
exos.api.
get_process_state
(process_name)[source]
Get the ProcessState of process_name.
Process States:
-
ProcessState.
FAIL
-
ProcessState.
STOPPED
-
ProcessState.
STARTED
-
ProcessState.
BOOTING
-
ProcessState.
LOADCFG
-
ProcessState.
READY
Stacking
EXOS switches can be “stacked” for redundancy and simplified management.
Not all switches can be stacked. Use the is_stackable()
function to determine
if the current switch supports stacking.
-
exos.api.
is_stackable
()[source]
Return True
if we are running on a stackable. However, stacking
may not be enabled.
Slots
The slot
collection provides information about the slot numbers used in
the stack.
-
exos.api.
slot
-
class
exos.api.
SlotProperties
[source]
-
self
The current switch’s slot number.
-
first
The first valid slot number.
-
last
The last valid slot number.
Primary, Backup, and Standby
Within each stack, one switch will be master, another may be a backup, and the
rest are standbys. The following functions allow a process to determine its
switches current state.
-
exos.api.
is_primary
()[source]
Return True
if this switch is the primary.
-
exos.api.
is_backup
()[source]
Return True
if this switch is the backup.
-
exos.api.
is_standby
()[source]
Return True
if this switch is a standby.
Processes can “checkpoint” state so that failover is seamless. For example,
as a routing protocol learns about neighbors, it may checkpoint that list to the
backup so that the list does not need to re-learned after a failover.
Checkpointing
Under Python, checkpointing is implemented as “call this, but over there.” For
example, if the primary learned about a new neighbor, it may call an
add_neighbor() locally and then use call_on_backup()
to make the same call,
but on the backup switch:
add_neighbor(new_ip)
api.call_on_backup(add_neighbor, new_ip)
The following checkpointing functions are available.
-
exos.api.
is_checkpointing
()[source]
Return True
if this switch is ready to checkpoint data.
-
exos.api.
call_on_primary
(fn, *args, **kwds)[source]
Call fn on the primary with args and kwds. fn, args, kwds
must be pickle-able. If not, a PicklingError
is raised. True
is returned if the message was sent.
-
exos.api.
call_on_backup
(fn, *args, **kwds)[source]
Call fn on the backup with args and kwds. fn, args, kwds
must be pickle-able. If not, a PicklingError
is raised. True
is returned if the message was sent.
-
exos.api.
call_on_standby
(slot, fn, *args, **kwds)[source]
Call fn on the standby with args and kwds. fn, args, kwds
must be pickle-able. If not, a PicklingError
is raised. True
is returned if the message was sent.
-
exos.api.
call_on_standbys
(fn, *args, **kwds)[source]
Call fn on all standbys with args and kwds. fn, args, kwds
must be pickle-able. If not, a PicklingError
is raised. True
is returned if the message was sent.