Missed Alert Details

EFA is composed of multiple services that communicate either synchronously or asynchronously. The following case describes what happens to alerts when a particular service is down.

  1. Fault Management Service Restart

    There are Ack‘s (messages have to be acknowledged) setups for any topics that Fault management service is subscribed to. The messages stay in the message bus. When Fault management service reboots or restarts, it receives the pending messages from components and continue raising the alerts. The message bus also guarantees in-order delivery of messages.

    EFA ensures to increment the sequence ID even after the service reboot, and ordered delivery of notifications.

  2. Notification Service Restart

    Fault Manager publishes alerts on the message bus which are consumed by the notification service. The notification service must acknowledge all the messages. If the notification service crashes or reboots due to some reason, the un-acknowledged messages will be present on the message bus. Therefore, after the notification service has rebooted, it continues publishing the notifications to the registered subscribers.

  3. RabbitMQ Restart

    EFA doesn‘t persist messages across MQ reboot. Hence all the pending alerts that have not been published to consumers – they will not be published. You can query the fault service for the missed messages based on the sequence IDs and retrieve them if needed.

    There is also a chance, depending on the state of where the message is, the Fault management service might not receive the notification from the components and hence will not raise an alert. This case is non-trivial to handle and users must be aware off. Chances of RabbitMQ reboots alone are very minimal, and it's usually associated with some system issue which can also impact other services.

  4. System Restart

    EFA 3.1.0 attempts to re-notify Fault management service on reboot. Since the alerts are stateless, users might see more frequent updates for some of the use-cases. These use-cases would be related to HA status, storage status, and LDAP connectivity.

    Most in-flight messages are lost. However, EFA ensures that alerts for the cases are regenerated and published on the system restart. This also applies to cases of failovers.

    EFA ensures that the sequence ID is incremented with the right amount even after the system reboot and ensures ordered delivery of notifications.

  5. Sub-System Restart

    A sub-system is a component of EFA that is responsible to publish a message that is eventually converted into an alert by the Fault management service. If a sub-system restarts, it could lead to EFA not sending an alert at that time.

    For example, if the monitoring system restarts when its supposed to publish message about disk space issue, EFA will not be able to raise an alert at that time. However, when that sub-system reboots or restarts, it will eventually publish that message, which will be raised as an alert from EFA.