How to Update the AAI Status Page
Introduction
We have a Status Page for our API, which can be found here. In the event of an incident like an interruption or degradation of service, the Status Page will need to be updated to reflect the current status of the incident. This will typically be done by the Incident Manager as detailed in the SOP for Outages playbook.
This Playbook will detail the steps necessary to update the Status Page.
Procedures
-
Navigate to the Incidents tab of the Status Page in Jira.
-
Click the Create Incident button.
-
Enter a descriptive title in the Incident name field.
-
Select the appropriate option on the Incident status bar. Typically a new incident would begin with a status of Investigating but there could sometimes be exceptions, for example if Engineering has already identified the root cause by the time Support gets notified about the incident.
-
Enter a descriptive message in the Message field. This message should be clear and concise and contain specific details when possible. They should not be overly long or get into the weeds technically. That is what post-mortems are for.
a. Details to include in the initial incident message:
i. A description of the issue ii. How the issue is affecting customers (i.e. slower than usual processing times, failed requests when using Speaker Labels, 4xx errors when making requests to the GET endpoint, etc) iii. What customers are affected (i.e. new accounts that are trying to add funds for the first time, all requests made to a certain endpoint, all requests using a certain feature, etc) -
Check the appropriate affected services in the Components affected section and select the appropriate performance status from the dropdown to the right.
-
Check the Send notifications box in the Notifications section.
-
Click Create.
This will create an incident and display it on our Status Page. As with all customer communications during an incident, it is important that we regularly update the Status Page to keep customers informed with the latest information we are receiving from Engineering. At a minimum, we should be updating the Status Page every half hour during an incident.
As you make updates be sure to also update the performance dropdown to reflect the current state of the service. For example: an incident might start with Degraded performance when initially reported but move to Operational once we release a fix and move into Monitoring.
Because incidents can affect the uptime that we show customers it is important that we update the Status Page as soon as new information comes in, particularly if we are showing an outage. We want to minimize the amount of time we show for an outage and for incidents in general. That typically means that you should update the status page before doing any individual customer comms.