Root Cause Analyses
After an outage, security incident or other service disruption a root cause analysis (RCA) SHOULD be performed.
The root cause analysis SHOULD be done as soon as possible after the incident and SHOULD be done as a collaboration between engineers, managers, and parties affected.
RCAs are blameless. We’re all human and make mistakes. At NYPL we want to encourage an engineering culture of experimentation, learning, continuous improvement, and making room for mistakes.
RCAs SHOULD be saved in a shared directory where they are accessible by other team members.
RCAs MUST be shared with the revelant Product Manager for approval and sent to Garvita Kapur for review after approval within the team.
More learning about RCAs
- Google SRE - Postmortem Culture: Learning from Failure
- How to Write Great Outage Post-Mortems
- What is an Incident Post-Mortem?
- Blameless PostMortems and a Just Culture
Sample RCA Template
Here is the information that SHOULD be included in a RCA. The complexity and length of each answer MAY be determined by the severity of the incident. Each RCA MUST include:
- A timeline of the event
- The completion of an analysis of the causes of the event
- A proposed solution/resolution that addresses the issue in a way that should prevent reoccurence in the future