Table of Contents

Runbooks are guides or instructions written to standardize and streamline processes and procedures. They often provide references for troubleshooting or remediation as part of incident response and help teams work calmly, efficiently, and consistently in high-pressure situations. The overall goal of runbooks is to reduce reliance on individual expertise and avoid human error during critical operations.

This article uses examples to explain the best practices for designing runbooks and explores tools that make runbooks and incident response more efficient. 

Summary of runbook best practices

Best practice

Description

Collaborate with subject matter experts when designing a runbook

When writing the runbook, include those with hands-on experience related to the tasks to ensure that it is accurate and reflects real-world scenarios.

Focus on user-friendly design

Avoid jargon and complex terminology. Use plain, actionable language with numbered or bulleted lists for clarity.

Use consistent templates for all runbooks to improve usability.

Test the runbooks

Conduct dry runs to validate the accuracy of the steps.

Gather feedback from the users performing the steps and refine them accordingly.

Include escalation and rollback plans

Provide rollback instructions to revert changes made if the steps don’t resolve the issue, and define escalation points for further action.

Integrate with existing tools and systems

Use automation tools to execute steps where possible. 

Embed runbooks into incident management platforms for easy access.

Regularly review and update

Update runbooks immediately after any change to processes, tools, or systems and schedule periodic reviews to ensure accuracy. Update after incident post-mortems if necessary. Maintain a version history to track changes over time, and clearly label the latest version to avoid confusion.

Train team members

Conduct workshops to practice using the runbooks during mock incidents.

Runbook example

Runbooks can improve productivity and increase system availability by documenting best practices and defined workflows for specific situations. When well written, their clear and precise nature is instrumental in maintaining order during high-pressure scenarios like outages impacting production services, making them an essential part of any IT incident management toolbox.

The following is an example of a basic runbook to react to HTTP 500 errors in a web application:

Description

The HTTP 500 error indicates a generic server-side issue. 

The server encountered an unexpected condition preventing it from fulfilling a request.

Initial troubleshooting

Verify the error

Confirm the 500 error by reloading the webpage.

Tools like Postman or cURL send HTTP GET, POST, PUT, etc. requests to the endpoint.

Inspect server logs

For Apache: /var/log/apache2/error.log

For Nginx: /var/log/nginx/error.log

Application Logs: See application documentation for log location

Check application code

Identify the failing module

Review the logs to pinpoint the code causing the issue.

Debug code locally

Use debuggers or additional logs to understand the problem.

Review recent changes

Test code deployments

Roll back recent changes and test if the issue persists

Check for server updates

Check for updates to the OS, libraries, or dependencies that may have introduced the issue

Escalation

Notify the development team with detailed findings and logs

Email devteam@example.com mailbox

 

File a ticket in the appropriate tracking system

Include error messages, logs, and steps to reproduce.

Preventive Measures

Implement logging and monitoring

Add any logging and monitoring that could have helped to prevent the situation.

Implement automated testing

Ensure that automated tests cover critical paths.

Notes

Related errors

502 Bad Gateway

503 Service Unavailable

References

Nginx debugging documentation

Apache debugging documentation


This example is far from comprehensive; any experienced SRE team will surely have additional steps based on their needs.

Before going further, it’s worth looking at how runbooks and playbooks differ. The two incident management tools have distinct purposes but are often confused due to their similarities.

Runbooks provide tactical step-by-step instructions for handling specific scenarios. Playbooks define a more strategic approach and document a broader framework that outlines roles, communication plans, and decision-making processes to guide teams. In short, playbooks outline the “what” and “why” of a response, while runbooks focus on “how.” Runbooks and playbooks complement each other to provide a cohesive system for effectively managing IT operations and incidents.

Customer-Facing Reliability Powered by Service-Level Objectives

Learn More

Integrate with your existing monitoring tools to create simple and composite SLOs

Rely on patented algorithms to calculate accurate and trustworthy SLOs

Fast forward historical data to define accurate SLOs and SLIs in minutes

Collaborate with subject matter experts

Collaboration with subject matter experts (SMEs) ensures that a runbook is accurate and actionable while reflecting real-world scenarios and solutions. Their insights can identify potential pitfalls and provide tested solutions, resulting in a precise and user-friendly runbook.

The SMEs chosen to assist in writing any given runbook will vary. You may engage technical engineers and architects at multiple levels to understand the original design of a system and common troubleshooting steps. You may also want input from non-technical product owners or other stakeholders who can provide essential context about Service Level Objectives (SLOs), availability requirements, and communication or reporting paths. You might choose to involve security teams if the runbook relates to particularly sensitive systems or data in order to confirm that the right access methods are recommended and only those with relevant access are asked to perform given tasks.

Initial conversations and documentation from SMEs will provide a sound technical base for any runbook. However, direct observation from shadowing these experts during their daily routines or while they respond to incidents enables you to better document their workflows, techniques, and decision-making processes in detail. This can also uncover additional nuance that may get missed if relying on explanation alone.

Surveys and questionnaires can efficiently gather input from a wider group of stakeholders. They can capture more diverse perspectives in a shorter timeframe than is possible during conversations and shadowing, helping identify common pain points. You should include questions about specific challenges, tools, and preferred solutions to create a context for your runbook.

A relatively straightforward example of a questionnaire could include the description of a recent incident, followed by  questions such as:

  1. What would your first troubleshooting step be?
  2. What external documentation would you reference for assistance?
  3. What internal knowledge articles may be relevant?

Focus on user-friendly design

As we saw in the introduction, presenting runbook information in a user-friendly way really matters. Here are some more specific guidelines.

Use a consistent template

Maintaining a user-friendly design ensures that runbooks are clear and actionable by all intended users. Use a consistent template with standard sections such as the following:

  • Title: The objective of the runbook
  • Triggers: Scenarios or conditions that invoke the runbook’s use
  • Instructions: Detailed, actionable steps
  • Outcomes: Desired results or indicators of success
  • Escalation: Whom to contact if the issue cannot be resolved
  • Contact details: Key contact information for quick reference

All runbooks must make sure that users know what to expect when reading them and can efficiently process their content.

Presentation is also important. Specifically, tables layout runbook information in a way that is easier to digest than freeform paragraphs. Consider the following runbook example:

“The HTTP 500 error indicates a generic server-side issue. It suggests that the server encountered an unexpected condition that prevented it from fulfilling the request. To troubleshoot, you should first verify the error. Confirm the 500 error by reloading the webpage, potentially using tools like Postman or cURL to send requests to the endpoint. If the error is verified, inspect the server logs (/var/log/apache2/error.log for Apache, /var/log/nginx/error.log for Nginx) and identify the failing module.”

Pretty hard to read and digest. Now look at the same information in tabular form:

Description

The HTTP 500 error indicates a generic server-side issue. 

The server encountered an unexpected condition preventing it from fulfilling a request.

Initial troubleshooting

Verify the error

Confirm the 500 error by reloading the webpage.

Tools like Postman or cURL send HTTP GET, POST, PUT etc. requests to the endpoint.

Inspect server logs

For Apache: /var/log/apache2/error.log

For Nginx: /var/log/nginx/error.log

Application Logs: see application documentation for log location


The table provides the same technical details but is much easier to read and comprehend quickly, especially in a high-pressure situation.

Use straightforward language

Using clear and simple language reduces the chances of misinterpretation. To ensure that the runbook is accessible to users of all skill levels, you should provide step-by-step instructions using numbered or bulleted lists and avoid jargon or overly complex terminology.

Provide decision trees

Include decision trees for scenarios requiring conditional actions, which will enable rapid decision-making and save time during a crisis. For example, Nobl9’s composite SLOs can guide decisions by illustrating thresholds and actions to take when metrics deviate from their expected norm.

Employ visual aids

Visual aids emphasizing warnings, tips, and prerequisites improve understanding and reduce cognitive load. Diagrams and screenshots should illustrate processes, while bold or italic text can highlight essential information. Use headings to ease navigation and color to emphasize warnings or specific actions.

For example, even if you have seen a specific runbook before, it can be hard to pick out important information when very few visual aids are used:

Initial troubleshooting

Verify the error

Confirm the 500 error by reloading the webpage.

Tools like Postman or cURL send HTTP GET, POST, PUT etc. requests to the endpoint.

Inspect server logs

For Apache: /var/log/apache2/error.log

For Nginx: /var/log/nginx/error.log

Application Logs: See application documentation for log location


Utilizing bold and italic text while adding color makes the names of tools and log locations stand out. They become easier to see should an engineer remember some of the processes behind a runbook but need to be quickly reminded of the specific tool names or log locations:

Initial troubleshooting

Verify the error

Confirm the 500 error by reloading the webpage.

Tools like Postman or cURL send HTTP GET, POST, PUT etc. requests to the endpoint.

Inspect Server Logs

For Apache: /var/log/apache2/error.log

For Nginx: /var/log/nginx/error.log

Application Logs: See application documentation for log location


Include the right level of detail

Not all users require the same amount of detail, and sometimes. Even perfect formatting, simplified language, and visual aids won't help if a document contains information that's irrelevant or too advanced for its intended audience. A runbook’s content should match the skill level of its intended users, and you should consider creating separate runbooks for different roles (e.g., level 1 vs. level 3 support) if this improves usability.

For example, a runbook aimed at level 1 support may be shorter and quick to reach an instruction to escalate, such as the following, which describes little more than gathering information and passing the responsibility to the development team via a support ticket.

Description

The HTTP 500 error indicates a generic server-side issue. 

The server encountered an unexpected condition preventing it from fulfilling a request.

Initial troubleshooting

Verify the error

Confirm the 500 error by reloading the webpage.

Tools like Postman or cURL send HTTP GET, POST, PUT etc. requests to the endpoint.

Inspect server logs

For Apache: /var/log/apache2/error.log

For Nginx: /var/log/nginx/error.log

Application Logs: see application documentation for log location

Review recent changes

Test code deployments

Did the error start directly after a code deployment?

Check for server updates

Did the error start directly after a server update?

Escalation

Notify the development team with detailed findings and logs

Email devteam@example.com mailbox

File a ticket in the appropriate tracking system

Include error messages, logs, and steps to reproduce.


In contrast, the development team’s runbook may omit the initial troubleshooting steps—since the results of this will have been recorded by level 1 support—but include more technical steps related to the application code:

Description

The HTTP 500 error indicates a generic server-side issue. 

The server encountered an unexpected condition preventing it from fulfilling a request.

Check application code

Identify the failing module

Review the logs to pinpoint the code causing the issue.

Debug code locally

Use debuggers or additional logs to understand the problem.

Review recent changes

Code deployments

Roll back recent changes and test if the issue persists.

Server updates

Check for updates to the OS, libraries, or dependencies that may have introduced the issue.

Preventive measures

Implement logging and monitoring

Add any logging and monitoring that could have helped to prevent the situation.

Implement automated testing

Ensure that automated tests cover critical paths.

Notes

Related errors

502 Bad Gateway

503 Service Unavailable

References

Nginx debugging documentation

Apache debugging documentation


Store runbooks in an accessible location

Even the most well-written runbook is useless if stored in an inaccessible location where your team cannot easily find it. Storing runbooks in a centralized and easily accessible location, with accurate metadata tags for efficient search retrieval and hyperlinks for quick navigation between sections, ensures that your runbooks are user-friendly for your teams and not a source of additional stress during an already high-pressure situation.

Customer-Facing Reliability Powered by Service-Level Objectives

Learn More

Test the runbooks

To effectively test a runbook, determine what you want to validate during testing, and establish clear success criteria to provide a framework for assessing the runbook’s effectiveness and usability. Once the success criteria have been determined, an environment that mirrors the real-world conditions referenced in the runbook should be set up with all necessary tools, data, and configurations available.

Choose a diverse group of testers with different levels of expertise to ensure that the runbook is understandable for all intended users, from novices to experts. This could include asking level 1 support members to test the level 1 runbook to ensure that they have the relevant permissions to complete all tasks, asking someone familiar with a particular runbook to test it and check that the steps are in a logical order, or asking someone unfamiliar with a specific runbook to test the steps to confirm that they work as expected when followed literally with no prior knowledge. Non-technical staff may also wish to test a runbook’s procedural steps, such as escalation paths or communication instructions.

You should replicate the specific scenarios or incidents the runbook is designed to address, and testers should follow the instructions step by step without any prior coaching or assistance. Monitor the testers as they execute the runbook and note whether the instructions are clear, actionable, and easy to follow. Any confusion or errors should be noted for future improvement.

Including edge case conditions or unusual configurations in your testing ensures that the runbook performs well under various scenarios. Addressing edge cases and common issues makes a runbook more versatile. This could include giving instructions for a variety of software, such as the following, if it is known that a mix of Apache and Nginx is used across the estate (despite one being far more common) or instructions for a variety of operating systems if log locations vary across them.

Once testing is complete, you should collect detailed feedback from the testers about their experience. Document any issues, suggestions, or improvements and use this feedback to make necessary revisions to the runbook. Update the runbook based on tester feedback and conduct additional rounds of testing once the improvements have been implemented. Repeat this cycle until the runbook meets all success criteria and user needs. 

For example, a runbook could start as this:

Initial troubleshooting

Verify the error

Confirm the 500 error by reloading the webpage.

Use cURL to send requests to the endpoint.

Inspect server logs

/var/log/apache2/error.log


It could become the following if a tester establishes that not all applications are hosted on Apache or that cURL isn’t always sufficient to establish whether the endpoint is responding correctly:

Initial troubleshooting

Verify the error

Confirm the 500 error by reloading the webpage.

Tools like Postman or cURL send HTTP GET, POST, PUT etc. requests to the endpoint.

Inspect server logs

For Apache: /var/log/apache2/error.log

For Nginx: /var/log/nginx/error.log

Application Logs: See application documentation for log location

Include escalation and rollback plans

The outlined steps may not always resolve the issue; you should account for such situations by specifying when and how to escalate the problem to ensure that issues are addressed promptly and by the appropriate personnel.

Here’s an example of an escalation flow:

  1. Level 1 support: Attempt resolution using the runbook.
  2. Escalate to level 2 support: Escalate if unresolved within 15 minutes or if the issue exceeds level 1 expertise.
  3. Notify incident manager: Notify for critical issues that impact service availability or require immediate attention, if more coordination is required.
  4. Engage external vendors: If internal teams cannot resolve the issue involving third-party tools or services, involve external vendors.

You may choose to implement an escalation matrix with different escalation paths or actions that depend on the criticality of the issues or the value of a composite SLO. For example:

Escalation

File a ticket in the appropriate tracking system.

Include error messages, logs, and steps to reproduce.

If issue is unresolved after 30 minutes

Assign ticket Medium priority.

If issue is unresolved after 60 minutes

Assign ticket Critical priority. Immediately notify the development team that the ticket has been raised.

Composite SLO “X” is above value “Y”

Assign ticket Medium priority.

Composite SLO “X” is below value “Y”

Assign ticket Critical priority. Immediately notify the development team that the ticket has been raised.


Detailed instructions to revert any changes made during the runbook process should be included to ensure that systems can be returned to a stable state if the attempted fix fails.

Here’s an example of a rollback procedure:

  1. Stop current operations: Cease any ongoing actions that could exacerbate the issue.
  2. Restore backups: Use pre-resolution backups to revert to the previous configuration.
  3. Verify system stability: Test the system to confirm that it functions as expected post-rollback.
  4. Document the rollback: Record actions taken and observations for future reference and analysis.

Visit SLOcademy, our free SLO learning center

Visit SLOcademy. No Form.

Integrate with existing tools and systems

Runbooks should not be treated as standalone entities; integrating them with existing tools and systems where possible improves their workflow. Embedding runbooks into tools commonly used by the team, such as dashboards and portals, improves visibility and efficiency. Integrating incident management platforms to centralize incident tracking and resolution efforts can streamline workflows and offer integration with other tools.


Ensure that your runbooks integrate with monitoring and alerting systems. For example, you could include Nobl9’s error budget adjustments in a decision tree to suggest different reactions to various scenarios or automatically annotate SLO charts when runbooks are triggered to add context to performance data and help identify patterns during post-incident reviews.

Furthermore, you could integrate Nobl9’s SLOs into messaging or notification systems like Slack to prompt to invoke a runbook when an SLO falls below a given value, provide updates on an incident as the value changes, and even confirm resolution if the value returns above a specific level. For example, you could use integrations to post a Slack, Teams, Discord, or PagerDuty message if latency for an application is above a given threshold for 1% of queries. This notifies the team of potential service degradation immediately. A similar notification could be posted once the latency returns below the threshold to confirm that the degradation has been resolved. 

Utilizing Nobl9’s low severity burn rate alerts, you can preemptively identify, be notified of, and even address potential service degradation before it escalates. Integration with Slack, for example, is effortlessly configurable via YAML:


apiVersion: n9/v1alpha
kind: AlertMethod
metadata:
  name: slack-notification
  displayName: slack notification
spec:
  description: Sends notification to a Slack channel
  slack:
    url: <https://hooks.slack.com/services/1234567890/abcdef>       

Workflow automation tools can provide pre-made remediation scripts for engineers to execute or carry out tasks automatically to reduce human error and speed up incident resolution.

Integration does not necessarily need to be achieved through technical means. Including links and signposts for dashboards or other data collections within the steps of a runbook is also a valid approach. Nobl9’s dashboards can collate data from various sources to provide a single-pane-of-glass view on the status of any service and confirm whether a runbook has the desired effect. Composite SLOs can also be used to check if the service is returning to an acceptable level.

As mentioned, a runbook is only valuable if easily located and accessed. Storing runbooks in a centralized knowledge management system alongside documentation ensures that users can easily access the latest version. Utilizing tags and keywords to provide a searchable database of runbooks and including hyperlinks for faster navigation enhances accessibility.

You should integrate runbooks with ticketing systems to connect incident documentation and task management. This ensures that every action is tracked and associated with the relevant ticket. Automation to trigger escalation workflows when predefined conditions are met can reduce response times and ensure that the right people address the issues.

Furthermore, ChatOps tools enable natural language interaction with your runbooks, enabling the retrieval of information or execution of automated steps directly from a chat platform.

Regularly review and update

Creating a runbook is not a one-time activity. Maintaining productive runbooks is an ongoing process that requires revisions to remain up-to-date and aligned with ever-changing business processes, tools, SLOs, and systems.

Assign runbook owners

Individuals or teams should take ownership of runbook maintenance and ensure that it remains relevant as business requirements and processes change. Roles such as primary owner, reviewers, and contributors should be assigned. Guidelines for incorporating changes should also be specified to define clear responsibilities for maintaining each runbook.

Review periodically

Review your runbooks periodically and incorporate any changes in best practices or technological progress over time. Regular reviews ensure that outdated information is not left in place for long periods and that your runbooks are up-to-date when most needed.

The owners should maintain open lines of communication with SMEs even after the runbook is first completed. They should encourage feedback and suggestions for improvement, mainly when new challenges arise or existing processes change. Continuous collaboration ensures that the runbook remains a dynamic and valuable resource for the entire team.

Review during post-mortems

Conducting a post-mortem after any incident is a common approach to understanding what went wrong, preventing future incidents, and improving responses. Any post-mortem should also review the runbooks used to determine whether they helped to provide a timely resolution or contributed to delays. The results of this review should be used to improve the runbook where possible, ensuring that it evolves based on lessons from real-world use. 

Post-mortems are best conducted as soon after the incident as possible, when everyone involved still has a fresh recollection of events and any changes proposed as a result can be made in a timely manner. Of course, these should be prioritized among other work, and post-mortems for less serious incidents may be considered less important, but completing a post-mortem within 48 hours of an incident is recommended.

Review after organizational change

Runbooks should be reviewed immediately after all organizational processes, SLOs, tooling, or systems changes. This could include updates to software configurations, infrastructure changes, or internal workflows. Failure to do so could lead to confusion or incorrect execution during incidents because outdated runbooks may provide irrelevant or inaccurate instructions.

Maintain version control

Regular reviews ensure that all content is up to date, so the most recent version of a runbook must be easily identified. Labeling the latest version with a “latest version” tag, a version number, or a timestamp reduces confusion and ensures that everyone uses the most accurate information, reducing errors caused by following outdated guidance.

You should also maintain a version history for each runbook. This could be as simple as a changelog or be more structured, with version numbers and dates. A version history allows for easy tracking of updates, helps teams identify which version they’re referencing, and provides insight into specific changes that were made (e.g., “updated after system migration on [date]”).

Visit SLOcademy, our free SLO learning center

Visit SLOcademy. No Form.

Train team members

A runbook is only as effective as the team members using it. All team members should receive adequate training on the structure and application of the runbooks to enable a smooth execution during critical situations. This minimizes confusion, reduces response times, and builds confidence in handling incidents before execution, which is necessary during a genuine service disruption.

You should ensure that incident stakeholders are familiar with a runbook's layout, contents, and purpose. This should include understanding how information is organized, the sequence of steps, and any embedded troubleshooting tips or escalation paths. A quick walkthrough of the runbook’s structure can help team members navigate it efficiently during high-pressure situations and should be part of any initial onboarding training.

Training sessions and workshops allow teams to practice using runbooks in simulated environments. Mock incidents replicate real-life scenarios and enable the team to follow procedures, identify potential gaps or ambiguities, and build confidence in their ability to respond effectively without the pressure of a live incident. This could be done in a sandbox environment, completely removed from production services, or utilizing chaos engineering to invoke issues in a production environment to test your monitoring services—such as Nobl9’s composite SLO values—and respond to issues as expected.

With training and hands-on practice, you can ensure that your teams are well-equipped to use your runbooks to manage incidents and maintain operational resilience when needed.

Last thoughts

When well written, runbooks become more than just another business documentation. They can act as lifelines that enable fast and effective responses to emergencies. By adopting best practices for designing and writing your runbooks, you can ensure that they are written with clarity, usability, and integration in mind. This empowers your teams to navigate incidents confidently, minimize downtime, and uphold service quality.

Navigate Chapters:

Continue reading this series

CHAPTER 1

Chapter 1: IT Incident Management

Read more

CHAPTER 2

Chapter 2: Service Level Objectives

Read more

CHAPTER 3

Chapter 3: Runbook Example: A Best Practices Guide

Read more