• Online, Self-Paced
Course Description

To ensure and maintain a system's functional state, site reliability engineers (SRE) must learn how to identify, calculate, and manage a system's operational load, which generally falls into three categories: ongoing operation activities, tickets, and pages.

In this course, you'll explore these categories in detail. You'll start by outlining methods for managing operational loads at the team level and using support ticketing systems and service level objectives.

Next, you'll investigate 'toil,' a term used to describe the operational work associated with running and maintaining a production service. You'll outline steps for identifying, calculating, and eliminating toil and examine the adverse effects toil can have on a team.

Additionally, you'll outline how to work with interrupts and distinguish between crucial metrics used for managing them. Lastly, you'll identify the human element factors to consider when dealing with interrupts, including efficiency, distractibility, and respect. 

Learning Objectives

{"discover the key concepts covered in this course"}

Framework Connections

The materials within this course focus on the NICE Framework Task, Knowledge, and Skill statements identified within the indicated NICE Framework component(s):

Specialty Areas

  • Incident Response
  • Systems Analysis
  • Systems Development

Feedback

If you would like to provide feedback for this course, please e-mail the NICCS SO at NICCS@hq.dhs.gov.