Udacity Logo
Log InSign Up

Establishing a Culture of Reliability

Course

This course is all about how to foster a culture that is based on reliability. We will learn how to utilize best practices for several key areas of being a Site Reliability Engineer (SRE) and how they contribute to a culture of reliability. We will cover how to have balanced and effective on-call rotations as well as how to handle incidents. Next, we will discuss how to review your system throughout its lifecycle to find and mitigate any potential risk factors. Managing system capacity at all phases of a system's lifecycle is another major component to ensuring that everything is operating at maximum reliability. We will round out this course by discussing a thorn in every SRE's side: toil. We will discuss how to identify and reduce toil to maximize time spent performing operational work.

This course is all about how to foster a culture that is based on reliability. We will learn how to utilize best practices for several key areas of being a Site Reliability Engineer (SRE) and how they contribute to a culture of reliability. We will cover how to have balanced and effective on-call rotations as well as how to handle incidents. Next, we will discuss how to review your system throughout its lifecycle to find and mitigate any potential risk factors. Managing system capacity at all phases of a system's lifecycle is another major component to ensuring that everything is operating at maximum reliability. We will round out this course by discussing a thorn in every SRE's side: toil. We will discuss how to identify and reduce toil to maximize time spent performing operational work.

4 weeks

Real-world Projects

Completion Certificate

Last Updated May 15, 2023

Prerequisites:

No experience required

Course Lessons

Lesson 1

Introduction to Establishing a Culture of Reliability

In this lesson, we cover some introductory material to help you start with a solid foundation.

Lesson 2

Improving On-Call Effectiveness

Having a solid on-call is very important to achieving peak reliability. This lesson discusses how to have balanced on-call shifts with a solid incident management process that your team can follow.

Lesson 3

Reliability Reviews

In this lesson, we learn how to review your system from the start to prepare for a release. It is important that you have systems in place to find potential risks and develop mitigations for them.

Lesson 4

Managing System Capacity

System capacity is an essential part of ensuring reliability. This lesson discusses how to balance system capacity with costs to ensure that resources and money are not being wasted.

Lesson 5

Toil Reduction

Toil is the bane of every SRE team, and this lesson is all about how to reduce toil to allow your team to focus on operational work that improves reliability.

Lesson 6 • Project

Project: Plan, Reduce, Repeat

To wrap everything up, you will complete the final project, where you will be participating in three scenarios that will tie everything you have learned together.

Taught By The Best

Photo of Sonny Sevin

Sonny Sevin

Site Reliability Engineer

Sonny is an SRE with a varied background. He has dabbled in research at Lawrence Berkeley National Labs before moving into site reliability engineering to have a more hands on role. He has been published in several computing journals, as well as taught introductory programming courses.

Taught By The Best

Photo of Sonny Sevin

Sonny Sevin

Site Reliability Engineer

Sonny is an SRE with a varied background. He has dabbled in research at Lawrence Berkeley National Labs before moving into site reliability engineering to have a more hands on role. He has been published in several computing journals, as well as taught introductory programming courses.

Get Started Today

Establishing a Culture of Reliability