Lesson 1
Introduction to Establishing a Culture of Reliability
In this lesson, we cover some introductory material to help you start with a solid foundation.
Course
This course is all about how to foster a culture that is based on reliability. We will learn how to utilize best practices for several key areas of being a Site Reliability Engineer (SRE) and how they contribute to a culture of reliability. We will cover how to have balanced and effective on-call rotations as well as how to handle incidents. Next, we will discuss how to review your system throughout its lifecycle to find and mitigate any potential risk factors. Managing system capacity at all phases of a system's lifecycle is another major component to ensuring that everything is operating at maximum reliability. We will round out this course by discussing a thorn in every SRE's side: toil. We will discuss how to identify and reduce toil to maximize time spent performing operational work.
This course is all about how to foster a culture that is based on reliability. We will learn how to utilize best practices for several key areas of being a Site Reliability Engineer (SRE) and how they contribute to a culture of reliability. We will cover how to have balanced and effective on-call rotations as well as how to handle incidents. Next, we will discuss how to review your system throughout its lifecycle to find and mitigate any potential risk factors. Managing system capacity at all phases of a system's lifecycle is another major component to ensuring that everything is operating at maximum reliability. We will round out this course by discussing a thorn in every SRE's side: toil. We will discuss how to identify and reduce toil to maximize time spent performing operational work.
4 weeks
Real-world Projects
Completion Certificate
Last Updated May 15, 2023
No experience required
Lesson 1
Introduction to Establishing a Culture of Reliability
In this lesson, we cover some introductory material to help you start with a solid foundation.
Lesson 2
Improving On-Call Effectiveness
Having a solid on-call is very important to achieving peak reliability. This lesson discusses how to have balanced on-call shifts with a solid incident management process that your team can follow.
Lesson 3
Reliability Reviews
In this lesson, we learn how to review your system from the start to prepare for a release. It is important that you have systems in place to find potential risks and develop mitigations for them.
Lesson 4
Managing System Capacity
System capacity is an essential part of ensuring reliability. This lesson discusses how to balance system capacity with costs to ensure that resources and money are not being wasted.
Lesson 5
Toil Reduction
Toil is the bane of every SRE team, and this lesson is all about how to reduce toil to allow your team to focus on operational work that improves reliability.
Lesson 6 • Project
Project: Plan, Reduce, Repeat
To wrap everything up, you will complete the final project, where you will be participating in three scenarios that will tie everything you have learned together.
Sonny Sevin
Site Reliability Engineer
Sonny is an SRE with a varied background. He has dabbled in research at Lawrence Berkeley National Labs before moving into site reliability engineering to have a more hands on role. He has been published in several computing journals, as well as taught introductory programming courses.
Sonny Sevin
Site Reliability Engineer
Sonny is an SRE with a varied background. He has dabbled in research at Lawrence Berkeley National Labs before moving into site reliability engineering to have a more hands on role. He has been published in several computing journals, as well as taught introductory programming courses.
Get Started Today