Site Reliability Engineer

Name: Site Reliability Engineer Nanodegree Program
Rating: 4.5 (9 reviews)

Nanodegree Program

The goal of the Site Reliability Engineer (SRE) Nanodegree program is to equip software developers with the engineering and operational skills required to build automation tools and responses that ensure designed solutions respond to non-functional requirements such as availability, performance, security, and maintainability. The content will focus on both designing systems to automate response to issues with software sites as well as how to respond to common on-call situations.

Intermediate

4 months

Real-world Projects

Completion Certificate

Last Updated July 19, 2024

Skills you'll learn:

Toil reduction • Data recovery • Site reliability engineering business context • Deployment automation

Prerequisites:

Devops basics • Scripting • Amazon Elastic Kubernetes Service

Courses In This Program

Course 1 • 45 minutes

Welcome!

Welcome! We're so glad you're here. Join us in learning a bit more about what to expect in this program and ways to succeed.

Lesson 1

An Introduction to Your Nanodegree Program

Welcome! We're so glad you're here. Join us in learning a bit more about what to expect and ways to succeed.

Lesson 2

Getting Help

You are starting a challenging but rewarding journey! Take 5 minutes to read how to get help with projects and content.

Lesson 1

An Introduction to Your Nanodegree Program

Welcome! We're so glad you're here. Join us in learning a bit more about what to expect and ways to succeed.

Lesson 2

Getting Help

You are starting a challenging but rewarding journey! Take 5 minutes to read how to get help with projects and content.

Course 2 • 4 weeks

Establishing a foundation in observability

In this course, we will learn about the founding concepts of Observability in terms of people and tools.

Lesson 1

Introduction to Establishing a Foundation in Observability

This lesson will introduce you to the course, including what SRE is and why it matters.

Lesson 2

SRE Roles and Responsibilities in Enterprise

In this lesson, we will learn how to distinguish unique SRE roles and responsibilities within an enterprise.

Lesson 3

Improving Enterprise Workflows with SRE Best Practices

In this lesson, we will investigate enterprise workflows that can be improved with common SRE practices using cost-benefit analysis.

Lesson 4

SRE Teams

In this lesson, we will learn how to define an optimal SRE team structure and work allocation given business needs.

Lesson 5

Monitoring System Performance

By the end of this lesson, you will have a fully-functional monitoring system that uses some of the most popular tools in the industry.

Lesson 6 • Project

Deploying System Observability

In this project, you will apply the skills you have acquired in the Establish a Foundation in Observability course to configure a monitoring software stack.

Lesson 1

Introduction to Establishing a Foundation in Observability

This lesson will introduce you to the course, including what SRE is and why it matters.

Lesson 2

SRE Roles and Responsibilities in Enterprise

In this lesson, we will learn how to distinguish unique SRE roles and responsibilities within an enterprise.

Lesson 3

Improving Enterprise Workflows with SRE Best Practices

In this lesson, we will investigate enterprise workflows that can be improved with common SRE practices using cost-benefit analysis.

Lesson 4

SRE Teams

In this lesson, we will learn how to define an optimal SRE team structure and work allocation given business needs.

Lesson 5

Monitoring System Performance

By the end of this lesson, you will have a fully-functional monitoring system that uses some of the most popular tools in the industry.

Lesson 6 • Project

Deploying System Observability

In this project, you will apply the skills you have acquired in the Establish a Foundation in Observability course to configure a monitoring software stack.

Course 3 • 4 weeks

Planning for High Availability and Incident Response

In this course, we will look at how SREs view availability and reliability for their infrastructure. We'll learn how to create effective monitoring using SLOs and SLIs. We will create dashboards in Grafana. Next, we'll identify all our IT assets, ensure they are configured for high availability. And then we will craft a disaster recovery plan to make sure failover is seamless and automated. After that, we'll deploy the infrastructure to AWS using Terraform. We'll learn the benefits of infrastructure as code. We'll see how easy it is to deploy to multiple regions. Finally, we'll learn how to make databases highly available and disaster recovery ready. We'll look at recovery strategies and implement them in AWS via Terraform.

Lesson 1

Course Introduction

Introduction to the course. We will look at how the topics all tie into being an SRE and what skills we'll learn and apply.

Lesson 2

SLOs and SLIs

In this lesson, we will learn about how SREs monitor using SLOs and SLIs. We will create queries in Prometheus and dashboard in Grafana.

Lesson 3

IT Assets, Availability and Disaster Recovery

In this lesson, we will identify all IT assets, make those assets highly available, and put together a disaster recovery plan for those assets.

Lesson 4

Creating and deploying HA and DR infrastructure using Terraform

In this lesson, we will deploy our HA/DR infrastructure using Terraform to AWS.

Lesson 5

High Availability and DR of Databases

In this lesson, we'll learn about database reliability and availability and how we can make databases more available. We will then deploy a replicated database cluster to AWS and also see a failover.

Lesson 6 • Project

Deploying High Availability Infrastructure

In this project, you will apply the skills you've learned in this course, by defining and implementing a resilient infrastructure in a cloud platform.

Lesson 1

Course Introduction

Introduction to the course. We will look at how the topics all tie into being an SRE and what skills we'll learn and apply.

Lesson 2

SLOs and SLIs

In this lesson, we will learn about how SREs monitor using SLOs and SLIs. We will create queries in Prometheus and dashboard in Grafana.

Lesson 3

IT Assets, Availability and Disaster Recovery

In this lesson, we will identify all IT assets, make those assets highly available, and put together a disaster recovery plan for those assets.

Lesson 4

Creating and deploying HA and DR infrastructure using Terraform

In this lesson, we will deploy our HA/DR infrastructure using Terraform to AWS.

Lesson 5

High Availability and DR of Databases

In this lesson, we'll learn about database reliability and availability and how we can make databases more available. We will then deploy a replicated database cluster to AWS and also see a failover.

Lesson 6 • Project

Deploying High Availability Infrastructure

In this project, you will apply the skills you've learned in this course, by defining and implementing a resilient infrastructure in a cloud platform.

Course 4 • 4 weeks

Self Healing Architectures

Self-healing architecture is resilient enough to withstand failure and resolve issues without human intervention through automation. In this course, you'll gain skills in self-healing architecture design strategies, deployment strategies, and cloud automation

Lesson 1

Introduction to Self-Healing Architectures

Welcome to Self-healing Architectures! In this lesson, you'll learn more about the course and the topic.

Lesson 2

Self-healing System Design Fundamentals

In this lesson, you'll learn about self-healing system design fundamentals like single points of failure, tiered architecture, automation strategies, and microservice design.

Lesson 3

Self-healing Deployment Strategies

In this lesson, you'll learn about and implement several self-healing deployment strategies

Lesson 4

Cloud Automation

In this lesson, you'll learn about several different self-healing cloud automation configurations for microservices and virtual machines.

Lesson 5 • Project

Deployment Roulette

In this project, you'll put everything you learned in the course into practice by playing the role of an SRE fixing and deploying applications using self-healing strategies

Lesson 1

Introduction to Self-Healing Architectures

Welcome to Self-healing Architectures! In this lesson, you'll learn more about the course and the topic.

Lesson 2

Self-healing System Design Fundamentals

In this lesson, you'll learn about self-healing system design fundamentals like single points of failure, tiered architecture, automation strategies, and microservice design.

Lesson 3

Self-healing Deployment Strategies

In this lesson, you'll learn about and implement several self-healing deployment strategies

Lesson 4

Cloud Automation

In this lesson, you'll learn about several different self-healing cloud automation configurations for microservices and virtual machines.

Lesson 5 • Project

Deployment Roulette

In this project, you'll put everything you learned in the course into practice by playing the role of an SRE fixing and deploying applications using self-healing strategies

Taught By The Best

Travis Scotto

Site Reliability Engineer

Travis has been working in IT for over 10 years. He's also been adjunct teaching for over 5 years. He loves technology and sharing his knowledge with students. Travis brings his industry experience as an SRE to the table in teaching different classes. He blends industry expertise with step by step teaching to allow students to excel! Seeing students succeed is what he likes best.

Emmanuel Apau

CTO of Mechanicode.io

Emmanuel is co-founder of the Black Code Collective and DC's Technical.ly RealLIST Engineer award recipient. An AWS Certified DevSecOps specialist with 12 years of experience, he has spent his career developing innovative solutions using DevSecOps & Site reliability best practices.

Sonny Sevin

Site Reliability Engineer

Sonny is an SRE with a varied background. He has dabbled in research at Lawrence Berkeley National Labs before moving into site reliability engineering to have a more hands on role. He has been published in several computing journals, as well as taught introductory programming courses.

Nathan Anderson, MBA

Global Cloud Architect

Nathan is a Certified Six Sigma Black Belt and has 10+ years of experience in IT in multiple industries. He is also the Instructor for two other Udacity courses: Ensuring Quality Releases and Azure Performance.

Ratings & Reviews

Average Rating: 4.5 Stars

9 Reviews

David A.

June 14, 2022

Pretty fine, very demanding.

Artiom D.

May 31, 2022

Great program

Marius T.

April 29, 2022

good so far

Yi J.

March 8, 2022

The project design is decent, but course instructions can be improved. more explanation of the architecture of the project will help to understand the how the application works. There are some mistakes in the instruction as well, making the course completion very confusing.

Felipe F.

March 2, 2022

The program is more challenging than I expect, however, I'm really enjoying the program.

The Udacity Difference

Combine technology training for employees with industry experts, mentors, and projects, for critical thinking that pushes innovation. Our proven upskilling system goes after success—relentlessly.

Demonstrate proficiency with practical projects

Projects are based on real-world scenarios and challenges, allowing you to apply the skills you learn to practical situations, while giving you real hands-on experience.

Gain proven experience
Retain knowledge longer
Apply new skills immediately

Top-tier services to ensure learner success

Reviewers provide timely and constructive feedback on your project submissions, highlighting areas of improvement and offering practical tips to enhance your work.