Skills you'll learn:
Site Reliability Engineer
Nanodegree Program
The goal of the Site Reliability Engineer (SRE) Nanodegree program is to equip software developers with the engineering and operational skills required to build automation tools and responses that ensure designed solutions respond to non-functional requirements such as availability, performance, security, and maintainability. The content will focus on both designing systems to automate response to issues with software sites as well as how to respond to common on-call situations.
The goal of the Site Reliability Engineer (SRE) Nanodegree program is to equip software developers with the engineering and operational skills required to build automation tools and responses that ensure designed solutions respond to non-functional requirements such as availability, performance, security, and maintainability. The content will focus on both designing systems to automate response to issues with software sites as well as how to respond to common on-call situations.
Intermediate
3 months
Last Updated December 17, 2024
Prerequisites:
Intermediate
3 months
Last Updated December 17, 2024
Skills you'll learn:
Prerequisites:
Courses In This Program
Course 1 • 45 minutes
Welcome!
Welcome! We're so glad you're here. Join us in learning a bit more about what to expect in this program and ways to succeed.
Lesson 1
An Introduction to Your Nanodegree Program
Welcome! We're so glad you're here. Join us in learning a bit more about what to expect and ways to succeed.
Lesson 2
Getting Help
You are starting a challenging but rewarding journey! Take 5 minutes to read how to get help with projects and content.
Course 2 • 3 weeks
Establishing a foundation in observability
In this course, we will learn about the founding concepts of Observability in terms of people and tools.
Lesson 1
Introduction to Establishing a Foundation in Observability
This lesson will introduce you to the course, including what SRE is and why it matters.
Lesson 2
SRE Roles and Responsibilities in Enterprise
In this lesson, we will learn how to distinguish unique SRE roles and responsibilities within an enterprise.
Lesson 3
Improving Enterprise Workflows with SRE Best Practices
In this lesson, we will investigate enterprise workflows that can be improved with common SRE practices using cost-benefit analysis.
Lesson 4
SRE Teams
In this lesson, we will learn how to define an optimal SRE team structure and work allocation given business needs.
Lesson 5
Monitoring System Performance
By the end of this lesson, you will have a fully-functional monitoring system that uses some of the most popular tools in the industry.
Lesson 6 • Project
Deploying System Observability
In this project, you will apply the skills you have acquired in the Establish a Foundation in Observability course to configure a monitoring software stack.
Course 3 • 3 weeks
Planning for High Availability and Incident Response
In this course, we will look at how SREs view availability and reliability for their infrastructure. We'll learn how to create effective monitoring using SLOs and SLIs. We will create dashboards in Grafana. Next, we'll identify all our IT assets, ensure they are configured for high availability. And then we will craft a disaster recovery plan to make sure failover is seamless and automated. After that, we'll deploy the infrastructure to AWS using Terraform. We'll learn the benefits of infrastructure as code. We'll see how easy it is to deploy to multiple regions. Finally, we'll learn how to make databases highly available and disaster recovery ready. We'll look at recovery strategies and implement them in AWS via Terraform.
Lesson 1
Course Introduction
Introduction to the course. We will look at how the topics all tie into being an SRE and what skills we'll learn and apply.
Lesson 2
SLOs and SLIs
In this lesson, we will learn about how SREs monitor using SLOs and SLIs. We will create queries in Prometheus and dashboard in Grafana.
Lesson 3
IT Assets, Availability and Disaster Recovery
In this lesson, we will identify all IT assets, make those assets highly available, and put together a disaster recovery plan for those assets.
Lesson 4
Creating and deploying HA and DR infrastructure using Terraform
In this lesson, we will deploy our HA/DR infrastructure using Terraform to AWS.
Lesson 5
High Availability and DR of Databases
In this lesson, we'll learn about database reliability and availability and how we can make databases more available. We will then deploy a replicated database cluster to AWS and also see a failover.
Lesson 6 • Project
Deploying High Availability Infrastructure
In this project, you will apply the skills you've learned in this course, by defining and implementing a resilient infrastructure in a cloud platform.
Course 4 • 2 weeks
Self Healing Architectures
Self-healing architecture is resilient enough to withstand failure and resolve issues without human intervention through automation. In this course, you'll gain skills in self-healing architecture design strategies, deployment strategies, and cloud automation
Lesson 1
Introduction to Self-Healing Architectures
Welcome to Self-healing Architectures! In this lesson, you'll learn more about the course and the topic.
Lesson 2
Self-healing System Design Fundamentals
In this lesson, you'll learn about self-healing system design fundamentals like single points of failure, tiered architecture, automation strategies, and microservice design.
Lesson 3
Self-healing Deployment Strategies
In this lesson, you'll learn about and implement several self-healing deployment strategies
Lesson 4
Cloud Automation
In this lesson, you'll learn about several different self-healing cloud automation configurations for microservices and virtual machines.
Lesson 5 • Project
Deployment Roulette
In this project, you'll put everything you learned in the course into practice by playing the role of an SRE fixing and deploying applications using self-healing strategies
Taught By The Best
Travis Scotto
Site Reliability Engineer
Travis has been working in IT for over 10 years. He's also been adjunct teaching for over 5 years. He loves technology and sharing his knowledge with students. Travis brings his industry experience as an SRE to the table in teaching different classes. He blends industry expertise with step by step teaching to allow students to excel! Seeing students succeed is what he likes best.
Emmanuel Apau
CTO of Mechanicode.io
Emmanuel is co-founder of the Black Code Collective and DC's Technical.ly RealLIST Engineer award recipient. An AWS Certified DevSecOps specialist with 12 years of experience, he has spent his career developing innovative solutions using DevSecOps & Site reliability best practices.
Sonny Sevin
Site Reliability Engineer
Sonny is an SRE with a varied background. He has dabbled in research at Lawrence Berkeley National Labs before moving into site reliability engineering to have a more hands on role. He has been published in several computing journals, as well as taught introductory programming courses.
Nathan Anderson, MBA
Global Cloud Architect
Nathan is a Certified Six Sigma Black Belt and has 10+ years of experience in IT in multiple industries. He is also the Instructor for two other Udacity courses: Ensuring Quality Releases and Azure Performance.
Student Reviews
Average Rating: 4.5 Stars
11 Reviews
David A.
June 14, 2022
Pretty fine, very demanding.
Artiom D.
May 31, 2022
Great program
Marius T.
April 29, 2022
good so far
Yi J.
March 8, 2022
The project design is decent, but course instructions can be improved. more explanation of the architecture of the project will help to understand the how the application works. There are some mistakes in the instruction as well, making the course completion very confusing.
Felipe F.
March 2, 2022
The program is more challenging than I expect, however, I'm really enjoying the program.
The Udacity Difference
Combine technology training for employees with industry experts, mentors, and projects, for critical thinking that pushes innovation. Our proven upskilling system goes after success—relentlessly.
Demonstrate proficiency with practical projects
Projects are based on real-world scenarios and challenges, allowing you to apply the skills you learn to practical situations, while giving you real hands-on experience.
Gain proven experience
Retain knowledge longer
Apply new skills immediately
Top-tier services to ensure learner success
Reviewers provide timely and constructive feedback on your project submissions, highlighting areas of improvement and offering practical tips to enhance your work.
Get help from subject matter experts
Learn industry best practices
Gain valuable insights and improve your skills
Enroll in Site Reliability Engineer. Choose the plan that works for you
All Access monthly
Unlimited access to our top-rated courses
Personalized Career Services
Cancel Anytime
Real-world projects
Personalized project reviews
Program certificates
Best Value
All Access bundle1
All the same great benefits as our monthly plan
The most cost-effective way to develop the skills you want
- 1Discount applies to the first 4 months of membership, after which plans are converted to month-to-month.
Your subscription also includes:
Your subscription also includes:
4 weeks
Intermediate
1 week
Fluency
(50)
3 months
Intermediate
(498)
3 months
Intermediate
(44)
2 months
Intermediate
(101)
3 months
Advanced
(129)
3 months
Beginner
(10)
3 months
Intermediate
(530)
2 months
Intermediate
3 weeks
Intermediate
(464)
2 months
Intermediate
(461)
3 months
Intermediate
1 month
Beginner
(76)
2 months
Intermediate
(709)
2 months
Beginner
2 months
Beginner