The goal of the Site Reliability Engineer (SRE) Nanodegree program is to equip software developers with the engineering and operational skills required to build automation tools and responses that ensure designed solutions respond to non-functional requirements such as availability, performance, security, and maintainability. The content will focus on both designing systems to automate response to issues with software sites as well as how to respond to common on-call situations.
What is a Site Reliability Engineer (SRE)?
According to the Devop Ops Institute, Site Reliability Engineers are change agents within organizations who champion reliability best practices, designing resilient systems, implement process, methods tools and self-service solutions. Site Reliability Engineers work with design, build and devops squads to establish elastic architecture, bridge application and platform design from an operational point of view.
About the Site Reliability Engineer (SRE) Nanodegree
The Site Reliability Engineer (SRE) Nanodegree program aims to equip software developers with the software engineering and operational skills required to build automation tools and responses that ensure designed solutions respond to non-functional requirements such as availability, performance, security, and maintainability.
A student enrolled in this course should have the following skills:
- Write basic functions in an object-oriented language (Python or Java), such as for loops, conditionals, etc.
- Write basic shell scripts in Bash or Powershell, which could include for loops, conditionals, etc.
- Linux command-line (bash/shell).
- Create simple SQL queries using SELECT, JOINS, GROUP BY functions.
- Networking skills including knowledge of virtual networks, DNS, subnets, and basic network troubleshooting techniques.
- Perform DevOps tasks, such as setting up monitoring, doing feature rollout, troubleshooting production issues, ideally for large systems.
- Experience with Kubernetes and basic kubectl, such as kubectl apply, kubectl create, kubectl config.
A graduate of this program will be able to:
- Use proactive and reactive SRE strategies (monitoring, postmortem, team building, etc.) to identify reliability risks through evaluating systems and processes
- Develop customer-centric SLOs (such as percentile targets for availability, latency and correctness), and set up corresponding monitoring and risk mitigation measures to ensure customer happiness.
- Create and deploy automated self-healing architectures and other technologies to make the environment more maintainable
- Design and implement organizational processes and culture that enhance product reliability, including outage/postmortem review, quarterly state of production presentation, production readiness review.
Specific Projects include:
Project 1: Observing Cloud Resources In this project, students will apply the skills they have acquired in the Establish a Foundation in Observability course to configure a monitoring software stack.
Project 2: Deploying HA Infrastructure In this project, students will design and deploy HA infrastructure through Terraform and deploy it to AWS.
Project 3: Deployment Roulette In this project, students will identify failing applications and implement fixes to resolve the problems. Students will also create an architecture diagram that communicates the status of the cloud environment to improve the onboarding of future developers after them.
Project 4: Plan, Reduce, Repeat In this project, students will be participating in several mock scenarios they might encounter as an SRE. There will be three scenarios, each demonstrating different skills students have learned.
Nathan Anderson, MBA, Global Cloud Architect
Travis Scotto, Site Reliability Engineer
Emmanuel Apau, CTO of Mechanicode.io
Sonny Sevin, Site Reliability Engineer
Enroll In the New Site Reliability Engineer (SRE) Nanodegree Program Today
In the new Site Reliability Engineer (SRE) Nanodegree program you will learn to use, design, and build system teams to establish an elastic architecture. You will also learn to design and implement organizational processes and culture that enhance product reliability, including outage/postmortem review, quarterly state of production presentation, and production readiness review. Click here to learn more and enroll in the Site Reliability Engineer Nanodegree program!