Skip to content

Learn Spark & Data Lakes


Learn more about the big data ecosystem and how to use Spark to work with massive datasets.

Enroll Now
  • Estimated time
    1 month

  • Enroll by
    June 7, 2023

    Get access to classroom immediately on enrollment

  • Skills acquired
    AWS Glue, AWS Data Lakes, Apache Spark, Data Transformation

What You Will Learn

  1. Spark and Data Lakes

    1 month to complete

    Build a data lake on AWS and a data catalog following the principles of data lakehouse architecture. Learn about the big data ecosystem and the power of Apache Spark for data wrangling and transformation. Work with AWS data tools and services to extract, load, process, query, and transform semi-structured data in data lakes.

    Prerequisite knowledge

    Intermediate Python, Intermediate SQL

    1. Big Data Ecosystem, Data Lakes, & Spark

      Identify what constitutes the big data ecosystem for data engineering. Explain the purpose and evolution of data lakes in the big data ecosystem. Compare the Spark framework with Hadoop framework. Identify when to use Spark and when not to use it and describe the features of lakehouse architecture.

      • Spark Essentials

        Wrangle data with Spark and functional programming to scale across distributed systems. Process data with Spark DataFrames and Spark SQL. Process data in common formats such as CSV and JSON. Use the Spark RDDs API to wrangle data and transform and filter data with Spark.

        • Using Spark & Data Lakes in the AWS Cloud

          Use distributed data storage with Amazon S3 and identify properties of AWS S3 data lakes. Identify service options for using Spark in AWS and configure AWS Glue. Create and run Spark Jobs with AWS Glue.

          • Ingesting & Organizing Data in Lakehouse Architecture on AWS

            Use Spark with AWS Glue to run ELT processes on data of diverse sources, structures, and vintages in lakehouse architecture. Create a Glue Data Catalog and Glue Tables. Use AWS Athena for ad-hoc queries in a lakehouse. Leverage Glue for SQL AWS S3 queries and ELT. Ingest data into lakehouse zones. Transform and filter data into curated lakehouse zones with Spark and AWS Glue. Join and process data into lakehouse zones with Spark and AWS Glue.

            • Course Project: STEDI Human Balance Analytics

              Act as a data engineer for the STEDI team to build a data lakehouse solution for sensor data that trains a machine learning model. Build an ELT (Extract, Load, Transform) pipeline for lakehouse architecture, load data from an AWS S3 data lake, process the data into analytics tables using Spark and AWS Glue, and load them back into lakehouse architecture.

            All Our Courses Include

            • Real-world projects from industry experts

              With real-world projects and immersive content built in partnership with top-tier companies, you’ll master the tech skills companies want.

            • Real-time support

              On demand help. Receive instant help with your learning directly in the classroom. Stay on track and get unstuck.

            • Workspaces

              Validate your understanding of concepts learned by checking the output and quality of your code in real-time.

            • Flexible learning program

              Tailor a learning plan that fits your busy life. Learn at your own pace and reach your personal goals on the schedule that works best for you.

            Course offerings

            • Class content

              • Real-world projects
              • Project reviews
              • Project feedback from experienced reviewers
            • Student services

              • Student community
              • Real-time support

            Succeed with personalized services.

            We provide services customized for your needs at every step of your learning journey to ensure your success.

            Get timely feedback on your projects.

            • Personalized feedback
            • Unlimited submissions and feedback loops
            • Practical tips and industry best practices
            • Additional suggested resources to improve
            • 1,400+

              project reviewers

            • 2.7M

              projects reviewed

            • 88/100

              reviewer rating

            • 1.1 hours

              avg project review turnaround time

            Learn with the best.

            Learn with the best.

            • Sean Murdock

              Professor at Brigham Young University Idaho

              Sean currently teaches cybersecurity and DevOps courses at Brigham Young University Idaho. He has been a software engineer for over 16 years. Some of the most exciting projects he has worked on involved data pipelines for DNA processing and vehicle telematics.

            Spark and Data Lakes

            Get started today

              • Learn

                Learn about the big data ecosystem and the power of Apache Spark for data wrangling and transformation.

              • Average Time

                On average, successful students take 1 month to complete this program.

              • Benefits include

                • Real-world projects from industry experts
                • Real-time support

              Program Details

              • Do I need to apply? What are the admission criteria?

                No. This Course accepts all applicants regardless of experience and specific background.

              • What are the prerequisites for enrollment?

                A well-prepared learner has experience in Relational database design, SQL, Basic dimensional modeling, Data Modeling Basics, Amazon Web Services Basics, and Python.

              • How is this course structured?

                This course is comprised of content and curriculum to support one project. We estimate that students can complete the program in one month.

                The project will be reviewed by the Udacity reviewer network and platform. Feedback will be provided and if you do not pass the project, you will be asked to resubmit the project until it passes.

              • How long is this course?

                Access to this course runs for the length of time specified in the payment card above. If you do not graduate within that time period, you will continue learning with month to month payments. See the Terms of Use and FAQs for other policies regarding the terms of access to our programs.

              • Can I switch my start date? Can I get a refund?

                Please see the Udacity Program Terms of Use and FAQs for policies on enrollment in our programs.

              • What software and versions will I need in this course?

                There are no software and version requirements to complete this course. All coursework and projects can be completed via Student Workspaces in the Udacity online classroom. Udacity’s full technical requirements are listed here.

              Spark and Data Lakes

              Enroll Now