Skip to content

Data Engineering for Data Scientists


Learn how to wrangle data on a massive scale! By the end of this course, you’ll be able to pull data from a wide range of sources, store it in a database, and create data pipelines (ETL, NLP, machine learning) that power real-world web applications.

Enroll Now
  • Estimated time
    1 month

  • Enroll by
    June 14, 2023

    Get access to classroom immediately on enrollment

  • Skills acquired
    scikit-learn, Data Cleaning, Machine Learning Pipeline Creation
In collaboration with
  • Appen

What You Will Learn

  1. Data Engineering for Data Scientists

    1 month to complete

    For many companies, data scientists who can also tackle data-engineering problems are worth their weight in gold. In this course, you’ll learn how to unlock data silos, pulling data from multiple sources and pipelining it into usable forms for analysts and top-level decision makers. At the end, you’ll even build an impressive machine-learning-powered web application that has real-world, life-saving significance.

    Prerequisite knowledge

    Python, SQL, Statistics, Machine Learning.

    1. ETL Pipelines

      Understand what ETL pipelines are and cccess and combine data from CSV, JSON, logs, APIs and databases.

      • Natural Language Processing

        Prepare text data for analysis with tokenization, lemmatization, and removing stop words. Use scikit-learn to transform and vectorize text data and build features with bag of words and tf-idf.

        • Machine Learning Pipelines

          Understand the advantages of using machine learning pipelines to streamline the data preparation and modeling process. Use feature unions to perform steps in parallel and create more complex workflows and complete a case study to build a full machine learning pipeline that prepares data and creates a model for a dataset.

          • Course Project: Build Disaster Response Pipelines

            In this project, you’ll build a data pipeline to prepare the message data from major natural disasters around the world. You’ll build a machine learning pipeline to categorize emergency text messages based on the need communicated by the sender.

          All Our Courses Include

          • Real-world projects from industry experts

            With real-world projects and immersive content built in partnership with top-tier companies, you’ll master the tech skills companies want.

          • Real-time support

            On demand help. Receive instant help with your learning directly in the classroom. Stay on track and get unstuck.

          • Workspaces

            Validate your understanding of concepts learned by checking the output and quality of your code in real-time.

          • Flexible learning program

            Tailor a learning plan that fits your busy life. Learn at your own pace and reach your personal goals on the schedule that works best for you.

          Course offerings

          • Class content

            • Real-world projects
            • Project reviews
            • Project feedback from experienced reviewers
          • Student services

            • Student community
            • Real-time support

          Succeed with personalized services.

          We provide services customized for your needs at every step of your learning journey to ensure your success.

          Get timely feedback on your projects.

          • Personalized feedback
          • Unlimited submissions and feedback loops
          • Practical tips and industry best practices
          • Additional suggested resources to improve
          • 1,400+

            project reviewers

          • 2.7M

            projects reviewed

          • 88/100

            reviewer rating

          • 1.1 hours

            avg project review turnaround time

          Learn with the best.

          Learn with the best.

          • Juno Lee

            Curriculum Lead at Udacity

            Juno is the curriculum lead for the School of Data Science. She has been sharing her passion for data and teaching, building several courses at Udacity. As a data scientist, she built recommendation engines, computer vision and NLP models, and tools to analyze user behavior.

          • Andrew Paster


            Andrew has an engineering degree from Yale, and has used his data science skills to build a jewelry business from the ground up. He has additionally created courses for Udacity’s Self-Driving Car Engineer Nanodegree program.

          • Arpan Chakraborty


            Arpan is a computer scientist with a PhD from North Carolina State University. He teaches at Georgia Tech (within the Masters in Computer Science program), and is a coauthor of the book Practical Graph Mining with R.

          Data Engineering for Data Scientists

          Get started today

            • Learn

              How to pull data, store it, and build ETL, NLP and machine-learning data pipelines with Python.

            • Average Time

              On average, successful students take 1 month to complete this program.

            • Benefits include

              • Real-world projects from industry experts
              • Real-time support

            Program Details

            • Do I need to apply? What are the admission criteria?

              No. This Course accepts all applicants regardless of experience and specific background.

            • What are the prerequisites for enrollment?

              Machine Learning:

              • Supervised and Unsupervised methods equivalent to those taught in the Intro to Machine Learning Nanodegree Program.
              • Experience with Python Programming including writing functions, building basic applications, and common libraries like NumPy and pandas
              • SQL programming including querying databases, using joins, aggregations, and subqueries
              • Comfortable using the Terminal and Github
              Probability and Statistics:
              • Descriptive Statistics including calculating measures of center and spread
              • Inferential Statistics including sampling distributions, hypothesis testing
            • How is this course structured?

              The Data Engineering for Data Scientists course is comprised of content and curriculum to support one project. We estimate that students can complete the program in 1 month.

              The project will be reviewed by the Udacity reviewer network and platform. Feedback will be provided and if you do not pass the project, you will be asked to resubmit the project until it passes.

            • How long is this course?

              Access to this course runs for the length of time specified in the payment card above. If you do not graduate within that time period, you will continue learning with month to month payments. See the Terms of Use and FAQs for other policies regarding the terms of access to our programs.

            • Can I switch my start date? Can I get a refund?

              Please see the Udacity Program Terms of Use and FAQs for policies on enrollment in our programs.

            • What software and versions will I need in this course?

              You’ll need access to the Internet, and a 64 bit computer. Additional software: need to be able to download and run Python 3.7.

            Data Engineering for Data Scientists

            Enroll Now