Cross-posted with the Google App Engine blog:
Steve Huffman blogged yesterday about how App Engine enables the project-based learning that makes his web development course so powerful. People are often surprised to learn that Udacity itself is built on App Engine.
The choice to use App Engine originally came from Mike Sokolsky, our CTO and cofounder, after his experience keeping the original version of our extremely popular AI course running on a series of virtual machines. Mike found App Engine’s operational simplicity extremely compelling after weeks of endlessly spinning up additional servers and administering MySQL replication in order to meet the crazy scale patterns we experience.
Close to a year later, with ten months of live traffic on App Engine, we continue to be satisfied customers. While there are a few things we do outside App Engine, our choice to continue using App Engine for our core application is clear: We prefer to spend our time figuring out how to scale personalized education, not memcached. App Engine’s infrastructure is better than what we could build ourselves, and it frees us to focus on behavior rather than operations.
How Udacity Uses App Engine
The App Engine features we use most include a pretty broad swath of the platform:
- High Replication Datastore with NDB
- Task Queues – Deferred execution, MapReduce, batch jobs
- App Engine Search API — Indexing both course content and student résumés
- Blobstore API — Lecture videos, résumés, data exportation
- Image API – Thumbnail generation
- MapReduce API – Daily usage analytics, data migrations, data maintenance
A high-level representation of our “stack” looks something like this:
Trails and Trove are two libraries developed in-house mainly by Piotr Kaminski. Trails supplies very clean semantics for creating families of RESTful endpoints on top of a webapp2.RequestHandler with automagic marshalling. Trove is a wrapper around NDB that adds common property types (e.g. efficient dereferencing of key properties), yet another layer of caching for entities with relations (both in-process and memcache), and an event “watcher” framework for reliably triggering out-of-band processing when data changes.
Something notable that is not represented in the drawing above is a specific set of monkey patches from Trove we apply to NDB to create better hooks similar to the existing pre/post-put/delete hooks. These custom hooks power a “watcher” abstraction that provides targeted pieces of code the opportunity to react to changes in the data layer. Execution of each watcher is deferred and runs outside the scope of the request so as to not increase response times.
During our first year of scaling on App Engine we learned its performance is a complex thing to understand. Response time is a function of several factors both inside and outside our control. App Engine’s ability to “scale-out” is undeniable, but we have observed high variance in response times for a given request, even during periods with low load on the system. As a consequence we have learned to do a number of things to minimize the impact of latency variance:
- Converting usage of the old datastore API to the new NDB API
- Using NDB.tasklet coroutines as much as possible to enable parallelism during blocking RPC operations
- Not indexing fields by default and adding an index only when we need it for a query
- Carefully avoiding index hotspots by indexing fields with predictable values only when necessary (i.e. auto-now DateTime and enumerated “choices” String properties).
- Materializing data views very aggressively so we can limit each request to the fewest datastore queries possible
This last point is obvious in the sense that naturally you get faster responses when you do less work. But we have taken pre-materializing views to an extreme level by denormalizing several aspects of our domain into read-optimized records. For example, the read-optimized version of a user’s profile record might contain standard profile information, plus privacy configuration, course enrollment information, course progress, and permissions — all things a data modeler would normally want to store separately. We pile it together into the equivalent of a materialized view so we can fetch it all in one query.
App Engine is an amazingly complete and reliable platform that works astonishingly well for a huge number of use cases. It is very apparent the services and APIs have been designed by people who know how to scale web applications, and we feel lucky to have the opportunity to ride on the shoulders of such giants. It is trivial to whip together a proof-of-concept for almost any idea, and the subsequent work to scale your app is significantly less than if you had rolled your own infrastructure.
As with any platform, there are tradeoffs. The tradeoff with App Engine is that you get an amazing suite of scale-ready services at the cost of relentlessly optimizing to minimize latency spikes. This is an easy tradeoff for us because App Engine has served us well through several exciting usage spikes and there is no question the progress we have already made towards our mission is significantly more than if we were also building our own infrastructure. Like most choices in life, this choice can be boiled down to a bumper sticker:
Editor’s note: Chris Chew and Steve Huffman will be participating in a Google Developers Live Hangout on Thursday, November 1st.
Here is the full Hangout conversation on Youtube: