Our notion of what qualifies as “up to date” has changed over time. Nowadays, a freshly printed newspaper can be full of old news. A stock quote from 15 minutes ago might be deemed irrelevant. And telegrams, once a revolutionary method for transporting information faster than humans could travel, have been made obsolete.
We have reached a point where many applications rely on sending and receiving information in real time.
What Is Real-Time Streaming?
But what exactly does “real time” mean? In the context of data transmission, it means that the generation of a new datapoint triggers its delivery to the next checkpoint. In the past, batch-processing systems aggregated data according to rules, such as “update the database once a day” or “collect 1,000 new datapoints before sending.”
Real-time streaming, in contrast, processes the data as it is generated, and this continuous propagation of information evokes the image of a stream of data.
Why Real-Time Streaming Is Useful
The continuity of real-time streaming enables processing of large volumes of data while going easy on computational resources; it brought about an entire industry whose selling point lies in the ability to update data continuously and instantaneously. Let’s look at a few examples of applications that rely on real-time streaming.
Oftentimes, when we hear about the link between social media and journalism, it’s in the context of fake news. But it’s equally true that a lot of real news breaks on microblogging platforms.
People have used Twitter to pool information and coordinate gatherings in many parts of the world. When it comes to demonstrations, participants often rely on the service for real-time updates.
Thanks to its ubiquity and near-instantaneous message propagation, Twitter can even be used for catastrophe detection: In 2015, the company announced in a blog post that the USGS would start using the Twitter API to detect earthquakes around the world.
Real-time streaming plays an important role in many areas of the finance sector. Stock markets are famously volatile, and stock prices rise and fall at an ever-increasing speed. In order to stay competitive, investors need to keep themselves informed about those changes in real time.
In addition, the technology helps banks react faster to fraudulent transactions. Detecting anomalies in banking behavior in real time can save financial institutions a significant amount of money.
Navigation apps like Google Maps don’t compute your path statically — they do so by taking into account changes in traffic, construction work, and events like parades or demonstrations, so they need to update their knowledge bases in real time.
The ability to quickly react to real-world changes is even more crucial in the context of autonomous driving. In order to create a safe-driving experience for everyone, algorithms for self-driving cars must combine measurements from the vehicle’s sensors with incoming information from other cars, weather stations, traffic monitoring systems, and many other sources — all in real time.
How Real-Time Streaming Works
In computer science, a stream is an open connection where information is transmitted. The sender of that information is called the publisher and is connected to one or more subscribers at the other end of the stream. As we touched on earlier, the generation of a data point — known as an event — triggers its delivery through the stream.
The propagation of many events through the pipeline results in a constant parallel flow of information. While this solution makes more efficient use of CPU and memory resources than bulk processing, it also requires frameworks capable of parallel computing paradigms.
The Real-Time Streaming Tools You Should Know About
There’s a growing number of tools that let you generate, send, and modify your data in real time. In the next sections, we’ll introduce you to the best frameworks for getting started with real-time streaming. In addition, we’ll mention a few APIs that allow you to parallelize computations in your infrastructure.
You’ll notice that all these platforms for real-time streaming come from the Apache Software Foundation (ASF). The ASF develops open-source web-serving tools which, thanks to the real-time streaming boom of recent years, have become state-of-the-art. Let’s look at three of them more closely.
Written in Java and Scala, the stream-processing package Apache Kafka was originally developed at LinkedIn and then open-sourced in 2011. Kafka turns the data that your platform generates into a searchable stream. You can search for and find newly published content within seconds of its being generated.
What if, instead of simply transmitting your newly generated data, you also want to modify it? That’s the specialty of Apache Storm. Written in the LISP dialect Clojure, this framework conceptualizes the flow of data as a directed acyclic graph (DAG).
There are two types of nodes in this graph: spouts and bolts. Spouts read data from sources like Apache Kafka streams and other data-queuing APIs. In contrast, bolts connect to spouts or other bolts and perform simple transformations on incoming data before passing it on.
Should you want to perform a complex data transformation, you can chain several bolts together, with each bolt performing one of the operation’s subtasks. A Storm graph, also known as a topology in Apache jargon, is analogous to a Hadoop MapReduce job, the difference being that Storm topologies run indefinitely.
If you need your pipeline to handle more complex operations, such as machine-learning models or big data analytics, Apache Storm might not be enough for you. In cases like these, you can use Apache Spark instead.
The framework was developed specifically for propagating large amounts of data using distributed computing techniques. Note that technically, Spark graphs don’t stream data in real time. Instead, they process the data in a series of micro-batches.
For instance, this might lead to undesirable results in some scenarios if you want your data to be processed exactly in the order in which it was generated.
In recent years, parallel computing has become increasingly important. We often want our modern applications to do complex mathematical computations in fractions of a second. The best way to do this is to break down each of our complex problems into several simpler ones.
That’s why graphical processing units (GPUs) have become the number-crunching resource of choice for computationally intensive frameworks like deep learning.
Just the same, in real-time streaming, we want to process millions of discrete events with low latency. So let’s take a quick look at the best frameworks for parallel computing, aka parallel processing.
Nvidia developed the CUDA parallel computing platform to parallelize tasks on its GPUs. This came as a relief to programmers who lacked any knowledge of graphics programming, once a requirement for GPU programming APIs of the past.
However, CUDA is designed to be used in C, C++ or Fortran and offers third-party wrappers in a number of other programming languages, such as Python, Ruby and Julia.
The Open Multi-Processing API (OpenMP) is a high-level framework for parallel computing across multiple processors that share the same memory. OpenMP creates parallel processes by forking from a master thread. Because it uses shared memory, OpenMP is limited to programs that run on one machine only.
On the other hand, the Open Computing Language (OpenCL) allows parallel computing on different devices, thereby lifting the size limit of OpenMP. With OpenCL, you can write programs that execute on different types of platforms, such as GPUs, CPUs and others.
In our highly connected world of recent years, we’ve very much gotten used to receiving updated information in real time.
If you want to learn about the technologies that underlie this trend and build your own real-time streaming application, consider taking our free course Real-Time Analytics with Apache Storm.