We generate mind-boggling amounts of data daily. Seeing as this trend is here to stay, our task as data engineers is to continually innovate ways to efficiently process, store and analyze data.
Data streaming is certainly part of the solution, and luckily we now have sophisticated tools for working with streaming data, including those offered by Amazon Web Services (AWS). Read on for our overview of both data streaming and some of the most popular AWS streaming tools.
What Is Data Streaming?
The term “data streaming” refers to the continuous collecting and processing of incoming data. This incoming data can be any business-relevant event occurring at a measured time-point, such as a log that’s generated each time someone completes a purchase in an online store. In fact, a data stream is a constant and uninterrupted flow of such data.
If this sounds vaguely familiar, that’s because the concept of “data streaming” has been around for a long time, though it hasn’t always been called that. Many related systems are based on the concept of taking actions in response to incoming data, such as transaction processing systems used in banks or even tools like the Windows Task Manager, which is used for real-time monitoring of computer performance. Modern data engineering tools now require in-depth study to fully master.
Types of Data Processing
There are two ways to process data: periodically in volumes, or continuously on-arrival. The former approach, also known as batch processing, delivers chunks of data according to predefined rules, such as overnight or once a certain number of data points has been reached. Continuous processing, on the other hand, focuses on instantaneous delivery: Each piece of data is sent as soon as it’s generated.
Batch processing used to be the dominant approach in the pre-big data times, and it’s still common to this day, especially for businesses that do not depend on time-sensitive data. But we’re now witnessing a paradigm shift with the business community’s realization that real-time insights have far greater value than delayed insights.
Benefits of Data Streaming
The benefits of real-time data streaming are most apparent when we compare it to traditional, batch-operated data warehouses. We’ve already mentioned the main advantage of data streaming, which is the low latency between updates. For a technology-driven business, being able to extract learnings and insights from production data on the same day can be much more valuable than getting those same learnings even just a few days or weeks later. Developers are able to work with analysts and product managers to create alerts based on metrics and behaviors in their applications.
Another benefit is in the increased simplicity of data architectures. If we zoom out a bit and think about data warehouses, we see that they exhibit features similar to a delayed data stream. Traditionally, businesses used to save data from disparate sources across multiple databases, with each interconnected via APIs or some other type of point-to-point connection. Batches of data from these databases would then periodically be used to populate a data warehouse using extract, transform and load (ETL) processes.
Such disjointed architectures clearly come with big scalability problems — so big, in fact, that these architectures earned themselves the name “spaghetti architectures,” evoking an image of labyrinth-like architecture with tangled components. Indeed, synchronizing many different parts comes with an enormous amount of complexity. As a solution, real-time data streaming architecture proposes a single, logically unified data source. This has a drastic effect on how data flows through businesses.
Data Streaming Use Cases
Businesses are interested in data streaming mostly because of the actions they can take based on the data they receive and process, both in real-time and post-factum. For example, data streaming enables companies to:
- build applications that depend on real-time data (e.g., fraud monitoring or fraud prevention systems);
- process data one element at a time and use it to train machine-learning models;
- perform data analysis on summaries created from original data;
- save data into databases such as Amazon Redshift or Apache Cassandra, for further analysis and dashboarding;
- store incoming data permanently into a storage service like Amazon S3.
In the next section, we’ll provide an overview of AWS streaming solutions. For a more thorough take on data streaming before moving on to these concrete tools, check out this article first.
AWS Kinesis is a fully managed and scalable real-time data streaming service from Amazon Web Services (AWS). That’s quite a definition, so let’s break it down:
- “Fully-managed” means that you don’t have to worry about any complex infrastructural issues; AWS takes care of it all.
- “Scalable” means that you can adapt Kinesis to handle any amount of data with minimal delay as your data increases. You pay only for what you use.
- “Real-time” means that Kinesis buffers your data and provides an analysis almost as soon as it’s generated.
Kinesis can be used for any type of real-time data, including audio, video, logs and clickstreams. It’s separated into multiple services; below, we’ll cover Data Streams, Data Firehose and Data Analytics.
Kinesis Data Streams
Kinesis Data Streams (KDS) is the main AWS streaming service used for capturing, processing and storing data streams. It’s manually managed and is therefore aimed at developers who want to build applications that require custom processing and analysis of streaming data. KDS works with data from one or multiple sources, such as EC2 streams and traditional servers, and with almost no latency at all. We’re talking about reading gigabytes of incoming data in real-time!
Data streams can be read from KDS using the AWS CLI (AWS command-line interface), the official AWS SDK (AWS software development kit) or the Kinesis client with Python and Java support. Additionally, there’s also Kinesis support for Apache Storm and Apache Spark.
Kinesis Data Firehose
Kinesis Data Firehose is a fully-managed AWS streaming architecture used for storing data streams into other AWS products, such as S3 buckets or Amazon Redshift, where the data can be further assessed and analyzed. On its way to the target destination, Firehose allows for the data to be transformed with AWS Lambda.
It’s important to note the distinction between Data Streams and Data Firehose.
Both Data Streams and Data Firehose are used for streaming data, but while the former retains data for up to a week, the latter lacks any data persistence capability. Another important distinction is latency: KDS buffers data in real-time, whereas Data Firehose has a slightly higher latency and therefore might be better suited for applications that don’t depend on real-time data ingestion.
Generally, Data Streams allows for more flexibility. But if your application doesn’t require custom processing and coding, Firehose beats Data Streams simply on ease of use. Additionally, Data Firehose is serverless and scales automatically, unlike Data Streams. Finally, the two tools may also be combined. For example, data from Data Streams can be fed into Data Firehose for permanent storage after processing.
Kinesis Data Analytics
Kinesis Data Analytics is an AWS streaming analytics tool for the real-time analysis of streaming data. It’s independent of the data streaming source and can be connected to both Data Streams and Data Firehose. Data Analytics allows you to write traditional SQL code and query the incoming streaming data just like you would query a relational database.
It’s also possible to write a series of SQL statements, such that the output stream of the previous statement is fed as an input to the next statement. The output streams can then be sent to a destination for further analysis, visualizations or dashboard updating.
In this article, we covered the fundamentals of data streaming and provided an overview of AWS streaming tools Data Streams, Data Firehose and Data Analytics. To be able to build continuous applications or visually inspect a data sink for accuracy, you’ll need practical, hands-on experience. Our highly specialized and expert-taught Data Streaming Nanodegree has got you covered.