People once believed that once a computer had beaten a human player at chess, machine intelligence would have surpassed human intelligence. But mastering chess for a computer turned out to be comparatively trivial, as demonstrated more than two decades ago when IBM’s Deep Blue beat Garry Kasparov, the former world chess champion.
And although machines have certainly surpassed human performance in some domains, teaching computers how to communicate via human language, something we do with little effort, has remained a challenging task.
However, owing in part to developments in algorithms and the democratization of natural-language processing (NLP) in the Python community, recently the field has seen rapid advances. What follows is an overview of the most popular NLP applications and techniques with practical implementations in Python.
What is Natural-Language Processing?
Natural-language processing (NLP) allows a computer to draw insights from language data. It can also be used as any application that produces an output given some language as input, but there are more comprehensive definitions of NLP as well. Most NLP applications are powered by Python and form the core of many tools we use every day, such as email spam filtering and auto-completion in instant messaging. Below we cover more common applications of NLP.
Spell checking is probably the most common NLP application, an integral part of internet search engines, email clients, and word processors. Spell checking often works in concert with other tools that increase text clarity and correctness, such as grammar correction and word completion.
Applications that perform speech recognition transcribe spoken language into text. It constitutes an essential component of Apple’s Siri and similar voice-controlled assistants. Speech recognition detects patterns in speech and transforms them into text, which a smart device then executes as a command.
Now that the internet represents the main channel for high-speed communication, the need has grown for instantaneous translation. Older machine-translation systems relied on complicated handwritten rules and templates and needed the help of human translators, linguists, and computer scientists — and provided mediocre results.
Today, machine-translation systems rely on deep-learning architectures that use no rules at all, instead capturing statistical patterns in huge bodies of parallel language data.
Sentiment analysis refers to the task of classifying opinions and emotions expressed within the text. This can be valuable for a company that wants to know how their customers feel about a product or a service they offer.
Rather than asking customers to fill out a survey, they can analyze online reviews or posts on social media, the source of much opinion data. Later in this article, we’ll take a look at an example of sentiment analysis in a Python natural-language processing library called TextBlob.
What are the Common NLP Techniques?
For a machine-learning system to learn, the training dataset requires proper preprocessing, which allows the model to use the relevant portion of the input and to capture meaningful statistical correlations, rather than keying on the noise in the data. Different applications require different preprocessing techniques, but their input will always need to be transformed in some way.
Some of the most common preprocessing techniques in NLP are tokenization (i.e., breaking up a sentence into smaller lexical units, such as words and phrases), stopword removal (removing words that don’t contribute to the meaning, like “the” and “of) and stemming (reducing words to their root forms so that words such as “go”, “going” and “went” count as a single token). More advanced processing techniques include part-of-speech tagging and named-entity recognition (extracting words and phrases referring to people, organizations, places, and other entities).
Why Python is a Popular Choice for NLP
Many practitioners of natural-language processing use Python: Its syntax is simple and it has a shallow learning curve, handling much of the low-level computational and logical complexity for the programmer. Writing and reading Python code is fairly intuitive, even if you’re just getting started. It boasts a large, collaborative community and numerous libraries with many tools usable out of the box, and these features make Python excellent for prototyping and experimenting — which is why it’s so popular among researchers and companies.
Python in Natural-Language Processing
Let’s look at an example of natural-language processing with Python. We’ll train a sentiment-analysis model on a movie-review dataset consisting of about 10,000 positive and negative movie reviews. For the sake of brevity, we’ll skip the preprocessing step, but you can learn more about preprocessing techniques in the Natural Language Processing Python course.
Below is an overview of the data we’re working with.
Machine-learning algorithms work only on numerical data, so we need to transform our text data into numbers. The sentiments of the reviews are already labeled numerically — 1 for a positive and 0 for a negative review — so we need to convert the reviews into numbers as well.
Before we begin, the reviews need to be ingested into a vocabulary — i.e., the set of all unique words that appear in the text. Then we’ll represent the reviews in a bag-of-words (BoW) model, which is one of the simplest ways to numerically represent text. In a BoW representation, each sentence is represented as a vector of numbers of the same length as the vocabulary, where each index in the vector corresponds to a different word from the vocabulary.
For words that were present in the original text sentence, the BoW representation will have a value of 1 (or the count of however many times each word appeared in the sentence) indexed in the vector according to the word’s position in the vocabulary, and 0 at indices of all other words that weren’t present. For a more comprehensive explanation, read this Medium article on text representation techniques.
We use scikit-learn to split our dataset into two parts, one of which we use for training our model, and another one for testing its performance. Then we create a vocabulary and convert the text reviews into a bag-of-words representation, and now our data is ready to train a logistic-regression classifier.
Our model correctly guessed the sentiment 80% of the time, not bad for such a simple model. Making the model more complex would increase its predictive power. For example, we could represent the reviews with a more sophisticated method than bag-of-words.
If you don’t want to build a model from scratch, you can often find a library that does the job. Let’s look at an example of sentiment analysis using TextBlob.
TextBlob‘s out-of-the-box sentiment analysis model performs worse than ours on the test set; we might be able to improve its performance with some fine tuning.
Now let’s dive into some of the libraries that every NLP practitioner should have in their toolkit.
Popular Python NLP Libraries
This essential natural-language processing Python library has the tools to accomplish the majority of NLP tasks. It’s a bit slow, though, so it’s mostly used for teaching purposes.
Whereas NLTK’s best for teaching, spaCy focuses on runtime performance. It offers state-of-the-art performance on most tasks, such as tokenizing, stemming, part-of-speech tagging, and dependency parsing. It’s fast and made for production.
TextBlob has a simple interface which makes it great for prototyping. It has many out-of-the-box tools, such as sentiment analysis, semantic-similarity calculation, and language translation.
Scikit-learn provides general machine learning tools used in NLP, such as classes for building a pipeline and parameter-tuning methods, as well as classification, regression, and clustering algorithms.
In this article, we explained how natural-language processing technologies can produce insights using language data. We briefly covered some of the most common NLP techniques and applications. By analyzing sentiment on a dataset of movie reviews, we established two simple but effective baselines and showed how Python’s rich and accessible natural-language processing ecosystem can help its users make better and more informed decisions.
To learn more about NLP, sign up for our Natural Language Processing in Python Nanodegree.