Online Learning - Programming Languages - Python - Tech tutorial

Web scraping with Python: Requests

Web scraping is the process of extracting data from websites. It can be as simple as manually copying and pasting from a webpage. More commonly though, it refers to the use of scripts or bots to extract the data automatically.

Requests

The Requests package is one of the most downloaded packages for Python. It averages around 30 million downloads a week (Requests, 2023). Requests provide the ability to query a webpage’s HTML code via Python. While its most common use is as an API tool, it is also great for extracting information from HTML without having to rely on a browser. 

Due to its popularity, Requests does come included in many popular Python distributions, including Anaconda. If your installation doesn’t have requests, simply run a Pip install from the command prompt, or terminal.

A full tutorial on the Requests package could take up a Udacity course. This blog article is going to focus on interacting with HTML and HTTP status codes to provide some insight into how a webpage is responding. This article will cover the .get() method to call a URL. From there, we will learn to use additional methods such as status_code, text, find, and content.

requests.get()

The .get() method from requests works by passing a URL. This returns the html code that makes up the URL in question. In this example, we are going to look at Udacity’s main website. Note the output: <Response [200]>. This is the HTML status code.

Status_code

Status_code can be used to return the HTTP status codes of a website. This can provide vital information as to how a website is responding. 

HTTP status codes work as follows:

  • 100’s the request was received
  • 200’s the request was successfully acted on 
  • 300’s the request required a redirection 
  • 400’s client error 
  • 500’s server error

Basically, when monitoring a website’s response, a response code starting with 1,2, or 3 means the page is working fine. 4 or 5 show the website is not responding .You can use status_code to get the direct status code that you can pass to a variable.

Text
Description automatically generated

Another option is .ok, which will return True for any status code under 400

Text
Description automatically generated

Either status_code or ok are great for creating a script that monitors website responsiveness. If you are monitoring a set of websites, this script can alert you to problems before the complaint emails start rolling in.

Timeout

Speaking of responsiveness. Sometimes problems with a webpage can cause a .get() request to hang. This can last a few seconds, a few  hours or until a break command is issued. One way to avoid this is to simply add a timeout argument. In the example below, the call will timeout after 3 seconds if there is no response.

Text
Description automatically generated

.text

The .text method returns the HTML code behind the website.

The HTML for a website can sometimes be a bit overwhelming, so adding a slicer (like shown below) to the r.text command returns only the first 150 characters.

Text
Description automatically generated

Unfortunately, the .text method returns raw text and does not provide any HTML parsing. For that you will need to utilize a parsing package like Beautifulsoup. .text does have a .find method which does allow for searching the raw text of the HTML. In the example below, find returns the character start position of the word Udacity in the HTML code. So looking at the slicer example above, you can see that the U in Udacity is the 122nd character in that text string.

Text
Description automatically generated

The r.content at first looks just like r.text output, but while r.text is Unicode, r.content is a byte output, meaning it can handle things beyond text, like images. r.content can also be used to save the HTML code into a file format.

This is a great method for monitoring changes to a website or creating snapshots of how a website was performing at a specific time.

Graphical user interface, application, table, Excel
Description automatically generated

Conclusion

While the Python package requests is actually more geared towards working with APIs, it can be a useful tool for basic web scraping and website monitoring. Adding parsing tools like Beautifulsoup can greatly improve working with the HTML in Python. However, sometimes a raw text search or being able to store the HTML code as file is really all that is needed.

Interested in expanding your Python skills? Udacity offers a variety of Python courses for any skill level. Just starting out, try our Introduction to Programming Nanodegree and learn coding basics. Already have a coding foundation, try our Intermediate Python Nanodegree program to learn more advanced Python topics and skills.