Web scraping is the process of extracting data from websites. It can be as simple as manually copying and pasting from a webpage. More commonly though, it refers to the use of scripts or bots to extract the data automatically.
Requests
The Requests package is one of the most downloaded packages for Python. It averages around 30 million downloads a week (Requests, 2023). Requests provide the ability to query a webpage’s HTML code via Python. While its most common use is as an API tool, it is also great for extracting information from HTML without having to rely on a browser.
Due to its popularity, Requests does come included in many popular Python distributions, including Anaconda. If your installation doesn’t have requests, simply run a Pip install from the command prompt, or terminal.
A full tutorial on the Requests package could take up a Udacity course. This blog article is going to focus on interacting with HTML and HTTP status codes to provide some insight into how a webpage is responding. This article will cover the .get()
method to call a URL. From there, we will learn to use additional methods such as status_code, text, find
, and content
.
requests.get()
The .get()
method from requests works by passing a URL. This returns the html code that makes up the URL in question. In this example, we are going to look at Udacity’s main website. Note the output: <Response [200]>.
This is the HTML status code.
Status_code
Status_code
can be used to return the HTTP status codes of a website. This can provide vital information as to how a website is responding.
HTTP status codes work as follows:
- 100’s the request was received
- 200’s the request was successfully acted on
- 300’s the request required a redirection
- 400’s client error
- 500’s server error
Basically, when monitoring a website’s response, a response code starting with 1,2, or 3 means the page is working fine. 4 or 5 show the website is not responding .You can use status_code
to get the direct status code that you can pass to a variable.
Another option is .ok
, which will return True for any status code under 400
Either status_code
or ok
are great for creating a script that monitors website responsiveness. If you are monitoring a set of websites, this script can alert you to problems before the complaint emails start rolling in.
Timeout
Speaking of responsiveness. Sometimes problems with a webpage can cause a .get()
request to hang. This can last a few seconds, a few hours or until a break command is issued. One way to avoid this is to simply add a timeout argument. In the example below, the call will timeout after 3 seconds if there is no response.
.text
The .text
method returns the HTML code behind the website.
The HTML for a website can sometimes be a bit overwhelming, so adding a slicer (like shown below) to the r.text
command returns only the first 150 characters.
Unfortunately, the .text
method returns raw text and does not provide any HTML parsing. For that you will need to utilize a parsing package like Beautifulsoup. .text
does have a .find
method which does allow for searching the raw text of the HTML. In the example below, find
returns the character start position of the word Udacity in the HTML code. So looking at the slicer example above, you can see that the U in Udacity is the 122nd character in that text string.
The r.content
at first looks just like r.text
output, but while r.text
is Unicode, r.content
is a byte output, meaning it can handle things beyond text, like images. r.content
can also be used to save the HTML code into a file format.
This is a great method for monitoring changes to a website or creating snapshots of how a website was performing at a specific time.
Conclusion
While the Python package requests is actually more geared towards working with APIs, it can be a useful tool for basic web scraping and website monitoring. Adding parsing tools like Beautifulsoup can greatly improve working with the HTML in Python. However, sometimes a raw text search or being able to store the HTML code as file is really all that is needed.
Interested in expanding your Python skills? Udacity offers a variety of Python courses for any skill level. Just starting out, try our Introduction to Programming Nanodegree and learn coding basics. Already have a coding foundation, try our Intermediate Python Nanodegree program to learn more advanced Python topics and skills.