Web scraping, also known as web data extraction or web harvesting, is the process of extracting structured data from websites. Python, with its robust libraries like BeautifulSoup and Requests, has become a popular choice for web scraping tasks. In this article, we’ll delve into a simple Python web scraping code example and explain its components.
Here’s a basic Python code snippet that demonstrates how to scrape a webpage using the Requests library to fetch the content and BeautifulSoup to parse and extract the desired data:
pythonimport requests
from bs4 import BeautifulSoup
def scrape_website(url):
# Send a GET request to the URL
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find the desired data in the HTML
# For example, let's scrape all the title tags (<title>)
titles = soup.find_all('title')
# Process the scraped data
for title in titles:
print(title.text.strip())
else:
print(f"Error: Failed to retrieve the webpage. Status code: {response.status_code}")
# Example usage
scrape_website('https://example.com')
In this code:
- We import the necessary libraries:
requests
for making HTTP requests and BeautifulSoup
for parsing HTML.
- We define a function
scrape_website
that takes a URL as input.
- Inside the function, we use
requests.get()
to send a GET request to the URL and store the response in the response
variable.
- We check if the request was successful by verifying the status code. If it’s 200, it means the request was successful.
- If the request was successful, we use BeautifulSoup to parse the HTML content of the response.
- We then find all the title tags (
<title>
) in the HTML using soup.find_all('title')
. This returns a list of all the title tags found in the HTML.
- We iterate over the list of title tags and print their text content using
title.text.strip()
. The strip()
method is used to remove any leading or trailing whitespace from the text.
- If the request was not successful (i.e., the status code is not 200), we print an error message indicating the failure.
Remember to handle exceptions and errors gracefully in real-world applications to make your web scraping scripts more robust and reliable.