Scraping images from websites and organizing them into structured formats like Excel spreadsheets can be a useful skill for data analysis, research, or personal projects. Python, with its vast array of libraries, provides an excellent environment for such tasks. In this article, we will discuss how to scrape images from Baidu Image Search and export the details to an Excel file. We will use libraries such as requests
, BeautifulSoup
for scraping, and pandas
for handling data and exporting to Excel.
Note: Web scraping can infringe on copyright and terms of service. Always ensure you have permission to scrape a website and comply with its robots.txt
file and terms of service.
Step 1: Setting Up the Environment
First, ensure you have Python installed on your machine. Then, install the necessary libraries using pip:
bashCopy Codepip install requests beautifulsoup4 pandas openpyxl
Step 2: Scraping Baidu Images
Baidu Image Search employs JavaScript rendering, which makes it challenging to scrape with traditional methods. For simplicity, this guide will focus on scraping basic image results from search pages directly accessible via URLs. Advanced scraping, especially JavaScript-rendered content, often requires tools like Selenium.
Here’s a basic script to scrape image titles, URLs, and thumbnail URLs from a Baidu Image Search result page:
pythonCopy Codeimport requests
from bs4 import BeautifulSoup
import pandas as pd
def scrape_baidu_images(search_query, num_pages=1):
images = []
base_url = f"https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word={search_query}&ct=201326592&v=flip"
for page in range(num_pages):
url = f"{base_url}&pn={page*30}"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
img_elements = soup.find_all('img', class_='img')
for img in img_elements:
title = img.get('alt')
img_url = img.get('src')
images.append({'title': title, 'img_url': img_url})
return images
# Example usage
search_query = "pandas"
images_data = scrape_baidu_images(search_query, num_pages=2)
Step 3: Exporting to Excel
Now, let’s export the scraped data to an Excel file:
pythonCopy Codedef export_to_excel(images, filename="images.xlsx"):
df = pd.DataFrame(images)
df.to_excel(filename, index=False)
print(f"Data exported to {filename}")
# Exporting the scraped images data to Excel
export_to_excel(images_data)
Conclusion
Scraping images from Baidu and exporting the data to Excel can be a straightforward process with Python. However, always ensure you are complying with the website’s terms of service and use scraping responsibly. For more complex websites or dynamic content, consider using tools like Selenium for browser automation.
[tags]
Python, Web Scraping, Baidu Images, Excel, Pandas, BeautifulSoup, Requests