Python: Scraping Baidu Images and Exporting to Excel

Scraping images from websites and organizing them into structured formats like Excel spreadsheets can be a useful skill for data analysis, research, or personal projects. Python, with its vast array of libraries, provides an excellent environment for such tasks. In this article, we will discuss how to scrape images from Baidu Image Search and export the details to an Excel file. We will use libraries such as requests, BeautifulSoup for scraping, and pandas for handling data and exporting to Excel.
Note: Web scraping can infringe on copyright and terms of service. Always ensure you have permission to scrape a website and comply with its robots.txt file and terms of service.

Step 1: Setting Up the Environment

First, ensure you have Python installed on your machine. Then, install the necessary libraries using pip:

bashCopy Code
pip install requests beautifulsoup4 pandas openpyxl

Step 2: Scraping Baidu Images

Baidu Image Search employs JavaScript rendering, which makes it challenging to scrape with traditional methods. For simplicity, this guide will focus on scraping basic image results from search pages directly accessible via URLs. Advanced scraping, especially JavaScript-rendered content, often requires tools like Selenium.

Here’s a basic script to scrape image titles, URLs, and thumbnail URLs from a Baidu Image Search result page:

pythonCopy Code
import requests from bs4 import BeautifulSoup import pandas as pd def scrape_baidu_images(search_query, num_pages=1): images = [] base_url = f"https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word={search_query}&ct=201326592&v=flip" for page in range(num_pages): url = f"{base_url}&pn={page*30}" response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') img_elements = soup.find_all('img', class_='img') for img in img_elements: title = img.get('alt') img_url = img.get('src') images.append({'title': title, 'img_url': img_url}) return images # Example usage search_query = "pandas" images_data = scrape_baidu_images(search_query, num_pages=2)

Step 3: Exporting to Excel

Now, let’s export the scraped data to an Excel file:

pythonCopy Code
def export_to_excel(images, filename="images.xlsx"): df = pd.DataFrame(images) df.to_excel(filename, index=False) print(f"Data exported to {filename}") # Exporting the scraped images data to Excel export_to_excel(images_data)

Conclusion

Scraping images from Baidu and exporting the data to Excel can be a straightforward process with Python. However, always ensure you are complying with the website’s terms of service and use scraping responsibly. For more complex websites or dynamic content, consider using tools like Selenium for browser automation.

[tags]
Python, Web Scraping, Baidu Images, Excel, Pandas, BeautifulSoup, Requests

Python official website: https://www.python.org/