使用Python的BeautifulSoup和lxml爬取IMDB数据

在本篇文章中，我们将学习如何使用JSON格式的数据文件。

环境准备

在开始之前，请确保您已安装以下Python库：

import requests
from bs4 import BeautifulSoup
from lxml import etree as et
import time
import random
import json
from unidecode import unidecode

这些库分别用于发送HTTP请求、解析HTML内容、处理XML路径、添加延迟以及格式化数据。

获取IMDB排行榜页面的电影链接

首先，我们需要定义IMDB排行榜页面的URL和请求头信息。以下代码展示了如何初始化这些变量：

start_url = "https://www.imdb.com/chart/top"
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"}
movie_urls = []

接下来，通过发送HTTP请求获取页面内容，并使用BeautifulSoup和lxml解析HTML结构，提取电影链接：

response = requests.get(start_url, headers=header)
soup = BeautifulSoup(response.content, 'html.parser')
dom = et.HTML(str(soup))
movie_urls_list = dom.xpath('//td[@class="titleColumn"]/a/@href')

我们将提取到的相对链接转换为完整的URL，并存储在movie_urls列表中：

for i in movie_urls_list:
    long_url = "https://www.imdb.com" + i
    short_url = long_url.split("?")[0]
    movie_urls.append(short_url)

设置请求延迟

为了避免对目标网站服务器造成过大负担，我们需要在连续请求之间添加随机延迟：

def time_delay():
    time.sleep(random.randint(2, 5))

这种做法不仅可以保护目标网站，还能降低被封禁的风险。

数据存储到JSON文件

我们将爬取的数据存储到JSON文件中。首先，创建一个空的JSON文件：

with open("data_v1.json", "w") as f:
    json.dump([], f)

然后，定义一个函数用于将新数据写入JSON文件：

def write_to_json(new_data, filename='data_v1.json'):
    with open(filename, 'r+') as file:
        file_data = json.load(file)
        file_data.append(new_data)
        file.seek(0)
        json.dump(file_data, file, indent=4)

爬取电影详细信息

接下来，我们遍历movie_urls列表，逐个请求电影详情页，并提取相关数据：

for movie_url in movie_urls:
    response = requests.get(movie_url, headers=header)
    soup = BeautifulSoup(response.content, 'html.parser')
    dom = et.HTML(str(soup))

    rank = movie_urls.index(movie_url) + 1
    movie_name = dom.xpath('//h1[@data-testid="hero-title-block__title"]/text()')[0]
    movie_year = dom.xpath('//a[@class="ipc-link ipc-link--baseAlt ipc-link--inherit-color sc-8c396aa2-1 WIUyh"]/text()')[0]
    genre = dom.xpath('//span[@class="ipc-chip__text"]/text()')
    director_name = dom.xpath('//a[@class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link"]/text()')[0]
    rating = dom.xpath('//span[@class="sc-7ab21ed2-1 jGRxWM"]/text()')[0]
    actors_list = dom.xpath('//a[@data-testid="title-cast-item__actor"]/text()')
    actors_list = [unidecode(i) for i in actors_list]    write_to_json({
        'rank': rank,
        'movie_name': movie_name,
        'movie_url': movie_url,
        'movie_year': movie_year,
        'genre': genre,
        'director_name': unidecode(director_name),
        'rating': rating,
        'actors': actors_list
    })    time_delay()
    print("{}% data is written to json file".format(round((rank * 100) / len(movie_urls))), 2)

在上述代码中，我们使用xpath提取电影的排名、名称、年份、评分、导演和演员等信息，并调用write_to_json函数将数据写入JSON文件。

总结

通过本文的教程，您已经学会了如何使用BeautifulSoup和lxml从IMDB网站爬取数据，并将其存储为JSON文件。以下是关键步骤的总结：

使用requests获取网页内容。
利用BeautifulSoup和lxml解析HTML结构。
提取所需数据并存储到JSON文件。
添加随机延迟以保护目标网站。

这种方法不仅适用于IMDB，还可以扩展到其他类似的网站数据爬取任务。希望本文对您的学习有所帮助！

原文链接: https://www.blog.datahut.co/post/scraping-imdb-data-using-python-beautifulsoup-and-lxml

使用Python的BeautifulSoup和lxml爬取IMDB数据