【学习交流】python简易爬虫入门

last_idol · 2021 年6 月 27 日 11:02

需要先在命令行用 pip3或者 pip安装:

pip3 install requests

之后打开 Convert curl commands to code 这个网页，先看网页下半部份的说明，按提示打开Chrome浏览器复制Copy as cURL的内容，粘贴到网页左边文本框，会自动生成对应的 Python 代码，改下requests.get后面的网址，就可以跑起来了。

这个网址的好处是，用 cURL 生成的 Python 代码会带有身份标识，可以帮助你解决登录问题。

# -*- coding: utf-8 -*-
import requests
import os.path
from os import path
import time

for i, line in enumerate(open("address.txt")):
    filename = str(i) + ".html"  # 保存的文件名

    # 检查文件是否存在，存在跳过
    if path.exists(filename):
        continue

    headers = {
        'authority': 'zidian.911cha.com',
        'pragma': 'no-cache',
        'cache-control': 'no-cache',
        'upgrade-insecure-requests': '1',
        'dnt': '1',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-dest': 'document',
        'referer': 'https://zidian.911cha.com/zi7684.html',
        'accept-language': 'en,ja;q=0.9,zh-CN;q=0.8,zh;q=0.7',
        'cookie': 't=d910e897351658b915c67413eb1d4c2f; r=9132',
    }

    response = requests.get(line, headers=headers)

    # 打印文本行，去除前后空格换行，http状态码，响应内容长度
    print(i, line.strip(), response.status_code, len(
        response.text))

    # 发现会返回空文件，检查响应内容长度，大于1000，再保存文件
    if len(response.text) > 1000 and response.status_code == 200:
        with open(filename, "w") as f:
            f.write(response.text)

    # 等待5秒
    time.sleep(5)