如何用Python爬取腾讯在线文档并自动更新IP白名单

大家好！今天我要分享一个实用的小技巧：如何用Python爬取腾讯在线文档的数据，并自动更新到Nginx的IP白名单中。这个需求来源于我最近的一个项目，需要定时拉取腾讯文档中的IP列表，并自动更新到Nginx配置中。由于腾讯文档的API需要注册企业开发者，这条路走不通，所以我决定用Python写一个爬虫来解决这个问题。

1. 数据准备

在开始之前，我们需要准备一些关键数据：

localPadId：这是每个腾讯文档的唯一标识符，需要手动获取。
Cookie：打开腾讯文档后，从浏览器的开发者工具中获取Cookie信息。

2. 代码实现

接下来，我们来看一下完整的Python代码。这个脚本的主要功能是：

获取当前用户信息。
创建导出Excel文件的任务。
下载生成的Excel文件。

# -*- coding: UTF-8 -*-
"""
@File      :ip.py
@author    :yufelix
@data      :2023/7/31 14:42
@description :
"""
import json
import os
import re
import time
import requests
from bs4 import BeautifulSoup

class TengXunDocument():

    def __init__(self, document_url, local_pad_id, cookie_value):
        self.document_url = document_url
        self.localPadId = local_pad_id
        self.headers = {
            'content-type': 'application/x-www-form-urlencoded',
            'Cookie': cookie_value
        }

    def get_now_user_index(self):
        response_body = requests.get(url=self.document_url, headers=self.headers, verify=False)
        parser = BeautifulSoup(response_body.content, 'html.parser')
        global_multi_user_list = re.findall(re.compile('window.global_multi_user=(.*?);'), str(parser))
        if global_multi_user_list:
            user_dict = json.loads(global_multi_user_list[0])
            return user_dict['nowUserIndex']
        return 'cookie过期,请重新输入'

    def export_excel_task(self, export_excel_url):
        body = {
            'docId': self.localPadId, 'version': '2'
        }
        res = requests.post(url=export_excel_url, headers=self.headers, data=body, verify=False)
        operation_id = res.json()['operationId']
        return operation_id

    def download_excel(self, check_progress_url, file_name):
        start_time = time.time()
        file_url = ''
        while True:
            res = requests.get(url=check_progress_url, headers=self.headers, verify=False)
            progress = res.json()['progress']
            if progress == 100:
                file_url = res.json()['file_url']
                break
            elif time.time() - start_time > 30:
                print("数据准备超时,请排查")
                break
        if file_url:
            self.headers['content-type'] = 'application/octet-stream'
            res = requests.get(url=file_url, headers=self.headers, verify=False)
            with open(file_name, 'wb') as f:
                f.write(res.content)
            print('下载成功,文件名: ' + file_name)
        else:
            print("下载文件地址获取失败, 下载excel文件不成功")

if __name__ == '__main__':
    document_url = 'https://docs.qq.com/sheet/DSnhHWFRraGRzSXhC'
    local_pad_id = '300000000$JxGXTkhdsIxB'
    cookie_value = '****'
    tx = TengXunDocument(document_url, local_pad_id, cookie_value)
    now_user_index = tx.get_now_user_index()
    export_excel_url = f'https://docs.qq.com/v1/export/export_office?u={now_user_index}'
    operation_id = tx.export_excel_task(export_excel_url)
    check_progress_url = f'https://docs.qq.com/v1/export/query_progress?u={now_user_index}&operationId={operation_id}'
    output_path = '/home/ip/'
    file_name = f'IP.xlsx'
    output_file_path = os.path.join(output_path, file_name)
    tx.download_excel(check_progress_url, output_file_path)

3. 代码解析

初始化：我们首先初始化了腾讯文档的URL、localPadId和Cookie。
获取用户信息：通过解析网页内容，获取当前用户的信息。
创建导出任务：向腾讯文档发送请求，创建一个导出Excel文件的任务。
下载文件：检查任务进度，当任务完成后，下载生成的Excel文件。

4. 定时任务

为了自动化这个过程，我们可以创建一个定时任务来调用这个Python脚本。这样，每隔一段时间，脚本就会自动运行，更新IP白名单。

5. 总结

通过这个简单的Python脚本，我们可以轻松地从腾讯文档中获取数据，并自动更新到Nginx的IP白名单中。这个方法不仅节省了手动操作的时间，还提高了系统的自动化程度。

希望这篇文章对你有帮助！如果你有任何问题或建议，欢迎在评论区留言。我们一起学习，一起进步！

Yufelix