Python爬虫学习笔记 1-3：爬取ajax加载网页

Published On 2019/08/11 Sunday, Singapore

本节以爬取Joyenjoye关注的人为例，讲解如何爬取ajax或者javascript加载的网页。

网页源码的组成一般来说为html, css 和javascript。当数据是是直接写在hmtl文件时，可以使用python爬虫学习笔记1-2中的方法，直接请求网页url获取html文件，解析得到数据。但是大多情况下，数据不会被直接写在html文件中，而是通过javascript来进行加载。这时不再能够通过网页URL直接请求数据，而需要先寻找数据的真实URL，再进行请求。

那么如何判断数据是否为javascript加载呢？在谷歌浏览器中，我们可以在网站site setting中禁用javascript，然后观察网页上的数据是否能正常显示。如果能正常显示，说明数据是直接写在html文件中的。反之，数据是通过javascript加载的。

寻找真实请求

确定数据为javascript进行加载以后，我们使用谷歌开发者工具对网页请求进行分析，寻找数据的真实请求。

在网页中点击右键选择检查（inspect）
选择网络面板（network）
重新刷新网页
选择XHR过滤

现在我们来分析Joyenjoye关注的人的真实请求，执行以上步骤我们可以看到下图所示：

在preview中可以看到其中的数据正是页面中看到的数据。现在我们点击header取获取真实请求url.

到此为止，我们找到了Joyenjoye关注的人的真实请求URL为

https://www.zhihu.com/api/v4/members/joye-lee-29/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20.

翻页参数

通常来讲，需要进行翻页来获取完整数据。这时候我们需要知道真实请求中的是控制翻页的参数。

点击不同页面2，3，4，找到对应页面的真实请求连接进行对比如下：

'https://www.zhihu.com/api/v4/members/joye-lee-29/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20.'

'https://www.zhihu.com/api/v4/members/joye-lee-29/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=40&limit=20.'

'https://www.zhihu.com/api/v4/members/joye-lee-29/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=60&limit=20.'

对比不同请求发现，只有offset的参数分别为20，40，60，可知其为翻页参数。另一种方法是去直接看几个真实请求header下的querying string parameter有什么不同。改变的paremeter一般来说即为翻页参数。

添加请求头

如果直接用request去请求前面的找到的URL，具体code如下，会返回bad request 404。

url ='https://www.zhihu.com/api/v4/members/joye-lee-29/followees?include=data\
%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed\
%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset={}&limit=20'.format(i*20)
response = requests.get(url)
print(response.text)

这是由于网站反爬虫的原因。该问题可以通过在请求时添加请求头（header）来解决。请求头信息承载了关于客户端浏览器、请求页面、服务器等相关的信息，用来告知服务器发起请求的客户端的具体信息。

请求头的信息可以通过google浏览器在Inspect > Network > Header > Request Headers 获取。不同网站的反爬虫对header的校验不同，经测试知乎仅需要提供user-agent即可。

headers={
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
url ='https://www.zhihu.com/api/v4/members/joye-lee-29/followees?include=data\
%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed\
%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset={}&limit=20'.format(i*20)
response = requests.get(url,headers = headers)
print(response.text)

定义函数实现翻页爬取

成功实现爬取单页数据以后，我们定义函数get_user_data来爬取多页数据，并添加请求时间间隔以免由于爬取太频繁给服务器造成负担。最后将获取的数据保存在csv文件中。

import requests
import pandas as pd
import json
import time

headers={
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}

user_data = []

def get_user_data(page):
    for i in range(page):
        print('当前正在爬取第{}页'.format(i+1))
        url ='https://www.zhihu.com/api/v4/members/joye-lee-29/followees?include=data%5B*%5D.answer_count\
        %2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following\
        %2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset={}&limit=20'.format(i*20)
        response = requests.get(url,headers = headers)
        data=json.loads(response.text)
        user_data.extend(data['data'])
        time.sleep(1)

get_user_data(7)
df = pd.DataFrame.from_dict(user_data)
df.to_csv('output/user_data.csv')

💚 Back to Home