九 爬虫实战系列:知乎热榜全爬取及词云制作

声明:本博客只是简单的爬虫示范 , 并不涉及任何商业用途 。
一.前言
今天正值国庆中秋双节 , 但作为一个技术宅的我仍然是在寝室度过 , 在下午我还是习惯性地打开知乎 , 结果发现《姜子牙》冲到了知乎热榜第一 , 而我最近也有意向去看这部国产动漫 。于是不了解风评的我准备利用爬虫+词云图对《姜子牙》的评价进行可视化 , 然后决定一波到底要不要去看 , 顺带的我也把热榜其他问题和对应的全部回答也扒了下来 , 下面是热榜全爬取的详细记录 。
二.爬虫过程 2.1 所有问题对应回答页面链接获取
首先 , 进入知乎热榜页面(展示如下图) , 可以看到热榜中一共包括了50个问题 , 这些问题的所有回答都是我们要爬取的目标 。
随机选中一个问题右键检查即可查看所有的元素都包含在一个
...
块中 , 即:
我们点开其中的一个元素 , 可以发现对应的问题及其所指向的链接 , 即我们需要的链接:
2.2 获取单个问题页面的全部回答
在解决了热榜所有问题的链接获取之后 , 下面的问题就是如何爬取单个页面的所有回答了 , 我们打开《姜子牙》的链接 , 可以看到如下页面:
需要注意的是:该页面的所有回答并不会全部显示出来 , 而是当滚动条滚动到底部后才会出现新的回答 , 即它采用了Ajax 动态加载的技术 。那该问题如何解决呢 , 我在开发者工具中 , 选中请求类型为XHR , 结果果然看到了评论数据(json格式):
我又继续滚动滑动条几次 , 得到如下几个链接:
https://www.zhihu.com/api/v4/questions/337873977/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit=5&offset=5&platform=desktop&sort_by=defaulthttps://www.zhihu.com/api/v4/questions/337873977/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit=5&offset=10&platform=desktop&sort_by=defaulthttps://www.zhihu.com/api/v4/questions/337873977/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit=5&offset=15&platform=desktop&sort_by=default