九 爬虫实战系列:知乎热榜全爬取及词云制作( 二 )


观察上述链接我们可以看到变化的只有字段 , 而且是加5递增的 , 因此我们只需要改变该链接的字段即可获取到对应问题的全部回答所对应的链接 。此外 , 我又打开了其他几个问题得到如下链接:
【九爬虫实战系列:知乎热榜全爬取及词云制作】https://www.zhihu.com/api/v4/questions/337873977/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit=5&offset=5&platform=desktop&sort_by=defaulthttps://www.zhihu.com/api/v4/questions/423719681/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit=5&offset=5&platform=desktop&sort_by=defaulthttps://www.zhihu.com/api/v4/questions/423737325/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit=5&offset=5&platform=desktop&sort_by=default
观察可知 , 不同问题的回答对应的链接的不同之处只包括问题ID和对应各自问题的 , 因此我们只需要在进入每个问题的回答页面时 , 将对应的问题ID和回答数获取即可获取包含所有回答的json数据 。
注:具体如何从json数据中提取作者和对应的回答的过程就不详细介绍了 。
2.3 爬虫结果保存
在爬取的过程中 , 由于首先要获取到热榜各个问题对应的链接 , 因此我将各个问题及其对应的回答页面的链接保存了下来 , 格式为csv , 其所包含的字段展示如下:
字段一字段二
title(问题)
url (问题对应的回答页面)

九  爬虫实战系列:知乎热榜全爬取及词云制作

文章插图
另外 , 对于所有问题的回答都单独存为一个csv文件 , 每个csv文件包含的字段如下:
字段一字段二
(回答者)
(回答内容 , 只筛选了其中的中文内容)
2.4 全过程流程总结
综上 , 爬虫的全过程已经一目了然了 , 首先是获取热榜所有问题的链接(顺便可以获取问题ID) , 然后进入到具体的某个页面获取回答数 , 然后就可以构造链接来爬取回答了 , 最后将回答保存为csv格式 , 即: