700字范文,内容丰富有趣,生活中的好帮手!
700字范文 > Python爬虫:搜狗(微信 知乎)公众号内容

Python爬虫:搜狗(微信 知乎)公众号内容

时间:2022-04-06 05:11:49

相关推荐

Python爬虫:搜狗(微信 知乎)公众号内容

Python爬虫:搜狗(微信,知乎)公众号内容

搜狗微信公众号链接:/weixin?query=神州十二号&type=2&page=2&ie=utf8&p=01030402&dp=1

需要登录,登录可以查看100页的内容

F12打开开发者工具,可以查看每一篇文章的跳转url:

这里比较简单,直接用xpath获取就可以了,不多说,上代码:

import requests,refrom lxml import etreerequests.packages.urllib3.disable_warnings()# verify=False 小伙伴可以不用写,我机子用了抓包工具改了证书,所以加上这个字段避免了SSL错误。你们加上也可以,加上后会出警告,在代码最上面加上requests.packages.urllib3.disable_warnings()就可以啦response1 = requests.get(url="/weixin?query=神州十二号&type=2&page=2&ie=utf8&p=01030402&dp=1",headers=headers,verify=False)response1.encoding = "utf-8"# print(response1.text)ele = etree.HTML(response1.text)href = ele.xpath('//h3/a/@href')print(href)

输出的href是长度为10列表:

['/link?url=dn9a_-gY295K0Rci_xozVXfdMkSQTLW6cwJThYulHEtVjXrGTiVgS7jybhXBZgWK-TO4QeI3eJOt0w-TiaSdK1qXa8Fplpd9ZkRdRACIBsvWR3m8CGaK2hpmgNYDATnSci0zijFAMUzxHWSCGmy8LmrCwIcbY7JZ81YPLE3T9SY1XWcMM0Z-xwbUBfQW-Sko4pXr4oEISty62KNgX8FinBQfGIevAlqbPJa_2sCJUOaOtWKk74_ZW2GC0k7R3ZR0AyspM5D1JltOJBjQH7pCxQ..&type=2&query=%E7%A5%9E%E5%B7%9E%E5%8D%81%E4%BA%8C%E5%8F%B7&token=C245CF0365AEC8BBCFCB04FB0B5F9927CF6639FC61245B46', '/link?url=dn9a_-gY295K0Rci_xozVXfdMkSQTLW6cwJThYulHEtVjXrGTiVgS7jybhXBZgWK-TO4QeI3eJOt0w-TiaSdK1qXa8Fplpd9c3AZohIBuieRv56nLOUpnm7QBavRX08wm1NgguMFbDCkoZo9pGHPXkl9qVrYKSf8g2uXEb8863BT2zt33p4qT5MmHddraHQB-P_cLGIkpmoTKUIT4yG94IfadQKd10RHbCWTKWuiznjhYahYpvUls_rRasC8ihpCUo2A-DLUWPSTJmCU1UgHwQ..&type=2&query=%E7%A5%9E%E5%B7%9E%E5%8D%81%E4%BA%8C%E5%8F%B7&token=C245CF0365AEC8BBCFCB04FB0B5F9927CF6639FC61245B46', '/link?url=dn9a_-gY295K0Rci_xozVXfdMkSQTLW6cwJThYulHEtVjXrGTiVgS7jybhXBZgWK-TO4QeI3eJOt0w-TiaSdK1qXa8Fplpd95ToAckDjxv1_SOXcuyrBuK6zLBsFtXPgpxRWxf0zxeNAcuLn_J9AUVvhilXhHSJ8ip_9wYV2hoByGMY5tCQvDyA-I3fohKnFLYfw_vqoGdhyvPv4-c6BdapOhitqYdh1E1uwxYFRpLx0ZCQSaEWKqhAhLHjIjUfO_CUpFPrGAHh535pjGOOjYg..&type=2&query=%E7%A5%9E%E5%B7%9E%E5%8D%81%E4%BA%8C%E5%8F%B7&token=C245CF0365AEC8BBCFCB04FB0B5F9927CF6639FC61245B46', '/link?url=dn9a_-gY295K0Rci_xozVXfdMkSQTLW6cwJThYulHEtVjXrGTiVgS7jybhXBZgWK-TO4QeI3eJOt0w-TiaSdK1qXa8Fplpd9UutRYWkJzwRVcqxaSrQ-G8KiHHOjnOX7tL8hEEk0bzmz6jUlZnEaH0hWZ1T-w8D0o-S-mauX2kfGUH8Or7tjSgj2tibhqKkDXXkgvzoTCWgnkG40n12ORVxZsbpmD49JnZBUiXeIzFZ7GKGjAAeLp0ubOAX1Sntjx6h0tQkLf8xAe0f8bRARvQ..&type=2&query=%E7%A5%9E%E5%B7%9E%E5%8D%81%E4%BA%8C%E5%8F%B7&token=C245CF0365AEC8BBCFCB04FB0B5F9927CF6639FC61245B46', '/link?url=dn9a_-gY295K0Rci_xozVXfdMkSQTLW6cwJThYulHEtVjXrGTiVgS7jybhXBZgWK-TO4QeI3eJOt0w-TiaSdK1qXa8Fplpd91cAXHByGcF2kCySAnR4QNigPStA5D5mbo6wVnAT1ZpQy-7CLXJXvMPZdEF7YTDKuZSREgItNDBFe1wieC8kRokQDMpT99Bi4o209qP8hQ2GtnrSkaqOpTWztYszBGR_HVnOlH98hDHVvgDbv3xuJELIDGPd1QnJ9iR8rqiw_8Sd6VKrzu_4XKA..&type=2&query=%E7%A5%9E%E5%B7%9E%E5%8D%81%E4%BA%8C%E5%8F%B7&token=C245CF0365AEC8BBCFCB04FB0B5F9927CF6639FC61245B46', '/link?url=dn9a_-gY295K0Rci_xozVXfdMkSQTLW6cwJThYulHEtVjXrGTiVgS7jybhXBZgWK-TO4QeI3eJOt0w-TiaSdK1qXa8Fplpd9rawxFTMyfZjY6-SxRegfcpyZETPEAXqvpY9SE2T8nWpL3NvElhuhCiqmpBoWFtnosRwSbFPe_9khQ_fr_6kesNqwPUt67yncghRwNaVEs4N8q-VRFSWZTZkNMYfWskzVW3tfnFQ56H0CyA1MasOCjfTssWyXPTPdwmOjQwSvpCfvzHZXxeLdyg..&type=2&query=%E7%A5%9E%E5%B7%9E%E5%8D%81%E4%BA%8C%E5%8F%B7&token=C245CF0365AEC8BBCFCB04FB0B5F9927CF6639FC61245B46', '/link?url=dn9a_-gY295K0Rci_xozVXfdMkSQTLW6cwJThYulHEtVjXrGTiVgS7jybhXBZgWK-TO4QeI3eJOt0w-TiaSdK1qXa8Fplpd96B-zMEgp66EWlaAMLX9F9L7w4taujSdtzxjrmtTyCrd3ZgeAIGGANnCZbgTmvGjKM_9hCh9v0vyYyhSxW5z6rbr1UdrKWSEgAuytBZRM9dyXzz-YSKTES4XZ1TY0vzQki43IpNPIKL77BOkfl91ZDmrIvYoFr35AMl5N_xxQbvJQ_LeJW-Rhtg..&type=2&query=%E7%A5%9E%E5%B7%9E%E5%8D%81%E4%BA%8C%E5%8F%B7&token=C245CF0365AEC8BBCFCB04FB0B5F9927CF6639FC61245B46', '/link?url=dn9a_-gY295K0Rci_xozVXfdMkSQTLW6cwJThYulHEtVjXrGTiVgS7jybhXBZgWK-TO4QeI3eJOt0w-TiaSdK1qXa8Fplpd9PZ-WxpMbY8sRDVAef4aim2r8FD0iXQlqzywMZA_q-f1IEUE5vyP_G-9FIo3uC0iWPyC3l0e3drBpdyAbW3v6Ni9voVsY5HdaZEMn8lRXx3VrmqiHPpPuplQFnEtHzLrpVH37lwa4iIL7FVTb0psWHZ53NdZAFEERDw5TCxuncXL2CsG6-xMgzQ..&type=2&query=%E7%A5%9E%E5%B7%9E%E5%8D%81%E4%BA%8C%E5%8F%B7&token=C245CF0365AEC8BBCFCB04FB0B5F9927CF6639FC61245B46', '/link?url=dn9a_-gY295K0Rci_xozVXfdMkSQTLW6cwJThYulHEtVjXrGTiVgS7jybhXBZgWK-TO4QeI3eJOt0w-TiaSdK1qXa8Fplpd9ergUlF6_3mNbqgydXzx5m9KoITlcHnYDECaJoxXaBvQx13j55Li-ginBtWxYiPAVknlqPq8ICP-4muuB4bJeX9CDAW1gjnwgDHbWzI4m-22QvRsKTy3-c7FZvrAFcfYBZXo2dMdkE-_0LpRqw4MOS2BsoHf__1zbcDL7GQkGQ6tQ_LeJW-Rhtg..&type=2&query=%E7%A5%9E%E5%B7%9E%E5%8D%81%E4%BA%8C%E5%8F%B7&token=C245CF0365AEC8BBCFCB04FB0B5F9927CF6639FC61245B46', '/link?url=dn9a_-gY295K0Rci_xozVXfdMkSQTLW6cwJThYulHEtVjXrGTiVgS7jybhXBZgWK-TO4QeI3eJOt0w-TiaSdK1qXa8Fplpd98a3ms5XOO3AZ8NnRcrx0q2ApW8Ngy03ma0bVG0mEeQPEsqAQCwMKxc6QPaI67hm7ogehZGkR-lsirhJ1Dxb3ZGT90CNo3h-ej-RrBsKO-dcQBO-Yc1qRVyvmwe1QSNJt_P5awcvQS93x995PaFFlPsx-pbJ6OTJjPVzNO0WGNCyrCDCayYmfpA..&type=2&query=%E7%A5%9E%E5%B7%9E%E5%8D%81%E4%BA%8C%E5%8F%B7&token=C245CF0365AEC8BBCFCB04FB0B5F9927CF6639FC61245B46']

url列表获取到了,但是不全,看看每一条内容跳转的链接是什么:

/link…前面加上域名才是真实的跳转链接

这里直接字符串拼接列表里面的链接就可以了

重点来了!

点击进去后发现链接变了,说明发生了第二次跳转。

这里我走了一个弯路,以为搜狗将真实的url做了加密(因为url列表里面/link?url=后面的内容很像加密后的)☺

但是请求了这个链接过后,发现返回的内容里面是这样的:

这就好办了,直接用正则匹配所有的url,然后再拼接到一起就可以了,不多说,上代码:

# href就是上面的url列表for h in href:url2 = "" + hresponse2 = requests.get(url=url2,headers=headers,verify=False)print(response2.text)r = re.findall("url \+= '(.*?)'", response2.text)true_url = ""for i in r:true_url += iprint(true_url)

输出的true_url :

"http://mp./s?src=11&timestamp=1629773949&ver=3271&signature=2NJAarqFMY0hKWeCNG*GDtNQPA*8t*A-WVC7PK0tZCZcigpZAttuPNsbGjQQe8FD5DORCy16jaiIBVRD8u1ZQYMdEaF7g5mJhC1mZS8Hwd8BH90okbIgTMycoqctIEyQ&new=1"

搞定,接下来就直接爬true_url里面想要的内容就好了!

注意:要传不同的cookies哦~

搜狗知乎也是一样,只不过返回的重定向内容变了,将正则表达式换成’window.location.replace("(.*?)")'获取真实的url就可以了

搜狗知乎链接

制作不易,多多鼓励~

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。