700字范文,内容丰富有趣,生活中的好帮手!
700字范文 > 【Python】《Python网络爬虫权威指南》第三章任务:验证六度分隔理论

【Python】《Python网络爬虫权威指南》第三章任务:验证六度分隔理论

时间:2018-07-14 14:19:53

相关推荐

【Python】《Python网络爬虫权威指南》第三章任务:验证六度分隔理论

【Python】《Python网络爬虫权威指南》第三章任务:验证六度分隔理论

任务描述

是否能够通过一个wiki页面上的站内链接,经过最多六次跳转,到达另一个wiki页面,对于本书,我们的任务是从/wiki/Eric_Idle跳转到/wiki/Kevin_Bacon

完成思路

书上都写了,不讲了

过程记录

反正疫情在家闲着也是闲着,让笔记本开着跑了三天,最后的结果是:

爬取了超过80,000个页面并保存到本地,大小10GB+;分析了超过200,000个站内链接;找到了十几种可行路径;实际上没有找到所有可行路径,最后不想跑下去了;

代码

获取一个wiki页面并保存到本地(毕竟有wall,方便出错了重新跑)

from urllib.request import urlopenfrom urllib.error import HTTPError, URLErrorfrom http.client import HTTPResponseimport timestorage_directory = 'D:/MyResources/爬虫数据/Wiki Pages'def process_filename(filename: str) -> str:hash_res = hash(filename)filename = filename.replace('"', '')\.replace('?', '')\.replace('*', '')\.replace('<', '')\.replace('>', '')\.replace(':', '')\.replace('/', '')\.replace('\\', '')\.replace('|', '')if len(filename) == 0 or len(filename) == filename.count('.'):filename = str(hash_res)return storage_directory + '/' + filenamedef get_and_store_page(url: str, filename: str) -> bool:try:response = urlopen(url) # type: HTTPResponseexcept HTTPError as e:print(f'HTTPError: {e}')return Falseexcept URLError as e:print(f'URLError: {e}')return Falseelse:html = response.read().decode(encoding='utf-8')try:filename = process_filename(filename)f = open(file=filename, mode='w', encoding='utf-8')except FileNotFoundError as e:print(f'check your file name: {e}')return Falseelse:f.write(html)f.close()time.sleep(1)return Truedef load_stored_html(filename: str) -> (str, bool):filename = process_filename(filename)try:f = open(file=filename, mode='r', encoding='utf-8')except FileNotFoundError as e:print(f'check your filename: {e}')return '', Falseelse:res = f.read()f.close()return res, Trueif __name__ == '__main__':if get_and_store_page(url='/wiki/Kevin_Bacon', filename='Kevin_Bacon.html'):print('success: /wiki/Kevin_Bacon')else:print('fail: /wiki/Kevin_Bacon')if get_and_store_page(url='/wiki/Eric_Idle', filename='Eric_Idle.html'):print('success: /wiki/Eric_Idle')else:print('fail: /wiki/Eric_Idle')

验证六度分隔理论

from bs4 import BeautifulSoupfrom bs4.element import Tagfrom CH3_GetWikipedia import load_stored_html, get_and_store_pageimport reimport timeimport copyhost = ''visited_url = dict()jump_path = ['', '', '', '', '', '', '']results = []def find_kevin_bacon(path: str, jumps: int) -> None:global host, visited_url, jump_path, resultsjump_path[jumps] = host + pathif path.split('/')[-1] == 'Kevin_Bacon':print(f'!!!! it\'s found!')results.append(copy.deepcopy(jump_path))with open(file='./result.txt', mode='a', encoding='utf-8') as f:for u in jump_path:print(u)f.write(u + '\n')print(host + '/wiki/Kevin_Bacon')f.write('--------------------\n')returnif path in visited_url:if visited_url[path] > jumps:visited_url[path] = jumpselse:returnelse:visited_url[path] = jumpsnow = time.localtime(time.time())hour = now.tm_hourminute = now.tm_minsecond = now.tm_secprint(f'---> {hour}:{minute}:{second} jump time: {jumps}, visited: {len(visited_url)}, now visit: {path}.')if jumps >= 6:returnhtml, success = load_stored_html(filename=path.split('/')[-1] + '.html')if not success:success = get_and_store_page(url=host + path, filename=path.split('/')[-1] + '.html')if not success:returnelse:html, success = load_stored_html(filename=path.split('/')[-1] + '.html')bs = BeautifulSoup(markup=html, features='html.parser')links = bs.find(name='div', attrs={'id': 'bodyContent'}).\find_all(name='a', attrs={'href': pile('^(/wiki/)((?!:).)*$')})for link in links: # type: Tagfind_kevin_bacon(path=link['href'], jumps=jumps + 1)if __name__ == '__main__':find_kevin_bacon(path='/wiki/Eric_Idle', jumps=0)print(f'一共找到{len(results)}种方案:')for res in results:for p in res:print(f'{p} -> ', end='')print('/wiki/Kevin_Bacon')

我找到的可行路径

--------------------/wiki/Eric_Idle/wiki/South_Shields/wiki/Tyne_and_Wear/wiki/Telford_and_Wrekin/wiki/Time_zone/wiki/Nome,_Alaska/wiki/Kevin_Bacon--------------------/wiki/Eric_Idle/wiki/South_Shields/wiki/Tyne_and_Wear/wiki/Telford_and_Wrekin/wiki/England/wiki/Michael_Caine/wiki/Kevin_Bacon--------------------/wiki/Eric_Idle/wiki/South_Shields/wiki/Tyne_and_Wear/wiki/Telford_and_Wrekin/wiki/England/wiki/Gary_Oldman/wiki/Kevin_Bacon--------------------/wiki/Eric_Idle/wiki/South_Shields/wiki/Tyne_and_Wear/wiki/Telford_and_Wrekin/wiki/England/wiki/Daniel_Day-Lewis/wiki/Kevin_Bacon--------------------/wiki/Eric_Idle/wiki/South_Shields/wiki/Tyne_and_Wear/wiki/Telford_and_Wrekin/wiki/New_town/wiki/Edmund_Bacon_(architect)/wiki/Kevin_Bacon--------------------/wiki/Eric_Idle/wiki/South_Shields/wiki/Tyne_and_Wear/wiki/Telford_and_Wrekin/wiki/Stoke-on-Trent/wiki/Hugh_Dancy/wiki/Kevin_Bacon--------------------/wiki/Eric_Idle/wiki/South_Shields/wiki/Tyne_and_Wear/wiki/Telford_and_Wrekin/wiki/Coventry/wiki/Bon_Jovi/wiki/Kevin_Bacon--------------------/wiki/Eric_Idle/wiki/South_Shields/wiki/Tyne_and_Wear/wiki/Telford_and_Wrekin/wiki/Blackpool/wiki/Pleasure_Beach_Blackpool/wiki/Kevin_Bacon--------------------/wiki/Eric_Idle/wiki/South_Shields/wiki/Tyne_and_Wear/wiki/Telford_and_Wrekin/wiki/Blackpool/wiki/Blackpool_Pleasure_Beach/wiki/Kevin_Bacon--------------------/wiki/Eric_Idle/wiki/South_Shields/wiki/Tyne_and_Wear/wiki/Telford_and_Wrekin/wiki/Blackpool/wiki/Frasier/wiki/Kevin_Bacon--------------------/wiki/Eric_Idle/wiki/South_Shields/wiki/Tyne_and_Wear/wiki/Telford_and_Wrekin/wiki/Brighton_and_Hove/wiki/Lewes/wiki/Kevin_Bacon--------------------/wiki/Eric_Idle/wiki/South_Shields/wiki/Tyne_and_Wear/wiki/Telford_and_Wrekin/wiki/Isle_of_Wight/wiki/Jeremy_Irons/wiki/Kevin_Bacon--------------------/wiki/Eric_Idle/wiki/South_Shields/wiki/Tyne_and_Wear/wiki/Telford_and_Wrekin/wiki/South_Gloucestershire/wiki/EE_Limited/wiki/Kevin_Bacon--------------------/wiki/Eric_Idle/wiki/South_Shields/wiki/Tyne_and_Wear/wiki/Metropolitan_county/wiki/Conservative_Party_(UK)/wiki/Early_1990s/wiki/Kevin_Bacon--------------------/wiki/Eric_Idle/wiki/South_Shields/wiki/Tyne_and_Wear/wiki/Metropolitan_county/wiki/Margaret_Thatcher/wiki/Meryl_Streep/wiki/Kevin_Bacon--------------------/wiki/Eric_Idle/wiki/South_Shields/wiki/Tyne_and_Wear/wiki/Metropolitan_county/wiki/History_of_local_government_in_England/wiki/Cleveland/wiki/Kevin_Bacon--------------------/wiki/Eric_Idle/wiki/South_Shields/wiki/Tyne_and_Wear/wiki/Metropolitan_county/wiki/Urban_area/wiki/Empire_State_Building/wiki/Kevin_Bacon--------------------

感想

见证历史!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。