700字范文,内容丰富有趣,生活中的好帮手!
700字范文 > 用Python爬虫抓取煎蛋(jandan.net)无聊图和妹子图

用Python爬虫抓取煎蛋(jandan.net)无聊图和妹子图

时间:2020-03-11 18:45:48

相关推荐

用Python爬虫抓取煎蛋(jandan.net)无聊图和妹子图

用Python爬虫抓取煎蛋()无聊图和妹子图,有需要的朋友可以参考下。

初学Python, 练手写了个程序

通过Python爬虫抓取煎蛋无聊图和妹子图,存储到本地硬盘

使用了pyquery包来做html解析,需要另外安装

图片默认下载到D:/Download/Python/ 目录下,无聊图在pic目录中,妹子图在ooxx目录中

程序开始输入三个参数: 开始页码, 结束页码, 无聊图或妹子图

程序:

# -*- coding: utf-8 -*-"""Created on Mon Dec 29 13:36:37 @author: Gavin"""import sys reload(sys) sys.setdefaultencoding('utf-8')from pyquery import PyQuery as pqfrom time import ctimeimport timeimport reimport osimport urllibdef main(page_start, page_end, flag):file_path_pre = 'D:/Download/Python/'folder_name = 'ooxx' if flag else 'pic'page_url = '/' + folder_name + '/page-'folder_name = file_path_pre + folder_name + '/' + str(page_start) + '-' + str(page_end) + '/'for page_num in range(page_start,page_end + 1):crawl_page(page_url, page_num, folder_name)def crawl_page(page_url, page_num, folder_name):page_url = page_url + str(page_num)print 'start handle',page_urlprint '','starting at', ctime()t0 = time.time()page_html = pq(url = page_url) #获取网页htmlcomment_id_patt = r'<li id="comment-(.+?)">'comment_ids = re.findall(comment_id_patt, page_html.html())name_urls = {}for comment_id in comment_ids:name_url = dispose_comment(page_html,comment_id)if name_url: name_urls.update(name_url)if not os.path.exists(folder_name):print '','new folder',folder_nameos.makedirs(folder_name)for name_url in name_urls.items():file_path = folder_name + 'page-' + str(page_num) + name_url[0]img_url = name_url[1]if not os.path.exists(file_path): print '','start download',file_path#print '','img_url is',img_urlurllib.urlretrieve(img_url, file_path)else:print '',file_path,'is already downloaded'print 'finished at', ctime(),',total time',time.time()-t0,'ms'def dispose_comment(page_html,comment_id):name_url_dict = {}id = '#comment-'+comment_idcomment_html = page_html(id)oo_num = int(comment_html(id + ' #cos_support-' + comment_id).text())xx_num = int(comment_html(id + ' #cos_unsupport-' + comment_id).text())oo_to_xx = oo_num/xx_num if xx_num != 0 else oo_numif oo_num > 1 and oo_to_xx > 0:imgs = comment_html(id + ' img')for i in range(0, len(imgs)):org_src = imgs.eq(i).attr('org_src')src = imgs.eq(i).attr('src')img_url = org_src if org_src else srcif img_url:img_suffix = img_url[-4:]if not img_suffix.startswith('.'):img_suffix = '.jpg'img_name = id + '_oo' + str(oo_num) + '_xx' + str(xx_num) + (('_' + str(i)) if i != 0 else '') + img_suffixname_url_dict[img_name] = img_urlelse:print '***url not exist'return name_url_dictif __name__ == '__main__': page_start = int(raw_input('Input start page number: '));page_end = int(raw_input('Input end page number: '));is_ooxx = int(raw_input('Select 0: wuliao 1: meizi '));main(page_start, page_end, is_ooxx)

from:/view/40724

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。