700字范文 > Python网络爬虫 pyautogui与pytesseract抓取新浪微博数据 OCR

Python网络爬虫 pyautogui与pytesseract抓取新浪微博数据 OCR

时间：2020-03-30 17:34:03

Python网络爬虫，pyautogui与pytesseract抓取新浪微博数据，OCR方案

用ocr与pyautogui，以及webbrowser实现功能：设计爬虫抓取新浪微博数据，比如，抓取微博用户的粉丝数。

Windows下需要先下载tesseract ocr安装文件。下载地址：

http://digi.bib.uni-mannheim.de/tesseract

下载后安装。安装完成后，需要配置当前系统指向tesseract.exe文件的路径，如果不想通过设置系统环境变量实现，也可以通过设置pytesseract包的源代码实现：

通过 pip installpytesseract 后，在pytesseract.py源代码的其中一行：

tesseract_cmd = 'tesseract'

修改成当前安装tesseract后，指向的tesseract.exe路径，如：

tesseract_cmd = r'D:\program\tesseract\tesseract.exe'

如图：

以上环境完成配置后，开始编程实现具体逻辑。

思路：新浪微博用户粉丝数量固定在特定位置：

在把这个图片（loc.png）交给pyautogui用于定位该图片位于整个网页截图的位置，然后从中心点位置“粉丝”上方开始截图，截取粉丝的数量切图，切出粉丝数字后交给tesseract做数字ocr识别。

import timeimport pandas as pdfrom tqdm import tqdmimport pyautoguiimport webbrowser as wbimport pytesseract as ptfrom selenium import webdriver# 启动程序前先打开浏览器，且使浏览器窗口最大化。def get_city_data(city):url = f'/{city[1]}'# chromepath = r'D:\program\chromedriver_win32\chromedriver.exe'# driver = webdriver.Chrome(executable_path=chromepath)# driver.get(url)# driver.close()# driver.quit()wb.open(url=url)pyautogui.sleep(6)full_img = pyautogui.screenshot()print(full_img)# loc.png需要事先打开微博截取，作为目标的模板匹配图片。locate = pyautogui.locateOnScreen('loc.png')print(locate)center_x, center_y = pyautogui.center(locate)print(center_x, center_y)# 截图的宽高，需要根据不同的浏览器和电脑分辨率尺寸进行调整。WIDTH = 140HIGHT = 38# 左，上，右，下box = (center_x - WIDTH / 2, locate.top - HIGHT, center_x + WIDTH / 2, locate.top)print(box)crop = full_img.crop(box=box)crop.save(fp='num.png')text = pt.image_to_string(image=crop)print(text)return city[0], int(text), time.strftime('%Y-%m-%d %H:%M', time.localtime())def main():city = [('成都', '2384889627'),]city_data = []pbar = tqdm(total=len(city), leave=True)for c in city:result = get_city_data(c)print(result)city_data.append(list(result))pbar.update(1)col = ['城市', '粉丝数量', '统计时间']df = pd.DataFrame(data=city_data, columns=col)df = df.sort_values(by=col[1], axis=0, ascending=False) # 降序# 排序后重置index，# 否则索引是混乱的df = df.reset_index(drop=True)# 因为默认的pandas起始索引从0开始，# 为了使数据行的初始索引（起始索引index）从1开始df.index = df.index + 1print(df.head(10))df.to_excel('city.xls', encoding='utf-8')df.to_csv('city.csv', encoding='utf-8')if __name__ == '__main__':main()

特别注意：该程序模块需要根据不同的浏览器缩放比例，电脑屏幕的尺寸进行调整参数。对于粉丝上面显示数字的区域，预估宽高（截取）需要根据当前代码实际运行的环境调整。

运行输出：

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。