700字范文 > python爬虫网络请求超时_6 web爬虫讲解2—urllib库爬虫—基础使用—超时设置—自动模

python爬虫网络请求超时_6 web爬虫讲解2—urllib库爬虫—基础使用—超时设置—自动模

时间：2020-02-28 14:33:34

利用python系统自带的urllib库写简单爬虫

urlopen()获取一个URL的html源码

read()读出html源码内容

decode("utf-8")将字节转化成字符串

#!/usr/bin/env python

# -*- coding:utf-8 -*-

import urllib.request

html = urllib.request.urlopen('/course/8360.html').read().decode("utf-8")

print(html)

正则获取页面指定内容

#!/usr/bin/env python

# -*- coding:utf-8 -*-

import urllib.request

import re

html = urllib.request.urlopen('/course/8360.html').read().decode("utf-8") #获取html源码

pat = "51CTO学院Python实战群\((\d*?)\)" #正则规则，获取到QQ号

rst = pile(pat).findall(html)

print(rst)

#['325935753']

urlretrieve()将网络文件下载保存到本地，参数1网络文件URL，参数2保存路径

#!/usr/bin/env python

# -*- coding:utf-8 -*-

from urllib import request

import re

import os

file_path = os.path.join(os.getcwd() + '/222.html') #拼接文件保存路径

# print(file_path)

request.urlretrieve('/course/8360.html', file_path) #下载这个文件保存到指定路径

urlcleanup()清除爬虫产生的内存

#!/usr/bin/env python

# -*- coding:utf-8 -*-

from urllib import request

import re

import os

file_path = os.path.join(os.getcwd() + '/222.html') #拼接文件保存路径

# print(file_path)

request.urlretrieve('/course/8360.html', file_path) #下载这个文件保存到指定路径

request.urlcleanup()

info()查看抓取页面的简介

#!/usr/bin/env python

# -*- coding:utf-8 -*-

import urllib.request

import re

html = urllib.request.urlopen('/course/8360.html') #获取html源码

a = html.info()

print(a)

# C:\Users\admin\AppData\Local\Programs\Python\Python35\python.exe H:/py/15/chshi.py

# Date: Tue, 25 Jul 16:08:17 GMT

# Content-Type: text/html; charset=UTF-8

# Transfer-Encoding: chunked

# Connection: close

# Set-Cookie: aliyungf_tc=AQAAALB8CzAikwwA9aReq63oa31pNIez; Path=/; HttpOnly

# Server: Tengine

# Vary: Accept-Encoding

getcode()获取状态码

#!/usr/bin/env python

# -*- coding:utf-8 -*-

import urllib.request

import re

html = urllib.request.urlopen('/course/8360.html') #获取html源码

a = html.getcode() #获取状态码

print(a)

#200

geturl()获取当前抓取页面的URL

#!/usr/bin/env python

# -*- coding:utf-8 -*-

import urllib.request

import re

html = urllib.request.urlopen('/course/8360.html') #获取html源码

a = html.geturl() #获取当前抓取页面的URL

print(a)

#/course/8360.html

timeout抓取超时设置，单位为秒

是指抓取一个页面时对方服务器响应太慢，或者很久没响应，设置一个超时时间，超过超时时间就不抓取了

#!/usr/bin/env python

# -*- coding:utf-8 -*-

import urllib.request

import re

html = urllib.request.urlopen('/course/8360.html',timeout=30) #获取html源码

a = html.geturl() #获取当前抓取页面的URL

print(a)

#/course/8360.html

自动模拟http请求

http请求一般常用的就是get请求和post请求

get请求

比如360搜索，就是通过get请求并且将用户的搜索关键词传入到服务器获取数据的

所以我们可以模拟百度http请求，构造关键词自动请求

quote()将关键词转码成浏览器认识的字符，默认网站不能是中文

#!/usr/bin/env python

# -*- coding: utf-8 -*-

import urllib.request

import re

gjc = "手机" #设置关键词

gjc = urllib.request.quote(gjc) #将关键词转码成浏览器认识的字符，默认网站不能是中文

url = "/s?q="+gjc #构造url地址

# print(url)

html = urllib.request.urlopen(url).read().decode("utf-8") #获取html源码

pat = "(\w*\w*\w*)" #正则获取相关标题

rst = pile(pat).findall(html)

# print(rst)

for i in rst:

print(i) #循环出获取的标题

# 官网 手机

# 官网 手机 这么低的价格

# 大牌 手机 低价抢

# 手机

# 淘宝网推荐 手机

# 手机

# 苏宁易购买 手机

# 买 手机

post请求

urlencode()封装post请求提交的表单数据，参数是字典形式的键值对表单数据

Request()提交post请求，参数1是url地址，参数2是封装的表单数据

#!/usr/bin/env python

# -*- coding: utf-8 -*-

import urllib.request

import urllib.parse

posturl = "/mypost/"

shuju = urllib.parse.urlencode({ #urlencode()封装post请求提交的表单数据，参数是字典形式的键值对表单数据

'name': '123',

'pass': '456'

}).encode('utf-8')

req = urllib.request.Request(posturl,shuju) #Request()提交post请求，参数1是url地址，参数2是封装的表单数据

html = urllib.request.urlopen(req).read().decode("utf-8") #获取post请求返回的页面

print(html)

python爬虫网络请求超时_6 web爬虫讲解2—urllib库爬虫—基础使用—超时设置—自动模拟http请求...

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。

python爬虫网络请求超时_6 web爬虫讲解2&mdash;urllib库爬虫&mdash;基础使用&mdash;超时设置&mdash;自动模

python爬虫网络请求超时_6 web爬虫讲解2—urllib库爬虫—基础使用—超时设置—自动模