700字范文 > python如何修改网页_python爬虫：使用BeautifulSoup修改网页内容

python如何修改网页_python爬虫：使用BeautifulSoup修改网页内容

时间：2018-10-13 07:18:45

BeautifulSoup除了可以查找和定位网页内容，还可以修改网页。修改意味着可以增加或删除标签，改变标签名字，变更标签属性，改变文本内容等等。

使用修BeautifulSoup修改标签

每一个标签在BeautifulSoup里面都被当作一个标签对象，这个对象可以执行以下任务：

修改标签名

修改标签属性

增加新标签

删除存在的标签

修改标签的文本内容

修改标签的名字

只需要修改.name参数就可以修改标签名字。

producer_entries.name="div"怎么办嘛

你咋这么说

修改标签的属性

修改标签的属性如class,id,style等。因为属性以字典形式储存，所以改变标签属性就是简单的处理Python的字典。

更新已经存在属性的标签

可以参照如下代码：

producer_entries['id']="producers_new_value"

为一个标签增加一个新的属性

比如一个标签没有class属性，那么可以参照如下代码增加class属性，

producer_entries['class']='newclass'

删除标签属性

使用del操作符，示例如下：

delproducer_entries['class']

增加一个新的标签

BeautifulSoup有new_tag()方法来创造一个新的标签。然后可以使用append(),insert(),insert_after()或者insert_before()等方法来对新标签进行插入。

增加一个新生产者，使用new_tag()然后append()

参照前面例子，生产者除了plants和alage外，我们现在添加一个phytoplankton.首先，需要先创造一个li标签。

用new_tag()创建一个新标签

new_tag()方法只能用于BeautifulSoup对象。现在创建一个li对象。

soup=BeautifulSoup(html_markup,"lxml")

new_li_tag=soup.new_tag("li")

new_tag()对象必须的参数是标签名，其他标签属性参数或其他参数都是可选参数。举例：

new_atag=soup.new_tag("a",href="")

new_li_tag.attrs={'class':'producerlist'}

使用append()方法添加新标签

append()方法添加新标签于,contents之后，就跟python列表方法append()一样。

producer_entries=soup.ul

producer_entries.append(new_li_tag)

li标签是ul标签的子代，添加新标签后的输出结果。

plants

100000

algae

100000

使用insert()向li标签中添加新的div标签

append()在.contents之后添加新标签，而insert()却不是如此。我们需要指定插入的位置。就跟python中的Insert()方法一样。

new_div_name_tag=soup.new_tag("div")

new_div_name_tag["class"]="name"

new_div_number_tag=soup.new_tag("div")

new_div_number_tag["class"]="number"

先是创建两个div标签

new_li_tag.insert(0,new_div_name_tag)

new_li_tag.insert(1,new_div_number_tag)

print(new_li_tag.prettify())

然后进行插入，输出效果如下：

改变字符串内容

在上面例子中，只是添加了标签，但标签中却没有内容，如果想添加内容的话，BeautifulSoup也可以做到。

使用.string修改字符串内容

比如：

new_div_name_tag.string="phytoplankton"

print(producer_entries.prettify())

输出如下：

plants

100000

algae

100000

phytoplankton

使用.append/()，insert()，和new_string()添加字符串

使用append()和insert()的效果就跟用在添加新标签中一样。比如：

new_div_name_tag.append("producer")

print(soup.prettify())

输出：

plants

100000

algae

100000

phytoplankton

producer

还有一个new_string()方法，

new_string_toappend=soup.new_string("producer")

new_div_name_tag.append(new_string_toappend)

从网页中删除一个标签

删除标签的方法有decomose()和extract()方法

使用decompose()删除生产者

我们现在移去class="name"属性的div标签，使用decompose()方法。

third_producer=soup.find_all("li")[2]

div_name=third_producer.div

div_name.decompose()

print(third_producer.prettify())

输出：

10000

decompose()方法会移去标签及标签的子代。

使用extract()删除生产者

extract()用于删除一个HTMNL文档中昂的标签或者字符串，另外，它还返回一个被删除掉的标签或字符串的句柄。不同于decompose()，extract也可以用于字符串。

third_producer_removed=third_producer.extract()

print(soup.prettify())

使用BeautifulSoup删除标签的内容

标签可以有一个NavigableString对象或tag对象作为子代。删除掉这些子代可以使用clear()

举例，可以移掉带有plants的div标签和相应的class=number属性标签。

li_plants=soup.li

li_plants.clear()

输出：

可以看出跟li相关的标签内容被删除干净。

修改内容的特别函数

除了我们之前看到的那些方法，BeautifulSoup还有其他修改内容的方法。

Insert_after()和Insert_before()方法：

这两个方法用于在标签或字符串之前或之后插入标签或字符串。这个方法需要的参数只有NavigavleString和tag对象。

soup=BeautifulSoup(html_markup,"lxml")

div_number=soup.find("div",class_="number")

div_ecosystem=soup.new_tag("div")

div_ecosystem['class']="ecosystem"

div_ecosystem.append("soil")

div_number.insert_after(div_ecosystem)

print(soup.prettify())

输出：

plants

100000

soil

algae

100000

replace_with()方法：

这个方法用于用一个新的标签或字符串替代原有的标签或字符串。这个方法把一个标签对象或字符串对象作为输入。replace_with()会返回一个被替代标签或字符串的句柄。

soup=BeautifulSoup(html_markup,"lxml")

div_name=soup.div

div_name.string.replace_with("phytoplankton")

print(soup.prettify())

replace_with()同样也可以用于完全的替换掉一个标签。

wrap()和unwrap()方法：

wrap()方法用于在一个标签或字符串外包裹一个标签或字符串。比如可以用一个div标签包裹li标签里的全部内容。

li_tags=soup.find_all("li")

forliinli_tags:

new_divtag=soup.new_tag("div")

li.wrap(new_divtag)

print(soup.prettify())

而unwrap()就跟wrap()做的事情相反。unwrap()和replace_with()一样会返回被替代的标签句柄。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。