700字范文 > python从html拿到数据 python - 使用BeautifulSoup和Python从HTML文件中提取数据 - 堆栈内存溢出...

python从html拿到数据 python - 使用BeautifulSoup和Python从HTML文件中提取数据 - 堆栈内存溢出...

时间：2020-07-13 22:06:07

我需要提取的数据可以在不同的标题下找到。

这是我到目前为止：

from BeautifulSoup import BeautifulSoup

ecj_data = open("data\ecj_1.html",'r').read()

soup = BeautifulSoup(ecj_data)

celex = soup.find('h1')

auth_lang = soup('ul', limit=14)[13].li

procedure = soup('ul', limit=20)[17].li

print "Celex number:", celex.renderContents(),

print "Authentic language:", auth_lang

print "Type of procedure:", procedure

我将所有数据存储在本地，这就是它打开文件ecj_1.html的原因。

Celex数字和Authentic语言有点好用。

celex回归

"Celex number:

61977J0059"

auth_lang返回"Authentic language:

French"

我只需要h1标签的内容(不是最后的中断)。

[另外，我需要auth_lang只返回“法语”，而不是

-tags。] 这不再是问题了。我意识到我可以在“auth_lang”的末尾添加“.text”。

另一方面，程序返回：

Type of procedure:

Type of procedure:

Reference for a preliminary ruling

这是非常错误的，因为我只需要它返回“参考初步裁决”。

有什么办法可以实现吗？

第二次编辑：我用celex = soup('h1', limit=2)[0]替换了celex = soup.find('h1') ，并将.text添加到print celex中。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。