从HTML里提取数据的方法的速度对比, 剧透 Beatifulsoup sucks!

舒服员 · 2022 年10 月 6 日 06:34

在同一个页面里面提取一些数据一千次, 速度对比:

Lxml Xpath                         : 0.77 seconds
Regex search                       : 1.25 seconds
Lxml CSS selector                  : 1.59 seconds
BeautifulSoup(lxml) find           : 4.75 seconds
BeautifulSoup(html.parser) find    : 5.52 seconds
BeautifulSoup(html5lib) find       : 9.92 seconds

surfactant · 2022 年10 月 6 日 06:44

前几天看见网上有人做的这个对比。好比有些脚本语言，虽然运行慢但编写速度快，人力才是最昂贵的资源。

hua · 2022 年10 月 6 日 07:15

试试在bs4里面指定lxml解析？

surfactant · 2022 年10 月 6 日 08:00

请教下：JS中的html解析器和bs4里面可选的是什么关系？是缺省用了其中的一个还是另外的？

舒服员 · 2022 年10 月 6 日 12:53

已更新, 还是xpath是王道.XD

atauzki · 2022 年10 月 6 日 13:51

默认用python自带的html.parser，似乎是安装了lxml才默认用lxml

surfactant · 2022 年10 月 6 日 14:35

你说的是bs4吧，我是说js里面也要解析html不知道和这里的解析是不是一回事

atauzki · 2022 年10 月 6 日 14:36

js肯定不是lxml，xml解析库的实现太多了

surfactant · 2022 年10 月 6 日 14:37

那你知道vb里面可以做这个事吗？

atauzki · 2022 年10 月 6 日 14:37

没用过vb，zsbd

surfactant · 2022 年10 月 6 日 14:39

这坛子里是python当道

atauzki · 2022 年10 月 6 日 14:40

什么简单用什么