好像之前有人问过,没找到。
......<div class="thesaurus" type="tabt"><div class="beigeBox">......</div><div class="entry">......</div></div>......
如何获取class="thesaurus"的div里面的所有内容,多谢
2 Likes
正则很难实现,建议用python+bs
2 Likes
送伊一颗小
原来如此,我在网上找半天没找到
找一个xml解析的工具或相关的库,很好处理的
1 Like
可能是因为我的文件没写全,你给的正则不行,我传上来可以麻烦再看看吗
code.txt (18.1 KB)
the world > life > biology >
biological processes > procreation or reproduction > reproductive
substances or cells > [noun] > zygote or syngamete (12)
zygote1891
A body of living protoplasm, as a cell or cell-nucleus, formed by the conjugation or fusion of two
such bodies in reproduction; a zygospore, or any…
zygotoid1891
a multinucleate form of zygote in certain fungi (see quot.).
syngametea1900
The cell produced by the fusion of two gametes in reproduction.
ookinete1902
A zygote capable of autonomous movement, esp. as a stage in the life cycle of some parasitic
protozoa.
the world > life > biology >
biological processes > procreation or reproduction > reproductive
substances or cells > [adjective] > zygote or syngamete
(15)
syngamic1904
syngamous1904
zygotic1909
pertaining to or of the nature of a zygote, produced or characterized by zygosis.
the world > life > biology >
biological processes > procreation or reproduction > reproductive
substances or cells > [adverb] > zygote (1)
zygotically1915
in the zygote; in terms of the zygote.
the world > life > biology >
biological processes > procreation or reproduction > reproductive
substances or cells > [noun] > zygote or syngamete >
zygotomere, etc. (3)
zoozygosphere1880
(in algae and fungi) a motile gamete; also called planogamete, zoogamete.
zygotoblast1899
one of a number of germ-cells or sporozoites produced by budding from a zygotomere (see
below).
zygotomere1899
one of a number of cells formed by segmentation of a zygote in the malaria parasite or other
Sporozoa.
the world > life > biology >
biological processes > procreation or reproduction > reproductive
substances or cells > [noun] > zygote or syngamete (12)
zygote1891
A body of living protoplasm, as a cell or cell-nucleus, formed by the conjugation or fusion of two
such bodies in reproduction; a zygospore, or any…
zygotoid1891
a multinucleate form of zygote in certain fungi (see quot.).
syngametea1900
The cell produced by the fusion of two gametes in reproduction.
ookinete1902
A zygote capable of autonomous movement, esp. as a stage in the life cycle of some parasitic
protozoa.
the world > life > biology >
biological processes > procreation or reproduction > reproductive
substances or cells > [noun] > zygote or syngamete >
zygotomere, etc. (3)
zoozygosphere1880
(in algae and fungi) a motile gamete; also called planogamete, zoogamete.
zygotoblast1899
one of a number of germ-cells or sporozoites produced by budding from a zygotomere (see
below).
zygotomere1899
one of a number of cells formed by segmentation of a zygote in the malaria parasite or other
Sporozoa.
是这种结果,正则没看到
效率如何,差不多7g,我用python+bs花了好几个小时(代码很简陋)
这个是正则写出来的吗
不知……我对这个一窍不通的。
1 Like
我用bs搞的, 你用的什么解析器? 试试lxml呢.
正则的话, 我只能分两步搞, 先匹配到thesaurus标签, 再提取text
thesaurus_tag = re.compile(r'<div\s+class=\"thesaurus\".*?>\s+<div\s+class=\"entry\">.*?</div>\s+</div>', re.DOTALL)
unwanted = re.compile(r"<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>", re.DOTALL)
或者你试试先匹配出thesaurus标签, 再用bs提取文字, thesaurus标签其实没有几个
我就随手用的html.parser,效率会差很多吗,大意了啊,7个g花了好几个小时 下次再试试你的方法
会差很多。
pip install lxml
然后代码里面:
soup = BeautifulSoup(html, 'lxml')
学到了,是我自作聪明了,看网上别人也都是用lxml