正则求助,嵌套的div

好像之前有人问过,没找到。
......<div class="thesaurus" type="tabt"><div class="beigeBox">......</div><div class="entry">......</div></div>......
如何获取class="thesaurus"的div里面的所有内容,多谢

2 个赞

正则很难实现,建议用python+bs

2 个赞

送伊一颗小:heart:

原来如此,我在网上找半天没找到

找一个xml解析的工具或相关的库,很好处理的

1 个赞

可能是因为我的文件没写全,你给的正则不行,我传上来可以麻烦再看看吗

code.txt (18.1 KB)


the world > life > biology >
                biological processes > procreation or reproduction > reproductive
                  substances or cells >  [noun]  > zygote or syngamete (12)
              
zygote1891
A body of living protoplasm, as a cell or cell-nucleus, formed by the conjugation or fusion of two
                  such bodies in reproduction; a zygospore, or any…

zygotoid1891
a multinucleate form of zygote in certain fungi (see quot.).

syngametea1900
The cell produced by the fusion of two gametes in reproduction.

ookinete1902
A zygote capable of autonomous movement, esp. as a stage in the life cycle of some parasitic
                  protozoa.



the world > life > biology >
                biological processes > procreation or reproduction > reproductive
                  substances or cells >  [adjective]  > zygote or syngamete
                (15) 
syngamic1904


syngamous1904


zygotic1909
pertaining to or of the nature of a zygote, produced or characterized by zygosis.



the world > life > biology >
                biological processes > procreation or reproduction > reproductive
                  substances or cells >  [adverb]  > zygote (1) 
zygotically1915
in the zygote; in terms of the zygote.



the world > life > biology >
                biological processes > procreation or reproduction > reproductive
                  substances or cells >  [noun]  > zygote or syngamete >
                zygotomere, etc. (3) 
zoozygosphere1880
(in algae and fungi) a motile gamete; also called planogamete, zoogamete.

zygotoblast1899
one of a number of germ-cells or sporozoites produced by budding from a zygotomere (see
                  below).

zygotomere1899
one of a number of cells formed by segmentation of a zygote in the malaria parasite or other
                  Sporozoa.



the world > life > biology >
                biological processes > procreation or reproduction > reproductive
                  substances or cells >  [noun]  > zygote or syngamete (12)
              
zygote1891
A body of living protoplasm, as a cell or cell-nucleus, formed by the conjugation or fusion of two
                  such bodies in reproduction; a zygospore, or any…

zygotoid1891
a multinucleate form of zygote in certain fungi (see quot.).

syngametea1900
The cell produced by the fusion of two gametes in reproduction.

ookinete1902
A zygote capable of autonomous movement, esp. as a stage in the life cycle of some parasitic
                  protozoa.



the world > life > biology >
                biological processes > procreation or reproduction > reproductive
                  substances or cells >  [noun]  > zygote or syngamete >
                zygotomere, etc. (3) 
zoozygosphere1880
(in algae and fungi) a motile gamete; also called planogamete, zoogamete.

zygotoblast1899
one of a number of germ-cells or sporozoites produced by budding from a zygotomere (see
                  below).

zygotomere1899
one of a number of cells formed by segmentation of a zygote in the malaria parasite or other
                  Sporozoa.



是这种结果,正则没看到 :grinning:

可以直接用XPath找。

效率如何,差不多7g,我用python+bs花了好几个小时(代码很简陋)

这个是正则写出来的吗

不知……我对这个一窍不通的。 :smiling_face_with_tear:

1 个赞

我用bs搞的, 你用的什么解析器? 试试lxml呢.

正则的话, 我只能分两步搞, 先匹配到thesaurus标签, 再提取text

thesaurus_tag = re.compile(r'<div\s+class=\"thesaurus\".*?>\s+<div\s+class=\"entry\">.*?</div>\s+</div>', re.DOTALL)
unwanted = re.compile(r"<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>", re.DOTALL)

或者你试试先匹配出thesaurus标签, 再用bs提取文字, thesaurus标签其实没有几个

我就随手用的html.parser,效率会差很多吗,大意了啊,7个g花了好几个小时 :smiling_face_with_tear:下次再试试你的方法

会差很多。

pip install lxml

然后代码里面:

soup = BeautifulSoup(html, 'lxml')

学到了,是我自作聪明了,看网上别人也都是用lxml :sweat_smile: