求助:CNKI的侧边栏目录如何爬下来?python,爬虫,

前面楼主发了个帖子,雄心勃勃 准备干票大的。

现在楼主遇到了一些问题,有些辞典 的目录很重要,必须要爬下来。手动一个个点开保存太费时力,怎样做到批量 把完整目录 爬下来。
什么方法都 行。 python代码,或什么软件能抓到都可以。

像这部分内容是可以公开访问的。比如:
https://gongjushu.cnki.net/rbook/bookdetail?bookid=R201211157
上面的地址薄我已经准备好了。

1 Like

看了一下,是一些json数据。把下面网址中的006001001分为三组,3个数字为一组迭代一下就行了。
https://t.cnki.net/rbook-api/v1/book/R201211157/catalog?code=006001001&type=&size=50&start=1

1 Like

看不懂,可否帮忙写一下可以直接跑的代码

import requests,time

agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.42"
headers = {"User-Agent": agent}

i, j, k = 1, 1, 1
while True:
    ii = str(i).zfill(3)
    jj = str(j).zfill(3)
    kk = str(k).zfill(3)
    url = f'https://t.cnki.net/rbook-api/v1/book/R201211157/catalog?code={ii}{jj}{kk}&type=&size=50&start=1'
    r = requests.get(url, headers=headers)
    text = r.json()
    n = text['data']['total']
    if n != 0:
        k += 1
        data = text['data']['data']
        with open('d:/fydcd.txt','a',encoding='utf-8') as f:
            for ddd in data:
                f.write(ddd['title']+' ')
            f.write('\n')
        time.sleep(5)
    elif k != 1:
        k = 1
        j += 1
    elif k == 1:
        i += 1
        k = 1
        j = 1
    elif k == 1 and j == 1:
        break

试试行不行,我没加延迟,被封ip了,现在代码已经加了延迟。文件保存在D盘fydcd.txt中。

感謝,我先試一下。

不过《反义大词典》网上不是有吗?你还是安心弄那个《方言大词典》吧

且慢,我忽视了一个问题。
改一下url,要不数据可能不全。

url = f'https://t.cnki.net/rbook-api/v1/book/R201211157/catalog?code={ii}{jj}{kk}&type=&size=200&start=1'

這本沒有吧,我沒看到有。

再修改一下代码,最后跳不出循环

import requests,time

agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.42"
headers = {"User-Agent": agent}
i, j, k = 1, 1, 1

while i:
    while j:
        while k:
            ii = str(i).zfill(3)
            jj = str(j).zfill(3)
            kk = str(k).zfill(3)
            url = f'https://t.cnki.net/rbook-api/v1/book/R201211157/catalog?code={ii}{jj}{kk}&type=&size=150&start=1'
            r = requests.get(url, headers=headers)
            text = r.json()
            n = text['data']['total']
            print("n:", n, "i:", i, "j:", j, "k:", k)
            if n != 0:
                k += 1
                data = text['data']['data']
                with open('d:/fydcd1.txt','a',encoding='utf-8') as f:
                    for ddd in data:
                        f.write(ddd['title']+' ')
                    f.write('\n')
                time.sleep(1)
            else:
                break
        if k>1:
            j+=1
            k=1
        else:
            break
    if j>1:
        i+=1
        j=1
        k=1
    else:
        break

反正我有《反义大词典》。

反义词大词典.7z (1.9 MB)
看了看,可能是我老早做的。

1 Like

感谢,已经在运行PY,一个小时了,这个目录还没弄完。速度有点慢。

看了一下抓取到的数据,都是文本内容。

○阿姨 ●叔叔 
○腌臜 ○肮脏 ○鏖糟 ○埋汰 ○龌龊 ○污秽 ○污浊 ●纯净 ●干净 ●光洁 ●洁净 ●明净 ●清洁 ●清爽 ●卫生 
○腌臜 ○别扭 ○不顺 ○逆心 ●合心 ●合意 ●顺心 ●顺意 ●中意 
○哀 ○悲 ○愁 ○忧 ●欢 ●快 ●乐 ●喜 ●愉 ●悦 

我需要这种原网页数据:

<div data-v-c0aa28b4="" class="leftSide" style="height: 2570px;"><div data-v-c0aa28b4="" class="category">目录</div><!----><div data-v-a6573e50="" data-v-c0aa28b4="" id="literatureClassify" class="mod-literatureClassify" style="height: 2530px;"><div data-v-a6573e50="" class="treeStructure"><div data-v-5da0ad47="" data-v-a6573e50="" id="TreeItem" class="mod-myTree"><ul data-v-5da0ad47=""><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix active" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-2874" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">A</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-2874" tabindex="0"><span data-v-5da0ad47="" class="navi-search">A</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-2254" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">B</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-2254" tabindex="0"><span data-v-5da0ad47="" class="navi-search">B</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-4957" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">C</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-4957" tabindex="0"><span data-v-5da0ad47="" class="navi-search">C</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-1565" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">D</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-1565" tabindex="0"><span data-v-5da0ad47="" class="navi-search">D</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-9059" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">E</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-9059" tabindex="0"><span data-v-5da0ad47="" class="navi-search">E</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-4372" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">F</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-4372" tabindex="0"><span data-v-5da0ad47="" class="navi-search">F</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-492" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">G</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-492" tabindex="0"><span data-v-5da0ad47="" class="navi-search">G</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-7850" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">H</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-7850" tabindex="0"><span data-v-5da0ad47="" class="navi-search">H</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-4676" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">J</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-4676" tabindex="0"><span data-v-5da0ad47="" class="navi-search">J</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-6339" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">K</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-6339" tabindex="0"><span data-v-5da0ad47="" class="navi-search">K</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-6513" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">L</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-6513" tabindex="0"><span data-v-5da0ad47="" class="navi-search">L</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-2420" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">M</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-2420" tabindex="0"><span data-v-5da0ad47="" class="navi-search">M</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-9479" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">N</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-9479" tabindex="0"><span data-v-5da0ad47="" class="navi-search">N</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-760" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">O</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-760" tabindex="0"><span data-v-5da0ad47="" class="navi-search">O</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-3931" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">P</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-3931" tabindex="0"><span data-v-5da0ad47="" class="navi-search">P</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-973" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">Q</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-973" tabindex="0"><span data-v-5da0ad47="" class="navi-search">Q</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-9564" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">R</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-9564" tabindex="0"><span data-v-5da0ad47="" class="navi-search">R</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-1503" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">S</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-1503" tabindex="0"><span data-v-5da0ad47="" class="navi-search">S</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-8079" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">T</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-8079" tabindex="0"><span data-v-5da0ad47="" class="navi-search">T</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-1138" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">W</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-1138" tabindex="0"><span data-v-5da0ad47="" class="navi-search">W</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-5625" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">X</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-5625" tabindex="0"><span data-v-5da0ad47="" class="navi-search">X</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-6971" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">Y</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-6971" tabindex="0"><span data-v-5da0ad47="" class="navi-search">Y</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-6042" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">Z</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-6042" tabindex="0"><span data-v-5da0ad47="" class="navi-search">Z</span></span></span></span></div><!----><!----></li></ul></div><!----></div></div><!----></div>

我想的是:
全部展开,所有的 “+” 和 所有的 “查看更多”,待数据全部加载完
然后 提取 展开后的 目录 全部 网页格式的源数据
再批量爬取地址目录
目录地址.zip (57.9 KB)

可否再帮忙写一下代码 实现,感谢 :pray:

我以为你字典已经抓过了,只要目录呢。
目录数据中并没有词条链接信息,抓词典不是这样抓的。
如果你有账号,我可以替抓或把词条链接信息找出来,现在我没办法。

字典已经抓过了。词条不用抓。
只要目录。 带格式的完整目录 数据文件。
我要抓很多本书的目录,所以要批量抓取目录,不是批量抓词条。

已有字典,我不知道你要目录干啥?
要目录数据文件,那数据只是在json数据上做了布局,对做字典没意义的。
已有字典,洗版整理就是了。
你在群里吗?群里交流方便,我去找你。还不能私聊你。