前面楼主发了个帖子,雄心勃勃 准备干票大的。
现在楼主遇到了一些问题,有些辞典 的目录很重要,必须要爬下来。手动一个个点开保存太费时力,怎样做到批量 把完整目录 爬下来。
什么方法都 行。 python代码,或什么软件能抓到都可以。
像这部分内容是可以公开访问的。比如:
https://gongjushu.cnki.net/rbook/bookdetail?bookid=R201211157
上面的地址薄我已经准备好了。
前面楼主发了个帖子,雄心勃勃 准备干票大的。
现在楼主遇到了一些问题,有些辞典 的目录很重要,必须要爬下来。手动一个个点开保存太费时力,怎样做到批量 把完整目录 爬下来。
什么方法都 行。 python代码,或什么软件能抓到都可以。
像这部分内容是可以公开访问的。比如:
https://gongjushu.cnki.net/rbook/bookdetail?bookid=R201211157
上面的地址薄我已经准备好了。
看了一下,是一些json数据。把下面网址中的006001001分为三组,3个数字为一组迭代一下就行了。
https://t.cnki.net/rbook-api/v1/book/R201211157/catalog?code=006001001&type=&size=50&start=1
看不懂,可否帮忙写一下可以直接跑的代码
import requests,time
agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.42"
headers = {"User-Agent": agent}
i, j, k = 1, 1, 1
while True:
ii = str(i).zfill(3)
jj = str(j).zfill(3)
kk = str(k).zfill(3)
url = f'https://t.cnki.net/rbook-api/v1/book/R201211157/catalog?code={ii}{jj}{kk}&type=&size=50&start=1'
r = requests.get(url, headers=headers)
text = r.json()
n = text['data']['total']
if n != 0:
k += 1
data = text['data']['data']
with open('d:/fydcd.txt','a',encoding='utf-8') as f:
for ddd in data:
f.write(ddd['title']+' ')
f.write('\n')
time.sleep(5)
elif k != 1:
k = 1
j += 1
elif k == 1:
i += 1
k = 1
j = 1
elif k == 1 and j == 1:
break
试试行不行,我没加延迟,被封ip了,现在代码已经加了延迟。文件保存在D盘fydcd.txt中。
感謝,我先試一下。
不过《反义大词典》网上不是有吗?你还是安心弄那个《方言大词典》吧
且慢,我忽视了一个问题。
改一下url,要不数据可能不全。
url = f'https://t.cnki.net/rbook-api/v1/book/R201211157/catalog?code={ii}{jj}{kk}&type=&size=200&start=1'
這本沒有吧,我沒看到有。
再修改一下代码,最后跳不出循环
import requests,time
agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.42"
headers = {"User-Agent": agent}
i, j, k = 1, 1, 1
while i:
while j:
while k:
ii = str(i).zfill(3)
jj = str(j).zfill(3)
kk = str(k).zfill(3)
url = f'https://t.cnki.net/rbook-api/v1/book/R201211157/catalog?code={ii}{jj}{kk}&type=&size=150&start=1'
r = requests.get(url, headers=headers)
text = r.json()
n = text['data']['total']
print("n:", n, "i:", i, "j:", j, "k:", k)
if n != 0:
k += 1
data = text['data']['data']
with open('d:/fydcd1.txt','a',encoding='utf-8') as f:
for ddd in data:
f.write(ddd['title']+' ')
f.write('\n')
time.sleep(1)
else:
break
if k>1:
j+=1
k=1
else:
break
if j>1:
i+=1
j=1
k=1
else:
break
反正我有《反义大词典》。
感谢,已经在运行PY,一个小时了,这个目录还没弄完。速度有点慢。
看了一下抓取到的数据,都是文本内容。
○阿姨 ●叔叔
○腌臜 ○肮脏 ○鏖糟 ○埋汰 ○龌龊 ○污秽 ○污浊 ●纯净 ●干净 ●光洁 ●洁净 ●明净 ●清洁 ●清爽 ●卫生
○腌臜 ○别扭 ○不顺 ○逆心 ●合心 ●合意 ●顺心 ●顺意 ●中意
○哀 ○悲 ○愁 ○忧 ●欢 ●快 ●乐 ●喜 ●愉 ●悦
我需要这种原网页数据:
<div data-v-c0aa28b4="" class="leftSide" style="height: 2570px;"><div data-v-c0aa28b4="" class="category">目录</div><!----><div data-v-a6573e50="" data-v-c0aa28b4="" id="literatureClassify" class="mod-literatureClassify" style="height: 2530px;"><div data-v-a6573e50="" class="treeStructure"><div data-v-5da0ad47="" data-v-a6573e50="" id="TreeItem" class="mod-myTree"><ul data-v-5da0ad47=""><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix active" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-2874" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">A</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-2874" tabindex="0"><span data-v-5da0ad47="" class="navi-search">A</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-2254" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">B</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-2254" tabindex="0"><span data-v-5da0ad47="" class="navi-search">B</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-4957" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">C</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-4957" tabindex="0"><span data-v-5da0ad47="" class="navi-search">C</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-1565" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">D</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-1565" tabindex="0"><span data-v-5da0ad47="" class="navi-search">D</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-9059" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">E</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-9059" tabindex="0"><span data-v-5da0ad47="" class="navi-search">E</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-4372" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">F</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-4372" tabindex="0"><span data-v-5da0ad47="" class="navi-search">F</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-492" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">G</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-492" tabindex="0"><span data-v-5da0ad47="" class="navi-search">G</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-7850" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">H</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-7850" tabindex="0"><span data-v-5da0ad47="" class="navi-search">H</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-4676" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">J</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-4676" tabindex="0"><span data-v-5da0ad47="" class="navi-search">J</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-6339" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">K</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-6339" tabindex="0"><span data-v-5da0ad47="" class="navi-search">K</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-6513" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">L</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-6513" tabindex="0"><span data-v-5da0ad47="" class="navi-search">L</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-2420" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">M</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-2420" tabindex="0"><span data-v-5da0ad47="" class="navi-search">M</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-9479" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">N</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-9479" tabindex="0"><span data-v-5da0ad47="" class="navi-search">N</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-760" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">O</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-760" tabindex="0"><span data-v-5da0ad47="" class="navi-search">O</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-3931" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">P</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-3931" tabindex="0"><span data-v-5da0ad47="" class="navi-search">P</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-973" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">Q</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-973" tabindex="0"><span data-v-5da0ad47="" class="navi-search">Q</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-9564" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">R</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-9564" tabindex="0"><span data-v-5da0ad47="" class="navi-search">R</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-1503" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">S</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-1503" tabindex="0"><span data-v-5da0ad47="" class="navi-search">S</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-8079" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">T</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-8079" tabindex="0"><span data-v-5da0ad47="" class="navi-search">T</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-1138" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">W</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-1138" tabindex="0"><span data-v-5da0ad47="" class="navi-search">W</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-5625" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">X</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-5625" tabindex="0"><span data-v-5da0ad47="" class="navi-search">X</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-6971" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">Y</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-6971" tabindex="0"><span data-v-5da0ad47="" class="navi-search">Y</span></span></span></span></div><!----><!----></li><li data-v-5da0ad47=""><div data-v-5da0ad47="" class="lineDetail clearfix" style="padding-left: 14px;"><span data-v-5da0ad47="" class="leftIcon" style="left: 5px;"> + </span><!----><!----><span data-v-5da0ad47=""><div role="tooltip" id="el-popover-6042" aria-hidden="true" class="el-popover el-popper popover-wrap" tabindex="0" style="display: none;"><!----><span data-v-5da0ad47="">Z</span></div><span class="el-popover__reference-wrapper"><span data-v-5da0ad47="" class="el-popover__reference" aria-describedby="el-popover-6042" tabindex="0"><span data-v-5da0ad47="" class="navi-search">Z</span></span></span></span></div><!----><!----></li></ul></div><!----></div></div><!----></div>
我想的是:
全部展开,所有的 “+” 和 所有的 “查看更多”,待数据全部加载完
然后 提取 展开后的 目录 全部 网页格式的源数据
再批量爬取地址目录
目录地址.zip (57.9 KB)
可否再帮忙写一下代码 实现,感谢
我以为你字典已经抓过了,只要目录呢。
目录数据中并没有词条链接信息,抓词典不是这样抓的。
如果你有账号,我可以替抓或把词条链接信息找出来,现在我没办法。
字典已经抓过了。词条不用抓。
只要目录。 带格式的完整目录 数据文件。
我要抓很多本书的目录,所以要批量抓取目录,不是批量抓词条。
已有字典,我不知道你要目录干啥?
要目录数据文件,那数据只是在json数据上做了布局,对做字典没意义的。
已有字典,洗版整理就是了。
你在群里吗?群里交流方便,我去找你。还不能私聊你。