Wiktionary 抓取工具分享

ylxdxx · 2023 年3 月 14 日 12:17

前言

最开始是用Python抓取处理的，但是Python的会出问题「一百万的词头，一万左右的有问题，原因未知」，后面抓取工作由aria2来完成。

整个抓取制作工作是在Linux下完成的，拼拼凑凑，非自动，只是能用。

‍

一、词头获取

先在 https://kaikki.org/dictionary/English/index.html 下载 Word lists 下的 JSON 文件，然后利用 jq 工具来进行词头获取，命令如下：

jq '.word' kaikki.org-dictionary-English.json | sort | uniq > words_list.txt

得到词头文件 words_list.txt

二、网页获取

首先生成每个词头的网页链接，这里可又用正则，也可以用脚本。正则不表，Perl脚本如下：

#!/usr/bin/perl
use utf8;
use strict;
use open ':std', ':encoding(UTF-8)';
use File::Slurper 'read_text';

#读取文件
my $contents = read_text("words_list.txt");

#格式化
$contents =~ s/^"//mg;
$contents =~ s/"$//mg;
$contents =~ s/^ +//mg;
$contents =~ s/ +$//mg;

#url处理
$contents =~ s/^(.+)$/https:\/\/en.wiktionary.org\/wiki\/${1}/mg;

open(FH, '>', "./aria2_down.txt") or die "Could not open file $_ because $!";
print FH $contents."\n";
close FH;

得到 aria2_down.txt 文件后，可以直接用 aria2 来下载

aria2c -i aria2_down.txt

但这里所需的内存很大，内存较小的可能通过文件切割，弄成多个小文件分别下载

split -l 100000 aria2_down.txt

其中 -l 100000 表示按行分割，每 100000 行为一个文件。

注意：这里下载的时候可能有遗漏，我们需要一个下载的词头文件，然后与原来的词头文件取作差集，拿到遗漏的词头，再次进行下载。

在三、内容提取中，提取内容的同时会生成一个关于已下载网页的词头文件

这里假设 A.txt 和 B.txt 两个文件，取差集 A-B ，得到 C.txt

sort A.txt B.txt B.txt | uniq -u > C.txt

‍

三、内容提取

把所有词头的网页下载好后，就可开始内容提取了。

这里我的内存较小，分成几块下载，下面的脚本是处理在 07 文件夹内的网页。

import random
import requests #请求网页用
from lxml import etree #处理HTML文本
import re #正则
import sys
import os
import time

import subprocess #执行shell命令
import copy
import urllib.parse#url编码
from tidylib import tidy_document




w_path = "./07" #文件夹目录
w_files= os.listdir(w_path) #得到所有文件名称(注意有目录)

All_w_citou=[]
All_w_mdx=[]
i=0
for w_file in w_files:
  
    #print(w_file)
    #读取HTML文件
    f = open(w_path+"/"+str(w_file), encoding = "utf-8")
    w_html=f.read()
    f.close()
  
  
    #需要去除其中的注释，减小体积
    html = etree.fromstring(w_html, parser=etree.HTMLParser(remove_comments=True))
  
  
    #提取有关内容
    word_mdx=html.xpath('//*[@id="mw-content-text"]/div[@class="mw-parser-output"]')[0]
    x_mdx=copy.copy(word_mdx)
  
    #提取词头
    w_citou=html.xpath('//*[@id="firstHeading"]')[0].xpath('string(.)')
  
  
    #获取图片等资源
    img_url=[]
    img_url_a= x_mdx.xpath('//@src')
    img_url_b= x_mdx.xpath('//@srcset')
    img_url=img_url_a + img_url_b
    if img_url:
        img_url="\n".join(img_url)
        with open("t_url.txt","a") as file:
            file.write(img_url+"\n")
   
  
    #转换成文本
    txt_mdx=etree.tostring(x_mdx,encoding = "utf-8").decode("utf-8")
  
    #删除所有换行
    txt_mdx = re.sub(r'\n<', r'<', txt_mdx)
    txt_mdx = re.sub(r'\n', r' ', txt_mdx)
    txt_mdx = re.sub(r'\t', r' ', txt_mdx)
    txt_mdx = re.sub(r'<hr ?\/>.+', r'</div>', txt_mdx)#mdx只提取第一个释义
    txt_mdx = re.sub(r'  +', r' ', txt_mdx)
  
  
    All_w_citou.append(w_citou)
    All_w_mdx.append(str(w_citou)+"\t"+str(txt_mdx))
    i=i+1
    print(i)



#最后写入
with open(w_path+"_citou.txt","a") as file:
    file.write("\n".join(All_w_citou))
with open(w_path+"_mdx.txt","a") as file:
    file.write("\n".join(All_w_mdx))

这里会生成三个文件：

t_url.txt 文件，是有关的图片音频等
07_citou.txt 文件，是已下载的词头
07_mdx.txt 文件，是后面制作mdx用的

四、mdx生成

前面得到的各个文件，可以合并成一个文件

cat *_mdx.txt > all_mdx.txt

前面Python脚本处理的结果有点问题，还需重新再处理一下

import random
import requests #请求网页用
from lxml import etree #处理HTML文本
import re #正则
import sys
import os
import time

import subprocess #执行shell命令
import copy
import urllib.parse#url编码
from tidylib import tidy_fragment


def fist_get_page_info(get_word):
  
    #获取HTML内容
    txt_mdx,errors= tidy_fragment(get_word)
  
  
    #删除所有换行
    txt_mdx = re.sub(r'\n<', r'<', txt_mdx)
    txt_mdx = re.sub(r'\n', r' ', txt_mdx)
    txt_mdx = re.sub(r'\t', r' ', txt_mdx)
    txt_mdx = re.sub(r'  +', r' ', txt_mdx)
  
  
    #写入信息
    with open("tidy_mdx.txt","a") as file:
        file.write(c_name+"\t"+txt_mdx+"\n")
  

i=1
with open(' all_mdx.txt') as file:
    for line in file:
        line=re.sub(r'\n', r' ', line)
        line.strip() #去除左右可能的多余空格
        c_name=re.sub(r'^(.+)\t.+$', r'\1', line)
        c_text = re.sub(r'^.+\t(.+)$', r'\1', line)
   
   
        #获取html页面
        fist_get_page_info(c_text)
  
        #打印进度
        print(str("{:.4%}".format(i/1115328,".4f"))+" \t "+str(i))
        i=i+1
        #print("sleep 1s")
        #time.sleep(0.1)

得到最终的 tidy_mdx.txt 文本后，就可以开始mdx文件的生成了。

可使用如下Perl脚本：

#!/usr/bin/perl
use utf8;
use strict;
use open ':std', ':encoding(UTF-8)';
use File::Slurper 'read_text';
use Encode qw(decode encode);
use IO::Handle;


my $file_name="03_tidy_mdx.txt";
my $contents = read_text($file_name);


#去除可能的多余换行
$contents =~ s/\n\n+/\n/mg;

print "读取文件完成，处理中。。。\n";


#转换成数组
my @contents = split(/\n/,$contents);
for my $temp (@contents) {
  
    #处理词头
    my $citou=$temp;
    $citou =~ s/\t.+//;
    $temp =~ s/.+?\t//;
    $temp  =~ s/id="English">English/id="English">$citou/;
    $temp  =~ s/id="Translingual">Translingual/id="Translingual">$citou/;
  
  
#生成mdx格式
$temp = <<"END_TXT";
<script type="text/javascript" src="wiktionary.js"></script>
<link rel="stylesheet" type="text/css" href="wiktionary_01.css">
<link rel="stylesheet" type="text/css" href="wiktionary_02.css">
<div class="skin-vector-legacy mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject mw-editable page-good rootpage-good skin-vector action-view"><div id="mw-page-base" class="noprint"></div>
<div id="content" class="mw-body" role="main">
<div id="bodyContent" class="vector-body">
<div id="mw-content-text" class="mw-body-content mw-content-ltr" lang="en" dir="ltr">
$temp
</div></div></div></div>
END_TXT

#去除换行
$temp=~ s/\n//g;

#mdx最终格式
$temp=$citou."\n".$temp."\n</>";

}

#列表转文本
$contents=join("\n", @contents);

#重新用tidy格式化后，去除<body>标签
$contents  =~ s/<body>//g;
$contents  =~ s/<\/body>//g;


#
#超链接处理
#  
#删除附录跳转
$contents  =~ s/<a href="\/wiki\/Appendix:[^><]+?>([^><]*?)<\/a>/${1}/g;
$contents  =~ s/href="\/wiki\/Appendix:[^"><]+" //g;


#删除en.wikipedia.org跳转
$contents  =~ s/<a href="https:\/\/en.wikipedia.org[^><]+?>([^><]+?)<\/a>/${1}/g;

#删除https://en.wikisource.org跳转
$contents  =~ s/<a href="https:\/\/en.wikisource.org[^><]+?>([^><]+?)<\/a>/${1}/g;


#删除commons.wikimedia.org跳转
$contents  =~ s/<a href="https:\/\/commons.wikimedia.org[^><]+?>([^><]+?)<\/a>/${1}/g;

#删除页内跳转
$contents  =~ s/<a [^><]+? href="#[a-zA-Z\_]+"[^><]*?>([^><]+?)<\/a>/${1}/g;

#删除无词条时新建动作
$contents  =~ s!<a href="/w/index\.php[^"><]+" class="new"[^><]*?>([^><]+?)</a>!${1}!g;

#删除不是英词词条的跳转
#/wiki/glossarium#Latin
#/wiki/%CF%83%CE%BF%CF%86%CE%AF%CE%B1#Ancient_Greek
#/wiki/philosophies#English
$contents  =~ s/<a href="\/wiki\/[^"><]+#(?!English)[^"><]+?"[^><]*?>([^><]+?)<\/a>/${1}/g;

#删除某些合集的跳转「未抓取」
$contents  =~ s/<a href="\/wiki\/[a-zA-Z\_]+?:[a-zA-Z][^><]+?>([^><]+?)<\/a>/${1}/g;

#词头跳转处理
$contents  =~ s/href="\/wiki\/[^"><]+" title="([^"><]+)"/href="entry:\/\/${1}" title="${1}"/g;

#资源链接
$contents  =~ s!//upload.wikimedia.org/!!g;
$contents =~ s!="/w/([^"><]+?\.png)\?[^"><]+!="w/${1}!mg;
$contents  =~ s!https://wikimedia.org/!!g;


#存储内容
#mkdir "out";
open(FH, '>>', "./wiktionary_mdx.txt") or die "Could not open file $_ because $!";
print FH $contents."\n";
close FH;

print("$file_name \t 处理完成！\n")

注意这里，如果直接整个文件进行处理，所需内存很大，如果电脑内存较小，建议分割文件，分别进行处理。

‍

五、资源下载

处理图片、音频等资源下载链接，这里假设前面分生成的t_url.txt 「多个」都在 o_down 文件夹下，用下面Perl脚本生成最终的 aria2 下载文档

#!/usr/bin/perl
use utf8;
use strict;#让Perl编译器以严格的态度对待Perl程序

#需要注意 https://commons.wikimedia.org 的东西，下载有问题
#直接用 aria2c -i aria2_down.txt 下载即可

use open ':std', ':encoding(UTF-8)';
use File::Slurper 'read_text';
use Encode qw(decode encode);
use IO::Handle;
use List::MoreUtils ':all'; #去重,排序
use Encode qw(decode encode);
use File::Find;

#读取文件
my $contents = "";
sub wanted {
my $file_name = decode('UTF-8', $_ );
$file_name =~ m/t_url\.txt\z/ && my_read($file_name)
}
sub my_read{
my $my_name = shift;
$contents = $contents.read_text("$my_name")."\n";
}
find(\&wanted, encode('UTF-8', "./o_down"));

#分割url
$contents =~ s/\/\/upload\.wikimedia\.org/\nhttps:\/\/upload.wikimedia.org/g;

#url处理
$contents =~ s!^/w/(.+?\.png)\?.+!https://en.wiktionary.org/w/${1}!mg;
$contents =~ s!^https://commons.wikimedia.org/w/api.php\?action=.+!!mg;

#去空格
$contents =~ s/ +$//mg;
$contents =~ s/^ +//mg;

#删除多余内容
$contents =~ s/\n\n+/\n/mg;
$contents =~ s/\A\n+//mg;
$contents =~ s/,$//mg;
$contents =~ s/ [0-9]\.?[0-9]?x$//mg;

#url去重
my @contents = split(/\n/,$contents);
@contents=sort(@contents);#排序
@contents=uniq(@contents);#去重

#转换成aria2下载格式
for my $tem (@contents) {
my $txt=$tem;
$txt=~ s/^https:\/\/[a-zA-Z\.]+\///mg;
$tem=$tem."\n  out=".$txt;
}
$contents=join( "\n", @contents );

open(FH, '>', "./aria2_down.txt") or die "Could not open file $_ because $!";
print FH $contents."\n";
close FH;

HMPT · 2023 年3 月 14 日 18:34

代码很有意思！辛苦！
这样是否下载了全部的网页内容？
按道理 python不会出现这样问题啊，之前有抓取的时候用的是 request提交网页，然后bs4提取内容，每个词头生成一个txt，最后合并，不知道是否有帮助。

ylxdxx · 2023 年3 月 15 日 11:12

跟request请求一样的

Python里也是用的request，不过分析提取用的是lxml「速度快点」，在大批量抓取中出的问题，应该不会有，但确是存在了，导致后面又重新抓取了一道