O X ten raw_data 整理

原帖: O X ten raw_data

创建本帖用于:梳理原帖数据。故新帖左上角选择“赞踩投票”将内容自动分类、筛选。

回复前:

  1. 请检查原帖,或已有楼层,在楼中楼回复。(每层楼左下角的“添加评论”)
  2. 检查不存在后再开新楼。

谢谢大家。

5 个赞

原材料 OALD_10th_data.7z (15.4 MB)

  • json.loads 会报错(没能力查明),故目前用 regex 提取链接

技术研究:mdx 格式组装示例

mdx

f
<a href="sound://sentence_mp3/__em%23_gbs_1.wav">ddd</a>
<img src="ill/fruit_misc.jpg"></a>
<a href="sound://word_mp3/5p%23_gb_1.mp3">b</a>
</>

地址要 urllib.parse.quote(name),否则找不到文件。用mdx私有发音协议就不用js了,audio 选 QT multimedia(ffmpeg测试未通过,tested on goldendict-ng)

mdd格式说明:(默认mdd为必要资源;非必要可选资源:1为图片,2为单词发音,3为例句发音(之前打包成1了,直接改名成3即可)

数据还原,源数据在二楼,共40970条,

  • 一层的 40972 则包括相同值(仅一条值)的多行记录 + key: {} 首尾两行,即40970,满且相同。
  • “id” 122910 则包括不同值(多条不同值)的一行记录,即 40970 * 3,满且唯一。
  • “word” 122790 则 40930 * 3,有重复。
      1 {
      2     "data": {            
      3         "o10dict": {
      4             "id": {
      5 +---122910 lines: "u596c17338875400e.30be04e6.154e2615987.3466": {······································································································
 122915             },
 122916             "word": {
 122917 +---122790 lines: "-ie": {··············································································································································
 245707             },
 245708             "word_body": {
 245709 +---8973718 lines: "o10dict": {·········································································································································
9219427             }
9219428         }
9219429     },
9219430     "status_code": {
9219431 +--40972 lines: "0": {··················································································································································
9260403     },
9260404     "message": {
9260405 +--40972 lines: "\u6210\u529f": {·······································································································································
9301377     }
9301378 }

想完善数据的,需要 APP 对照这几个词。

数字即行号,对应源文件文本行号。这里是单词顶部栏(单词发音之前,词头+单词表分级标志的地方),这11个单词有什么特别之处。

update: bud 暖心提供的截图

例句发音 mp3

  • 数量
    • 出现了 107502 次
    • 去重后 106890 个 MP3:比 OALD_9_mdx 3.1.2 的 114178 少
  • 音质
    • 音质更佳
  • 大小
    • 20GB:体积大了20倍,压缩到 mdd 共 20GB
  • 下载
    • gofile 上传中
    • 官方服务器下载 :
    • @hua 请问坛主有这个需要吗,可以开辟个 20G 的 downloads 或 cloud 空间吗?
    • 压缩版(maybe)

发音分类:英美、强弱
±-----------40388 lines: “BrE”: {··············································································································································
±-----------40383 lines: “NAmE”: {·············································································································································
±----------- 86 lines: “NAmE also”: {··········································································································································
±----------- 87 lines: “BrE also”: {···········································································································································
±----------- 55 lines: “EAfrE”: {··············································································································································
±----------- 53 lines: “SAfrE”: {··············································································································································
±----------- 11 lines: “WAfrE”: {··············································································································································
±----------- 3 lines: “BrE sometimes”:


±-----------40388 lines: “”: {·················································································································································
±----------- 35 lines: “strong form”: {········································································································································
±----------- 10 lines: “weak form”: {··········································································································································
±----------- 10 lines: “before vowels”: {······································································································································
±----------- 3 lines: “before names”: {·······································································································································
±----------- 5 lines: “before vowels and finally”: {··