COCA corpus的N-grams数据下载,有没有免费方式

购买要395美元,太贵了。

我用google搜了搜没有搜到免费途径。
只搜到了一个替代

样例是只以m开头的

250欧元

1 Like

非常感谢,我看了下,是按照字母排序的,有没有现成的按照频率排序的(如果没有的话我就等会写段代码处理一下)

说实话,coca这样的词表作用不大。一般人掌握两万就可以了,这两万还不能靠背词典来达到。背词典只有短期记忆效应,长期还是没用。

不是背词典,是背ngram搭配

ok, i have finished the python code to sort this dsl file.

2 Likes
skip_words = [
    "the", "be", "and", "of", "a", "in", "to", "have", "it", "i", "that", "for", "you", "he", "with", "on", "do", "say", "this"
]


with open('N-grams-2.dsl', 'r', encoding='utf-16-le') as file:
    lines = file.readlines()

# 将行分割成数组,每6行作为一个元素
n = 6
groups = [lines[i:i+n] for i in range(0, len(lines), n)]
del groups[0] # 不知道为什么有个\ufeff

result = []
for group in groups:
    first_line = group[0].strip()  # 取第一行  最后有个换行符
    if first_line == '————————':
        continue

    first_line_split = first_line.split()
    word1, word2 = first_line_split[0], first_line_split[1]

    if word1.lower() in skip_words or word2.lower() in skip_words:
        continue

    last_line_last_part = int(group[-1].split()[-1])  # 取最后一行最后一个空格后的部分
    result.append((first_line, last_line_last_part))

result.sort(key=lambda item: item[1], reverse=True)

with open('output.txt', 'w') as output_file:
    for item in result:
        output_file.write(f"{item[0]}, {item[1]}\n")