Dictionary.com 抓取笔记

wordlist 有两个渠道:

  • robots.txt 里的 xml
  • 网页端 https://www.dictionary.com/list/0/1
    • 实际有6页,但xml里只有3页,所以需要实时获取最大页数。
当时(2024)的selector

        max = s.select_one('div[id="content"] > div[data-type="bottom-paging"]').find_all('ul')[-1].find_all('li')[-1].find('a')['href'].split('/')[-1]

其他已经不能理解的代码,估计再爬可能知道。

        hwdsDivs = s.find_all('div', attrs={'data-type': 'results-page-navigation-group'})
        if hwdsDivs:
            for hwdDiv in hwdsDivs:
                i = i+1
                p = hwdDiv.find('p')
                sups = p.find_all('sup')
                if sups:
                    sups[-1].decompose()
                hwd = p.decode_contents()
                hwds[hwd] = True
        
        script_tag = s.find('script', id='preloaded-state')
        if script_tag:
            # Get the contents and strip whitespace
            script_content = script_tag.string.strip()
            
            # Remove the prefix
            json_text = script_content.replace("window.__PRELOADED_STATE__ = ", "").rstrip(";")
            
            # Parse the JSON
            data = du.json.loads(json_text)
            
            # Extract the desired data into a new dictionary
            for i in range(1, 16):
                key = f'index-rank-{i:02d}'  # Format to 'index-rank-01', 'index-rank-02', etc.
                if 'data' in data['luna']['resultsData']:
                    if data['luna']['resultsData']['data'] is None:
                        pass
                    else:
                        if key in data['luna']['resultsData']['data']:
                            for e in data['luna']['resultsData']['data'][key]:
                                variants[e] = True
                else:
                    pass

所以六哥在通过by hand 这个词在研究反义词?
我这有个ppt讲这方面的,分类是morphology
这个ppt节选了morphology的部分内容,
大意是反义词分很多类,如:
husband/wife;
man/woman
老公和老婆为什么是反义词?(

type of antonyms
relative terms

English Morphology and Lexicology - ppt downl(1).pdf (1.7 MB)