Dictionary.com 抓取笔记

6lj6 · 2022 年12 月 5 日 03:12

wordlist 有两个渠道：

robots.txt 里的 xml
网页端 https://www.dictionary.com/list/0/1
- 实际有6页，但xml里只有3页，所以需要实时获取最大页数。

当时（2024）的selector

        max = s.select_one('div[id="content"] > div[data-type="bottom-paging"]').find_all('ul')[-1].find_all('li')[-1].find('a')['href'].split('/')[-1]

其他已经不能理解的代码，估计再爬可能知道。

        hwdsDivs = s.find_all('div', attrs={'data-type': 'results-page-navigation-group'})
        if hwdsDivs:
            for hwdDiv in hwdsDivs:
                i = i+1
                p = hwdDiv.find('p')
                sups = p.find_all('sup')
                if sups:
                    sups[-1].decompose()
                hwd = p.decode_contents()
                hwds[hwd] = True
        
        script_tag = s.find('script', id='preloaded-state')
        if script_tag:
            # Get the contents and strip whitespace
            script_content = script_tag.string.strip()
            
            # Remove the prefix
            json_text = script_content.replace("window.__PRELOADED_STATE__ = ", "").rstrip(";")
            
            # Parse the JSON
            data = du.json.loads(json_text)
            
            # Extract the desired data into a new dictionary
            for i in range(1, 16):
                key = f'index-rank-{i:02d}'  # Format to 'index-rank-01', 'index-rank-02', etc.
                if 'data' in data['luna']['resultsData']:
                    if data['luna']['resultsData']['data'] is None:
                        pass
                    else:
                        if key in data['luna']['resultsData']['data']:
                            for e in data['luna']['resultsData']['data'][key]:
                                variants[e] = True
                else:
                    pass

mdict6 · 2022 年12 月 5 日 04:54

所以六哥在通过by hand 这个词在研究反义词？
我这有个ppt讲这方面的，分类是morphology
这个ppt节选了morphology的部分内容，
大意是反义词分很多类，如：
husband/wife；
man/woman
老公和老婆为什么是反义词？（

type of antonyms
relative terms

English Morphology and Lexicology - ppt downl(1).pdf (1.7 MB)