wordlist 有两个渠道:
- robots.txt 里的 xml
- 网页端
https://www.dictionary.com/list/0/1
- 实际有6页,但xml里只有3页,所以需要实时获取最大页数。
当时(2024)的selector
max = s.select_one('div[id="content"] > div[data-type="bottom-paging"]').find_all('ul')[-1].find_all('li')[-1].find('a')['href'].split('/')[-1]
其他已经不能理解的代码,估计再爬可能知道。
hwdsDivs = s.find_all('div', attrs={'data-type': 'results-page-navigation-group'})
if hwdsDivs:
for hwdDiv in hwdsDivs:
i = i+1
p = hwdDiv.find('p')
sups = p.find_all('sup')
if sups:
sups[-1].decompose()
hwd = p.decode_contents()
hwds[hwd] = True
script_tag = s.find('script', id='preloaded-state')
if script_tag:
# Get the contents and strip whitespace
script_content = script_tag.string.strip()
# Remove the prefix
json_text = script_content.replace("window.__PRELOADED_STATE__ = ", "").rstrip(";")
# Parse the JSON
data = du.json.loads(json_text)
# Extract the desired data into a new dictionary
for i in range(1, 16):
key = f'index-rank-{i:02d}' # Format to 'index-rank-01', 'index-rank-02', etc.
if 'data' in data['luna']['resultsData']:
if data['luna']['resultsData']['data'] is None:
pass
else:
if key in data['luna']['resultsData']['data']:
for e in data['luna']['resultsData']['data'][key]:
variants[e] = True
else:
pass