I have crawled the data from thefreedictionary.com. Below is the html + css + js files. Could you please explain why the brand_copy
is duplicated everywhere.
Here is the files Crawl TFD.rar (94.5 KB)
Thank you so much for your help!
I have crawled the data from thefreedictionary.com. Below is the html + css + js files. Could you please explain why the brand_copy
is duplicated everywhere.
Here is the files Crawl TFD.rar (94.5 KB)
Thank you so much for your help!
async.js will add a brand_copy div based on data-src attribute in section tag.
If you are curious, the code which adds an extra brand_copy is in async.js line 525 to 544
for(var i=0;i<l.length;i++){
var e=l[i],s=e.getAttribute('data-src'),p=e.getAttribute('data-src-p');
if(!s)continue;
var copy=src(s,p,e);
if(!copy)continue;
sources.push(s);
var c=document.createElement('div');
c.className='brand_copy';
c.innerHTML=(info.IsApp?'':'<span onclick="location=\'/_/cite.aspx?url=\'+encodeURIComponent(info.canonical)+\'&word=\'+encodeURIComponent(info.word)+\'&sources=\'+sources" style="margin-right:7px;position:relative;top:-2px"><span class="i A" style="width:62px;height:13px;background-position:0 -37px"></span></span>')+copy;
if(e.lastChild&&e.lastChild.getAttribute&&e.lastChild.getAttribute('data-ad')!=null)e.insertBefore(c,e.lastChild);else e.appendChild(c);
var z=e.getAttribute('data-zip'),n=e.getAttribute('data-name');
if((z&&z.length==5)||n){
var c=document.createElement('DIV');
c.innerHTML='';
if(z&&z.length==5){var no=Math.random();c.innerHTML+='<br><div id="WTHR_'+no+'"></div><iframe src="/_/hp/Controls/AsyncWeatherControl.aspx?location='+z+'&contentId=WTHR'+no+'&NOD=5&Unit=F" id="wrifrm'+no+'" style="position:absolute;z-index:-100"></iframe><br>'};
if(n){n=encodeURIComponent(n);c.innerHTML+='<br><img style="margin:0 15px 15px 0;width:320px;height:280px" src="//maps.google.com/maps/api/staticmap?center='+n+'&zoom=5&size=320x280&maptype=map&markers=color:red|color:red|label:|'+n+'&sensor=false"><img style="margin:0 15px 15px 0;width:320px;height:280px" src="//maps.google.com/maps/api/staticmap?center='+n+'&&size=320x280&maptype=terrain&markers=color:red|color:red|label:|'+n+'&sensor=false">'};
};
}};
});
I am not sure which tool you use to crawl but I can tell you the logic.
When the page loads, there is no brand_copy in the page originally, then the browser executes async.js adding brand_copy. If you save the current page (with async.js executed), then import async.js again, the brand_copy will be added twice.
I use below code (in which selenium
is used to get html with javascript loaded) to extract the content in this url. My main problem is how to keep the click-to-choose dropbox
which is rendered by javascript. I hope you can have some suggestion for me on this project.
Here is the code I use.code.rar (1.8 KB)
You offline html page is working well on this translation button. All you need is to wait until JS finishes execution ~4s.
What are you asking exactly?
Add the CSS below
#LangBar:nth-child(2n-1) {
display: none
}
Thank you so much for your help! I’m working on this dictionary.