How to avoid IP block?

I’m currently crawling the word list of https://www.thefreedictionary.com/. Unfortunately, I get an IP ban after a number of iterations.

Could you please suggest how to overcome this hurdle?

https://docs.scrapy.org/en/latest/topics/practices.html#bans

Here are some tips to keep in mind when dealing with these kinds of sites:

  • rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)
  • disable cookies as some sites may use cookies to spot bot behaviour
  • use download delays (2 or higher)
  • if possible, use Google cache to fetch pages, instead of hitting the sites directly
  • use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh. An open source alternative is scrapoxy, a super proxy that you can attach your own proxies to.
  • use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera
1 Like

My recommended way.

I tried this one, once, in a time-consuming and almost overwhelming way.

Thank you so much! I will try them.