How to scrape 1 million pages from 1 site daily?

  • nik 

What does it mean to scrape 1 million web pages(URLs) daily?
Under page, we mean here 1 HTTP web request to 1 URL.

It is really important because 1 page opening in any browser could cause calling extra urls opening for images, css, scripts etc.

It means
1000000 pages daily
or 41666.66667 pages hourly
or 694.4444444 pages per minute
or 11.57407407 pages per second
so, you need ~12 successful attempts to scrape pages per second.
or 700 pages per 1 minute…
It is really high speed.
Plus do not forget that the source site will block your request, so you need about 100K proxy servers to manage it.

