We have Walmart scraper. And we got a task to download 20K high-quality products from Walmart.
Find more here https://mydataprovider.com/sites/walmart/
This task looks simple but if you know the definition of high-quality products you will understand that it is not so simple.
because if you need to get 20K high-quality products from Walmart via scraping you need to scrape about 500K or even 1000K (1 million).
After web scraping, it is necessary to filter products by products reviews from Walmart products pages + sellers ratings.
How did we get product URLs?
In the first step, we started to collect product URLs from categories.
We created a scraper that did simple job
input – category URLs
output – products URLs.
+ 1 simple feature – it is possible to insert category URL with all applied filters from the browser UI.
it allowed us to reduce the number of requests to unwanted products. (we filtered them by rating or prices, sellers etc)
But ! You need to insert category URLs and filter them via UI filters.
It takes time and a human has to work under that!
So we started to looking for the other way
Walmart products URLs scraping via satemap.xml / robots.txt
Hope you know that robots.txt has link to sitemap or sitemaps.
Lets look at Walmart robots.txt
Look at this :
You see that it several sitemaps,
they are logically divided by topics: articles, brands, products, categories etc.
So, the idea is to use it for scraping all product URLs from sitemap!
Now we have to develop a scraper for scraping product URLs from Walmart sitemaps.
it is important to know that Walmart sitemap has a deep hierarchy + has archived XML data with gz algorithm.
And we implemented that!
Hope this article will help you to build something similar for your project!