My{Data}Provider - your web scraping service

Sitemap web scraping. Lets try to scrape 100K or 1M sitemaps

What is it sitemap and why do I want to scrape it? Sitemap is a core info about pages at site. If site is seo-friendly it means that sitemap exists. Find more info about sitemap at google : https://developers.google.com/search/docs/advanced/sitemaps/overview or Wikipedia: https://en.wikipedia.org/wiki/Site_map For developers who want to scrape data from sitemap it is necessary to know the next sitemap max size is 20MB 1 sitemap file can have up to 50K URLs inside. 99% of all sitemaps are XML files and 90% of sitemaps' relative path is /sitemap.xml but the other 10% do not have sitemap at all or they placed this file in the place. If you want to know sitemap URL you need to read this data from /robots.txt file. it should have line inside like Sitemap: Absolute URL to the sitemap. You have to take into account that robots.txt is built by humans 🤣 so there are a lot of cases how people could add issues to this file. For example, for the unknown reason, there are 2 or 3 sitemaps that could've inside and if you want to scrape sitemaps right you need to be prepared for that.