MyDataProvider » Blog » Sitemap web scraping. Lets try to scrape 100K or 1M sitemaps

Sitemap web scraping. Lets try to scrape 100K or 1M sitemaps

  • by

What is it sitemap and why do I want to scrape it?
Sitemap is a core info about pages at site.
If site is seo-friendly it means that sitemap exists.
Find more info about sitemap
at google : https://developers.google.com/search/docs/advanced/sitemaps/overview
or Wikipedia: https://en.wikipedia.org/wiki/Site_map

For developers who want to scrape data from sitemap it is necessary to know the next
sitemap max size is 20MB
1 sitemap file can have up to 50K URLs inside.
99% of all sitemaps are XML files and 90% of sitemaps’ relative path is /sitemap.xml
but the other 10% do not have sitemap at all or they placed this file in the place.

If you want to know sitemap URL you need to read this data from /robots.txt file.
it should have line inside like
Sitemap: Absolute URL to the sitemap.

You have to take into account that robots.txt is built by humans 🤣 so there are a lot of cases how people could add issues to this file.
For example, for the unknown reason, there are 2 or 3 sitemaps that could’ve inside
and if you want to scrape sitemaps right you need to be prepared for that.