Get all the URLs from a domain
How our URL extractor works
Getting all of the URLs for a website isn’t hard; it’s just tedious. We use a process that’s common for web scrapers.
- Fetch the website’s robots.txt file, which usually lists the domain’s sitemaps and any rules for what can and cannot be scraped.
- Iteratively search the sitemap XML files for more sitemap files. For a large site, there maybe hundreds, each listing hundreds or thousands of URLs.
- Show the URLs for the first found sitemap, with the option to show results from any other sitemap.
Why does URL extraction fail for some sites?
The above process works for most websites built with even a pinch of SEO know-how. And we have other failsafes built in to deal with robots.txt files that have errors or are missing sitemap.
Some sites, though, go out of their way to guard against scrapers, either by disallowing crawling or by hiding their sitemaps. In those cases, we respect their privacy.
If the URL extractor fails on your site and you don’t have such safeguards in place, let us know. Search engines also rely on crawlers, so any technical issues that block crawlers could be detrimental to your search traffic.