Skip to content
Browse docs
Docs

How Lightcrawl crawls your site

Crawl scope, request rate, sitemaps, and how we honor robots.txt.

When you add a site, Lightcrawl crawls it to discover pages, then audits each page with Lighthouse and our own SEO and quality checks. This page explains how the crawl works so you know what to expect in your logs.

What we crawl

We start at the URL you added (the seed) and discover the rest of the site two ways: by reading your sitemap.xml (and any sitemaps referenced from robots.txt), and by following internal links from the pages we fetch. A site's crawl scope is one of:

  • Domain (default): all pages on the same host, found via sitemap and link discovery.
  • Single page: only the URL you added.

By default we stay on the exact host you added and ignore URL query parameters when deciding whether two links are the same page. Subdomain and query-parameter handling are configurable per site.

How fast we crawl

We crawl sequentially and politely — roughly two to three requests per second — so we don't add meaningful load to your origin. We send cache-revalidation headers so the pages we read reflect what your visitors see, not a stale cache.

robots.txt

We fetch and obey robots.txt before crawling. Rules targeting User-agent: lightcrawl (or *) apply to us. If your robots.txt disallows the page you added, the crawl stops with a "blocked by robots.txt" result instead of returning an empty crawl — and we'll point you to ownership verification if you want to crawl it anyway.

If a crawl comes back empty

A crawl that finds only one page (or none) usually means one of two things: your robots.txt disallows us, or a security/CDN layer is serving our crawler a challenge instead of your real content. Lightcrawl detects both and tells you which it is, with a link to the fix. For the second case, see Allow Lightcrawl to crawl your site.