Allow Lightcrawl to crawl your site
Lightcrawl is an SEO and site-quality auditing service. When you add a site to Lightcrawl, we crawl its pages on your behalf to run Lighthouse and 30+ SEO and quality checks. We are an identifiable, well-behaved bot: we send a stable identity on every request, we respect robots.txt, and we crawl at a polite rate. We do not try to evade security measures — if your site blocks us, this page explains how to let us through.
How to identify our crawler
Every request from Lightcrawl carries two stable identifiers:
- A request header
X-Lightcrawl-Crawler: 1. This is the most reliable signal to match on — it's unambiguous and easy to allowlist. - The substring
lightcrawl/1.0in theUser-Agent. Our full User-Agent looks like a modern Chrome browser with our token appended:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 lightcrawl/1.0 (+https://lightcrawl.com/bot)Match the X-Lightcrawl-Crawler header or the lightcrawl/ substring — not the full Chrome string, which changes as we keep the browser version current. User-Agent strings and headers can be spoofed by anyone, so for a stronger guarantee combine them with the published IP ranges or the shared-secret option described under "Sites still blocked" below.
How we behave
- robots.txt: we fetch and obey it. Rules targeting
User-agent: lightcrawl(or*) apply to us — the browser-style User-Agent does not change which robots rules we follow. - Crawl rate: sequential, roughly two to three requests per second, with cache-revalidation headers so we don't serve stale data.
- Scope: we only crawl sites a customer explicitly added to their Lightcrawl workspace.
How to opt out
To stop Lightcrawl from crawling, disallow it in your robots.txt:
User-agent: lightcrawl
Disallow: /We honor this. A verified owner can choose to override robots.txt for their own site from within Lightcrawl, but that's an explicit, ownership-gated action — it never applies to sites a workspace doesn't control.
Allowlist recipes by platform
Cloudflare
Add a WAF custom rule with a Skip action that matches our header. In Security → WAF → Custom rules, create a rule like http.request.headers["x-lightcrawl-crawler"][0] eq "1" and set the action to Skip (remaining custom rules, managed rules, and rate limiting). Some products — for example free-plan Bot Fight Mode — can't be skipped by a custom rule; if you use those, you may also need to allow our IP ranges.
Ezoic
Ezoic is an ad/CDN proxy that can serve a challenge page to automated traffic. Add Lightcrawl to your crawler/bot allowlist in the Ezoic dashboard. Ezoic also gates on the source IP, so the header alone may not be enough — see "Sites still blocked" below.
Wordfence / Sucuri (WordPress)
In Wordfence, add the lightcrawl/ User-Agent (or our IP range) under Wordfence → Firewall → Allowlisted URLs/Agents. In Sucuri, add an allowlist rule for the X-Lightcrawl-Crawler header or the User-Agent substring.
Generic WAF, nginx, or Apache
Add an allow rule that matches the X-Lightcrawl-Crawler request header or the lightcrawl/ User-Agent substring, and place it ahead of any bot-blocking rule.
Sites still blocked
Some protections gate on the source IP, not the User-Agent — in that case allowlisting the header alone won't help. Lightcrawl crawls from a small, stable set of IP addresses you can allowlist directly.
Allowlist these Lightcrawl crawler IP addresses:
209.71.89.141
2a09:8280:e626:1:0:108:20b5:0
209.71.81.125
2a09:8280:e618:1:0:108:20b5:0These are also available as machine-readable JSON at /bot.json (UA token, header, contact, and the current IP list), so you can automate your allowlist against a durable source.
We're also pursuing verified-bot recognition with major providers (for example Cloudflare), which lets compatible platforms identify Lightcrawl automatically using these stable IPs and this public documentation — no manual rule required. We crawl from 2 regions (ord, iad).
Verify it worked
After allowlisting Lightcrawl, re-run the crawl from the blocked banner in your dashboard. When the crawl reads real pages again, the "blocking our crawler" badge clears automatically.