Skip to content
Browse docs
Docs

Allow Lightcrawl to crawl your site

Recognize our crawler and allowlist it in your CDN, WAF, or firewall so your site can be audited.

Lightcrawl is an SEO and site-quality auditing service. When you add a site to Lightcrawl, we crawl its pages on your behalf to run Lighthouse and 30+ SEO and quality checks. We are an identifiable, well-behaved bot: we send a stable identity on every request, we respect robots.txt, and we crawl at a polite rate. We do not try to evade security measures — if your site blocks us, this page explains how to let us through.

How to identify our crawler

Every request from Lightcrawl carries two stable identifiers:

  • A request header X-Lightcrawl-Crawler: 1. This is the most reliable signal to match on — it's unambiguous and easy to allowlist.
  • The substring lightcrawl/1.0 in the User-Agent. Our full User-Agent looks like a modern Chrome browser with our token appended:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 lightcrawl/1.0 (+https://lightcrawl.com/bot)

Match the X-Lightcrawl-Crawler header or the lightcrawl/ substring — not the full Chrome string, which changes as we keep the browser version current. User-Agent strings and headers can be spoofed by anyone, so for a stronger guarantee combine them with the published IP ranges or the shared-secret option described under "Sites still blocked" below.

How we behave

  • robots.txt: we fetch and obey it. Rules targeting User-agent: lightcrawl (or *) apply to us — the browser-style User-Agent does not change which robots rules we follow.
  • Crawl rate: sequential, roughly two to three requests per second, with cache-revalidation headers so we don't serve stale data.
  • Scope: we only crawl sites a customer explicitly added to their Lightcrawl workspace.

How to opt out

To stop Lightcrawl from crawling, disallow it in your robots.txt:

User-agent: lightcrawl
Disallow: /

We honor this. A verified owner can choose to override robots.txt for their own site from within Lightcrawl, but that's an explicit, ownership-gated action — it never applies to sites a workspace doesn't control.

Allowlist recipes by platform

Cloudflare

Add a WAF custom rule with a Skip action that matches our header. In Security → WAF → Custom rules, create a rule like http.request.headers["x-lightcrawl-crawler"][0] eq "1" and set the action to Skip (remaining custom rules, managed rules, and rate limiting). Some products — for example free-plan Bot Fight Mode — can't be skipped by a custom rule; if you use those, you may also need to allow our IP ranges.

Ezoic

Ezoic is an ad/CDN proxy that can serve a challenge page to automated traffic. Add Lightcrawl to your crawler/bot allowlist in the Ezoic dashboard. Ezoic also gates on the source IP, so the header alone may not be enough — see "Sites still blocked" below.

Wordfence / Sucuri (WordPress)

In Wordfence, add the lightcrawl/ User-Agent (or our IP range) under Wordfence → Firewall → Allowlisted URLs/Agents. In Sucuri, add an allowlist rule for the X-Lightcrawl-Crawler header or the User-Agent substring.

Generic WAF, nginx, or Apache

Add an allow rule that matches the X-Lightcrawl-Crawler request header or the lightcrawl/ User-Agent substring, and place it ahead of any bot-blocking rule.

Sites still blocked

Some protections gate on the source IP, not the User-Agent — in that case allowlisting the header alone won't help. Lightcrawl crawls from a small, stable set of IP addresses you can allowlist directly.

Allowlist these Lightcrawl crawler IP addresses:

209.71.89.141
2a09:8280:e626:1:0:108:20b5:0
209.71.81.125
2a09:8280:e618:1:0:108:20b5:0

These are also available as machine-readable JSON at /bot.json (UA token, header, contact, and the current IP list), so you can automate your allowlist against a durable source.

We're also pursuing verified-bot recognition with major providers (for example Cloudflare), which lets compatible platforms identify Lightcrawl automatically using these stable IPs and this public documentation — no manual rule required. We crawl from 2 regions (ord, iad).

Verify it worked

After allowlisting Lightcrawl, re-run the crawl from the blocked banner in your dashboard. When the crawl reads real pages again, the "blocking our crawler" badge clears automatically.