Blog

Verified crawlers, evidence labels, and IPv6: signals you can reason about

This release is about a single idea: signals you can reason about. Three changes shipped together — verified-crawler detection, per-signal evidence labels, and IPv6 parity — and they all serve the same goal. Not "trust our number," but "here's exactly what we saw, and how directly we saw it, so you decide."

The bot you must not block

Most "bot detection" is about catching bad bots. This is the opposite. is_verified_bot identifies a verified good crawler — Googlebot, Bingbot and similar — matched against the operator's own published IP ranges. The accompanying verified_bot_name tells you which one.

Why does that matter? Because the expensive mistake isn't letting a crawler through — it's blocking one. Serve Googlebot a 403 and you start dropping out of the index. So a verified Googlebot is not a risk to manage; it's a fact to act on:

const r = await geoq.check(ip);

// A verified good crawler is NOT fraud — don't run it through your risk gate.
if (r.signals.is_verified_bot) {
  // e.g. 'googlebot', 'bingbot'
  console.log('verified crawler:', r.signals.verified_bot_name);
  return serveNormally();   // blocking Googlebot deindexes you
}

// is_verified_bot carries ZERO risk weight, so it never lands in reasons[].
// r.risk.score reflects only the abuse signals that actually fired.

And here's the part we care most about: because a verified good crawler is not fraud, is_verified_bot carries zero risk weight. It never appears in risk.reasons and never moves risk.score. It would be easy to inflate a score by counting "it's a bot" as suspicious — we don't, because it would be wrong. This is also not behavioural bad-bot detection; that's a different problem, and it's on the roadmap, not quietly bundled in here.

Evidence: how directly we saw it, not how likely it is

Every signal now ships with an evidence label. This is the honesty primitive of the whole API, so it's worth being precise about what it means — and what it doesn't.

{
  signals: { is_datacenter: true, is_vpn: false, is_proxy: false, is_tor: false },
  evidence: {
    datacenter: 'authoritative', // matched a cloud provider's published CIDRs
    tor:        'authoritative', // matched the Tor exit list
    vpn:        'inferred',      // attributed at the ASN level, not a published list
    proxy:      'beta',          // heuristic, still maturing — weight accordingly
    verified_bot: 'authoritative'
  },
  risk: { score: 35, level: 'medium', reasons: ['is_datacenter'] }
}

The labels describe observation directness:

  • authoritative — drawn from an operator-published list: a cloud provider's CIDRs, the Tor exit list, Googlebot's published ranges. The most direct observation we have.
  • inferred — attributed at the ASN level rather than from a published list. A reasoned attribution, not a direct one.
  • beta — heuristic and partial, still maturing. Weight it accordingly.

What evidence is not: it is not a probability or an accuracy score. "authoritative" doesn't mean "definitely correct" — a published range can still be stale. It means we observed the signal as directly as it can be observed. We'd rather hand you that distinction than launder it into a single confidence percentage that pretends to know more than we do.

IPv6, at parity

Signals that used to be IPv4-only now cover IPv6 too: is_datacenter, is_tor and is_verified_bot all resolve against IPv6 addresses. (Geolocation and ASN always have.) As more traffic — and more crawlers — arrives over IPv6, a check that quietly returned nothing for a 2001:… address was a silent gap. Now it's the same call, the same fields, either family.

Why we built it this way

The thread tying these together: an IP signal is one input, never a verdict. The more honestly we can tell you what fired and how directly we saw it, the better your own logic gets — and the easier it is to audit a decision after the fact. Branch on reasons[], weight by evidence, and never block the crawler that indexes you.

Next steps

Read the response schema for the full evidence and verified-bot fields, the risk-score methodology for the exact weights (and why the bot weight is zero), or the verified bot API overview. Want to try it? Free key, no card.

Signals are probabilistic, not facts. Don't make a sole-basis automated decision about a person — see the acceptable use policy.

Keep reading

Get a free key — 5,000 lookups/day, no card.

Every signal and the same risk score as every paid plan. Upgrade only when you outgrow it.