Why don't you publish accuracy numbers yet?

Because publishing a number we can't defend would be dishonest. We're assembling labelled ground-truth datasets and the scoring harness first. The methodology is below; the first numbers will follow it transparently.

Will the benchmark be reproducible?

Yes — that's the point. We'll publish the methodology, the metrics, the dataset composition and the scoring code so anyone can scrutinise or rerun it.

Will you claim to be the most accurate?

No. We'll publish our numbers and our method. 'Most accurate' is a claim we won't make — comparisons depend heavily on the test set.

Benchmarks

A benchmark we can actually defend.

Most IP-intelligence vendors quote an accuracy figure with no method behind it. We're doing the opposite: publishing the methodology first, then the numbers — reproducibly. This page is the method. The first results are publishing soon.

status

Methodology: published (this page). First measured results: in progress — we won't post a number until the ground-truth sets are solid.

What we measure

Per signal — is_vpn, is_proxy, is_datacenter, is_tor, is_bot — and for geolocation:

Precision — of the IPs we flagged, how many were truly that thing.
Recall — of the IPs that truly were that thing, how many we caught.
F1 — the harmonic mean, so we can't game one at the expense of the other.
Geo accuracy — country-level hit rate, and median city-level distance error (km).

Ground-truth datasets

A benchmark is only as honest as its labels. We build labelled sets per signal from defensible sources:

Datacenter — providers' own published IP ranges (AWS, GCP, Azure and others).
Tor — the Tor Project's public exit list as ground truth.
VPN / proxy — IPs confirmed via independent commercial provider lists and active probing, with residential-proxy cases held out and reported separately given their beta status.
Bot — labelled automation traffic from honeypots and known crawler ranges (reported as beta).
Geo — addresses with independently-known locations (e.g. infrastructure with published coordinates).

We'll publish the composition and size of each set so you can judge representativeness — not just a headline percentage.

The scoring harness

Scoring is mechanical and open. In essence:

# scoring harness (pseudocode)
for ip, label in ground_truth:        # label = known truth per signal
    pred = geoq.check(ip).signals
    for signal in SIGNALS:
        tally(signal, pred[signal], label[signal])  # TP/FP/TN/FN

for signal in SIGNALS:
    precision = TP / (TP + FP)
    recall    = TP / (TP + FN)
    f1        = 2 * precision * recall / (precision + recall)

We'll release the harness so the numbers are reproducible end-to-end.

How we'll report it

A table of precision / recall / F1 per signal, with the test-set size.
Beta signals (residential-proxy, bot) clearly separated and labelled.
Geo accuracy as country hit-rate + median km error.
The date of each run and the dataset version, refreshed over time.
No cherry-picking, no "up to" framing, no comparison claims we can't reproduce.

Why this is our flagship asset

If we're going to say "the IP fraud API that shows its work", the benchmark is the work. Holding ourselves to a published, reproducible standard is the whole point.

Want to be told when the first numbers land? Create a free account — we'll email it to signed-up developers first.