Category 02 / Validation & QA

Accuracy is a metric. Not a closing report.

An independent validation operation for AI training data. Multi-tier review, inter-annotator agreement scored weekly, gold-standard benchmarking, and written accuracy commitments. Runs over the data we label, or the data you bring us.

ISO 27001 Certified · Cert No. 452AGI102121
Review tiers
FourPrimary, 2nd pass, Lead, Exception
IAA reporting
WeeklyBy operator, class, and batch
Scope
Any data typeImage, video, LiDAR, text, audio
Commitment
WrittenSLA’d per engagement

Operating model

Quality is the deliverable. Not a closing checkbox.

01 / Independence
A QA lead separate from the annotation delivery lead
02 / Measurement
IAA scored weekly, by operator, by class, by batch
03 / Feedback
Disagreements routed back with schema-tied reasoning
04 / Commitment
Written accuracy threshold agreed before production

The review pipeline

Five checkpoints. Every batch. Every engagement.

Quality is enforced on the way through the pipeline — not retrofitted at the end of a batch. Each stage has a defined output, an accountable role, and a gate criterion.

Stage 01 / Gate

Calibration

Operators train and pass gold-set calibration before production access. No pass, no production.

Gold set benchmark
Per-operator pass/fail
Stage 02 / Work

Primary pass

A schema-trained operator labels the batch inside the agreed tool, under the agreed convention.

Schema-tied labels
Exception flags raised
Stage 03 / Review

Independent 2nd pass

A second operator reviews without seeing the first pass. Disagreements surface as measurable IAA.

Blind to pass 01
Agreement logged
Stage 04 / Gate

Lead arbitration

A QA lead resolves disagreements against the schema and a gold reference — not against whichever pass was loudest.

Schema-tied rationale
Gold-ref lookup
Stage 05 / Signal

Exception & drift

Ambiguous frames and schema gaps route to your data lead. Drift by class surfaces in the next week’s standup.

Weekly IAA report
Class-level drift
Gate stage · pass/fail against a written criterion Work stage · produces a measured artifact

The four capabilities

What we actually enforce.

The pipeline is the scaffolding. These four practices are what the scaffolding is there to hold up.

01 / Multi-tier review

Two independent passes plus a lead.

Primary annotator, independent second pass, QA lead arbitration. The lead is not the delivery lead — the same person cannot sign off on their own throughput.

  • Blind 2nd-pass review
  • Schema-tied arbitration
  • Independent QA role
02 / IAA scoring

Agreement, measured where it matters.

Inter-annotator agreement computed at the level that is actually useful — by operator, by class, by batch. Drift on a single class surfaces before it contaminates a release.

  • Cohen’s κ / F1 per class
  • Per-operator trend
  • Weekly delivery report
03 / Gold-standard

Benchmarks built with your data lead.

A gold-reference set drafted with your team becomes the benchmark operators calibrate against and that lead arbitration resolves to. Revisited as the schema evolves.

  • Co-drafted with your team
  • Versioned with the schema
  • Calibration + arbitration use
04 / Exception flagging

Edge cases flagged, not guessed.

Schema gaps and ambiguous frames route to your data lead with context and a proposed resolution — not silently labelled wrong, not buried in a retrospective.

  • Route-to-client channel
  • Proposed resolution attached
  • Schema amendments logged

Where we apply it

Over the data we label. Over the data you bring.

The QA operation is self-contained. Most often it sits over our own annotation pipeline — but it applies just as cleanly to datasets you’ve labelled in-house, or received from another vendor.

Primary mode · QA over our work

Validation inside our annotation pipeline.

The default mode. Our QA lead is a separate role from the annotation delivery lead, so the same team that ships the labels is not the team that signs off on them. Calibration, IAA, and exception flagging all run continuously, against your written accuracy threshold.

  • Independent QA lead
  • Continuous IAA, not batch-end
  • Reported against a written threshold
  • Drift surfaced weekly, not quarterly
Standalone mode · QA over your data

Validation layered over existing datasets.

Bring us a dataset that was labelled in-house or by another vendor and we’ll run the same review pipeline over it — calibrate the reviewers to your schema, score agreement, surface drift, flag exceptions. Useful before a retraining run, an acquisition, or a vendor handoff.

  • Third-party dataset audits
  • Pre-retraining cleanse passes
  • Handoff validation between vendors
  • Written findings report per batch

Written commitments

What gets put in writing.

Specific thresholds are agreed per engagement and live in the SOW. What doesn’t vary is the shape of the commitment.

01 / Accuracy floor

An agreed minimum IAA threshold.

Before production starts, we agree a written IAA threshold for the dataset — set against a gold reference your team co-drafts. Miss it, and rework is on us, not on your next release.

02 / Cadence

A reporting rhythm you don’t have to chase.

Weekly IAA and exception reports, delivered into your channel of choice. Monthly trend packs for your data lead. No "we’ll get numbers next week" — the numbers are the deliverable.

03 / Scope

A review pipeline that doesn’t drift.

The five-stage pipeline is the same on week 1 and week 52. Headcount scales, techniques scale, the discipline does not get compressed away under delivery pressure.

Specific SLA numbers — threshold value, rework turnaround, report delivery day — are set per engagement. We’d rather hit an agreed number than publish a generic one. Treat these as shapes of commitment, not line items.

Applies across

Any data type. Any annotation technique.

The same review discipline applies whether the underlying labels are bounding boxes on a dashcam frame, point-cloud cuboids from a LiDAR rig, or entity spans in a legal document. The schema changes. The pipeline doesn’t.

Bounding Box
Semantic Segmentation
Polygonal / Instance
Video / Tracking
LiDAR 3D
NLP & Text
Audio / Transcription
Moderation / Policy
Reviewers specialised by data type — a LiDAR cuboid reviewer is not arbitrating an NER span, and vice versa.
QA room ● LIVE Impact Outsourcing QA reviewers working at a shared table, laptops open alongside a wall-mounted review monitor.
IMG_301 · REVIEW_ROOM Nairobi · Kenya

What the record looks like

Accuracy, sustained. Not staged for a quarterly deck.

0.94
Average IAA

Inter-annotator agreement averaged across production accounts. Reported weekly, never retrofitted.

2M+
Records validated

Passed through the review pipeline for our longest-running annotation client. Multiple pipeline generations, one discipline.

6
Production AI clients

Across computer vision, NLP, and multimodal pipelines. Repeat engagements, not one-off audits.

5
Pipeline stages

Calibration, primary pass, independent 2nd pass, lead arbitration, exception & drift — the same five, every batch.

ISO 27001 Certified / Cert No. 452AGI102121 / 500+ operators · Nairobi

Scope with us

Send us your dataset. We’ll send back a findings report.

For new annotation work, we scope a calibrated team and a written accuracy threshold together. For existing datasets, we run the review pipeline over a sample and return a findings report before you commit to a full audit.

Already labelling in-house? Run a validation pass over a slice before your next retraining run.