Category 03 / HITL AI Operations

The model is the engine. Human judgement is the steering.

Human-in-the-loop operations for LLMs and AI products. Preference ranking for RLHF, prompt & response evaluation, adversarial red-teaming, agent & tool-use review, fine-tuning data curation, and policy moderation in the loop. Delivered from Nairobi, against your written rubrics.

ISO 27001 Certified · Cert No. 452AGI102121
Pillars
SixRLHF, eval, red-team, agent, curation, policy
Rubrics
WrittenCo-drafted with your policy team
Reviewer pool
Domain-trainedNot crowdsourced, not anonymous
Cadence
ContinuousWeekly IAA & exception reports

Operating model

A reviewer pool that holds a rubric — not a mechanical turk.

01 / Rubric-led
Every task ships with a written rubric, co-drafted with your policy team
02 / Named pool
Identified operators trained to the rubric, not crowdsourced anonymous workers
03 / Measured
Inter-annotator agreement and rubric drift scored continuously
04 / Auditable
Every judgement is traceable to an operator, a rubric version, a timestamp

The six pillars

Six loops. One discipline.

Every AI product has places where a model ships a decision and a human has to arbitrate it. These are the six places we do that, at production cadence.

Pillar 01

RLHF & preference ranking.

Pairwise and N-way preference data for reward-model training. Reviewers rank completions against a written rubric — helpfulness, honesty, tone, task fit — not against personal taste.

Pairwise & N-way
Reward-model training
Rubric-driven
Pillar 02

Prompt & response evaluation.

Rubric-scored evaluation of model outputs — accuracy, safety, instruction-following, refusal quality. Runs on release candidates, production samples, and A/B comparisons.

Per-dimension scoring
Release & production
A/B head-to-head
Pillar 03

Red-teaming & adversarial testing.

Targeted adversarial prompts against safety policies — jailbreaks, policy bypass, persona attacks, prompt injection in tool contexts. Findings delivered with reproducer prompts and proposed mitigations.

Policy bypass
Prompt injection
Reproducer + severity
Pillar 04

Agent & tool-use evaluation.

Trajectory review for agents that call tools and take actions. Reviewers grade the full trace — plan, tool selection, arg construction, recovery — not just the final answer.

Trajectory grading
Tool-selection rubric
Recovery scoring
Pillar 05

Fine-tuning data curation.

Instruction-tuning and SFT dataset construction. Prompts, ideal completions, rejected completions, schema-tagged domain coverage. Built to be the dataset you’d choose, not the one you had time to make.

SFT & instruction pairs
Ideal + rejected
Coverage-mapped
Pillar 06

Policy moderation in the loop.

Human review on the escalation path for model outputs flagged by an automated moderator. Policy-tagged decisions, written rationale, and feedback that trains the next version of the classifier.

Classifier escalation
Written rationale
Feedback loop

How the loop runs

Three rhythms. You pick the cadence.

HITL work doesn’t have one shape. We operate against three, depending on whether you’re training, releasing, or running in production.

Rhythm 01 / Training

Batched data collection.

For reward-model training or SFT construction. Prompt pool -> reviewer pairs -> QA arbitration -> delivered dataset with per-operator provenance.

Prompts Review QA Dataset
Rhythm 02 / Release

Pre-release evaluation.

Candidate model scored against rubric dimensions before ship. Diff vs. prior version, regression flags, go/no-go signal to your release team.

Candidate Eval suite Diff vs. prior Release signal
Rhythm 03 / Production

Continuous review & escalation.

Production traffic sampled into the review queue on a schedule. Escalations from your moderator classifier route to human arbitration and feed back into training.

Prod sample Review queue Arbitration Feedback

Infrastructure & security

Secure by default. Configured per engagement.

Reviewing model outputs means reviewing data that is often sensitive, pre-release, or policy-adjacent. The security envelope isn’t an add-on — it’s where the engagement starts.

ISO 27001 certified operations floor
Cert 452AGI102121
Access-audited review environments
Per-engagement
No-egress clean-room option
On request
Named operators, background-checked
Standard
Your tool stack — not ours
Vendor-agnostic
Operations floor ● LIVE Impact Outsourcing reviewers huddled at a laptop during a rubric calibration session.
IMG_402 · CALIBRATION Nairobi · Kenya

Who’s in the loop

The pool is not a pool. It’s four named roles.

"HITL" is a misleading noun. There isn’t one human — there are four, each doing a different job, each accountable for a different artifact.

Role 01

Reviewer

Applies the rubric to individual items. Trained against the gold set before production, scored against IAA in production.

Role 02

Arbiter

Resolves reviewer disagreement. Writes a rationale tied to a rubric clause — not a preference.

Role 03

QA lead

Independent from delivery. Owns the rubric’s calibration, reports drift, proposes schema amendments back to your policy team.

Role 04

Policy liaison

Single point of contact with your policy or trust-and-safety team. Escalates rubric gaps, routes exceptions, closes the loop on amendments.

What the record looks like

Reviewers. Rubrics. Repeat engagements.

6
Pillars delivered

RLHF, evaluation, red-teaming, agent, curation, moderation — from one trained reviewer pool.

4
Named roles per loop

Reviewer, arbiter, QA lead, policy liaison — each accountable for a distinct artifact.

500+
Operators

Nairobi-based, English-fluent, domain-trained. Not crowdsourced, not anonymous.

0.94
Average IAA

Inter-annotator agreement carried over from our data-labelling and QA work, reported weekly.

ISO 27001 Certified/ Cert No. 452AGI102121/ Nairobi · 01°17′S · 36°49′E

Scope with us

Send us a rubric. We’ll run a 200-item pilot.

We scope pool size, training time, and calibration together — then run a paid pilot against a 200-item gold set before you commit to steady-state volume.