Annexo is an independent trust layer for AI agents. It verifies how a third party’s AI agent actually behaves with live behavioural probes, watches it for drift over time, and produces audit-ready assurance evidence a buyer, regulator or insurer can rely on. The thesis is simple: a builder cannot credibly grade its own homework, so verification has to be independent.

EU and DACH enterprises deploying AI agents in regulated settings — insurance, banking, industrial — and the consultancies that build agents for them. Later, insurers underwriting agent risk.

How does Annexo verify an AI agent?

Point the verify console at your own AI agent endpoint or run a built-in sample agent. A live probe battery runs against it — prompt injection, tool poisoning, guardrails under pressure, AI disclosure, PII handling, request logging — and resolves into an evidence dashboard. Your agent’s API key is held in memory for that one request only and is never stored.

Does Annexo certify or guarantee that an AI agent is compliant?

No. Annexo is not a notified body and does not certify, guarantee, or give legal advice. Every result is observed behaviour at the time of testing, reported as a status — holding, watch, or surfaced — never a pass/fail verdict or a conformity assessment.

What about EU regulations like the EU AI Act, GDPR, DORA and NIS2?

Annexo also produces done-for-you EU conformity dossiers — the evidence and technical documentation mapped to the EU AI Act, GDPR, DORA and NIS2, produced from your system and audit-ready. It is the deliverable, not a substitute for your own counsel or a conformity assessment body.

Where is Annexo’s data processed?

In the EU. Compute runs in the Frankfurt (fra1) region and persisted data uses an EU-region store, in line with EU data-residency expectations.

Sample deliverable · EU AI Act Annex IV

A real dossier.
Redacted.

This is a redacted example of what Annexo delivers — the audit-ready Annex IV technical documentation a notified body, regulator, or auditor would read. The company and figures below are fictional; the structure and provenance are exactly what you receive.

EU Conformity Dossier

Sample

CreditGuard AI

high-risk · Annex III §5(b) creditworthiness evaluation

Provider

Aurora Banking N.V.

Model

XGBoost

247 features · binary classifier

In production

Oct 2024

~14,200 decisions/day

Dossier date

22 May 2026

PDF · 78 pages

What's in the dossier

12Annex IV sections

10documented or evidenced

2findings with remediation

§1
General description of the AI system
Documented
§2(a)
Development methods & process steps
Documented
§2(c)
System architecture & compute
Documented
§2(d)
Data & data governanceArt. 10
Evidenced
§2(e)
Human oversight measuresArt. 14
Documented
§2(g)
Accuracy, robustness & cybersecurityArt. 15
Evidenced
§3
Risk-management systemArt. 9
Documented
§4
Performance metrics & validation
Evidenced
§5
Changes through the lifecycle
Documented
§6
Record-keeping & loggingArt. 12
Evidenced
§8
Post-market monitoring planArt. 72
Partial
§9
EU declaration of conformity
Template

Inside the document · sample pages

Annex IV · §1

General description

CreditGuard AI is an XGBoost gradient-boosted decision-tree classifier that assigns default-probability scores to consumer-loan applicants of the provider. The system has been in production since October 2024, processing approximately 14,200 decisions per day in the credit-origination workflow. It constitutes a high-risk AI system under Annex III §5(b) — creditworthiness evaluation of natural persons.

sourcerepo://creditguard/src/model.py:L42· accessed 2026-05-19

Evidence on file

Annex IV · §2(d)· Art. 10

Data & data governance

Training data is drawn from the provider's own loan-origination outcomes over a multi-year window, governed end-to-end through a feature store with documented lineage. Protected attributes are excluded from the model's input features and retained only for fairness testing. The dataset's composition, splits, label definition, and known limitations are recorded in the datasheet below and cross-referenced to the project DPIA.

Evidence · dataset datasheetdvc://datasets/origination-2019-2024

Dataset: Loan-origination outcomes · 2019–2024
Records: 1,840,512 applications
Label: 18-month default flag (binary)
Split: 70 / 15 / 15 · temporal, no leakage
Protected attrs: Excluded from features · retained for fairness testing only
Provenance: Core-banking warehouse → governed feature store
Known limitation: 2022 rate-shock regime shift — documented in §3

sourcedvc://datasets/origination-2019-2024 + DPIA· accessed 2026-05-18

Evidence on file

Annex IV · §2(e)· Art. 14

Human oversight

Every application above €5,000 is routed to a qualified underwriter who holds final authority on the decision; the model output is presented as a ranked risk indicator with the top contributing features, not as an automated approval. Underwriters can override any score, and overrides are logged with a free-text rationale. The oversight workflow, escalation thresholds, and override audit trail are documented and evidenced from the production ticketing system.

sourcewiki://risk/human-oversight-sop.md + override-log export· accessed 2026-05-20

Evidence on file

Annex IV · §2(g)· Art. 15

Accuracy, robustness & cybersecurity

Accuracy is reported against a held-out temporal validation set and recalibrated quarterly; the headline discrimination metrics and per-segment performance are disclosed below for reviewer assessment. Robustness is evidenced through adversarial-input and missing-feature stress tests, and the serving path enforces input-schema validation and rate limiting. Cybersecurity measures — secrets management, dependency scanning, and access controls on the model registry — are documented and cross-referenced to the provider's ISMS.

Evidence · accuracy & subgroup performanceregistry://mlflow · eval-suite v4.2.1

AUC-ROC

0.87

KS statistic

0.52

Precision

0.79

Recall

0.74

SegmentAUCApprovalΔ vs overall

Overall0.8741.0%—

Age 18–250.8438.2%−2.8 pts

Age 26–400.8842.6%+1.6 pts

Age 41–600.8741.3%+0.3 pts

Age 60+0.8539.1%−1.9 pts

Held-out temporal validation set (Q1 2026). Per-segment figures are disclosed for reviewer assessment; no automated parity verdict is asserted.

sourceregistry://mlflow · run v4.2.1 + eval-suite report· accessed 2026-05-21

Evidence on file

Annex IV · §6· Art. 12

Record-keeping & logging

Every decision writes one immutable, hash-chained record capturing the model version, scored inputs (hashed), output band, the top contributing features, and any human override. Records are retained for 13 months and are queryable by decision ID, giving the automatic logging and traceability Art. 12 requires over the lifetime of each decision.

Evidence · record-keeping samplelogstore://creditguard/decisions

decision record · JSON

{ "ts": "2026-05-21T09:14:22Z", "decision_id": "d_8f3a…c0",
  "model": "creditguard@v4.2.1", "score": 0.182, "band": "low-risk",
  "input_hash": "sha256:9b1c…", "features_version": "fs_2026.05",
  "outcome": "auto-eligible", "reviewer": null,
  "explain_top": ["dti_ratio", "util_12m", "inq_6m"] }

One immutable, hash-chained record per decision · 13-month retention · queryable by decision_id

sourcelogstore://creditguard/decisions · sample export· accessed 2026-05-20

Evidence on file

Annex IV · §3· Art. 9

Risk-management system

Risk management runs continuously across the lifecycle, anchored in the provider's existing operational-risk framework and adapted for the AI-specific risks in Art. 9(2). Identified and mitigated risks include: bias against protected attributes (addressed by §2(g) fairness testing), data drift on macroeconomic shifts, model staleness on product-mix change, and explainability gaps in adverse-action notifications. Each risk carries an owner, a control, and a residual-risk rating reviewed quarterly.

sourcejira://CR-217 + 46 related incident records· accessed 2026-05-19

Evidence on file

Evidence & traceability

ObligationEvidence sourceStatus

Risk-management system

Art. 9

Jira incidents + risk register

Documented

Data & data governance

Art. 10

DPIA + dataset datasheets

Evidenced

Record-keeping & logging

Art. 12

Log samples · 13-mo retention

Evidenced

Human oversight

Art. 14

Oversight SOP + override log

Documented

Accuracy & robustness

Art. 15

Registry metrics + eval suite

Evidenced

Post-market monitoring

Art. 72

Drift dashboards (partial)

Partial

Findings & remediation

The dossier is honest about what's missing. Where evidence appears incomplete, we say so and hand you a concrete step to close it — we never assert conformity on your behalf.

Annex IV · §8· Art. 72

Post-market monitoring plan

A drift dashboard exists, but a written post-market monitoring plan — owners, thresholds, escalation, and review cadence — appears incomplete in the evidence reviewed.

Remediation

Draft a one-page PMM plan binding the existing drift dashboards to named owners and action thresholds. Template supplied in the appendix.

Annex IV · §2(d)· Art. 10

Bias-testing evidence

Fairness testing is referenced in the risk register, but quarterly subgroup-performance results for the most recent two periods appear to be missing from the documentation set.

Remediation

Attach the Q4-2025 and Q1-2026 fairness-test exports; cross-cite into §2(d) and §2(g). Owner: model-risk team.

Self-certification

M. Santiago

Head of Compliance · authorised signatory

The AI Act makes the provider the signer. We produce the dossier audit-ready — structured, evidenced, every claim traced to a source you control — so your compliance officer can review it and self-certify with confidence. You sign; we make that signature defensible.

An independent regulatory legal review is available as an optional add-on.

This is a sample

Yours is built from your own stack, in about five business days.

Check your readiness — free Scope your dossier

See how we keep it current → live monitoring

A real dossier.
Redacted.

CreditGuard AI

General description

Data & data governance

Human oversight

Accuracy, robustness & cybersecurity

Record-keeping & logging

Risk-management system

Post-market monitoring plan

Bias-testing evidence

Yours is built from your own stack, in about five business days.

About Annexo

Frequently asked questions

A real dossier.Redacted.

CreditGuard AI

General description

Data & data governance

Human oversight

Accuracy, robustness & cybersecurity

Record-keeping & logging

Risk-management system

Post-market monitoring plan

Bias-testing evidence

Yours is built from your own stack, in about five business days.

About Annexo

Frequently asked questions

A real dossier.
Redacted.