Retrieval-augmented generation has become the dominant approach for building AI tools in regulated industries. The pitch is compelling: instead of asking a language model to answer from training data (which may be stale, hallucinated, or jurisdiction-confused), you retrieve relevant document chunks first, then ask the model to answer based on those chunks.
It’s better than nothing. But for regulatory compliance specifically, basic RAG fails at several problems that matter — and fails in ways that are hard to detect without understanding the architecture.
This post explains why.
What RAG Actually Does
Before explaining the failure modes, it’s worth being precise about what a standard RAG pipeline does:
- Ingestion: Regulatory documents are split into chunks (typically 500-2000 tokens), embedded using a vector model, and stored in a vector database.
- Retrieval: When a query arrives, the query is embedded and the vector database returns the top-k most semantically similar chunks.
- Generation: The retrieved chunks are injected into a prompt along with the question, and the language model generates an answer based on the injected context.
That’s it. There’s no understanding of regulatory structure in this pipeline. No extraction of obligation relationships. No knowledge of firm types. No interpretation of deontic modality (what “must” means versus “should” versus “may”). The system is doing similarity search on text chunks, not regulatory reasoning.
The Five Failure Modes
1. Chunk Boundary Breaks Obligation Context
Regulatory obligations are rarely self-contained within a 1000-token window. A typical obligation structure looks like:
Section 3.2: A Category 2 firm that provides discretionary portfolio management services must maintain separate client accounts unless the conditions in Section 3.2.1 are satisfied.
Section 3.2.1: The conditions referred to in Section 3.2 are: (a) the client has given written consent…
A RAG system retrieving Section 3.2 without Section 3.2.1 will return an incomplete answer. Whether the retrieval system captures the cross-reference depends on how the document was chunked and whether the semantic similarity to the query was high enough to surface both chunks.
In practice, chunk boundary problems cause systematic incompleteness. The model generates an answer that looks complete but omits conditions, exceptions, or carve-outs that materially affect the obligation. A compliance officer relying on that answer without checking the source text has taken on regulatory risk.
2. Firm Type Filtering Doesn’t Exist
ADGM’s FSRA Rulebook has thousands of regulatory obligations. The vast majority of them do not apply to any given firm. A Category 3B fund manager’s obligation set is radically different from a Category 1 bank’s.
Basic RAG has no mechanism for firm type filtering. When you ask “what are our AML obligations?”, the retrieval system returns the most semantically similar chunks — which may include obligations for categories your firm doesn’t hold, activities you don’t conduct, or products you don’t offer.
The model synthesises these into an answer that looks authoritative and complete. It isn’t. It includes obligations that don’t apply to you and may omit obligations that do, because the relevant text was in a less-semantically-similar chunk.
For compliance officers using RAG-based tools without firm-type awareness, the obligation register they generate is structurally unreliable. They can’t know which items apply without checking every item against the source — which defeats the purpose of the tool.
3. Cross-References Are Invisible
Financial regulation is a network of cross-references. An obligation in the Conduct of Business Rulebook may depend on a definition in the General Rulebook. An exception may be governed by a condition in a separate module. An interpretation may be clarified in an FSRA guidance note published three years after the rule.
Vector similarity search retrieves text chunks that are semantically similar to the query. It has no way to traverse cross-references. A retrieved obligation that says “as defined in Section 1” doesn’t cause the system to retrieve Section 1. The answer is structurally incomplete in a way that the model — and the compliance officer — may not notice.
For regulatory analysis, cross-reference traversal isn’t an edge case. It’s routine. The answer to most substantive compliance questions requires following at least one cross-reference to a definition, a condition, or a related obligation.
4. Temporal Confidence Is Broken
Regulatory text changes. Amendments are issued. Guidance notes update interpretations. FSRA circulars modify rulebook provisions between formal revision cycles.
Basic RAG systems are vulnerable to temporal confusion in two ways:
Stale content: If the ingestion pipeline doesn’t keep pace with regulatory updates, retrieved chunks may be from superseded versions of rules. The model generates answers based on old text, confidently citing rule numbers that no longer say what the answer claims.
Temporal mixing: Even if the ingestion pipeline is current, a RAG system has no way to ensure that retrieved chunks are internally consistent in time. A query might retrieve a 2024 obligation and a 2026 amendment as if they were both currently applicable, without representing the temporal relationship between them.
A compliance officer needs to know not just what the rules say, but which version of the rules applies and whether a recent amendment changed the answer to their question. Basic RAG cannot reliably provide this.
5. Confidence Is Fictional
Perhaps the most dangerous failure mode: basic RAG systems produce confident-sounding answers to questions they cannot reliably answer.
Language models are trained to generate plausible, coherent text. Given retrieved chunks, they synthesise an answer that sounds authoritative. They don’t distinguish between “I retrieved the exact relevant obligation and can answer with high confidence” and “I retrieved peripherally related text and am synthesising a plausible-sounding answer from insufficient context.”
For compliance, this is a serious problem. A compliance officer needs to know when an answer is uncertain, when it depends on an interpretation that the regulator hasn’t definitively resolved, or when the question requires legal advice rather than regulatory research. A system that generates uniformly confident answers — regardless of the quality of the retrieved context — is actively harmful: it makes uncertain answers look certain.
What a Stronger Architecture Looks Like
The answer isn’t to abandon AI for compliance analysis. The answer is to use AI in the part of the pipeline where it actually works — synthesis — and replace it with more reliable mechanisms for the parts where it fails.
Structured Obligation Extraction
Instead of storing raw regulatory text as vector-indexed chunks, process the text into structured ComplianceUnit objects: each obligation extracted, normalised, and tagged with its deontic type (must, must not, should, may), the applicable firm types, the applicable activities, and the cross-references.
This extraction step is where the language model is most valuable — it’s good at reading natural language text and producing structured output. Done once, during ingestion, with human verification for the most critical obligations.
Graph-Based Retrieval
Store obligations and their relationships in a knowledge graph. When a query arrives, the retrieval is a deterministic graph traversal: retrieve the obligations applicable to this firm type, these regulated activities, this query context — and follow cross-references explicitly.
This is slower than vector search but dramatically more reliable. The retrieved obligation set is structurally complete: cross-references are followed, conditions are included, exceptions are surfaced.
Firm Context Filtering
Before retrieval, the system knows the firm’s regulatory context: category, regulated activities, approved person roles, jurisdiction. The graph traversal applies these filters. The resulting obligation set applies to this firm, not to regulated firms in general.
Confidence Scoring with Semantic Content
Rather than generating uniform confidence, the system assigns confidence based on the quality of the retrieval: high confidence when the exact applicable obligation was retrieved with full context, lower confidence when the answer requires interpolation, explicit uncertainty when the question requires interpretation that the regulatory text doesn’t resolve.
Confidence scores should be interpretable by compliance officers — not just numbers, but descriptions of what makes an answer certain or uncertain.
Auditable Output
Every answer includes the reasoning chain: which obligation nodes were retrieved, which cross-references were followed, which firm-type filters were applied, which inference steps produced the conclusion. The output is a package, not a paragraph.
The Standard for Compliance AI
The test for any AI tool in a compliance context is not “does it produce plausible-sounding answers?” It’s “can a compliance officer defend the answer to a regulator?”
That requires knowing where the answer came from. Which rule. Which version. Which interpretation. What the applicable conditions are. What the exceptions are. Whether the answer depends on facts the system doesn’t know.
Basic RAG, honestly assessed, does not produce defensible answers. It produces plausible answers — which is useful for research, dangerous for compliance.
DIFC Regulation 10 on autonomous systems formalised this distinction in regulatory terms: AI systems used in regulated decision-making must produce auditable, explainable outputs. ADGM is developing equivalent requirements. The direction is clear: compliance AI will be held to the same documentation standard as human compliance analysis.
A tool built on basic RAG doesn’t meet that standard. Not because RAG is bad — retrieval is an essential component of any honest compliance AI architecture — but because retrieval alone is not enough. The obligation structure, the firm context, the cross-reference traversal, and the confidence calibration are the parts that make the difference.
How Seif Approaches This
Seif’s architecture was designed around these failure modes.
Obligations are extracted into structured ComplianceUnit objects during ingestion. Retrieval is graph-based, with deterministic traversal of cross-references and firm-type filtering applied before retrieval. Confidence scores reflect the quality of the retrieval context, not just the model’s estimated certainty. Every output includes the full reasoning chain.
The result is a system where you can trace every answer back to specific regulatory text through a documented chain of inference. That’s not a marketing claim — it’s a structural property of the architecture.
If you want to see how this compares to your current compliance tooling, book a demo. We’ll walk through a live query, show you the reasoning chain, and let you verify the answer against the source text.
This post reflects our perspective on AI architecture for compliance applications. It is not a review of any specific competitor product. It is not legal advice.