TL;DR

Not all content should be stored in a RAG system
Use a banlist + masking + ensemble filtering to control ingestion
Combine lexical, fuzzy, and semantic methods (Regex, BM25, KeyBERT, etc.)
Apply filtering at ingestion, query, and answer stages
Expect trade-offs: better safety vs. potential recall loss
Add a human review loop for continuous tuning

When Do You Need This?

This approach is especially useful when:

You handle sensitive or regulated data (PII, financial, medical)
Your domain has strict boundaries (e.g., legal, industrial, internal corp data)
You want to prevent prompt/data leakage
You operate a multi-tenant or customer-facing system

Introduction

In Retrieval-Augmented Generation (RAG), most discussions focus on improving recall—ensuring that relevant context is not missed.

However, in production systems, the opposite question is equally important:

What content should not be retrieved—or not even indexed in the first place?

Depending on the domain, certain information may be irrelevant, sensitive, or even harmful. A cybersecurity company expects content about malware or exploits. An ice cream manufacturer clearly does not.

>Not all extracted content should necessarily be stored in a vector database.

Domain-Specific Filtering

Unwanted content is highly domain-specific and must be configured accordingly.

A common strategy is to exclude unwanted chunks during ingestion. However:

>Removing chunks may lead to loss of relevant information.

Structure-aware chunking reduces this risk.

Masking Instead of Removing

4111 1111 1111 1111 → [CREDIT_CARD]

Masking protects sensitive data while preserving meaning.

Language Handling Strategy

Banlist in English
Synonym expansion
On-the-fly translation (cached)

Multi-Layer Detection

Algorithms:

Regex
Levenshtein
Jaccard
BM25
KeyBERT

Aggregation:

- **Depth** (consensus strength)
- **Breadth** (algorithm diversity)

🗺️ System Overview

If you only look at one diagram, make it this one:

The diagram below shows the full filtering pipeline, including:

banlist preparation (synonyms + translation)
masking of sensitive data
the ensemble detection logic
the final decision (pass vs. flagged)

The key idea: filtering is not a single step, but a coordinated set of checks across text, embeddings, and multiple algorithms.

╔════════════════════════════════════════════════════════════════════════════╗
║                    BANLIST FILTERING — SYSTEM OVERVIEW                     ║
╚════════════════════════════════════════════════════════════════════════════╝

  Config_Banned.py
  ┌──────────────────────────────────────────────────────────────────────────┐
  │  BANNED = ["password", "credit card", "iban", ...]   (English)           │
  │  MASKING_REGEXES = { credit_card: r"\d{4}[- ]\d{4}...", ssn: r"...", }   │
  │  Per-app thresholds: RAGLoad / RAGChat / DocClassify                     │
  └──────────────────────────────────────────────────────────────────────────┘
         │                           │
         ▼                           ▼
  ┌─────────────┐           ┌──────────────────┐
  │  Synonyms   │           │    Masker        │
  │ (WordNet)   │           │ (regex redact)   │
  │             │           │                  │
  │  "password" │           │ 4111 1111 1111   │
  │  → watchword│           │ → [CREDIT_CARD]  │
  │  → passcode │           │                  │
  │  → ...      │           │ 123-45-6789      │
  │             │           │ → [SSN]          │
  │  NOTE: NOT  │           │                  │
  │  used by    │           │ applied BEFORE   │
  │  Cosine /   │           │ storage (Load)   │
  │  KeyBERT    │           │ and AFTER LLM    │
  │  (embeddings│           │ answer (Chat)    │
  │  handle it) │           └────────┬─────────┘
  └──────┬──────┘                    │
         │  Expanded banlist         │ Redacted text
         ▼                           │
  ┌──────────────────────────┐       │
  │   Argos Translate        │       │
  │   (banlist translation)  │       │
  │                          │       │
  │  EN → DE, FR, ES, ...    │       │
  │  "password" → "Passwort" │       │
  │                          │       │
  │  Caches:                 │       │
  │  • translation_cache     │       │
  │  • translated_list_cache │       │
  └──────────┬───────────────┘       │
             │ Native-language       │
             │ banlist               │
             ▼                       │
  ┌──────────────────────────────────────────────────────────────────────────┐
  │                         ENSEMBLE CHECKS                                  │
  │                   (run_ensemble_checks)                                  │
  │                                                                          │
  │   Text ──────────────────────────────────────────────────────────────►   │
  │   Embedding ─────────────────────────────────────────────────────────►   │
  │                                                                          │
  │  ┌────────────────────────────────────────────────────────────────────┐  │
  │  │  ① Regex         exact/fuzzy pattern anchors on each banned phrase │  │
  │  │        │                                                           │  │
  │  │        └──► ② Levenshtein  edit-distance on regex hits             │  │
  │  │                            (catches typos &amp; l33t-speak)            │  │
  │  │                                                                    │  │
  │  │  ③ Jaccard       char n-gram overlap (n=4–6) vs banlist            │  │
  │  │                  cache: per-language tokenized banlist             │  │
  │  │                                                                    │  │
  │  │  ④ BM25          TF-IDF term match, k1/b tunable                   │  │
  │  │                  cache: banlist_cache, idf_cache, avg_len_cache    │  │
  │  │                                                                    │  │
  │  │  ⑤ KeyBERT       double-pass keyword extraction → embedding        │  │
  │  │                  compare keyword vectors to banned phrase vectors  │  │
  │  │                                                                    │  │
  │  │  ⑥ Cosine        document embedding vs banned phrase embeddings    │  │
  │  │                  cache: pharase_embedding_cache_tensor             │  │
  │  │                  (optional, disabled by default)                   │  │
  │  └────────────────────────────────────────────────────────────────────┘  │
  │                                                                          │
  │   Each algo produces a score.  Scores go to the Accumulator.             │
  │                                                                          │
  │   ┌────────────────────────────────────────────────────────┐             │
  │   │  Accumulator                                           │             │
  │   │                                                        │             │
  │   │  Depth:  REQUIRED_ALGOS_ABOVE_THRESHOLD = N            │             │
  │   │  Bredth: REQUIRED_DIFFERENT_ALGOS_HAVE_A_SCORE = M     │             │
  │   │                                                        │             │
  │   │  pass: all scores below threshold                      │             │
  │   │  flag: ≥ N algos exceed their threshold                │             │
  │   └────────────────┬───────────────────────────────────────┘             │
  └────────────────────┼─────────────────────────────────────────────────────┘
                       │
            ┌──────────┴──────────┐
            ▼                     ▼
         PASS                  FLAGGED
            │                     │
            │               HUMAN_REVIEW CSV
            │               (phrase, algo, score,
            │                threshold, chunk)
            │                     │
            │               USE_EXCLUSIONS=True?
            │                     │
            │               Exclusions file
            │               (skip on next run)
            ▼
       continue pipeline

In practice, this structure allows you to tune filtering behavior per stage without changing the overall pipeline.

📥 RAGLoad — Ingestion Path

This is where most filtering happens.

Before any content is stored, it is:

cleaned
masked (PII removal)
chunked
and then checked using the ensemble pipeline

Only chunks that pass these checks are embedded and stored.

╔════════════════════════════════════════════════════════════════════════════╗
║                    RAGLoad  —  DOCUMENT INGESTION PATH                     ║
╚════════════════════════════════════════════════════════════════════════════╝

  Document file (PDF / DOCX / PPTX / image / ...)
      │
      ▼
  [Text Extraction]  (pdfminer, python-docx, tesseract OCR, ...)
      │
      ▼
  [Unicode Normalizer]
      │
      ▼
  [Masker]  ◄── MASKING_REGEXES from Config_Banned.py
      │           redacts PII before it ever reaches the store
      │           e.g.  "CC: 4111 1111 1111 1111"
      │               → "CC: [CREDIT_CARD]"
      ▼
  [Language Detection]  (langdetect)
      │
      ├── unsupported language ──► reject / FALLBACK_EN
      │
      ▼
  [Chunker]  (SEMANTIC / SLIDING_WINDOW / FIXED_SIZE / HEADING / ...)
      │
      ▼  (per chunk)
  ┌─────────────────────────────────────────────────────────┐
  │  ENSEMBLE CHECKS  (PIPELINE_CHECK, accumulate=True)     │
  │  Regex + Levenshtein + Jaccard + BM25 + KeyBERT         │
  │  Banlist translated to document language via Argos      │
  └──────────────────────┬──────────────────────────────────┘
                         │
            ┌────────────┴────────────┐
            ▼                         ▼
          PASS                    FLAGGED
            │                         │
            ▼                    HUMAN_REVIEW CSV
  [Embed + store in ChromaDB]    + Exclusions file

💬 RAGChat — Query & Answer Path

Filtering is also applied at runtime.

Both the user query and the generated answer are validated:

the query is checked before retrieval
the answer is checked after generation

This ensures that unsafe or unwanted content does not enter or leave the system.

╔════════════════════════════════════════════════════════════════════════════╗
║                      RAGChat  —  QUERY &amp; ANSWER PATH                       ║
╚════════════════════════════════════════════════════════════════════════════╝

  User query  (any language)
      │
      ▼
  [Language Detection]
      │
      ├── English ──────────────────────────────────────────────────────────┐
      │                                                                     │
      └── non-English                                                       │
              │                                                             │
              ▼                                                             │
       [HfTranslator]  (M2M-100 / Argos Translate)                          │
       query → English                                                      │
       session.response_language = detected_lang                            │
              │                                                             │
              ▼  (rewriter may mix languages again)                         │
       [Language Detection — 2nd pass]                                      │
              │ still non-English?                                          │
              └──► [HfTranslator — 2nd pass] ───────────────────────────────┤
                                                                            │
                                                                            ▼
                                                                English query
                                                                            │
                                                                            ▼
  ┌─────────────────────────────────────────────────────────────────────────┐
  │  PROMPT CHECK  (filter chain)                                           │
  │                                                                         │
  │  ① Ensemble Checks on query text (PROMPT_CHECK stage)                   │
  │     Regex + Levenshtein + Jaccard + BM25 + KeyBERT                      │
  │     (smaller TOP_K for performance)                                     │
  │                                                                         │
  │  ② LLM Guard  (check_prompt_with_llm_guard)                             │
  │     dedicated safety LLM (Llama-Guard / Mistral-based)                  │
  │     prompt: banlist + user classification keys injected                 │
  └────────────────────────────┬────────────────────────────────────────────┘
                               │
              ┌────────────────┴────────────────┐
              ▼                                 ▼
           PASS                              REJECTED
              │                              (block / log)
              ▼
  [PromptRewrite]  (coreference resolution via spaCy + LLM)
              │
              ▼
  [Vector Retrieval + BM25 Retrieval + RRF fusion]
              │
              ▼
  [LLM generation]  (Ollama)
              │
              ▼
  ┌───────────────────────────────────────────────────┐
  │  ANSWER COMPLIANCE CHECK  (PIPELINE_CHECK)        │
  │  Ensemble Checks on LLM answer text               │
  │  Regex + Levenshtein + Jaccard + BM25 + KeyBERT   │
  └───────────────────┬───────────────────────────────┘
                      │
         ┌────────────┴────────────┐
         ▼                         ▼
       PASS                    FLAGGED
         │                    answer suppressed
         ▼                    HUMAN_REVIEW CSV
  [Masker]
  redact PII from answer
  (credit cards, SSN, IBAN, ...)
         │
         ▼
  Answer shown to user
  (in session.response_language)

🏷️ DocClassify — Classification Path

The classification pipeline extends the same filtering approach.

Here, filtering ensures that:

classification prompts are safe
documents are validated before classification
results can be reviewed and curated for targeted collections

╔════════════════════════════════════════════════════════════════════════════╗
║                    DocClassify  —  CLASSIFICATION PATH                     ║
╚════════════════════════════════════════════════════════════════════════════╝

  [STARTUP — once per process]
  ┌──────────────────────────────────────────────────────────────────┐
  │  Prompt Compliance Check  (_ensure_compliance_checked)           │
  │                                                                  │
  │  User-supplied classification prompt fed to LLM guard            │
  │  + filter chain (Ensemble Checks on prompt text)                 │
  │                                                                  │
  │  FAIL → PromptComplianceError (abort)                            │
  │  PASS → continue                                                 │
  └──────────────────────────────────────────────────────────────────┘
      │
      │  per document:
      ▼
  Document
      │
      ▼
  [Text Cleaning]  (punctuation, unwanted chars)
      │
      ▼
  [Language Detection]
      ├── unsupported → reject (NOT_OK CSV)
      ▼
  [Embedding]  (HuggingFace SBERT, cached via ModelsCache)
      │
      ▼
  ┌─────────────────────────────────────────────────────────┐
  │  ENSEMBLE CHECKS  (PIPELINE_CHECK, accumulate=False)    │
  │  Regex + Levenshtein + Jaccard + BM25 + KeyBERT         │
  └──────────────────────┬──────────────────────────────────┘
                         │
      ┌──────────────────┴──────────────────┐
      │   result stored; pipeline continues │
      ▼                                     ▼
  [KeyBERT double-pass]           (flag stored for later)
  Pass 1: extract top-N phrases
  Pass 2: refine to top-M n-grams
      │
      ▼
  [Cosine similarity]
  keyword embeddings vs document vector
      │
      ▼
  [Merge weights]  (KeyBERT × Cosine)
      │
      ▼
  [Snowball Stemmer]  (language-aware)
  + ReverseStemmer (restores surface forms after LLM)
      │
      ▼
  [LLM Classification prompt]
  formatted keyword/weight JSON → Ollama LLM
      │
      ▼
  [ModelOutputAdapter]  (parse JSON answer)
      │
      ▼
  [ReverseStemmer.apply_to_meta]  (restore best surface form)
      │
      ▼
  OK CSV  (classification result)
      │
      └── ensemble flagged earlier?
               │
               ▼
          HUMAN_REVIEW CSV
          + Exclusions file  (if USE_EXCLUSIONS=True)

Pros & Cons

✅ Pros

Strong control over indexed content
Domain adaptability
Defense-in-depth
PII protection
Multilingual support
Auditability

⚠️ Cons

Information loss
Configuration complexity
False positives
Performance overhead
Translation gaps

Design Notes

Embeddings capture semantic similarity
Synonym expansion mainly helps lexical methods
Downranking is an alternative to exclusion

Alternatives

LLM-only filtering
- simpler
- but slower and less deterministic
Post-retrieval filtering
- preserves recall
- but unsafe content may still enter the system
No filtering
- higher recall
- but higher risk (hallucinated or unsafe outputs)

This pipeline combines deterministic and semantic methods across multiple stages.

How to Evaluate This

Typical metrics:

False positive rate (good chunks removed)
False negative rate (bad chunks still included)
Recall impact
Latency overhead

In practice, tuning thresholds and reviewing flagged samples is essential.

Summary

Filtering is applied at:

Ingestion
Query validation
Answer validation

Balancing recall, safety, and relevance.

Final Thought

Filtering in RAG is not just a safety feature—it’s a retrieval quality control mechanism.

Deciding what not to remember is as important as deciding what to retrieve. ``

Implementation

This setup is part of the framework I’ve been experimenting with. If you’re curious about the implementation details or want to explore the components themselves, you can find it here:

https://github.com/HarinezumIgel/RAG-LCC

u/HarinezumIgel

Filtering the Noise: A Practical Multi-Layer Banlist Pipeline for RAG Systems