u/Ill-Structure4482

Title: Follow-up experiment: structured evidence boundaries before RAG retrieval

I posted an earlier discussion about adding a structured evidence boundary before RAG retrieval.

This is a small follow-up experiment around that idea.

Most RAG pipelines start with something like:

Which chunk is semantically closest to the query?

That works well for many use cases.

But I’m exploring whether, in higher-risk domains like legal, compliance, finance, or API specs, the system should first ask a different question:

Which evidence boundary is this query allowed to search inside?

For example, before retrieving text, maybe the system should first narrow the search space by jurisdiction, corpus, source identity, document version, legal unit, or some kind of deterministic address.

The goal is not to replace vector search or BM25.

It is more like adding a routing / evidence-selection layer before retrieval, so downstream search happens inside a smaller and more auditable space.

I put together a small implementation and an RQ7 benchmark pass here:

https://github.com/Void-Ghost000/HSRAG

Important scope limitation: this benchmark does not call an LLM.

It does not test:

generated answer quality

LLM reasoning

legal reasoning

end-to-end chatbot behavior

production vector database performance

It only looks at the retrieval / routing / evidence-selection layer.

So I’m not claiming this “solves RAG” or “makes the model smarter.”

The narrower experiment is about whether structured evidence boundaries can help reduce:

wrong-corpus retrieval

cross-jurisdiction contamination

unsupported queries being forced into a nearby chunk

unnecessary candidate search

weak auditability of retrieved evidence

The repo now has a clearer RQ7 section with key findings, evidence summary, local vector/hybrid baselines, and explicit claim boundaries.

I’d also be interested in hearing from people who want to explore or extend this kind of experiment.

Some directions I’m thinking about:

trying the same retrieval-boundary idea on other corpora or domains

adding better baselines or more adversarial query sets

visualizing the retrieval / routing / evidence-selection path as a hash-audited trace

using a hash audit chain to make retrieved evidence easier to inspect, replay, or challenge

exploring whether this kind of structure could help with memory, provenance, or long-running agent workflows

To be clear, I don’t mean exposing hidden LLM reasoning. I’m more interested in making the external retrieval process, evidence path, and system decisions easier to inspect.

If anyone has related ideas, criticism, implementation suggestions, or wants to try a different experiment design, I’d be happy to discuss.

I’ve been working on an early exploration project around a common RAG retrieval problem.

This is not meant to argue that RAG is obsolete or that vector search is useless. The hypothesis is narrower:

Can hash-addressed evidence chunks act as a complementary pre-retrieval boundary layer for RAG?

The idea is to take high-value or frequently reused text, split it into chunks, and attach each chunk to metadata such as corpus, jurisdiction, unit, source_hash, and evidence_hash.

Traditional RAG often starts with:

“Which chunk is semantically similar?”

This exploration first asks:

“Which bounded evidence address is this query allowed to search?”

This does not mean removing BM25, vector search, or hybrid retrieval. The idea is to narrow the allowed search space first, then let BM25 / TF-IDF / vector / hybrid retrieval work inside that bounded subset.

I call this experiment HSRAG: Hash-Structured Retrieval-Augmented Generation. The name may sound larger than the current implementation, so the more precise description is: an early exploration project where hash-addressed chunks act as retrieval boundaries and auditable evidence units. I plan to keep testing different hypotheses, baselines, corpus settings, and failure cases.

I used legal text as the first benchmark domain because legal retrieval has many easy-to-define cross-domain failure cases, such as:

EU AI Act Article 5
U.S. FTC Act Section 5
CDA Section 230
EU DMA gatekeeper obligations

The latest benchmark is RQ6, which tests multi-turn retrieval contamination. For example, if a user first asks about EU law and then switches to a similar U.S. law, does the retriever incorrectly carry over the previous EU context?

RQ6 stress run:

20,000 Monte Carlo trials
720,000 result rows
retrieval modes: BM25, TF-IDF, Hybrid RRF, HSRAG CTHC, HSRAG Hybrid Subset
context policies: no_memory, naive_memory, bounded_cthc_memory

In this controlled benchmark, the current observation is that HSRAG modes had:

0 wrong-corpus retrieval
0 wrong-jurisdiction retrieval
0 false allow on NO_EVIDENCE / AMBIGUOUS cases
0 cross-turn contamination

Important caveats:

This depends on clean upfront corpus classification
This is not legal advice
This is not production-ready
This is not a RAG replacement claim
If metadata is wrong, the hash-addressed boundary can also be wrong
Multi-law comparison questions should still be decomposed into atomic retrieval tasks before synthesis

Repo: https://github.com/Void-Ghost000/HSRAG

I’m curious how people here would think about this:

Are hash-addressed evidence chunks useful as a complement to RAG?
Is this better described as metadata filtering, routing, or retrieval governance?
What would be a fair vector baseline for this kind of benchmark?
In production RAG, would the biggest issue be ingestion, metadata quality, scale, or query decomposition?

Follow-up experiment: does it make sense to add a structured evidence boundary before RAG retrieval?

Exploring hash-addressed evidence chunks as a complement to RAG