Need a help
We’re experimenting with a local document verification pipeline using OCR + a small language model (Qwen2.5 1.5B via Ollama), and we’re hitting an interesting issue around consistency validation.
Current pipeline:
PDF/Image
→ OCR extraction
→ cleaned extracted text
→ Qwen2.5 1.5B
→ verification / normalization layer
The OCR itself is working surprisingly well. We’re getting reasonably clean extracted text even from noisy multilingual scans.
The problem starts in the verification stage.
Examples of what we want the SLM to reliably do:
- normalize names
- normalize dates/currency formats
- compare entities across multiple extracted sections
- detect mismatches/inconsistencies
- avoid hallucinating missing values
- maintain deterministic output structure
Example input:
PAN:
Name: Rahul S Shah
DOB: 12/04/1996
Salary Slip:
Employee Name: Rahul Shah
Net Salary: INR 1,20,000
Bank Statement:
Account Holder: Rahul S. Shah
Salary Credits: 120000
Problems we’re seeing:
- inconsistent reasoning between runs
- occasional hallucinated fields
- weak cross-document comparison
- poor long-context consistency
- model sometimes treats semantically identical values as different
- unstable formatting/output
It feels like the model lacks “document context awareness” and structural understanding of what kind of records it is processing.
Questions:
Is this mainly a prompting/context-engineering problem?
Should we move from raw OCR dumps → structured extraction first?
Are smaller models fundamentally weak at entity consistency tasks?
Would rule-engine + SLM hybrid systems work better here?
Should we chunk documents by semantic sections before prompting?
Has anyone had success with constrained decoding / JSON schema enforcement for deterministic verification workflows?
Are there open-source models that perform better specifically for structured document validation/reconciliation tasks?
We’re intentionally keeping everything local/offline, so cloud APIs are not preferred.
Would really appreciate insights from anyone working on:
- document intelligence
- OCR pipelines
- local LLM systems
- entity resolution
- structured extraction
- verification engines
- long-context consistency
Especially interested in architectural lessons learned rather than model benchmarks.