u/CheesieApple

I've been building an open-source document→JSON extractor that runs fully local on Ollama (no API keys, $0), and I wanted to share a few things that surprised me — plus a failure mode I'm still chewing on, because this sub is the right place to get torn apart constructively.

The setup: you give it a file + a schema (just {"invoice_date": "date", "total": "number"}), and it returns JSON validated against that schema, or a structured error. The "understanding" step is swappable — stub / Ollama / (eventually) a hosted model — but the whole point was to make a small local model good enough to trust.

Thing 1: Ollama's structured outputs (format) do a lot of heavy lifting.

Passing the JSON Schema derived from the user's schema constrains a 3B model to emit matching JSON. Combined with one corrective retry that feeds validation errors back, even llama3.2 does surprisingly well on clean invoices and résumés.

Thing 2: the biggest reliability win wasn't a bigger model.

It was deterministic post-processing.

Classic example: an Indian receipt with 26-05-2025 (DD-MM-YYYY). Every model I tested — llama3.2 and qwen2.5:7b — occasionally interpreted that as the year 2605.

The fix wasn't scaling up.

It was parsing the date in code (strptime) and normalizing to ISO. Dates are a solved problem; making the model guess was the mistake.

I now do schema validation + deterministic repairs before trusting any extraction.

On my (small but honest) eval set — invoices and a résumé with nested lists — the pipeline hits 100% field accuracy on llama3.2, scored field-by-field against known answers.

Thing 3 (the failure mode I'd love feedback on):

I threw a real 15-page PDF at it and asked yes/no + list questions.

It confidently returned wrong answers:

has_burger: false even though burgers existed later in the document
Invented pizza toppings that never appeared in the source

Root causes seem to be:

Context truncation

llama3.2's default num_ctx (~2048) only covered the first few pages. The relevant information appeared later, so the model never saw it.

Hallucination on absent fields

The schema asked for pizza toppings, but the document never mentioned pizza. Instead of returning null, the model fabricated an answer with high confidence.

My current thinking is:

Retrieval/chunking so each field only sees relevant sections
Grounding checks that verify extracted values actually exist in source text
Returning null when evidence is missing instead of forcing a value

Curious how people here handle the "field requested but not present in source" problem when working with local models.

Do you use:

String grounding?
Verifier passes?
Confidence thresholds?
Something else entirely?

Not selling anything. Mostly looking for feedback from people who have pushed small local models into production-style structured extraction workflows.

I've been building an open-source document→JSON extractor that runs fully local on Ollama (no API keys, $0), and I wanted to share a few things that surprised me - plus a failure mode I'm still chewing on, because this sub is the right place to get torn apart constructively.

The setup: you give it a file + a schema (just {"invoice_date": "date", "total": "number"}), and it returns JSON validated against that schema, or a structured error. The "understanding" step is swappable - stub / Ollama / (eventually) a hosted model - but the whole point was to make a small local model good enough to trust.