Ollama Local LLM Paperless GPT - Paperless-ngx PDF with searchable text OCR issues.
Local setup:
Paperless-ngx
Paperless-GPT
Ollama on DGX Spark
MiniCPM-V for OCR/image processing
Paperless-AI for metadata afterward
I noticed a consistent issue with searchable PDFs (PDFs with embedded text).
I tested the same document as:
Searchable PDF with embedded text
Image-only PDF version (pdf-> screenshot-> converted back to pdf with an online img to pdf tool)
Results:
Searchable PDF
-Can take a very long time to process
-Repeats the same paragraphs 100+ times in content
Image-only PDF
- Processes quickly
- Works correctly
Has anyone else seen this with MiniCPM-V or Paperless-GPT? If you're using Ollama + local vision models, what are you doing to avoid this with searchable PDFs?