r/Paperlessngx

Can't get any competent LLM model running without crashing on OCR

Can't get any competent LLM model running without crashing on OCR

I've had a paperless-ngx instance up and running on my Ubuntu Server 24.04.4 LTS for a while, but it's difficult for me to put effort into using, because in my experience, it doesn't necessarily work as advertised without some serious tinkering with the settings. Scanned in PDFs are always flipping around/upside down, despite trying to play around with the autorotate settings. The ML suggestions are ok, but tedious to go in and apply. Just generally not as much of a hands-off experience that I would like.

Then I came across this guide/video and thought, it could definitely be useful, as when he switches over to the AI OCR, it seems to classify/textualize the document content flawlessly, to then have the LLM follow up and apply the correct tags:

https://technotim.com/posts/paperless-ngx-local-ai/

In the guide, he makes no mention of GPU specs that he's using, he just mentions that the model he's using it "runs great". In fact, he even specifies that an NVIDIA GPU is optional but recommended for vision OCR.

Well I recently just bought a 5060 Ti 16GB for my own desktop to playing around with local LLMs, and moved my older 1660 Super 6GB to the server for plex transcoding and hopefully running some light duty LLMs (particularly for this use case).

The problem is, I can't get really any competent model running to perform the OCR without missing huge portions of text and/or straight up hallucinating stuff that isn't in there. The model will load entirely on VRAM, and then it will crash after trying to process even basic PDF files, due to running out of memory. I've had some luck with turning on the OCR_LIMIT_PAGES : "1", but still will generally crash.

I've gotten it to process a few documents with moondream and some non-vision models, and it will just miss entire swaths of text or adding stuff that's not even remotely related to the document. I know 6GB isn't huge, but why is one page at a time killing the entire model, especially when he's saying GPU is optional?

This is just a personal home server, and I'm not going to be crunching out a massive workflow, basically just receipts and letters and "important stuff" here and there. Accuracy is far more important to me than speed, as long as I'm also utilizing the hardware to it's fullest ability.

My problem with the built in paperless-ngx OCR is that if the page is flipped at all (or a bit crumpled), it just goes and types a whole bunch of gibberish in the content field.

Anyone have any luck with smaller models? Anyone care to share their docker settings?

u/Auwardamn — 3 days ago

Have it leave my files where they are.

I have a folder structure and existing PDF's and pictures that I want to leave in their location already. I do not want paperless to consume them and move them. I just want it to be a search engine where I can tag files.

My folder is about 20 gigs of business data with many PDF's and scanned pictures.

I have set it up

>PAPERLESS_CONSUMER_DELETE_ON_SUCCESS=false

>PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS=true

Unfortunately, that did not work. As far as I can tell, it moved all my PDF's.

AI is hallucinating, saying the primary culprit is typically a setting called PAPERLESS_CONSUMER_RECURSIVE=true interacting with an ambiguous duplicate detection policy. In older build versions, when Paperless detects an exact hash duplicate inside a deeply nested recursive directory, it can trigger a cleanup function to purge the duplicate from the landing tree—accidentally ignoring the main global deletion override flag.

The problem with that is, I am 99% sure it's not "duplicates" because I can look at unique pdf's that were in my folder, but are not after paperless scanned it.

Is putting the volume in Read Only mode the only way to fix this?

Appreciate any help.

reddit.com
u/Savings_Art5944 — 3 days ago
▲ 9 r/Paperlessngx+1 crossposts

Ollama Local LLM Paperless GPT - Paperless-ngx PDF with searchable text OCR issues.

Local setup:
Paperless-ngx

Paperless-GPT

Ollama on DGX Spark

MiniCPM-V for OCR/image processing

Paperless-AI for metadata afterward

I noticed a consistent issue with searchable PDFs (PDFs with embedded text).
I tested the same document as:

  1. Searchable PDF with embedded text

  2. Image-only PDF version (pdf-> screenshot-> converted back to pdf with an online img to pdf tool)

Results:

Searchable PDF

-Can take a very long time to process

-Repeats the same paragraphs 100+ times in content

Image-only PDF

- Processes quickly

- Works correctly

Has anyone else seen this with MiniCPM-V or Paperless-GPT? If you're using Ollama + local vision models, what are you doing to avoid this with searchable PDFs?

reddit.com
u/TEEorCoffee2025 — 5 days ago

I’m building a self-hosted document app with built-in LLM OCR/Q&A, and I’d love feedback from paperless users

Hi everyone, I hope this kind of post is okay here. I’ve been building Paperwise, a self-hosted document intelligence app, and I’d really value feedback from people who already care deeply about document workflows.

To be clear: Paperless is much more mature, and I’m not trying to position Paperwise as a drop-in replacement. I built it because I wanted a document app where LLM features are native rather than bolted on afterward.

The main things I’m exploring are:

  • OCR and metadata extraction using local or remote LLMs
  • Grounded “ask your documents” answers with source-backed context
  • Per-task model configuration for OCR, metadata, and Q&A
  • Self-hosted deployment with normal document organization workflows
  • Better debugging when provider/model connections fail

Project link: https://paperwise.dev/

Github: https://github.com/zellux/paperwise

If anyone here is curious enough to try it, I’d love blunt feedback. Missing basics, rough setup, confusing UX, or “I would never use this because…” comments are all useful to me.

Thanks!

https://preview.redd.it/ymke30nr8f1h1.png?width=2880&format=png&auto=webp&s=960ceb288f071ef7a4b225ac54bb6c6d37d09be2

https://preview.redd.it/ch11os1t8f1h1.png?width=2880&format=png&auto=webp&s=33bb3eeb970ebfb6004fb4f7651e25d156fddeaf

reddit.com
u/zellux — 6 days ago

paperlessimap: Browse your Paperless-ngx documents as emails via IMAP (Public Alpha)

Hello everyone!

For the past year, I’ve been working on a bridge to bring my Paperless-ngx library into my daily email workflow. I’m happy to announce the public alpha of paperlessimap.

What is it?

It’s an IMAP server bridge that allows you to access your Paperless-ngx documents from any mail client (Thunderbird, Outlook, etc.). It currently provides read-only access, where your documents are presented as emails with the original PDFs attached.

The Tech Stack

  • Backend: PHP (Symfony)
  • Mail Core: Dovecot
  • Deployment: Docker-ready (Compose setup included)

Why use this?

As a heavy Thunderbird user, I found that I could often find and navigate my documents faster using a mail client's native search and folder (tag) structure than through the WebUI. It’s about integrating document management into the tools I already use all day.

Current Status

  • Alpha version: Stable enough for daily private use.
  • Authentication: Currently via a fixed password in .env (direct Paperless-ngx credential login is planned).
  • Easy Setup: A pre-configured docker-compose.yaml is available in the /docker/compose directory.
  • Localization: Currently in German, but the codebase is prepared for translations.

Feedback & Ideas

I'd love to get some feedback from the community!

  • Does an IMAP interface fit your workflow?
  • What would be your priority: "Move to folder" for tagging or full write-access?
  • Any specific ideas for the development roadmap?

Repository:https://codeberg.org/lindesbs/paperlessImap

Note: Developed with the assistance of LLM (Cursor.com) for documentation, testing, and planning.

Looking forward to your thoughts!

https://preview.redd.it/x3f8le9rj41h1.png?width=2031&format=png&auto=webp&s=0fa422c21f79a8dbf8ae4ccbfa572bade4cc9823

https://preview.redd.it/8fm6ce9rj41h1.png?width=1029&format=png&auto=webp&s=dd6201534e94029f63604588be8949fb709b7fca

reddit.com
u/lindesbs — 8 days ago

Do you think there is a market for pre-configured Paperless-NGX devices?

>I did not use AI to write this. I just happen to be an IT person who knows Markdown

Do you think there is a market for pre-configured Paperless-NGX devices?

I provide IT services and management of various systems. And am considering adding a product to my offerings. Pre-configured Plug-n-Play Paperless-NGX on Carbon System MiniPCs.

Paperless-NGX Site

Paperless-NGX:

It's a popular FOSS application that auto-organizes documents. It's overall goal is to make you "Paperless" To put it lightly: "Its a damn useful piece of software."

I've been using it for about a year, and it's been lovely: 2 min vid

  • Automatically converts docs (PDF, Office Docs, Pictures) to OCR (searchable text)
  • Learns your documents and automatically assigns useful info
    • Tags for quick sorting
    • Correspondents (names of the org the doc is associated with. ie Walmart for any receipt from Walmart)
    • Document Types (fully customizable, example: "Deposit Slip")
  • Ability to share documents (with optional time sensitivity) with outside users
  • User & Group rights
  • Processing of docs using file-scanning or email or the drag-n-drop web interface
  • Exposeable API for advanced customization/workflows

The Pre-Configured Device:

I am a dealer for Carbon Systems PCs. And would use these PCs to provided a dedicated Paperless install.

  • Intel based PC with a 3-year warranty.
  • Configurable storage (default of 500GB, max of 4TB)
  • Pre-configured SMB share (for scanning to the device)
  • Pre-configured local SMTP option (would only be able to be used as a local send option for scanning from a copier or automated email)
    • I feel I may be over explaining this part. Sending over email from a copier/scanner is a PITA when ppl try to use their Google or M365 email. This would essentially be a local email server for the single purpose of making scanning via email simple for the customer. (this has nothing to do with receiving docs via email in paperless. It's just that email-consumption in paperless is far more advanced than other methods. And I'd like for there to be a simple option for ppl to use this feature.)
  • Setup and training session included
  • 3 months of software & management support included

The Managed Services Side:

  • Backup
  • 24/7 monitoring of system health
  • Handling of updates of the OS & Program(s)
  • Program administration (ie add/remove users)
  • (optional) Assignment and management of a domain for remote access to the program

My own thoughts on the idea:

Paperless is better than SharePoint or Google Drive for management of non-editable documentation (things like receipts and bank statements). And for me, it's been a god send for managing MAIL (i despise snail mail and paper docs. Everything has been digitized and is super easy to find now).

I've not implemented this program to many businesses. The ppl I've setup with this program are small operations. And before I offer this as a service I would implement it at a few of my preferred customers before general release.

The price point of offering a dedicated Paperless Server would likely be $1k - $2k. (because prices right now are insane).

What are your thoughts about this?

reddit.com
u/TxTechnician — 7 days ago

Fresh installation via script and Docker -- getting "Not found" on site

I've run the install script from this page:

https://docs.paperless-ngx.com/setup/#after-installation_1

I've run it twice now thinking I set something I shouldn't have but both times the end result is the same: I get "Not found" when accessing localhost:8000.

Not sure if it matters but I notice that after running the script, it automatically starts the services and the script never formally ends (it just shows the HTTP server running for paperless-ngx).

I've restarted the containers in case it's that but nope ... still getting the "Not found" message when accessing the URL.

Any ideas? I've followed the instructions which are pretty simple and straightforward and Google searches aren't turning up anything. Any ideas?

reddit.com
u/PretendsHesPissed — 6 days ago

Wow. Why has it taken me so long to discover Paperless

I’ve had a Synology NAS for 10 years or more and only recently discovered paperless as a solution for documents. Previously I stored everything in folders in iCloud. I’m currently moving everything over now to paperless and also scanning all the old paperwork I have in binders with a view to eliminating those physical copies.

Any top tips on how to best make this migration?

reddit.com
u/mountainmaestro23 — 9 days ago

I got tired of self-hosted PDF tools requiring Docker, servers, and maintenance

Every time I needed to process a PDF I had two options:

  1. Upload it to some random website and hope they don't store it forever
  2. Self-host something like Stirling-PDF which requires Docker, a server, ongoing maintenance, and still processes files server-side

Neither felt right for sensitive documents. So a third option.

Mini Tool- A PDF toolkit that runs 100% in your browser. No server. No Docker. No setup.

No maintenance. Just open the URL and it works.

What it does:

- Compress, Merge, Split, Rotate PDFs

- Protect and Unlock PDFs (AES-256 encryption)

- Sign and Watermark PDFs

- Organize pages (drag and drop reorder)

- Batch process multiple files at once

- Workflow Builder (chain operations together)

- Images to PDF

- Smart Print Mode + Booklet Optimizer

The privacy angle that matters:

Every operation runs locally using pdf-lib and PDF.js in Web Workers. I opened DevTools and

confirmed zero outgoing file requests during processing. Your files genuinely never leave

your device.

For the self-hosted crowd specifically:

I know this community values owning your stack. The irony here is that "self-hosted" still means your files hit YOUR server. With browser-based processing the files never hit any server at all - not even one you control.

It's the most private PDF processing possible short of running offline desktop software.

What I'd love feedback on:

- Are there PDF operations missing that you regularly need a self-hosted solution for?

- Any edge cases with complex PDFs you'd want to test?

- Would an offline PWA version be useful to this community?

reddit.com
u/Cute_Ad2883 — 9 days ago

Solid flatbed scanner for only a few documents (that won't go through a document scanner)?

Hey there I've got an Epson Workforce Scanner but looking for a solid but cheap (maybe 2nd hand) flatbed scanner for only a handful of documents that would be eaten by my Epson. Any recommendations? Thanks!

reddit.com
u/risikorolf — 12 days ago

ADF Scanner that can handle wrinkled, torn papers (crumpled/scrunched up badly) and receipts?

Hi,

I would appreciate recommendations:

I need to digitize a large archive, and many papers are folded, creased, or (worst case) very badly wrinkled up (crumpled/scrunched up).

Some of them are irregular shape and size (torn pieces, notes).

Obviously, not all of them are like this, most papers are just A4 folded in half, or letters with a letter fold (2 creases etc.), but I have several boxes of this stuff to scan, and I want my job to be as easy, pain free, and fast as possible.

Also, long, old receipts.

What would be the best, most reliable scanner with a large auto-feeder to handle this mess without choking/jamming too much?

I haven't owned an AFD scanner before, just AIO flatbeds.

Thanks!

reddit.com
u/Infinite100p — 11 days ago

Brother ADS-4700W

Hi, looking for some troubleshooting help. Many people in the sub recommended this printer. went ahead an bought it. Very nice quality and good speed. But i am having some trouble. I cant seem to get it to scan thing in the correct rotation. is is always on its head. also, it alsoways scans the last page fisrt, so i have to go and reorder the pages in a pdf editor. am i doing something wrong? i just wanna use the scanner, preferably without a pc having to be on. i just want itto scan to the consume folder on my server. Any help would be appreciated. thx

reddit.com
u/Ggsam3 — 12 days ago
▲ 10 r/Paperlessngx+1 crossposts

Review my App iPDF local for PDF - Processing - TestFlight available

I got frustrated with the existing Paperless-ngx mobile workflows, so I built my own iOS app. Main goal to run cross System end 2 end pdf editing workflows (Paperless-ngx, Stirling PDF, Nextcloud) on your iPhone, iPad and Mac.

My app “iPDF Local” now supports:
- direct import/export with Paperless-ngx
- local-first PDF workflows
- sharing PDFs directly from the iOS share sheet
- integration with self-hosted setups
- optional Stirling PDF support

The main goal was:
A native Apple-style experience for self-hosters without forcing cloud workflows.

Built with SwiftUI for iPhone/iPad/Mac.

I’m especially interested in feedback from heavy Paperless users:
- What mobile workflow annoys you most today?
- What’s still missing in existing apps?
- Bulk actions?
- Better scanning/import?
- Offline handling?

App Store:
https://apps.apple.com/de/app/ipdf-local/id6742412603

Would love honest feedback from this community.
PM for test flight access

u/Mountain-Marketing55 — 14 days ago