u/Blake_Olson

▲ 1 r/PythonProjects2+1 crossposts

Built an invoice-scanning service for our accounting team in one afternoon with Claude — sharing the architecture in case it helps someone else

Our AR team was hand-keying ~25 invoices a week into a spreadsheet. I had Claude build us a Python service that watches a network folder, extracts invoice data from any PDF dropped in (vendor, dates, totals, line items, addresses), and appends a row to a shared Excel register. Total chat-to-deployed time: about half a day, including all the deploy headaches.

The architecture, for anyone who wants to replicate this:

  • Python service on our Windows file server, registered with NSSM. Auto-starts with the host.
  • watchdog library polls the SMB share for new PDFs. Each new file goes through a pipeline.
  • Two-tier extraction: per-vendor regex templates first (free, instant, deterministic), then Azure AI Document Intelligence "prebuilt-invoice" model as a universal fallback. Azure handles OCR for scanned PDFs natively, so the same flow works whether AR drops a digital PDF or our MFP scans one from paper.
  • SQLite on the local disk is the source of truth. The shared .xlsx is a curated view that gets appended to on each batch. Delete the .xlsx and it'll repopulate fresh from the next batch — handy for resetting.
  • Failed extractions go to a Failed\ folder with a sibling .error.txt explaining why.

Cost reality check: Azure DI free tier covers 500 pages/month. At our volume (~25 invoices/week, mostly 1-2 pages) that's well under the cap. Paid tier is roughly $0.01–$0.05 per page. Cheap enough that I don't think about it.

Gotchas I ran into so others don't have to:

  • Azure returns addresses as structured objects, not strings. If you naively str() them you get the raw Python dict repr in your spreadsheet. Format them manually from street_address / city / state / postal_code.
  • On Windows Server, PowerShell 7's Restart-Service can throw "Cannot open service" against NSSM-wrapped services for no good reason. Use nssm restart <name> instead.
  • Python 3.14 is so new that some package wheels aren't published for it yet. Stick with 3.12 for production.
  • Tracking "what's new this batch" is way simpler than maintaining a watermark in DB. Just snapshot MAX(invoice_id) before and after the batch, and only project that range to the spreadsheet.

Things I'd add if/when I have time: vendor templates for our top 5 recurring vendors (cuts Azure cost to zero for those), a daily canary PDF for monitoring, swap the LocalSystem service account for a dedicated low-privilege one.

Happy to answer questions about any specific piece. The whole thing is ~1,500 lines of Python plus a deploy script.

reddit.com
u/Blake_Olson — 1 day ago