u/Frosty-Layer-7192

Hi 👋 Recently I tried QWEN3-VL-30B API to test reading texts and returning required information from old type-written documents - as a test before I download and use it locally.

When I used it for reading from paragraph-format document, it was very accurate. However, when I tried paragraph & table format document, it made hallucination and mixed up texts from different rows which returned wrong outputs. (I attached the sample page below)

I am thinking between 1) should I move to another version, not VL model? but I need multi-modal input for this project. 2) should I try harnessing engineering? (I have only used prompt-wise ways) If so, what would be the best way? 3) OR should I move to totally different model?

Constraints are: a) I need FREE model which can be downloaded to my pc and locally run.
b) I need multi-modal input (image/pdf & text (prompt). c) I will buy physical GPU with probably 24GB VRAM or little higher, but not super fancy one.

Any insight would be very appreciated! Thanks!

-----------sample page--------

https://preview.redd.it/r7d2b3i89a5h1.png?width=1221&format=png&auto=webp&s=4fb2659f6375e690422d0304289fb61252e05003

Anyone used QWEN3-VL for OCR and information extract on old documents?