Best local multimodal llm for 8GB Vram?
Hi everyone, I’m currently looking for recommendations for a good local multimodal model for my project: an AI-based assistant system for visually impaired users that helps operate an air conditioner remote control. The model needs strong multimodal understanding because it must read, recognize, and analyze the buttons, labels, symbols, and layout of different AC remotes from camera input. Right now I’m using Qwen 3.5 9B quantized to 4-bit using Unsloth, and the deployment target is an RTX 4060, 8GB VRAM. The current model still struggles to correctly interpret remote display states, especially indicators such as small logos, icons, bars, mode symbols, fan speed indicators, and similar visual elements.. I’m trying to find the best balance between multimodal accuracyband VRAM efficiency for local inference. If anyone has experience with lightweight VLMs or local multimodal setups for assistive technology projects, I’d really appreciate your recommendations for models, quantization strategies, or inference frameworks.