u/Juliusicon

Hey, before i start i want to say i am German and my english is sometimes pretty bad. So i read a lot about Air LLM, to stream the LLM layers from the SSD into the GPU instead of loading the whole modell and to use QuIP# 2-bit to further kompress the modell layers and get theoretical 3,4 token/s with 3gb vram and 4gb system ram. But i am not a coder, i developed the idea of Air LLM in theory further but lack the skills to use Linux or code outside of vibe coding and arguing with Claude about my idea vs its halloucinations and i only posess an amd rx 7900 xt. Sorry if this was convoluted i just wanted to share my idea and ask for feedback and ideas to further fasten up ssd upload to gpu because that is the main speed loss, the time to move the layer into the vram.

Air LLM development