Do VLMs in production still use fixed-patch ViTs for their vision capabilities? [D]
The research community has provided (already for some time) seemingly more efficient and effective tokenizations for vision. Do we have any hint on whether non-fixed-patches tokenization is being applied on the big player models?
I imagine not, and I'm trying to think why:
- marginal gains?
- pipelines needing a fixed number of tokens per image upfront for efficiency reasons (or even harder limitations)?
- scaling laws are not well understood for input-adaptive patching therefore big players do not bet on this?
or am I simply totally wrong and under the hood all the big players are doing dynamic tokenization for vision?