u/One-Kraken

Gemma 4 on iOS: Anyone else stuck on CPU because of the "Buffer(31)" Metal crash?

Hey everyone,

I’m hitting a massive performance wall building an on-device AI app for the iPhone 17 Pro. I’m using MediaPipeTasksGenAI via CocoaPods to run Gemma 4 E2B, but the inference is incredibly slow.

Looking at the logs, it’s constantly defaulting to CPU fallback. The GPU initialization fails every time with this specific Metal compiler error:

'buffer' attribute parameter is out of bounds: must be between 0 and 30. device half4* src_tensor_buffer[[buffer(31)]]

It seems like Gemma 4’s graph is too complex for the standard MediaPipe delegate, hitting that hard 31-buffer limit on Apple’s newer chips. It’s frustrating because the official Google AI Edge Gallery app is blazing fast on the same hardware.

Has anyone else run into this? If so, how did you fix it?

•	Did you pivot to the LiteRT-LM path (Google's newer engine) despite the lack of Swift bindings?

•	Or did you jump ship to MLX-Swift for a more native Apple Silicon approach?

Would love to hear if anyone has successfully bypassed this "Buffer 31" ceiling!

How are you currently handling on-device Gemma 4 inference in your projects?

Priority Mail Express stuck at Jersey City Distribution Center (NJ to CA) - Missed delivery date, no updates?