u/BosonCollider

Linear transformers (basically removing the softmax from the attention mechanism and possibly replacing it with a relu on Q and K) are really nice for teaching transformers due to how you can rewrite them as an RNN. They made transformers as a generalization of the RNNs with nonlinear attention "click" for me.

I'm kind of wondering why more courses don't cover them before the real thing. If you are just using FlashAttention from a framework as in production it feels like a black box, but bottom-up courses that have people implement backpropagation (manually or autodiff) themselves can benefit quite a bit from it since you only really need to implement matrix multiplication and relu to get something that performs fairly well relative to the amount of effort put in, even when run on CPU.

The fact that they are relatively new and were a research trend that didn't entirely pan out due to the success of FlashAttention is probably one reason?

Why isn't linear attention used more in ML teaching as a pedagogical step?