Efficient image window vectorization for CNN accelerator (systolic array design)
Hi everyone,
In CNN accelerators, we often use systolic arrays to speed up matrix multiplication and reduce overall computation latency. This approach works very well for convolution once the data is already in a vector/matrix form.
However, I feel that another major bottleneck is the process of sliding the filter over the image and converting each local window into a vector before feeding it into the systolic array.
I would really like to hear your ideas and approaches for efficiently vectorizing image windows in hardware. Are there any optimized architectures or scheduling techniques you use to reduce this overhead?
In my current design:
- Input: 28×28 image
- Filters: 10 kernels of size 3×3
- Stride: 1, Padding: 1
Even with the systolic array accelerating multiplication, the full convolution still takes around 8000 clock cycles, and I suspect the window extraction / data feeding (im2col-like process) is a major contributor.
Has anyone worked on reducing this “windowing / im2col” overhead or implemented more efficient streaming or line-buffer based approaches?
I’d really appreciate any thoughts or design strategies you can share.
Thanks!