Built a Real-Time FPGA Anomaly Detection System on ZCU104 Using MobileNet + GRU — Looking for Optimization Advice
My friend and I built a real-time hardware anomaly detection system on an FPGA using a hybrid MobileNet + GRU architecture deployed on a Xilinx Zynq UltraScale+ ZCU104 platform.
The pipeline works like this:
- MobileNet is used for spatial feature extraction from 224×224 video frames.
- A GRU processes the temporal sequence information for anomaly detection.
- The accelerator was implemented on the FPGA fabric, while the quad-core ARM processor on the Zynq handled camera integration and system-level control.
- We later integrated a 30 FPS camera feed to demonstrate real-time inference.
For testing, since the GRU was trained only on hockey-fight anomaly datasets, we pointed the camera toward a laptop playing YouTube hockey-fight videos to validate the detection pipeline in real time.
Current performance:
- Input resolution: 224×224
- Inference latency: ~620 ms per frame
- Platform: ZCU104 / PYNQ framework
One optimization we already implemented was using a CDMA (memory-mapped DMA) approach instead of a stream-based DMA to reduce unnecessary BRAM/URAM data movement overhead and simplify memory transfers between PS and PL.
I’d really appreciate feedback from the FPGA/embedded AI community on:
- Whether this is considered a solid FPGA project for research/industry portfolios.
- Suggestions to improve inference latency on the PYNQ/Zynq platform.
- Whether moving more preprocessing into PL would help significantly.
- Ideas like quantization, pruning, pipelining, double-buffering, AXI-Stream architectures, or using DPU/Vitis AI instead of custom logic.
- Whether the MobileNet+GRU architecture is a good fit for FPGA deployment or if there are better temporal models for low-latency anomaly detection.
I’m especially interested in opinions from people who have worked with:
- AMD Zynq platforms
- Xilinx ZCU104
- PYNQ
- FPGA-based CNN acceleration
- Video analytics pipelines
- AXI DMA/CDMA optimization
Does ~620 ms latency sound reasonable for a first custom implementation, or is there likely a major bottleneck in the architecture/design flow that we should investigate
GitHub (other projects): CraftedByDavid GitHub
LinkedIn: David Paul LinkedIn