u/Haza_rd

▲ 26 r/FPGA

Built a Real-Time FPGA Anomaly Detection System on ZCU104 Using MobileNet + GRU — Looking for Optimization Advice

My friend and I built a real-time hardware anomaly detection system on an FPGA using a hybrid MobileNet + GRU architecture deployed on a Xilinx Zynq UltraScale+ ZCU104 platform.

The pipeline works like this:

  • MobileNet is used for spatial feature extraction from 224×224 video frames.
  • A GRU processes the temporal sequence information for anomaly detection.
  • The accelerator was implemented on the FPGA fabric, while the quad-core ARM processor on the Zynq handled camera integration and system-level control.
  • We later integrated a 30 FPS camera feed to demonstrate real-time inference.

For testing, since the GRU was trained only on hockey-fight anomaly datasets, we pointed the camera toward a laptop playing YouTube hockey-fight videos to validate the detection pipeline in real time.

Current performance:

  • Input resolution: 224×224
  • Inference latency: ~620 ms per frame
  • Platform: ZCU104 / PYNQ framework

One optimization we already implemented was using a CDMA (memory-mapped DMA) approach instead of a stream-based DMA to reduce unnecessary BRAM/URAM data movement overhead and simplify memory transfers between PS and PL.

I’d really appreciate feedback from the FPGA/embedded AI community on:

  1. Whether this is considered a solid FPGA project for research/industry portfolios.
  2. Suggestions to improve inference latency on the PYNQ/Zynq platform.
  3. Whether moving more preprocessing into PL would help significantly.
  4. Ideas like quantization, pruning, pipelining, double-buffering, AXI-Stream architectures, or using DPU/Vitis AI instead of custom logic.
  5. Whether the MobileNet+GRU architecture is a good fit for FPGA deployment or if there are better temporal models for low-latency anomaly detection.

I’m especially interested in opinions from people who have worked with:

  • AMD Zynq platforms
  • Xilinx ZCU104
  • PYNQ
  • FPGA-based CNN acceleration
  • Video analytics pipelines
  • AXI DMA/CDMA optimization

Does ~620 ms latency sound reasonable for a first custom implementation, or is there likely a major bottleneck in the architecture/design flow that we should investigate

GitHub (other projects): CraftedByDavid GitHub
LinkedIn: David Paul LinkedIn

u/Haza_rd — 3 days ago