Is multi-camera person tracking + re-identification actually feasible today? How close are we to “movie-style” systems?
I’m coming more from an NLP background and recently started digging into computer vision, so I might be missing some context here.
I’m trying to understand how realistic multi-camera person tracking systems are in practice — the kind where a person is consistently identified and followed across different cameras (like surveillance systems or what we see in movies).
From my current understanding, such a system would typically involve:
- Person detection (YOLO / RT-DETR etc.)
- Multi-object tracking within each camera (ByteTrack / DeepSORT / BoT-SORT)
- Cross-camera re-identification using embeddings (OSNet / TorchReID / ViT-based models)
My questions are:
- How mature is this field today in real-world deployments?
- Is consistent identity tracking across multiple non-overlapping cameras actually reliable, or still very brittle?
- What are the main failure points in practice (lighting, clothing similarity, occlusion, etc.)?
- Are there any solid open-source end-to-end systems worth studying?
- At what point does this stop being a “CV engineering problem” and become an open research problem again?
I’m not expecting movie-level perfect tracking — just trying to understand how close we are to a robust real-world system and what the real limitations are today.