Distributed Intelligence: Smart Surveillance System

Unsupervised Person Re-Identification in Multi-Camera Tracking

Demo: Surveillance View

Demo: Personal Summarization


What is the task?

In this thesis, we try to solve the problem of Multi-Camera Tracking (MCT). Multi-Camera Tracking aims to track multiple targets in a camera network. Specifically, we focus on tracking pedestrians as our priority target. Multi-Camera Tracking is the critical underlying technology for building large-scale intelligent surveillance systems. Building such complex systems requires solving some of the hardest task in computer vision, including detection, single-camera tracking (also known as visual object tracking, multi-object tracking), inter-camera tracking and re-identification.

What is the motivation?

As a large-scale intelligent surveillance system typically connects thousands of camera, a tremendous amount of video data is collected every second by machines. However, such amount of data could potentially overwhelm even the most capable machines if not processed efficiently. Therefore, it is critical for us to develop efficient ways to extract useful information from the videos. A typical approach for surveillance systems is to stream all videos to a central server and perform analysis all at once on the server. Yet, such approach not only put extensive stress on the central server but also requires expensive bandwidth to transmit high-resolution videos from edge to the server. Therefore, we aim to develop a distributed solution to process the videos in contrast to the traditional centralized approach.



  • Unsupervised Re-Identification: No human annotation required for training CNN models in unseen environments
  • Distributed Multi-camera Tracking: No central server required and reduction in transmission bandwidth
  • Real-time performance: Achieve real-time Multi-Camera Tracking on aggregators without discrete GPUs

Context Mining for Unsupervised Person Re-Identification

We propose a novel unsupervised Person Re-Identification method for learning CNN on unseen environments without the need of human annotation. We observe that the spatial and temporal information captured along with pedestrian images could serve as supervisory signals for adapting CNN to new environments. Base on these free context information, we introduce three data mining techniques to extract training data: Cross-Track Mining (CTM), Cross-Camera Mining (CCM) and Cross-Domain Mining (CDM). We name our method as C3M.

Cross-Track Mining

CTM is based on: (1) A person moves continu- ously along the time (positive pair). (2) A person cannot appear at two places at the same time (negative pair).

Cross-Camera Mining

CCM is based on: (1) Tracks that are best buddy pair in feature space are likely to be the same person (positive pair). (2) Tracks disagreed by spatial-temporal model is unlikely to have the same identity (negative pairs).

Cross-Domain Mining

CDM is based on: (1) Images with the same ID in source domain or same track in target domain have the same identity (positive pairs). (2) Every identity in source domain is disjoint with the identities in target domain (negative pairs).

Distributed Multi-Camera Tracking System

We introduce a general yet efficient Multi-Camera Tracking system for edge devices. We focus on improving the inter-camera tracking part of the system. Our approach extracts important information such as visual feature and spatial-temporal information from captured videos, and makes the decision to associate two tracks or not on edge devices. Compared to traditional centralized approach where all data are streamed to cloud server for processing, we only exchanged only the extracted information so that precious bandwidth is saved and stress on the central server is reduced. Moreover, our system can employ any kind of Re-ID algorithm to enhance the performance of MCT, allowing us to deploy more advanced Re-ID algorithms in the future.


A person appears in camera A. Extract its feature.


Broadcast its feature to adjacent cameras’ buffers once it leave camera A.


Once a person appears in camera C: If found a match in buffer, associate with the person in buffer. If not, it is a new person.


Context Mining for Unsupervised Person Re-Identification

Distributed Multi-Camera Tracking System

All results are reported on DukeMTMC. ClustF1, ClustP, ClustR is our proposed evaluation measure for evaluating inter-camera tracking performance. See below for more detail.

Method ClustF1 ClustP ClustR IDF1 IDP IDR
Direct Transfer 0.103 0.956 0.054 0.522 0.523 0.521
C3M Visual 0.742 0.918 0.623 0.816 0.817 0.815
C3M Visualr 0.798 0.909 0.710 0.858 0.859 0.857

Please also refer to the Demo videos at the top of this page.

Advantage of Using Clust F-measure (vs. ID F-measure)

Each dot of the same color is produced by selecting different feature distance thresholds for inter-camera tracking (or known as T-MCT) given the same detection and single-camera tracking results. In other words, the curve connected by the same color dots should characterize the precision-recall of track association. This comparison shows that the proposed ClustP-ClustR plot could visualize the inter-camera tracking precision and recall while the IDP-IDR plot cannot. Such evaluation is useful when selecting system parameters such as feature distance thresholds for different kinds of system requirements.


  • C.-W. Wu, M.-T. Zhong, Y. Tsao, S.-W. Yang, Y.-K. Chen, and S.-Y. Chien, “Track-clustering error evaluation for track-based multi-camera tracking sys- tem employing human re-identification,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), pp. 1416–1424.
  • E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in European Conference on Computer Vision workshop on Benchmarking Multi-Target Tracking, 2016.
  • L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in Proceedings of IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1116–1124.
  • Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated by gan improve the person re-identification baseline in vitro,” in Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017.