In this thesis, we try to solve the problem of Multi-Camera Tracking (MCT). Multi-Camera Tracking aims to track multiple targets in a camera network. Specifically, we focus on tracking pedestrians as our priority target. Multi-Camera Tracking is the critical underlying technology for building large-scale intelligent surveillance systems. Building such complex systems requires solving some of the hardest task in computer vision, including detection, single-camera tracking (also known as visual object tracking, multi-object tracking), inter-camera tracking and re-identification.
As a large-scale intelligent surveillance system typically connects thousands of camera, a tremendous amount of video data is collected every second by machines. However, such amount of data could potentially overwhelm even the most capable machines if not processed efficiently. Therefore, it is critical for us to develop efficient ways to extract useful information from the videos. A typical approach for surveillance systems is to stream all videos to a central server and perform analysis all at once on the server. Yet, such approach not only put extensive stress on the central server but also requires expensive bandwidth to transmit high-resolution videos from edge to the server. Therefore, we aim to develop a distributed solution to process the videos in contrast to the traditional centralized approach.
We propose a novel unsupervised Person Re-Identification method for learning CNN on unseen environments without the need of human annotation. We observe that the spatial and temporal information captured along with pedestrian images could serve as supervisory signals for adapting CNN to new environments. Base on these free context information, we introduce three data mining techniques to extract training data: Cross-Track Mining (CTM), Cross-Camera Mining (CCM) and Cross-Domain Mining (CDM). We name our method as C3M.
CTM is based on: (1) A person moves continu- ously along the time (positive pair). (2) A person cannot appear at two places at the same time (negative pair).
CCM is based on: (1) Tracks that are best buddy pair in feature space are likely to be the same person (positive pair). (2) Tracks disagreed by spatial-temporal model is unlikely to have the same identity (negative pairs).
CDM is based on: (1) Images with the same ID in source domain or same track in target domain have the same identity (positive pairs). (2) Every identity in source domain is disjoint with the identities in target domain (negative pairs).
We introduce a general yet efficient Multi-Camera Tracking system for edge devices. We focus on improving the inter-camera tracking part of the system. Our approach extracts important information such as visual feature and spatial-temporal information from captured videos, and makes the decision to associate two tracks or not on edge devices. Compared to traditional centralized approach where all data are streamed to cloud server for processing, we only exchanged only the extracted information so that precious bandwidth is saved and stress on the central server is reduced. Moreover, our system can employ any kind of Re-ID algorithm to enhance the performance of MCT, allowing us to deploy more advanced Re-ID algorithms in the future.
A person appears in camera A. Extract its feature.
Broadcast its feature to adjacent cameras’ buffers once it leave camera A.
Once a person appears in camera C: If found a match in buffer, associate with the person in buffer. If not, it is a new person.
All results are reported on DukeMTMC. ClustF1, ClustP, ClustR is our proposed evaluation measure for evaluating inter-camera tracking performance. See below for more detail.
Method | ClustF1 | ClustP | ClustR | IDF1 | IDP | IDR |
---|---|---|---|---|---|---|
Direct Transfer | 0.103 | 0.956 | 0.054 | 0.522 | 0.523 | 0.521 |
C3M Visual | 0.742 | 0.918 | 0.623 | 0.816 | 0.817 | 0.815 |
C3M Visualr | 0.798 | 0.909 | 0.710 | 0.858 | 0.859 | 0.857 |
Please also refer to the Demo videos at the top of this page.
Each dot of the same color is produced by selecting different feature distance thresholds for inter-camera tracking (or known as T-MCT) given the same detection and single-camera tracking results. In other words, the curve connected by the same color dots should characterize the precision-recall of track association. This comparison shows that the proposed ClustP-ClustR plot could visualize the inter-camera tracking precision and recall while the IDP-IDR plot cannot. Such evaluation is useful when selecting system parameters such as feature distance thresholds for different kinds of system requirements.