Distributed Intelligence:
Smart Surveillance System
Unsupervised Person Re-Identification with Hard Samples Rectification in a Real-time Multi-Camera Tracking System
Man-Yu Lee, leemanyu@media.ee.ntu.edu.tw
Introduction
Multi-camera tracking
Multi-Camera Tracking (MCT) is a crucial technology in an envisioned smart city which aims to track multiple people through a network of cameras. MCT is composed of three parts: Detection, Single-Camera Tracking(SCT), and data association. Generally, the inference of deep learning network relies on a strong server with unlimited computing power and memory space. Yet considering the tremendous amount of data collected by multiple cameras in a surveillance system, it is not practical to transmit all the video data to the central server due to the limited transmission bandwidth. Instead of streaming all the information to a central server, a better approach is to deploy the system on an edge device and convey the high-level information of interest to the server for further analysis. In that case, the limited computational resources on mobile devices are what we need to take into consideration.
We proposed the multi-camera tracking system on a real-world hardware with an efficient framework to demonstrate the viability on edge devices
Person re-identification
While MCT is a notoriously difficult problem to solve, a popular research topic has derived from the final step of the matching scheme, person re-identification (re-ID), which address the problem of recognizing people across cameras with visual appearance. Although person re-ID has received great improvement due to the rise of the Convolution Neural Network (CNN) with the supervised learning methods, the task of unsupervised cross-domain re-ID is still challenging owing to the lack of labelled data in the target domain.
We propose an unsupervised learning scheme of Hard Samples Rectification (HSR) for person re-ID which resolves the weakness of original clustering-based methods being vulnerable to the hard positive and negative samples in the dataset.
Algorithm
Highlight
We propose a dual-faceted learning scheme of Hard Samples Rectification (HSR) which contains two components in the dual aspects:
1) an inter-camera mining technique (ICM) which utilizes the feature distribution and the camera ID information to resolve the shortcomings in the original clustering results caused by the hard positive pairs.
2) a part-based homogeneity technique (PBH) to split the possible hard negative pairs within a cluster into different groups by their features of local parts.
Proposed Method
Initially, the feature extractor is pretrained on the source dataset. For each iteration after clustering, we first rectify the hard negative pairs in the imperfect clusters with our part-based homogeneity technique (PBH) by splitting and regrouping the samples. The new refined pseudo label is then employed as the supervised information to fine-tune the model along with the cross-entropy loss and triplet loss. In the other aspect, we apply inter-camera mining technique (ICM) as a complement of clustering results by pulling close the possible hard positive pairs which are mutually top-$K$ closest to the anchor image and at the same time captured in different camera views.
Inter-camera mining
For each anchor image in the training procedure, we will mine the mutually top-K images closest to it in the feature space but with different camera views to form the possible hard positive pairs. To ensure the robustness and correctness of our inter-camera mining, an additional K mutually best-buddies pairs technique is applied.
Part-based homogeneity
We extract local features of upper part and lower part for each sample in the imperfect cluster and apply K-means clustering on the local features respectively to obtain two kinds of part-based labels. With the two temporary local labels, the cluster is then split into at most four different groups according to the look-up table.
Result
System
System overview
The original framework of MCT system implementation is composed of four operators: Detector, Tracker, Extractor, and Matcher. The layout of the pipeline system requires memory buffers to store transmitted frames between every operator. Once the buffer has reached the maximum amount of storage, the operator would discard the redundant frames and degrade the performance of MCT by generating a considerable amount of false negatives in detection.
Proposed framework
We propose an effective framework by switching the order of Extractor and Tracker in the pipeline to make use of the lacking computing power. Detector only runs at predefined intervals while leaving the rest unprocessed to the next operator. We place Extractor after Detector so that the latter only extract features of a few frames on which Detector has run detection. Tracker would bridge the broken trajectories with Kalman Filter tracker and, at the same time, preserve the features of each track. Since there only exists a subtle difference between the person captured within a relatively short time, extracting features from sampled frames of a track is an acceptable approach to maintain the operating speed of the system. The Matcher would link up the corresponding tracks and assign identities based on the Euclidean distance between each feature.
result
We adopt the DukeMTMC as the dataset in our experiments by cutting off a few video sequences from the original testing set to form two scenarios, easy and hard, as the new testing set for executive expediency. The scene "Easy" contains 13 identities across two 90 seconds long of videos captured by two cameras, where scene "Hard" contains 28 identities across two 70 seconds long of videos.
Demo
Offline demo
Person summarization
Real-time system
To demonstrate the viability on an edge device, the system is implemented on an Intel mini-PC, NUC, as our hardware platform. All operations are performed in real-time.
Original framework
Proposed framework
Elements
Text
This is bold and this is strong. This is italic and this is emphasized.
This is superscript text and this is subscript text.
This is underlined and this is code: for (;;) { ... }. Finally, this is a link.
Heading Level 2
Heading Level 3
Heading Level 4
Heading Level 5
Heading Level 6
Blockquote
Fringilla nisl. Donec accumsan interdum nisi, quis tincidunt felis sagittis eget tempus euismod. Vestibulum ante ipsum primis in faucibus vestibulum. Blandit adipiscing eu felis iaculis volutpat ac adipiscing accumsan faucibus. Vestibulum ante ipsum primis in faucibus lorem ipsum dolor sit amet nullam adipiscing eu felis.
Preformatted
i = 0;
while (!deck.isInOrder()) {
print 'Iteration ' + i;
deck.shuffle();
i++;
}
print 'It took ' + i + ' iterations to sort the deck.';