Abstract:
Person re-identification (ReID) across multiple surveillance cameras with non-overlapping
fields of view is one of the most significant problems in real-world intelligent video
surveillance systems. Due to the unconstrained nature of the problem, gait-based per-
son recognition is the only likely identification method to solve the person ReID in
this situation. Furthermore, most of the existing ReID algorithms were designed for
closed-world scenarios that consider the same descriptors across the camera network re-
gardless of the dramatic change in view angle due to different camera positions, which
eventually cause them to perform poorly in real-world scenarios. To address this prob-
lem, therefore, in this thesis, we present a simple yet effective algorithm for robust gait
recognition for person ReID that addresses the challenges that arise from the real-world
multi-camera surveillance environment. In this approach, we first designed a novel low-
dimensional spatio-temporal feature vector that was extracted from the pose estimation
of raw video frames. In this research, we have developed a 50-dimensional feature
descriptor by concatenating four different types of spatio-temporal features. These fea-
tures are discriminant, and at the same time robust to the variations of different covari-
ate factors. Thereafter, a pose sequence having a timestep of length of 28 frames was
formed to feed into an RNN-based classifier network. The RNN network consists of two
BGRU layers each of which only has 80 GRU cells. The input layer was followed by a
batch normalization layer. The output of the recurrent layers was also batch normalized
to standardize the activations and finally fed into an output softmax layer. The major-
ity voting scheme was employed to process the output to predict the subject ID. For
multi-view gait recognition, we also propose a two-stage network in which we initially
identify the walking direction from gait video by employing a view angle identification
network. Here, the input of the network was a clip of 16 consecutive frames that were
preprocessed and resized to 112x112 to feed into a 3D convolutional network based on
C3D. The experimental evaluation conducted on two challenging CASIA A and CA-
SIA B gait datasets demonstrates that the proposed method has achieved state-of-the-art
performance on both single-view and multi-view gait recognition. The experimental re-
sult clearly confirms the effectiveness of our proposed approach when compared to the
other state-of-the-art methods.