End-to-End Violence Detection Using Pedestrian Detection, Pose Estimation, and Temporal GRUs for Surveillance Applications
Citation
Share
Abstract
In recent years, surveillance systems have played an increasingly prominent role in both public and private settings. These systems monitor activities in real time and provide information to security personnel and authorities. Their constant observation helps prevent incidents and maintain order. Traditional surveillance systems record events but do not fully exploit the valuable information they capture. New technologies allow valuable data to be extracted, turning surveillance into an active tool for security. With the development of tools like object detection, pose estimation, and neural networks, surveillance systems can now interpret the scenes they capture. Rather than simply recording footage, these systems are becoming active participants in security by extracting meaningful information from visual data. Despite these advances, it remains a challenge to identify violent acts using visual information. The main challenge is to analyze the data in a way that identifies risks. Although cameras capture lot of information, traditional systems do not always use them preventively. These systems must predict risky situations by detecting aggressive behavior or suspicious activities early. This work primarily focuses on addressing the development of techniques to improve the detection of violence in surveillance videos by optimizing specific processes such as pedestrian detection, human posture estimation, object tracking, and violent behavior classification. Pedestrian detection is optimized using advanced models like YOLO, enhancing accuracy in high-density environments. Posture estimation is improved through advanced pose detection algorithms that reduce manual intervention. Object tracking is enhanced by implementing Deep SORT to maintain reliable identity tracking across video frames. Violent behavior classification is fine-tuned using a deep neural network architecture based on Gated Recurrent Units (GRU), which captures temporal movement patterns. Video footage from the KranokNV database is processed to identify joint angles of pedestrians, and the VID dataset is used to evaluate system performance. This integrated approach aims to achieve faster, more accurate, and more reliable detection of violent situations, contributing to public safety. Additionally, the evaluation considers spatial and temporal features, such as velocity, acceleration, motion energy, abrupt changes, symmetry, and expansion radius. The processed data was smoothed with the Kalman filter, achieving an accuracy of 99.44%. The results indicate continuous detection capability and improvement in generalization throughout the training process.
Description
https://orcid.org/0000-0001-6270-3164