Dense video captioning of violent behavior using bi-modal transformers and unsupervised semantic information.

Cárdenas Pimentel, Israel

Tesis de maestría / master thesis

Dense video captioning of violent behavior using bi-modal transformers and unsupervised semantic information.

Files

Download

Request a copy

Citation

View formats

Share

Bibliographic managers

Mendeley

View statistical information

Abstract

This work presents the research developed to obtain the degree of Master of Science in Computer Science. Security and safety are issues that, in recent decades, with the increasing number of crimes in cities that overwhelm public security, have been subject to improvement with the use of technology. With the evolution of technology, people access security systems such as video surveillance to ensure security and safety in all places, from home to business. Nevertheless, the data these systems collect is large and sometimes complex for the non-trained eye to interpret. Therefore, a system capable of understanding the environment, the subjects involved in it, and explaining what is happening in a video with a textual description is an improvement to video surveillance to understand and prevent crime. Technical challenges of dense video captioning are related to the correct event detection and textual description of these events by exploiting visual and audio features on a dataset with a specific domain. Some video captioning techniques have been developed, like bidirectional analysis, hierarchical reinforcement learning agent and event sequence generation. The difference between these dense video captioning models and the proposed bi-modal transformer is that it generates descriptions for events using visual and audio inputs, showing how audio facilitates the dense video captioning performance. However, the audio signal is not available in all cases for the proposed application of captioning of violent behavior present in CCTV footage. So the audio signal is replaced in the Bi-modal transformer by unsupervised semantic information, learn with a method based on the premise that complex events can be decomposed into more elementary events shared across several complex events. The dense video captioning process is relevant because it covers a tedious and challenging task for humans helping to reduce the work of detection and interpretation of the events presented on screen. For this work, implementing a dataset of a specific domain of violent behavior satisfies this problem. The contribution of this work is to implement this dense video captioning infrastructure with a dataset with these characteristics, the existent DCSASS video dataset, a collection of surveillance camera videos that contain anomalies and expected behaviors, and the merge of this dataset with the ActivityNet dataset, a collection of videos that contains human behavior. Both databases contribute to the description of these violent events presented in videos. The DCSASS dataset is processed to create from scratch new descriptions for the events in the videos from the DCSASS/ActivityNet merged dataset. These descriptions provide the features to feed the bi-modal transformer and train it to generate new model adapted environments involving violent and expected behaviors. Also, it allows results closest to the original implementation of the ActivityNet dataset when comparing the output metric scores such as METEOR, SPLICE, and CIDER. It represents an opportunity to engance the dataset and improve the trainning process of this new model.

Description

https://orcid.org/0000-0002-5320-0773

Collections

Ciencias Exactas y Ciencias de la Salud

Loading...

Full item page

Document viewer

Select a file to preview:

Reload