Dense video captioning of violent behavior using bi-modal transformers and unsupervised semantic information.

Cárdenas Pimentel, Israel

Dense video captioning of violent behavior using bi-modal transformers and unsupervised semantic information.

dc.audience.educationlevel	Empresas/Companies
dc.audience.educationlevel	Otros/Other
dc.contributor.advisor	Terashima Marín, Hugo
dc.contributor.author	Cárdenas Pimentel, Israel
dc.contributor.cataloger	emipsanchez
dc.contributor.committeemember	Conant Pablos, Santiago Enrique
dc.contributor.committeemember	Rad, Paul
dc.contributor.department	School of Engineering and Sciences
dc.contributor.institution	Campus Estado de México	es_MX
dc.date.accepted	2023-06
dc.date.accessioned	2025-03-20T02:49:50Z
dc.date.issued	2023
dc.description	https://orcid.org/0000-0002-5320-0773
dc.description.abstract	This work presents the research developed to obtain the degree of Master of Science in Computer Science. Security and safety are issues that, in recent decades, with the increasing number of crimes in cities that overwhelm public security, have been subject to improvement with the use of technology. With the evolution of technology, people access security systems such as video surveillance to ensure security and safety in all places, from home to business. Nevertheless, the data these systems collect is large and sometimes complex for the non-trained eye to interpret. Therefore, a system capable of understanding the environment, the subjects involved in it, and explaining what is happening in a video with a textual description is an improvement to video surveillance to understand and prevent crime. Technical challenges of dense video captioning are related to the correct event detection and textual description of these events by exploiting visual and audio features on a dataset with a specific domain. Some video captioning techniques have been developed, like bidirectional analysis, hierarchical reinforcement learning agent and event sequence generation. The difference between these dense video captioning models and the proposed bi-modal transformer is that it generates descriptions for events using visual and audio inputs, showing how audio facilitates the dense video captioning performance. However, the audio signal is not available in all cases for the proposed application of captioning of violent behavior present in CCTV footage. So the audio signal is replaced in the Bi-modal transformer by unsupervised semantic information, learn with a method based on the premise that complex events can be decomposed into more elementary events shared across several complex events. The dense video captioning process is relevant because it covers a tedious and challenging task for humans helping to reduce the work of detection and interpretation of the events presented on screen. For this work, implementing a dataset of a specific domain of violent behavior satisfies this problem. The contribution of this work is to implement this dense video captioning infrastructure with a dataset with these characteristics, the existent DCSASS video dataset, a collection of surveillance camera videos that contain anomalies and expected behaviors, and the merge of this dataset with the ActivityNet dataset, a collection of videos that contains human behavior. Both databases contribute to the description of these violent events presented in videos. The DCSASS dataset is processed to create from scratch new descriptions for the events in the videos from the DCSASS/ActivityNet merged dataset. These descriptions provide the features to feed the bi-modal transformer and train it to generate new model adapted environments involving violent and expected behaviors. Also, it allows results closest to the original implementation of the ActivityNet dataset when comparing the output metric scores such as METEOR, SPLICE, and CIDER. It represents an opportunity to engance the dataset and improve the trainning process of this new model.
dc.description.degree	Master of Science in Computer Science	es_MX
dc.format.medium	Texto
dc.identificator	120314
dc.identifier.citation	Cárdenas Pimentel, I. (2023). Dense Video Captioning of Violent Behavior Using Bi-modal Transformers and Unsupervised Semantic Information. [Tesis maestría]. Instituto Tecnológico y de Estudios Superiores de Monterrey.
dc.identifier.cvu	1151184	es_MX
dc.identifier.orcid	https://orcid.org/0009-0003-8516-8725
dc.identifier.uri	https://hdl.handle.net/11285/703376
dc.language.iso	eng	es_MX
dc.publisher	Instituto Tecnológico y de Estudios Superiores de Monterrey	es_MX
dc.relation.isFormatOf	submittedVersion	es_MX
dc.rights	openAccess	es_MX
dc.rights.uri	http://creativecommons.org/licenses/by-nc/4.0	es_MX
dc.subject.classification	INGENIERÍA Y TECNOLOGÍA::CIENCIAS TECNOLÓGICAS::TECNOLOGÍA DE LOS ORDENADORES::SISTEMAS DE CONTROL DEL ENTORNO
dc.subject.keyword	Dense video captioning
dc.subject.keyword	Violent behavior
dc.subject.keyword	Transformers
dc.subject.keyword	ActivityNet
dc.subject.lcsh	Technology
dc.subject.lcsh	Science
dc.title	Dense video captioning of violent behavior using bi-modal transformers and unsupervised semantic information.
dc.type	Tesis de Maestría / master Thesis	es_MX