Dense video captioning of violent behavior using bi-modal transformers and unsupervised semantic information.

dc.audience.educationlevelEmpresas/Companies
dc.audience.educationlevelOtros/Other
dc.contributor.advisorTerashima Marín, Hugo
dc.contributor.authorCárdenas Pimentel, Israel
dc.contributor.catalogeremipsanchez
dc.contributor.committeememberConant Pablos, Santiago Enrique
dc.contributor.committeememberRad, Paul
dc.contributor.departmentSchool of Engineering and Sciences
dc.contributor.institutionCampus Estado de Méxicoes_MX
dc.date.accepted2023-06
dc.date.accessioned2025-03-20T02:49:50Z
dc.date.issued2023
dc.descriptionhttps://orcid.org/0000-0002-5320-0773
dc.description.abstractThis work presents the research developed to obtain the degree of Master of Science in Computer Science. Security and safety are issues that, in recent decades, with the increasing number of crimes in cities that overwhelm public security, have been subject to improvement with the use of technology. With the evolution of technology, people access security systems such as video surveillance to ensure security and safety in all places, from home to business. Nevertheless, the data these systems collect is large and sometimes complex for the non-trained eye to interpret. Therefore, a system capable of understanding the environment, the subjects involved in it, and explaining what is happening in a video with a textual description is an improvement to video surveillance to understand and prevent crime. Technical challenges of dense video captioning are related to the correct event detection and textual description of these events by exploiting visual and audio features on a dataset with a specific domain. Some video captioning techniques have been developed, like bidirectional analysis, hierarchical reinforcement learning agent and event sequence generation. The difference between these dense video captioning models and the proposed bi-modal transformer is that it generates descriptions for events using visual and audio inputs, showing how audio facilitates the dense video captioning performance. However, the audio signal is not available in all cases for the proposed application of captioning of violent behavior present in CCTV footage. So the audio signal is replaced in the Bi-modal transformer by unsupervised semantic information, learn with a method based on the premise that complex events can be decomposed into more elementary events shared across several complex events. The dense video captioning process is relevant because it covers a tedious and challenging task for humans helping to reduce the work of detection and interpretation of the events presented on screen. For this work, implementing a dataset of a specific domain of violent behavior satisfies this problem. The contribution of this work is to implement this dense video captioning infrastructure with a dataset with these characteristics, the existent DCSASS video dataset, a collection of surveillance camera videos that contain anomalies and expected behaviors, and the merge of this dataset with the ActivityNet dataset, a collection of videos that contains human behavior. Both databases contribute to the description of these violent events presented in videos. The DCSASS dataset is processed to create from scratch new descriptions for the events in the videos from the DCSASS/ActivityNet merged dataset. These descriptions provide the features to feed the bi-modal transformer and train it to generate new model adapted environments involving violent and expected behaviors. Also, it allows results closest to the original implementation of the ActivityNet dataset when comparing the output metric scores such as METEOR, SPLICE, and CIDER. It represents an opportunity to engance the dataset and improve the trainning process of this new model.
dc.description.degreeMaster of Science in Computer Sciencees_MX
dc.format.mediumTexto
dc.identificator120314
dc.identifier.citationCárdenas Pimentel, I. (2023). Dense Video Captioning of Violent Behavior Using Bi-modal Transformers and Unsupervised Semantic Information. [Tesis maestría]. Instituto Tecnológico y de Estudios Superiores de Monterrey.
dc.identifier.cvu1151184es_MX
dc.identifier.orcidhttps://orcid.org/0009-0003-8516-8725
dc.identifier.urihttps://hdl.handle.net/11285/703376
dc.language.isoenges_MX
dc.publisherInstituto Tecnológico y de Estudios Superiores de Monterreyes_MX
dc.relation.isFormatOfsubmittedVersiones_MX
dc.rightsopenAccesses_MX
dc.rights.urihttp://creativecommons.org/licenses/by-nc/4.0es_MX
dc.subject.classificationINGENIERÍA Y TECNOLOGÍA::CIENCIAS TECNOLÓGICAS::TECNOLOGÍA DE LOS ORDENADORES::SISTEMAS DE CONTROL DEL ENTORNO
dc.subject.keywordDense video captioning
dc.subject.keywordViolent behavior
dc.subject.keywordTransformers
dc.subject.keywordActivityNet
dc.subject.lcshTechnology
dc.subject.lcshScience
dc.titleDense video captioning of violent behavior using bi-modal transformers and unsupervised semantic information.
dc.typeTesis de Maestría / master Thesises_MX

Files

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
CardenasPimental_TesisMaestriapdfa.pdf
Size:
13.76 MB
Format:
Adobe Portable Document Format
Description:
Tesis Maestría
Loading...
Thumbnail Image
Name:
CardenasPimentel_ActaGradoDeclaracionAutoriapdfa.pdf
Size:
473.48 KB
Format:
Adobe Portable Document Format
Description:
Acta de Grado y Declaración de Autoría
Loading...
Thumbnail Image
Name:
CardenasPimentel_CartaAutorizacionpdfa.pdf
Size:
130.53 KB
Format:
Adobe Portable Document Format
Description:
Carta Autorización

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.3 KB
Format:
Item-specific license agreed upon to submission
Description:
logo

El usuario tiene la obligación de utilizar los servicios y contenidos proporcionados por la Universidad, en particular, los impresos y recursos electrónicos, de conformidad con la legislación vigente y los principios de buena fe y en general usos aceptados, sin contravenir con su realización el orden público, especialmente, en el caso en que, para el adecuado desempeño de su actividad, necesita reproducir, distribuir, comunicar y/o poner a disposición, fragmentos de obras impresas o susceptibles de estar en formato analógico o digital, ya sea en soporte papel o electrónico. Ley 23/2006, de 7 de julio, por la que se modifica el texto revisado de la Ley de Propiedad Intelectual, aprobado

DSpace software copyright © 2002-2026

Licencia