Two-fold Approach for Video Retrieval: Semantic Vectors to Guide Neural Network Training and Video Representation Approximation Via Language-Image Models

Portillo Quintero, Jesús Andrés; TERASHIMA MARIN, HUGO; 65879

Two-fold Approach for Video Retrieval: Semantic Vectors to Guide Neural Network Training and Video Representation Approximation Via Language-Image Models

dc.audience.educationlevel	Investigadores/Researchers	es_MX
dc.contributor.advisor	Terashima Marín, Hugo
dc.contributor.author	Portillo Quintero, Jesús Andrés
dc.contributor.cataloger	tolmquevedo, emipsanchez	es_MX
dc.contributor.committeemember	Ortiz Bayliss, José Carlos
dc.contributor.committeemember	Han, David
dc.contributor.department	Escuela de Ciencias e Ingeniería	es_MX
dc.contributor.institution	Campus Monterrey	es_MX
dc.creator	TERASHIMA MARIN, HUGO; 65879
dc.date.accepted	2021-06-07
dc.date.accessioned	2022-05-25T18:09:11Z
dc.date.available	2022-05-25T18:09:11Z
dc.date.created	2021-05-28
dc.date.issued	2021-05
dc.description	http://orcid.org/0000−0002−5320−0773	es_MX
dc.description.abstract	Video Retrieval is a challenging task concerned with recovering relevant videos from a collection to fulfill a query. Defining relevance is an unsolved problem in Information Retrieval literature since it is prone to subjective considerations. Text-based Video Retrieval systems calculate relevance by measuring the relationship between a textual query and video metadata. This is a widely used approach, but it does not consider motion and video dynamics. On the other hand, content-based methods account for visuals in the retrieval process but can only operate with visual queries. This phenomenon poses the question of whether it is possible to create a Video Retrieval system that collects video based on visual content and works with textual queries. A method to bridge the semantic gap between video and text is presented. This approach employs a Multimodal Machine Learning model capable of mapping multiple types of infor- mation among themselves. The connection between modalities occurs in a learned video-text space, where it is possible to measure similarity between them. With a trained system like this, it is possible to retrieve the most similar videos to a query by obtaining the similarity between the vector representation of a text query and a collection of videos. The work presented in this thesis is focused on a Dual Encoder architecture to funnel video and text information through independent Neural Networks. These Neural Networks take advantage of pre-trained models for each modality, called backbones. Other authors have used word-level backbones to encode text; we claim this method restricts the descriptiveness of text. One contribution from this work is the implementation of a novel sentence-level em- bedding backbone. This method generates sentence vectors representing the holistic phrasal meaning and has the added benefit of allowing to measure the semantic similarity among sen- tences. A second research contribution is to employ similarity measurements in text in order to guide the Neural Network training. A proposed Proxy Mining loss finds the contrary of sentences and their corresponding videos to ground the video-text space training. Compared to video, the image modality has been dedicated more assets and research efforts given its ease of use. The possibility of leveraging those assets to video is considered. A third scientific contribution is to extend the image and text representation called CLIP. This model is pre-trained to produce a fixed-size representation for images and text that allows for similarity measurement. By trying several aggregation methods, it was possible to collapse the temporal dimension inherent in videos, hence approximate a video-text representation. This breakthrough resulted in state-of-the-art results on the MSR-VTT and MSVD benchmarks. This document represents a thesis project for the degree of Master in Computer Science from Instituto Tecnolo ́gico y de Estudios Superiores de Monterrey.	es_MX
dc.description.degree	Mastro en Ciencias Computacionales	es_MX
dc.format.medium	Texto	es_MX
dc.identificator	7\|\|33\|\|3304\|\|330406	es_MX
dc.identificator	7\|\|33\|\|3304\|\|120304	es_MX
dc.identifier.citation	Portillo-Quintero, J.A. (2021). Two-fold Approach for Video Retrieval: Semantic Vectors to Guide Neural Network Training and Video Representation Approximation Via Language-Image Models. (Tesis de Maestría). Instituto Tecnológico y de Estudios Superiores de Monterrey. Recuperado de: https://hdl.handle.net/11285/648384	es_MX
dc.identifier.orcid	http://orcid.org/0000−0002−9856−1900	es_MX
dc.identifier.uri	https://hdl.handle.net/11285/648384
dc.language.iso	eng	es_MX
dc.publisher	Instituto Tecnológico y de Estudios Superiores de Monterrey	es_MX
dc.relation	CONACyT	es_MX
dc.relation.isFormatOf	versión publicada	es_MX
dc.relation.isreferencedby	REPOSITORIO NACIONAL CONACYT
dc.rights	openAccess	es_MX
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0	es_MX
dc.subject.classification	INGENIERÍA Y TECNOLOGÍA::CIENCIAS TECNOLÓGICAS::TECNOLOGÍA DE LOS ORDENADORES::ARQUITECTURA DE ORDENADORES	es_MX
dc.subject.classification	INGENIERÍA Y TECNOLOGÍA::CIENCIAS TECNOLÓGICAS::TECNOLOGÍA DE LOS ORDENADORES::INTELIGENCIA ARTIFICIAL	es_MX
dc.subject.keyword	Video Retrieval	es_MX
dc.subject.keyword	Neural Network	es_MX
dc.subject.keyword	CLIP	es_MX
dc.subject.lcsh	Technology	es_MX
dc.title	Two-fold Approach for Video Retrieval: Semantic Vectors to Guide Neural Network Training and Video Representation Approximation Via Language-Image Models	es_MX
dc.type	Tesis de maestría