Two-fold Approach for Video Retrieval: Semantic Vectors to Guide Neural Network Training and Video Representation Approximation Via Language-Image Models

dc.audience.educationlevelInvestigadores/Researcherses_MX
dc.contributor.advisorTerashima Marín, Hugo
dc.contributor.authorPortillo Quintero, Jesús Andrés
dc.contributor.catalogertolmquevedo, emipsanchezes_MX
dc.contributor.committeememberOrtiz Bayliss, José Carlos
dc.contributor.committeememberHan, David
dc.contributor.departmentEscuela de Ciencias e Ingenieríaes_MX
dc.contributor.institutionCampus Monterreyes_MX
dc.creatorTERASHIMA MARIN, HUGO; 65879
dc.date.accepted2021-06-07
dc.date.accessioned2022-05-25T18:09:11Z
dc.date.available2022-05-25T18:09:11Z
dc.date.created2021-05-28
dc.date.issued2021-05
dc.descriptionhttp://orcid.org/0000−0002−5320−0773es_MX
dc.description.abstractVideo Retrieval is a challenging task concerned with recovering relevant videos from a collection to fulfill a query. Defining relevance is an unsolved problem in Information Retrieval literature since it is prone to subjective considerations. Text-based Video Retrieval systems calculate relevance by measuring the relationship between a textual query and video metadata. This is a widely used approach, but it does not consider motion and video dynamics. On the other hand, content-based methods account for visuals in the retrieval process but can only operate with visual queries. This phenomenon poses the question of whether it is possible to create a Video Retrieval system that collects video based on visual content and works with textual queries. A method to bridge the semantic gap between video and text is presented. This approach employs a Multimodal Machine Learning model capable of mapping multiple types of infor- mation among themselves. The connection between modalities occurs in a learned video-text space, where it is possible to measure similarity between them. With a trained system like this, it is possible to retrieve the most similar videos to a query by obtaining the similarity between the vector representation of a text query and a collection of videos. The work presented in this thesis is focused on a Dual Encoder architecture to funnel video and text information through independent Neural Networks. These Neural Networks take advantage of pre-trained models for each modality, called backbones. Other authors have used word-level backbones to encode text; we claim this method restricts the descriptiveness of text. One contribution from this work is the implementation of a novel sentence-level em- bedding backbone. This method generates sentence vectors representing the holistic phrasal meaning and has the added benefit of allowing to measure the semantic similarity among sen- tences. A second research contribution is to employ similarity measurements in text in order to guide the Neural Network training. A proposed Proxy Mining loss finds the contrary of sentences and their corresponding videos to ground the video-text space training. Compared to video, the image modality has been dedicated more assets and research efforts given its ease of use. The possibility of leveraging those assets to video is considered. A third scientific contribution is to extend the image and text representation called CLIP. This model is pre-trained to produce a fixed-size representation for images and text that allows for similarity measurement. By trying several aggregation methods, it was possible to collapse the temporal dimension inherent in videos, hence approximate a video-text representation. This breakthrough resulted in state-of-the-art results on the MSR-VTT and MSVD benchmarks. This document represents a thesis project for the degree of Master in Computer Science from Instituto Tecnolo ́gico y de Estudios Superiores de Monterrey.es_MX
dc.description.degreeMastro en Ciencias Computacionaleses_MX
dc.format.mediumTextoes_MX
dc.identificator7||33||3304||330406es_MX
dc.identificator7||33||3304||120304es_MX
dc.identifier.citationPortillo-Quintero, J.A. (2021). Two-fold Approach for Video Retrieval: Semantic Vectors to Guide Neural Network Training and Video Representation Approximation Via Language-Image Models. (Tesis de Maestría). Instituto Tecnológico y de Estudios Superiores de Monterrey. Recuperado de: https://hdl.handle.net/11285/648384es_MX
dc.identifier.orcidhttp://orcid.org/0000−0002−9856−1900es_MX
dc.identifier.urihttps://hdl.handle.net/11285/648384
dc.language.isoenges_MX
dc.publisherInstituto Tecnológico y de Estudios Superiores de Monterreyes_MX
dc.relationCONACyTes_MX
dc.relation.isFormatOfversión publicadaes_MX
dc.relation.isreferencedbyREPOSITORIO NACIONAL CONACYT
dc.rightsopenAccesses_MX
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0es_MX
dc.subject.classificationINGENIERÍA Y TECNOLOGÍA::CIENCIAS TECNOLÓGICAS::TECNOLOGÍA DE LOS ORDENADORES::ARQUITECTURA DE ORDENADORESes_MX
dc.subject.classificationINGENIERÍA Y TECNOLOGÍA::CIENCIAS TECNOLÓGICAS::TECNOLOGÍA DE LOS ORDENADORES::INTELIGENCIA ARTIFICIALes_MX
dc.subject.keywordVideo Retrievales_MX
dc.subject.keywordNeural Networkes_MX
dc.subject.keywordCLIPes_MX
dc.subject.lcshTechnologyes_MX
dc.titleTwo-fold Approach for Video Retrieval: Semantic Vectors to Guide Neural Network Training and Video Representation Approximation Via Language-Image Modelses_MX
dc.typeTesis de maestría

Files

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
PortilloQuintero_TesisMaestriaPDFA.pdf
Size:
3.47 MB
Format:
Adobe Portable Document Format
Description:
Tesis Maestría
Loading...
Thumbnail Image
Name:
PortilloQuintero_ActadeGradoyDeclaracionAutoriaPDFA.pdf
Size:
605.58 KB
Format:
Adobe Portable Document Format
Description:
Acta de Grado y Declaración Autoría
Loading...
Thumbnail Image
Name:
CartaAutorizacionTesis-CON firmada.pdf
Size:
110.21 KB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.3 KB
Format:
Item-specific license agreed upon to submission
Description:
logo

El usuario tiene la obligación de utilizar los servicios y contenidos proporcionados por la Universidad, en particular, los impresos y recursos electrónicos, de conformidad con la legislación vigente y los principios de buena fe y en general usos aceptados, sin contravenir con su realización el orden público, especialmente, en el caso en que, para el adecuado desempeño de su actividad, necesita reproducir, distribuir, comunicar y/o poner a disposición, fragmentos de obras impresas o susceptibles de estar en formato analógico o digital, ya sea en soporte papel o electrónico. Ley 23/2006, de 7 de julio, por la que se modifica el texto revisado de la Ley de Propiedad Intelectual, aprobado

DSpace software copyright © 2002-2026

Licencia