Two-fold Approach for Video Retrieval: Semantic Vectors to Guide Neural Network Training and Video Representation Approximation Via Language-Image Models
| dc.audience.educationlevel | Investigadores/Researchers | es_MX |
| dc.contributor.advisor | Terashima Marín, Hugo | |
| dc.contributor.author | Portillo Quintero, Jesús Andrés | |
| dc.contributor.cataloger | tolmquevedo, emipsanchez | es_MX |
| dc.contributor.committeemember | Ortiz Bayliss, José Carlos | |
| dc.contributor.committeemember | Han, David | |
| dc.contributor.department | Escuela de Ciencias e Ingeniería | es_MX |
| dc.contributor.institution | Campus Monterrey | es_MX |
| dc.creator | TERASHIMA MARIN, HUGO; 65879 | |
| dc.date.accepted | 2021-06-07 | |
| dc.date.accessioned | 2022-05-25T18:09:11Z | |
| dc.date.available | 2022-05-25T18:09:11Z | |
| dc.date.created | 2021-05-28 | |
| dc.date.issued | 2021-05 | |
| dc.description | http://orcid.org/0000−0002−5320−0773 | es_MX |
| dc.description.abstract | Video Retrieval is a challenging task concerned with recovering relevant videos from a collection to fulfill a query. Defining relevance is an unsolved problem in Information Retrieval literature since it is prone to subjective considerations. Text-based Video Retrieval systems calculate relevance by measuring the relationship between a textual query and video metadata. This is a widely used approach, but it does not consider motion and video dynamics. On the other hand, content-based methods account for visuals in the retrieval process but can only operate with visual queries. This phenomenon poses the question of whether it is possible to create a Video Retrieval system that collects video based on visual content and works with textual queries. A method to bridge the semantic gap between video and text is presented. This approach employs a Multimodal Machine Learning model capable of mapping multiple types of infor- mation among themselves. The connection between modalities occurs in a learned video-text space, where it is possible to measure similarity between them. With a trained system like this, it is possible to retrieve the most similar videos to a query by obtaining the similarity between the vector representation of a text query and a collection of videos. The work presented in this thesis is focused on a Dual Encoder architecture to funnel video and text information through independent Neural Networks. These Neural Networks take advantage of pre-trained models for each modality, called backbones. Other authors have used word-level backbones to encode text; we claim this method restricts the descriptiveness of text. One contribution from this work is the implementation of a novel sentence-level em- bedding backbone. This method generates sentence vectors representing the holistic phrasal meaning and has the added benefit of allowing to measure the semantic similarity among sen- tences. A second research contribution is to employ similarity measurements in text in order to guide the Neural Network training. A proposed Proxy Mining loss finds the contrary of sentences and their corresponding videos to ground the video-text space training. Compared to video, the image modality has been dedicated more assets and research efforts given its ease of use. The possibility of leveraging those assets to video is considered. A third scientific contribution is to extend the image and text representation called CLIP. This model is pre-trained to produce a fixed-size representation for images and text that allows for similarity measurement. By trying several aggregation methods, it was possible to collapse the temporal dimension inherent in videos, hence approximate a video-text representation. This breakthrough resulted in state-of-the-art results on the MSR-VTT and MSVD benchmarks. This document represents a thesis project for the degree of Master in Computer Science from Instituto Tecnolo ́gico y de Estudios Superiores de Monterrey. | es_MX |
| dc.description.degree | Mastro en Ciencias Computacionales | es_MX |
| dc.format.medium | Texto | es_MX |
| dc.identificator | 7||33||3304||330406 | es_MX |
| dc.identificator | 7||33||3304||120304 | es_MX |
| dc.identifier.citation | Portillo-Quintero, J.A. (2021). Two-fold Approach for Video Retrieval: Semantic Vectors to Guide Neural Network Training and Video Representation Approximation Via Language-Image Models. (Tesis de Maestría). Instituto Tecnológico y de Estudios Superiores de Monterrey. Recuperado de: https://hdl.handle.net/11285/648384 | es_MX |
| dc.identifier.orcid | http://orcid.org/0000−0002−9856−1900 | es_MX |
| dc.identifier.uri | https://hdl.handle.net/11285/648384 | |
| dc.language.iso | eng | es_MX |
| dc.publisher | Instituto Tecnológico y de Estudios Superiores de Monterrey | es_MX |
| dc.relation | CONACyT | es_MX |
| dc.relation.isFormatOf | versión publicada | es_MX |
| dc.relation.isreferencedby | REPOSITORIO NACIONAL CONACYT | |
| dc.rights | openAccess | es_MX |
| dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0 | es_MX |
| dc.subject.classification | INGENIERÍA Y TECNOLOGÍA::CIENCIAS TECNOLÓGICAS::TECNOLOGÍA DE LOS ORDENADORES::ARQUITECTURA DE ORDENADORES | es_MX |
| dc.subject.classification | INGENIERÍA Y TECNOLOGÍA::CIENCIAS TECNOLÓGICAS::TECNOLOGÍA DE LOS ORDENADORES::INTELIGENCIA ARTIFICIAL | es_MX |
| dc.subject.keyword | Video Retrieval | es_MX |
| dc.subject.keyword | Neural Network | es_MX |
| dc.subject.keyword | CLIP | es_MX |
| dc.subject.lcsh | Technology | es_MX |
| dc.title | Two-fold Approach for Video Retrieval: Semantic Vectors to Guide Neural Network Training and Video Representation Approximation Via Language-Image Models | es_MX |
| dc.type | Tesis de maestría |
Files
Original bundle
1 - 3 of 3
Loading...
- Name:
- PortilloQuintero_TesisMaestriaPDFA.pdf
- Size:
- 3.47 MB
- Format:
- Adobe Portable Document Format
- Description:
- Tesis Maestría
Loading...
- Name:
- PortilloQuintero_ActadeGradoyDeclaracionAutoriaPDFA.pdf
- Size:
- 605.58 KB
- Format:
- Adobe Portable Document Format
- Description:
- Acta de Grado y Declaración Autoría
Loading...
- Name:
- CartaAutorizacionTesis-CON firmada.pdf
- Size:
- 110.21 KB
- Format:
- Adobe Portable Document Format
- Description:
License bundle
1 - 1 of 1
Loading...
- Name:
- license.txt
- Size:
- 1.3 KB
- Format:
- Item-specific license agreed upon to submission
- Description:

