Two-fold Approach for Video Retrieval: Semantic Vectors to Guide Neural Network Training and Video Representation Approximation Via Language-Image Models

Portillo Quintero, Jesús Andrés; TERASHIMA MARIN, HUGO; 65879

Tesis de maestría

Two-fold Approach for Video Retrieval: Semantic Vectors to Guide Neural Network Training and Video Representation Approximation Via Language-Image Models

Files

Download

Request a copy

Citation

View formats

Share

Bibliographic managers

Mendeley

View statistical information

Abstract

Video Retrieval is a challenging task concerned with recovering relevant videos from a collection to fulfill a query. Defining relevance is an unsolved problem in Information Retrieval literature since it is prone to subjective considerations. Text-based Video Retrieval systems calculate relevance by measuring the relationship between a textual query and video metadata. This is a widely used approach, but it does not consider motion and video dynamics. On the other hand, content-based methods account for visuals in the retrieval process but can only operate with visual queries. This phenomenon poses the question of whether it is possible to create a Video Retrieval system that collects video based on visual content and works with textual queries. A method to bridge the semantic gap between video and text is presented. This approach employs a Multimodal Machine Learning model capable of mapping multiple types of infor- mation among themselves. The connection between modalities occurs in a learned video-text space, where it is possible to measure similarity between them. With a trained system like this, it is possible to retrieve the most similar videos to a query by obtaining the similarity between the vector representation of a text query and a collection of videos. The work presented in this thesis is focused on a Dual Encoder architecture to funnel video and text information through independent Neural Networks. These Neural Networks take advantage of pre-trained models for each modality, called backbones. Other authors have used word-level backbones to encode text; we claim this method restricts the descriptiveness of text. One contribution from this work is the implementation of a novel sentence-level em- bedding backbone. This method generates sentence vectors representing the holistic phrasal meaning and has the added benefit of allowing to measure the semantic similarity among sen- tences. A second research contribution is to employ similarity measurements in text in order to guide the Neural Network training. A proposed Proxy Mining loss finds the contrary of sentences and their corresponding videos to ground the video-text space training. Compared to video, the image modality has been dedicated more assets and research efforts given its ease of use. The possibility of leveraging those assets to video is considered. A third scientific contribution is to extend the image and text representation called CLIP. This model is pre-trained to produce a fixed-size representation for images and text that allows for similarity measurement. By trying several aggregation methods, it was possible to collapse the temporal dimension inherent in videos, hence approximate a video-text representation. This breakthrough resulted in state-of-the-art results on the MSR-VTT and MSVD benchmarks. This document represents a thesis project for the degree of Master in Computer Science from Instituto Tecnolo ́gico y de Estudios Superiores de Monterrey.

Description

http://orcid.org/0000−0002−5320−0773

Collections

Loading...

Full item page

Document viewer

Select a file to preview:

Reload