Tesis de maestría

Domain-adapted pretraining and topic modeling for identifying skills categories in job postings

Loading...
Thumbnail Image

Citation

View formats

Share

Bibliographic managers

Abstract

The need to identify and cluster related skills in job postings has become increasingly essential as the labor market becomes more complex, driven by the rapid growth in job market data and continuous shifts in economic conditions, technology, and skill requirements. This task is especially challenging for postings in low-resources languages such as Spanish, as there is a lack of models specifically trained to handle these language variations. Previous work in this regard involves taxonomies created by experts such as ESCO, intended to be used as reference points via measured skills. However, some issues associated with these systems stem from their reliance on region-specific taxonomies as well as their rigidity to adapt to the changing environment of the market. Thus, we proposed a method to improve skill identification performance within the Mexican automotive industry by grouping equivalent skills present in Spanish job postings through the integration of text normalization, a Domain-Adaptive Pre-training (DAPT) Spanish BERT model, the use of BERTopic for pseudo-labels extraction, the improvement of vocabulary representation via VGCN embeddings, and similarity metrics such as keyword overlap and cosine similarity for final refined clustering. The scope of this research is to evaluate our approach by using an Adjusted Rand Index (ARI) score in skill classification on a dataset exhibiting a long-tail distribution across both the head and tail data, comparing the results to those of an initial Non-DAPT model, since, to the best of our knowledge, no direct approach exists that is comparable to either our ensemble model or the distribution of our dataset.

Description

https://orcid.org/0000-0002-2460-3442

Collections

Loading...

logo

El usuario tiene la obligación de utilizar los servicios y contenidos proporcionados por la Universidad, en particular, los impresos y recursos electrónicos, de conformidad con la legislación vigente y los principios de buena fe y en general usos aceptados, sin contravenir con su realización el orden público, especialmente, en el caso en que, para el adecuado desempeño de su actividad, necesita reproducir, distribuir, comunicar y/o poner a disposición, fragmentos de obras impresas o susceptibles de estar en formato analógico o digital, ya sea en soporte papel o electrónico. Ley 23/2006, de 7 de julio, por la que se modifica el texto revisado de la Ley de Propiedad Intelectual, aprobado

Licencia