Domain-adapted pretraining and topic modeling for identifying skills categories in job postings

Ceballos Cancino, Héctor GibránVázquez Lepe, Elisa VirginiaMadera Espíndola, Diana Patricia2025-12-102025-12-05https://hdl.handle.net/11285/705193https://orcid.org/0000-0002-2460-3442The need to identify and cluster related skills in job postings has become increasingly essential as the labor market becomes more complex, driven by the rapid growth in job market data and continuous shifts in economic conditions, technology, and skill requirements. This task is especially challenging for postings in low-resources languages such as Spanish, as there is a lack of models specifically trained to handle these language variations. Previous work in this regard involves taxonomies created by experts such as ESCO, intended to be used as reference points via measured skills. However, some issues associated with these systems stem from their reliance on region-specific taxonomies as well as their rigidity to adapt to the changing environment of the market. Thus, we proposed a method to improve skill identification performance within the Mexican automotive industry by grouping equivalent skills present in Spanish job postings through the integration of text normalization, a Domain-Adaptive Pre-training (DAPT) Spanish BERT model, the use of BERTopic for pseudo-labels extraction, the improvement of vocabulary representation via VGCN embeddings, and similarity metrics such as keyword overlap and cosine similarity for final refined clustering. The scope of this research is to evaluate our approach by using an Adjusted Rand Index (ARI) score in skill classification on a dataset exhibiting a long-tail distribution across both the head and tail data, comparing the results to those of an initial Non-DAPT model, since, to the best of our knowledge, no direct approach exists that is comparable to either our ensemble model or the distribution of our dataset.TextoengopenAccesshttp://creativecommons.org/licenses/by-nc-nd/4.0CIENCIAS FÍSICO MATEMÁTICAS Y CIENCIAS DE LA TIERRA::MATEMÁTICAS::CIENCIA DE LOS ORDENADORES::SISTEMAS DE INFORMACIÓN, DISEÑO Y COMPONENTESHUMANIDADES Y CIENCIAS DE LA CONDUCTA::LINGÜÍSTICA::LINGÜÍSTICA APLICADA::LINGÜÍSTICA INFORMATIZADAHUMANIDADES Y CIENCIAS DE LA CONDUCTA::LINGÜÍSTICA::LINGÜÍSTICA APLICADA::DOCUMENTACIÓN AUTOMATIZADACIENCIAS FÍSICO MATEMÁTICAS Y CIENCIAS DE LA TIERRA::MATEMÁTICAS::CIENCIA DE LOS ORDENADORES::INTELIGENCIA ARTIFICIALTechnologyDomain-adapted pretraining and topic modeling for identifying skills categories in job postingsTesis de maestríaPor política las tesis de Ciencias Exactas y Ciencias de la Salud estarán en embargo por 1 añohttps://orcid.org/0009-0007-5040-6171Skill IdentificationSkill ClassificationUnsupervised ClusteringDomain-Adaptive PretrainingBERTopic1347681