Domain-adapted pretraining and topic modeling for identifying skills categories in job postings
Citation
Share
Abstract
The need to identify and cluster related skills in job postings has become increasingly essential as the labor market becomes more complex, driven by the rapid growth in job market data and continuous shifts in economic conditions, technology, and skill requirements. This task is especially challenging for postings in low-resources languages such as Spanish, as there is a lack of models specifically trained to handle these language variations. Previous work in this regard involves taxonomies created by experts such as ESCO, intended to be used as reference points via measured skills. However, some issues associated with these systems stem from their reliance on region-specific taxonomies as well as their rigidity to adapt to the changing environment of the market. Thus, we proposed a method to improve skill identification performance within the Mexican automotive industry by grouping equivalent skills present in Spanish job postings through the integration of text normalization, a Domain-Adaptive Pre-training (DAPT) Spanish BERT model, the use of BERTopic for pseudo-labels extraction, the improvement of vocabulary representation via VGCN embeddings, and similarity metrics such as keyword overlap and cosine similarity for final refined clustering. The scope of this research is to evaluate our approach by using an Adjusted Rand Index (ARI) score in skill classification on a dataset exhibiting a long-tail distribution across both the head and tail data, comparing the results to those of an initial Non-DAPT model, since, to the best of our knowledge, no direct approach exists that is comparable to either our ensemble model or the distribution of our dataset.
Description
https://orcid.org/0000-0002-2460-3442