Domain-adapted pretraining and topic modeling for identifying skills categories in job postings

dc.audience.educationlevelInvestigadores/Researchers
dc.audience.educationlevelEstudiantes/Students
dc.contributor.advisorCeballos Cancino, Héctor Gibrán
dc.contributor.advisorVázquez Lepe, Elisa Virginia
dc.contributor.authorMadera Espíndola, Diana Patricia
dc.contributor.catalogermtyahinojosa, emipsanchez
dc.contributor.committeememberGonzález Gómez, Luis José
dc.contributor.committeememberFahim Siddiqui, Muhammad Hammad
dc.contributor.committeememberCantú Ortiz, Francisco Javier
dc.contributor.departmentEscuela de Ingeniería y Ciencias
dc.contributor.institutionCampus Estado de México
dc.contributor.mentorButt, Sabur
dc.date.accepted2025-11-21
dc.date.accessioned2025-12-10T17:25:35Z
dc.date.embargoenddate2026-12
dc.date.issued2025-12-05
dc.descriptionhttps://orcid.org/0000-0002-2460-3442
dc.description.abstractThe need to identify and cluster related skills in job postings has become increasingly essential as the labor market becomes more complex, driven by the rapid growth in job market data and continuous shifts in economic conditions, technology, and skill requirements. This task is especially challenging for postings in low-resources languages such as Spanish, as there is a lack of models specifically trained to handle these language variations. Previous work in this regard involves taxonomies created by experts such as ESCO, intended to be used as reference points via measured skills. However, some issues associated with these systems stem from their reliance on region-specific taxonomies as well as their rigidity to adapt to the changing environment of the market. Thus, we proposed a method to improve skill identification performance within the Mexican automotive industry by grouping equivalent skills present in Spanish job postings through the integration of text normalization, a Domain-Adaptive Pre-training (DAPT) Spanish BERT model, the use of BERTopic for pseudo-labels extraction, the improvement of vocabulary representation via VGCN embeddings, and similarity metrics such as keyword overlap and cosine similarity for final refined clustering. The scope of this research is to evaluate our approach by using an Adjusted Rand Index (ARI) score in skill classification on a dataset exhibiting a long-tail distribution across both the head and tail data, comparing the results to those of an initial Non-DAPT model, since, to the best of our knowledge, no direct approach exists that is comparable to either our ensemble model or the distribution of our dataset.
dc.description.degreeMaestría en Ciencias Computacionales
dc.format.mediumTexto
dc.identificator120318||120318||330405||570104||570102||120304
dc.identifier.cvu1347681
dc.identifier.orcidhttps://orcid.org/0009-0007-5040-6171
dc.identifier.urihttps://hdl.handle.net/11285/705193
dc.language.isoeng
dc.publisherInstituto Tecnológico y de Estudios Superiores de Monterrey
dc.relation.isFormatOfacceptedVersion
dc.rightsopenAccess
dc.rights.embargoreasonPor política las tesis de Ciencias Exactas y Ciencias de la Salud estarán en embargo por 1 año
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0
dc.subject.classificationCIENCIAS FÍSICO MATEMÁTICAS Y CIENCIAS DE LA TIERRA::MATEMÁTICAS::CIENCIA DE LOS ORDENADORES::SISTEMAS DE INFORMACIÓN, DISEÑO Y COMPONENTES
dc.subject.classificationHUMANIDADES Y CIENCIAS DE LA CONDUCTA::LINGÜÍSTICA::LINGÜÍSTICA APLICADA::LINGÜÍSTICA INFORMATIZADA
dc.subject.classificationHUMANIDADES Y CIENCIAS DE LA CONDUCTA::LINGÜÍSTICA::LINGÜÍSTICA APLICADA::DOCUMENTACIÓN AUTOMATIZADA
dc.subject.classificationCIENCIAS FÍSICO MATEMÁTICAS Y CIENCIAS DE LA TIERRA::MATEMÁTICAS::CIENCIA DE LOS ORDENADORES::INTELIGENCIA ARTIFICIAL
dc.subject.keywordSkill Identification
dc.subject.keywordSkill Classification
dc.subject.keywordUnsupervised Clustering
dc.subject.keywordDomain-Adaptive Pretraining
dc.subject.keywordBERTopic
dc.subject.lcshTechnology
dc.titleDomain-adapted pretraining and topic modeling for identifying skills categories in job postings
dc.typeTesis de maestría

Files

Original bundle

Now showing 1 - 4 of 4
Loading...
Thumbnail Image
Name:
MaderaEspíndola_CartadeAutorización.pdf
Size:
157.45 KB
Format:
Adobe Portable Document Format
Description:
Carta de Autorización
Loading...
Thumbnail Image
Name:
MaderaEspíndola_TesisMaestríaOriginal.pdf
Size:
3.45 MB
Format:
Adobe Portable Document Format
Description:
Tesis Original
Loading...
Thumbnail Image
Name:
MaderaEspíndola_HojadeFirmas.pdf
Size:
212.5 KB
Format:
Adobe Portable Document Format
Description:
Hoja de Firmas
Loading...
Thumbnail Image
Name:
MaderaEspíndolaDiana_TesisMaestría.pdf
Size:
3.2 MB
Format:
Adobe Portable Document Format
Description:
Tesis Maestría

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.28 KB
Format:
Item-specific license agreed upon to submission
Description:
logo

El usuario tiene la obligación de utilizar los servicios y contenidos proporcionados por la Universidad, en particular, los impresos y recursos electrónicos, de conformidad con la legislación vigente y los principios de buena fe y en general usos aceptados, sin contravenir con su realización el orden público, especialmente, en el caso en que, para el adecuado desempeño de su actividad, necesita reproducir, distribuir, comunicar y/o poner a disposición, fragmentos de obras impresas o susceptibles de estar en formato analógico o digital, ya sea en soporte papel o electrónico. Ley 23/2006, de 7 de julio, por la que se modifica el texto revisado de la Ley de Propiedad Intelectual, aprobado

DSpace software copyright © 2002-2026

Licencia