Domain-adapted pretraining and topic modeling for identifying skills categories in job postings
| dc.audience.educationlevel | Investigadores/Researchers | |
| dc.audience.educationlevel | Estudiantes/Students | |
| dc.contributor.advisor | Ceballos Cancino, Héctor Gibrán | |
| dc.contributor.advisor | Vázquez Lepe, Elisa Virginia | |
| dc.contributor.author | Madera Espíndola, Diana Patricia | |
| dc.contributor.cataloger | mtyahinojosa, emipsanchez | |
| dc.contributor.committeemember | González Gómez, Luis José | |
| dc.contributor.committeemember | Fahim Siddiqui, Muhammad Hammad | |
| dc.contributor.committeemember | Cantú Ortiz, Francisco Javier | |
| dc.contributor.department | Escuela de Ingeniería y Ciencias | |
| dc.contributor.institution | Campus Estado de México | |
| dc.contributor.mentor | Butt, Sabur | |
| dc.date.accepted | 2025-11-21 | |
| dc.date.accessioned | 2025-12-10T17:25:35Z | |
| dc.date.embargoenddate | 2026-12 | |
| dc.date.issued | 2025-12-05 | |
| dc.description | https://orcid.org/0000-0002-2460-3442 | |
| dc.description.abstract | The need to identify and cluster related skills in job postings has become increasingly essential as the labor market becomes more complex, driven by the rapid growth in job market data and continuous shifts in economic conditions, technology, and skill requirements. This task is especially challenging for postings in low-resources languages such as Spanish, as there is a lack of models specifically trained to handle these language variations. Previous work in this regard involves taxonomies created by experts such as ESCO, intended to be used as reference points via measured skills. However, some issues associated with these systems stem from their reliance on region-specific taxonomies as well as their rigidity to adapt to the changing environment of the market. Thus, we proposed a method to improve skill identification performance within the Mexican automotive industry by grouping equivalent skills present in Spanish job postings through the integration of text normalization, a Domain-Adaptive Pre-training (DAPT) Spanish BERT model, the use of BERTopic for pseudo-labels extraction, the improvement of vocabulary representation via VGCN embeddings, and similarity metrics such as keyword overlap and cosine similarity for final refined clustering. The scope of this research is to evaluate our approach by using an Adjusted Rand Index (ARI) score in skill classification on a dataset exhibiting a long-tail distribution across both the head and tail data, comparing the results to those of an initial Non-DAPT model, since, to the best of our knowledge, no direct approach exists that is comparable to either our ensemble model or the distribution of our dataset. | |
| dc.description.degree | Maestría en Ciencias Computacionales | |
| dc.format.medium | Texto | |
| dc.identificator | 120318||120318||330405||570104||570102||120304 | |
| dc.identifier.cvu | 1347681 | |
| dc.identifier.orcid | https://orcid.org/0009-0007-5040-6171 | |
| dc.identifier.uri | https://hdl.handle.net/11285/705193 | |
| dc.language.iso | eng | |
| dc.publisher | Instituto Tecnológico y de Estudios Superiores de Monterrey | |
| dc.relation.isFormatOf | acceptedVersion | |
| dc.rights | openAccess | |
| dc.rights.embargoreason | Por política las tesis de Ciencias Exactas y Ciencias de la Salud estarán en embargo por 1 año | |
| dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0 | |
| dc.subject.classification | CIENCIAS FÍSICO MATEMÁTICAS Y CIENCIAS DE LA TIERRA::MATEMÁTICAS::CIENCIA DE LOS ORDENADORES::SISTEMAS DE INFORMACIÓN, DISEÑO Y COMPONENTES | |
| dc.subject.classification | HUMANIDADES Y CIENCIAS DE LA CONDUCTA::LINGÜÍSTICA::LINGÜÍSTICA APLICADA::LINGÜÍSTICA INFORMATIZADA | |
| dc.subject.classification | HUMANIDADES Y CIENCIAS DE LA CONDUCTA::LINGÜÍSTICA::LINGÜÍSTICA APLICADA::DOCUMENTACIÓN AUTOMATIZADA | |
| dc.subject.classification | CIENCIAS FÍSICO MATEMÁTICAS Y CIENCIAS DE LA TIERRA::MATEMÁTICAS::CIENCIA DE LOS ORDENADORES::INTELIGENCIA ARTIFICIAL | |
| dc.subject.keyword | Skill Identification | |
| dc.subject.keyword | Skill Classification | |
| dc.subject.keyword | Unsupervised Clustering | |
| dc.subject.keyword | Domain-Adaptive Pretraining | |
| dc.subject.keyword | BERTopic | |
| dc.subject.lcsh | Technology | |
| dc.title | Domain-adapted pretraining and topic modeling for identifying skills categories in job postings | |
| dc.type | Tesis de maestría |
Files
Original bundle
1 - 4 of 4
Loading...
- Name:
- MaderaEspíndola_CartadeAutorización.pdf
- Size:
- 157.45 KB
- Format:
- Adobe Portable Document Format
- Description:
- Carta de Autorización
Loading...
- Name:
- MaderaEspíndola_TesisMaestríaOriginal.pdf
- Size:
- 3.45 MB
- Format:
- Adobe Portable Document Format
- Description:
- Tesis Original
Loading...
- Name:
- MaderaEspíndola_HojadeFirmas.pdf
- Size:
- 212.5 KB
- Format:
- Adobe Portable Document Format
- Description:
- Hoja de Firmas
Loading...
- Name:
- MaderaEspíndolaDiana_TesisMaestría.pdf
- Size:
- 3.2 MB
- Format:
- Adobe Portable Document Format
- Description:
- Tesis Maestría
License bundle
1 - 1 of 1
Loading...
- Name:
- license.txt
- Size:
- 1.28 KB
- Format:
- Item-specific license agreed upon to submission
- Description:

