Pipeline evaluation of clustering algorithms aimed at clinical data

dc.contributor.advisorTemez Peña, José G.en_US
dc.contributor.authorDuarte Dyck, David Absalónen_US
dc.contributor.committeememberTerashima Marín, Hugoen_US
dc.contributor.committeememberTreviño Alvarado, Víctor M.en_US
dc.date.accessioned2018-05-28T17:23:36Z
dc.date.available2018-05-28T17:23:36Z
dc.date.issued2018-05-22
dc.description.abstractDisease understanding is key in designing effective treatments and diagnostic tools. A key aspect of this understanding is grouping the patients according to their phenotypes. Phenotypes are patterns in the characteristics of certain members of a population that are correlated with a particular illness. This grouping may be useful in revealing associations between disease risk, treatment responses, and other key clinical outcomes. Once these associations are found, it is easier to design tailored diagnosis tools and effective personalized treatments. To achieve this grouping goal, data is key, and recent advancements in digital technology have made possible to capture hundreds and thousands of clinical data that may be used to group patients into different disease phenotypes. To handle hundreds of patients, with hundreds of features, clinical researchers use clustering algorithms that automatically find hiding association between subjects. These algorithms are very useful once the researcher selects the correct clustering and configure it to the specific research task. Selecting the correct clustering algorithm is time-consuming, and setting up their parameters may take several trail and test sessions. On the other hand, computer scientists have developed several clustering metrics that can evaluate the fitness of the clustering algorithms to the data, and computer power has increased, allowing the automated testing and evaluation of the clustering algorithms in the specific data set. The objective of this proposal was the development of an automated computer pipeline that evaluates several clustering algorithms, providing metrics regarding important features such as clustering stability (Jaccard index) and clustering relevance (ANOVA test). Furthermore, the pipeline returns the number of natural clusters that may be useful for the given dataset (Dunn index). The designed pipeline was set up to evaluate the classical clustering algorithms of k-means, Fuzzy C-means, and Hierarchical clustering, but it can be used to test a user-provided clustering method. The evaluation consisted in bootstrapping the data and extracting the Dunn and Jaccard clustering indexes in a meaningful manner. Furthermore, the clinical relevance of the final clusters was evaluated using an ANOVA test, that provided indications of disease phenotypes. All the test results are plotted and the user can visually evaluate the performance of the different clustering methods in their data. The result of this development was deployed in R (github.com/majordave/clustest). The utility of the pipeline was tested on synthetic data sets and two radiomics datasets associated with the development of Osteoarthritis (OA) and the presence of breast cancer from mammograms. Furthermore, we contrasted the closeting approach to supervised learning of a large dataset of the association of nutrition with OA symptoms. Hence, the present work established that the automated robust evaluation of the utility of clustering algorithms in clinical data is feasible, and provided a publicly available software tool that can be used by any clinical researchers to select the best clustering algorithm for their data.
dc.identifier.urihttp://hdl.handle.net/11285/629898
dc.language.isoengen_US
dc.publisherInstituto Tecnológico y de Estudios Superiores de Monterreyesp
dc.rightsOpen Accessen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/us/*
dc.subject.disciplineIngeniería y Ciencias Aplicadas / Engineering & Applied Sciencesen_US
dc.subject.keywordClusteringen_US
dc.subject.keywordMachine Learningen_US
dc.subject.keywordalgorithmsen_US
dc.titlePipeline evaluation of clustering algorithms aimed at clinical dataen_US
dc.typeTesis de maestría
html.description.abstract<html> <head> <title></title> </head> <body> <p>Disease understanding is key in designing effective treatments and diagnostic tools. A key aspect of this understanding is grouping the patients according to their phenotypes. Phenotypes are patterns in the characteristics of certain members of a population that are correlated with a particular illness. This grouping may be useful in revealing associations between disease risk, treatment responses, and other key clinical outcomes. Once these associations are found, it is easier to design tailored diagnosis tools and effective personalized treatments. To achieve this grouping goal, data is key, and recent advancements in digital technology have made possible to capture hundreds and thousands of clinical data that may be used to group patients into different disease phenotypes. To handle hundreds of patients, with hundreds of features, clinical researchers use clustering algorithms that automatically find hiding association between subjects. These algorithms are very useful once the researcher selects the correct clustering and configure it to the specific research task. Selecting the correct clustering algorithm is time-consuming, and setting up their parameters may take several trail and test sessions. On the other hand, computer scientists have developed several clustering metrics that can evaluate the fitness of the clustering algorithms to the data, and computer power has increased, allowing the automated testing and evaluation of the clustering algorithms in the specific data set. The objective of this proposal was the development of an automated computer pipeline that evaluates several clustering algorithms, providing metrics regarding important features such as clustering stability (Jaccard index) and clustering relevance (ANOVA test). Furthermore, the pipeline returns the number of natural clusters that may be useful for the given dataset (Dunn index). The designed pipeline was set up to evaluate the classical clustering algorithms of k-means, Fuzzy C-means, and Hierarchical clustering, but it can be used to test a user-provided clustering method. The evaluation consisted in bootstrapping the data and extracting the Dunn and Jaccard clustering indexes in a meaningful manner. Furthermore, the clinical relevance of the final clusters was evaluated using an ANOVA test, that provided indications of disease phenotypes. All the test results are plotted and the user can visually evaluate the performance of the different clustering methods in their data. The result of this development was deployed in R (github.com/majordave/clustest). The utility of the pipeline was tested on synthetic data sets and two radiomics datasets associated with the development of Osteoarthritis (OA) and the presence of breast cancer from mammograms. Furthermore, we contrasted the closeting approach to supervised learning of a large dataset of the association of nutrition with OA symptoms. Hence, the present work established that the automated robust evaluation of the utility of clustering algorithms in clinical data is feasible, and provided a publicly available software tool that can be used by any clinical researchers to select the best clustering algorithm for their data.</p> </body> </html>en_US
refterms.dateFOA2018-05-28T17:23:36Z
thesis.degree.disciplineEscuela de Ingeniería y Cienciasen_US
thesis.degree.levelMaestro en Ciencias con Especialidad en Sistemas Inteligentesen_US
thesis.degree.nameMaestría en Sistemas Inteligentesen_US
thesis.degree.programCampus Monterreyen_US

Files

Original bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
Final-MIT-MasterThesis.pdf
Size:
3.7 MB
Format:
Adobe Portable Document Format
Description:
Master Thesis - Final Version
Loading...
Thumbnail Image
Name:
Carta de Autorización David.pdf
Size:
73.46 KB
Format:
Adobe Portable Document Format
Description:
Carta de Autorización

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.3 KB
Format:
Item-specific license agreed upon to submission
Description:
logo

El usuario tiene la obligación de utilizar los servicios y contenidos proporcionados por la Universidad, en particular, los impresos y recursos electrónicos, de conformidad con la legislación vigente y los principios de buena fe y en general usos aceptados, sin contravenir con su realización el orden público, especialmente, en el caso en que, para el adecuado desempeño de su actividad, necesita reproducir, distribuir, comunicar y/o poner a disposición, fragmentos de obras impresas o susceptibles de estar en formato analógico o digital, ya sea en soporte papel o electrónico. Ley 23/2006, de 7 de julio, por la que se modifica el texto revisado de la Ley de Propiedad Intelectual, aprobado

DSpace software copyright © 2002-2026

Licencia