Tesis de doctorado

Feature selection from biological bigdata: identification of significant associations applying multivariate machine learning algorithms to genome-wide association studies (GWAS)

Loading...
Thumbnail Image

Citation

View formats

Share

Bibliographic managers

Abstract

Crohn's Disease (CD) is a type of Inflammatory Bowel Disease (IBD) affecting the gastrointestinal tract with diverse symptoms. At present, Genome-Wide Association Studies (GWAS) have discovered over 140 genetic loci associated with CD. Usual univariate GWAS methods have allowed the discovery of minor effects from common variants. It assumes independence among them, which can lead to missing subtle combinatorial signals. Considering the importance of CD, multivariate approaches can aid to elucidate the etiology of the disease and facilitate the identification of novel associations. However, current univariate-based and multivariate CD models have a broad performance spectrum and have been assessed in different datasets under diverse methodological settings. Other multivariate methods and models (LASSO, XGBoost, Random Forest, BSWiMS, and LDpred) were compared under a strict sub-sampling and cross-validation approach to predict CD risk in a GWAS dataset (de Lange et al. 2017). The predictions were explored and compared to whether the generated models could provide additional information about variants and genes associated with CD. Additionally, the effect of common strategies was assessed by increasing and decreasing the number of SNP markers (using genotype imputation and LD-clumping). The LDpred model without imputation appears to be the best model among all tested models to predict Crohn’s disease risk (AUROC = 0.667 ± 0.024) in this dataset. The best models were validated in a second dataset (NIDDK IBD Genetics), where LDpred was also the best method with similar performance (AUROC = 0.634 ± 0.009). Finally, based on the importance of the variants yielded by the multivariate models, an unnoticed region was identified within chromosome 6, SNP rs4945943, close to gene MARCKS, which appears to contribute to CD risk.

Description

https://orcid.org/0000-0002-7472-9844

Collections

Loading...

Document viewer

Select a file to preview:
Reload

logo

El usuario tiene la obligación de utilizar los servicios y contenidos proporcionados por la Universidad, en particular, los impresos y recursos electrónicos, de conformidad con la legislación vigente y los principios de buena fe y en general usos aceptados, sin contravenir con su realización el orden público, especialmente, en el caso en que, para el adecuado desempeño de su actividad, necesita reproducir, distribuir, comunicar y/o poner a disposición, fragmentos de obras impresas o susceptibles de estar en formato analógico o digital, ya sea en soporte papel o electrónico. Ley 23/2006, de 7 de julio, por la que se modifica el texto revisado de la Ley de Propiedad Intelectual, aprobado

Licencia