Tesis de maestría

Enhancing single-cell and spatial transcriptomics analysis: the role of imputation and feature selection

Loading...
Thumbnail Image

Citation

View formats

Share

Bibliographic managers

Abstract

Single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics have revolutionized our understanding of cellular heterogeneity and tissue organization. However, extracting biological insights from these technologies remains challenging due to high-dimensional, sparse, and noisy data. Two critical but understudied problems hinder robust analysis: (1) the impact of feature selection strategies on cell-type identification, and (2) the role of data imputation in integrating scRNA-seq with spatial transcriptomics. While clustering and integration methods are widely benchmarked, the influence of pre-processing decision, such as using biologically informed marker genes or imputing missing values, remains poorly understood. This thesis addresses these gaps through systematic evaluations. This thesis addresses these knowledge gaps through systematic evaluations across diverse datasets and algorithms. First, we assess how different imputation algorithms (MAGIC, DCA, scPHENIX) affect the integration of scRNA-seq with spatial transcriptomics in both ways, cell-type deconvolution and spatial transcript prediction. Using 13 paired datasets and 10 integration tools, we found that imputation’s benefits depend on the task and algorithm. The results reveal that imputation benefits are highly context-dependent rather than universally beneficial. SpaGE consistently outperformed other methods for transcript prediction regardless of imputation status, while RCTD demonstrated superior performance for cell deconvolution tasks. Notably, we observed that imputation primarily enhances magnitude estimation rather than improving spatial pattern preservation. Second, we evaluate whether marker gene-based feature selection improves scRNA-seq clustering accuracy compared to standard approaches. By benchmarking seven algorithms(Seurat, SC3, CIDR, etc.) across five pancreatic datasets, we demonstrate that performance gains are algorithm, and dataset-dependent. SC3 and TSCAN benefited from marker gene selection across multiple datasets, while SIMLR showed dramatic dataset-dependent responses,yielding superior ARI scores (greater than 0.7) in some contexts but diminished performance in others. The Segerstolpe dataset showed consistent improvements across most algorithms when using marker genes, suggesting dataset-specific characteristics strongly influence optimal feature selection strategies. Our analysis further revealed that algorithms often identify fewer clusters than reference annotations, indicating challenges in resolving fine-grained pancreatic cell type heterogeneity. The results of this thesis emphasize that pre-processing choices must align with both analytical goals and dataset characteristics to unlock the full potential of single-cell technologies. This work provides an evidence-based framework for optimizing spatial transcriptomics and scRNA-seq analysis workflows, with implications for understanding tissue architecture and cellular dynamics across diverse biological systems.

Description

https://orcid.org/0000-0003-1303-0834

Collections

Loading...

logo

El usuario tiene la obligación de utilizar los servicios y contenidos proporcionados por la Universidad, en particular, los impresos y recursos electrónicos, de conformidad con la legislación vigente y los principios de buena fe y en general usos aceptados, sin contravenir con su realización el orden público, especialmente, en el caso en que, para el adecuado desempeño de su actividad, necesita reproducir, distribuir, comunicar y/o poner a disposición, fragmentos de obras impresas o susceptibles de estar en formato analógico o digital, ya sea en soporte papel o electrónico. Ley 23/2006, de 7 de julio, por la que se modifica el texto revisado de la Ley de Propiedad Intelectual, aprobado

Licencia