Strategies for characterizing the regulatory code of the human genome
2023
Hochschulschrift
Zugriff:
The regulatory code of the genome refers to the mechanisms that govern differential gene expression across diverse cell-types in the human body. Epigenome profiling has nominated millions of regulatory elements (REs) within the non-coding genome as potential mediators of this context-dependent gene regulation. However, how genomic sequence encodes for the function of these REs remains poorly understood. In this thesis, I describe computational and experimental strategies related to the characterization of these sequence determinants, with a focus on the use and development of machine learning models for regulatory genomics, and CRISPR-based perturbations. In chapter 1, we introduce a tool for detecting sample-swaps within large-scale, multi-donor functional genomics databases, which can confound integrative genomics analyses and machine learning model development. This approach quantifies the sample-relatedness of diverse sequencing datasets by utilizing linkage disequilibrium between genetic variants, allowing comparison between non-overlapping reads. By combining this approach with ancestry-agnostic haplotype maps, we achieve reliable sample-swap detection across different assays and sample read-depths. Application of this method led to the identification and correction of sample-swaps in nearly 1\% of ENCODE datasets. This tool is scalable for the systematic validation of the rapidly expanding body of genomic data that can provide a foundation for deep learning- and experimentally-based investigations of the genome's regulatory code, as proposed in the subsequent chapters. In chapter 2, we describe a framework for dissecting regulatory elements at base resolution, using a combination of genetic and epigenetic CRISPR-based perturbations, as well as a sequence based deep learning model. We use these complementary strategies to characterize a differentially accessible enhancer upstream of the key T-cell activation gene, CD69. These approaches converge on a $\sim$ 170 base interval critical for CD69 induction in stimulated Jurkat T cells. Individual cytosine-to-thymine base edits within the interval reduce element accessibility and acetylation, with a corresponding reduction of CD69 expression. The most potent base edits likely impact regulatory interactions between the transcriptional activators, GATA3 and TAL1, and the repressor BHLHE40. Systematic analysis suggests that interplay between GATA3 and BHLHE40 plays a general role in rapid T cell transcriptional responses. This chapter provides a framework for parsing regulatory elements in their endogenous chromatin contexts and identifying operative artificial variants. Furthermore, it demonstrates the utility of sequence based deep learning models for parsing regulatory element function and provides early benchmarking against experimental data for these emerging tools. In chapter 3, we build upon the sequence-based deep learning model utilized in chapter 2. A limitation of this model is that it cannot generalize to cell-types unobserved during training. To address this, we introduce a multi-modal transformer based neural network and propose a novel masking based pre-training objective to learn joint representations of sequence and cell-type specific ATAC-seq signal. We demonstrate that this approach yields useful embeddings that can be used for downstream tasks including gene expression prediction in de novo cellular contexts.
Titel: |
Strategies for characterizing the regulatory code of the human genome
|
---|---|
Autor/in / Beteiligte Person: | Javed, Nauman Muhammad |
Link: | |
Veröffentlichung: | 2023 |
Medientyp: | Hochschulschrift |
Schlagwort: |
|
Sonstiges: |
|