DNA language models are powerful predictors of genome-wide variant effects.

Benegas, Gonzalo ; Batra, Sanjit Singh ; et al.

In: Proceedings of the National Academy of Sciences of the United States of America, Jg. 120 (2023-10-31), Heft 44, S. 1-27

Online academicJournal

Wie komme ich dran?

Zugriff:

Full Text Finder (Volltext)

The expanding catalog of genome-wide association studies (GWAS) provides biological insights across a variety of species, but identifying the causal variants behind these associations remains a significant challenge. Experimental validation is both labor-intensive and costly, highlighting the need for accurate, scalable computational methods to predict the effects of genetic variants across the entire genome. Inspired by recent progress in natural language processing, unsupervised pretraining on large protein sequence databases has proven successful in extracting complex information related to proteins. These models showcase their ability to learn variant effects in coding regions using an unsupervised approach. Expanding on this idea, we here introduce the Genomic Pre-trained Network (GPN), a model designed to learn genome-wide variant effects through unsupervised pretraining on genomic DNA sequences. Our model also successfully learns gene structure and DNA motifs without any supervision. To demonstrate its utility, we train GPN on unaligned reference genomes of Arabidopsis thaliana and seven related species within the Brassicales order and evaluate its ability to predict the functional impact of genetic variants in A. thaliana by utilizing allele frequencies from the 1001 Genomes Project and a comprehensive database of GWAS. Notably, GPN outperforms predictors based on popular conservation scores such as phyloP and phastCons. Our predictions for A. thaliana can be visualized as sequence logos in the UCSC Genome Browser (https://genome.ucsc.edu/s/gbenegas/gpn-arabidopsis). We provide code (https://github.com/songlab-cal/gpn) to train GPN for any given species using its DNA sequence alone, enabling unsupervised prediction of variant effects across the entire genome. [ABSTRACT FROM AUTHOR]

Copyright of Proceedings of the National Academy of Sciences of the United States of America is the property of National Academy of Sciences and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)

Titel:	DNA language models are powerful predictors of genome-wide variant effects.
Autor/in / Beteiligte Person:	Benegas, Gonzalo ; Batra, Sanjit Singh ; Song, Yun S.
Link:	Full Text Finder (Volltext)
Zeitschrift:	Proceedings of the National Academy of Sciences of the United States of America, Jg. 120 (2023-10-31), Heft 44, S. 1-27
Veröffentlichung:	2023
Medientyp:	academicJournal
ISSN:	0027-8424 (print)
DOI:	10.1073/pnas.2311219120
Schlagwort:	LANGUAGE models GENETIC variation NATURAL language processing GENOME-wide association studies TRANSGENIC organisms NUCLEOTIDE sequence
Sonstiges:	Nachgewiesen in: Complementary Index Sprachen: English

Klicken Sie ein Format an und speichern Sie dann die Daten oder geben Sie eine Empfänger-Adresse ein und lassen Sie sich per Email zusenden.

BibTeX Citavi, JabRef, u.a.
(Literaturverwaltung)

PDF kein Volltext!
(Merkzettel, Notizen)

RIS Endnote, Citavi u.a.
(Literaturverwaltung)

MODS
(XML zur Weiterverarbeitung)

oder

Wählen Sie das für Sie passende Zitationsformat und kopieren Sie es dann in die Zwischenablage, lassen es sich per Mail zusenden oder speichern es als PDF-Datei.

Gewünschter Zitations-Stil:

oder

Bitte prüfen Sie, ob die Zitation formal korrekt ist, bevor Sie sie in einer Arbeit verwenden. Benutzen Sie gegebenenfalls den "Exportieren"-Dialog, wenn Sie ein Literaturverwaltungsprogramm verwenden und die Zitat-Angaben selbst formatieren wollen.