Since the first international effort to sequence the 3 billion DNA letters in the human genome (The Human Genome Project), the study of human genomes has relied almost exclusively on a single reference genome to which others are compared to identify genetic variations. Scientists have long recognized that a single reference genome cannot represent human diversity and that using it introduces a pervasive bias into these studies. Now, they finally have a practical alternative.
In a paper published December 16 in Science, researchers at the UC Santa Cruz Genomics Institute have introduced a new tool, called Giraffe, that can efficiently map new genome sequences to a “pangenome” representing many diverse human genome sequences. They show that this approach allows a more comprehensive characterization of genetic variations and can improve the genomic analyses used by a wide range of researchers and clinicians.
All humans have the same genes, but there are many variations in the exact sequences of the genes — meaning the sequence of DNA subunits (abbreviated A, C, T, G) that spell out the genetic code — as well as in the vast stretches of the genome outside of the protein-coding genes. A difference in a single letter of code is called a single nucleotide variant (SNV), and insertions or deletions of short sequences are known collectively as “indels.”
The most complex variants are structural variations involving rearrangements of large segments of code (50 or more letters). These are especially hard to find using a single reference genome, yet they can have significant effects and are known to play an important role in some diseases. The average person has millions of SNVs and indels and tens of thousands of larger structural variants, and collectively the structural variants actually involve more letters of code than the other types of variants do.
“Pangenomics is making structural variants visible so we can study them the same way we do SNVs and short indels. There are a lot of structural variants and they can have a big impact, so this is critical for the future of genetic studies of disease,” said corresponding author Benedict Paten, associate professor of biomolecular engineering at UC Santa Cruz and associate director of the Genomics Institute.
A pangenome reference can be created from multiple genome sequences using a mathematical graph structure to represent the relationships between different sequences. In the new paper, the researchers built two human genome reference graphs using publicly available data. These were used to evaluate the new tool, Giraffe, which is a set of algorithms for mapping new sequence data to a pangenome reference.
Giraffe can accurately map new sequence data to thousands of genomes embedded in a pangenome reference as quickly as existing tools map to a single reference genome. The study also showed that using Giraffe reduces mapping bias, the tendency to incorrectly map sequences that differ from the reference genome.
“A lot of structural variants have been discovered recently using long-read sequencing,” he said. “With pangenomes, we can look for these structural variants in large datasets of short-read sequencing. It’s exciting because this will allow us to study those new structural variants across many people and ask questions about their functional impact, association with disease, or role in evolution,” said co-first author Jean Monlong, a postdoctoral researcher at the Genomics Institute.
A single reference genome must choose one version of any variation to represent, leaving the other versions unrepresented. By making more broadly representative pangenome references practical, Giraffe can make genomics more inclusive.
Original article published on ScienceDaily on December 16, 2022, Link.
Photo credit: Pixabay
Reference: Jouni Sirén, Jean Monlong, Xian Chang, Adam M. Novak, Jordan M. Eizenga, Charles Markello, Jonas A. Sibbesen, Glenn Hickey, Pi-Chuan Chang, Andrew Carroll, Namrata Gupta, Stacey Gabriel, Thomas W. Blackwell, Aakrosh Ratan, Kent D. Taylor, Stephen S. Rich, Jerome I. Rotter, David Haussler, Erik Garrison, Benedict Paten. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science, 2021; 374 (6574) DOI: 10.1126/science.abg8871