Researchers from National Institute of Standards and Technology (NIST), Baylor College of Medicine and DNAnexus, together with other members of the Genome in a Bottle (GIAB) consortium, have announced the publication of a comprehensive benchmark dataset comprising 273 challenging medically-relevant autosomal genes that are associated with the development of diseases such as homocystinuria and spinal muscular atrophy.
The approach
Co-corresponding authors Dr. Fritz Sedlazeck of Baylor, Dr. Justin Zook of NIST and Dr. Jason Chin of DNAnexus led the team of researchers in focusing on a set of medically relevant genes that had been excluded from previous benchmarks due to their complexity.
Using HiFi sequencing reads, the GIAB team identified thousands of single nucleotide variants, structural variants and large insertions and deletions in those genes in the two most-commonly used human reference genomes. They also corrected errors in several medically-relevant genes, improving variant recall accuracy to 100%.
The report
The report characterises 273 of 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly. This curated benchmark reports over 17,000 single-nucleotide variations, 3,600 insertions and deletions and 200 structural variations each for human genome reference GRCh37 and GRCh38 across HG002.
The consortium shows that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically relevant genes, including CBS, CRYAA and KCNE1. When masking these false duplications, variant recall can improve from 8% to 100%. Forming benchmarks from a haplotype-resolved whole-genome assembly may become a prototype for future benchmarks covering the whole genome.