By Dr. Oyvind Dahle and Michael R.Pontrelli, Foley & Lardner LLP.
The first complete, gapless sequence of a human genome was published 1 April 2022 in a special issue of the journal Science1. While The Human Genome Project mapped about 92% of the human genome two decades ago, sequencing the last 8% of the genome proved highly challenging. The remaining 8% of the genome was often referred to as the dark matter of the genome or sometimes even ‘Junk DNA’2. The Telomere-to-Telomere (T2T) Consortium successfully sequenced about 200 million remaining base pairs to approach 100% of the human genome by using the newly developed long read sequencing (LRS) technology3,4,5. This technical tour de force will have far reaching implications for our understanding of the human genome and for practical use of this knowledge in clinical medicine.
Completion of the human genome was made possible by LRS technology and will inspire a global change in use and development of this technology
The first ‘final’ human genome sequence was announced some 20 years ago, and since then ‘next-generation’ sequencing technology and the human genome reference sequence have made it possible to sequence a whole genome within a week, providing widespread applications in both research and clinical medicine.
However, the work was not done and about 8% of the genome remained in the dark until recently. One of the main hurdles for sequencing the last 8% was the highly repetitive nature of some regions of the human genome. The short fragments of DNA derived from repetitive DNA may be almost identical, making the fragmented sequences difficult to assemble into a whole sequence – like putting together a jigsaw puzzle with very similar tiny pieces. The sequencing technology relied on could only yield sequences of relatively short DNA fragments, so the assembly of sequences from highly repetitive regions was a huge challenge. The development of LRS technology that could capture longer sequences was a major milestone on the road to a complete human genome sequence. The T2T Consortium applied two LRS sequencing methods called PacBio and Nanopore sequencing to enable sequencing of the missing repetitive regions in the human genome1,4,5.
The PacBio technology relies on using molecular biology techniques to generate a special DNA template (called a single-molecule real time bell or SMRT bell) that can be replicated to generate long sequences in real time6. The replication of the SMRT bell generates a continuous long read (CLR) of the DNA template. Each CLR can be read multiple times, which allows for self-correction. The PacBio technology yields average sequencing lengths of about 10,000 base pairs (bps). In contrast, the maximum read length of the next generation sequencing technology such as Illumina HiSeq 2500 is only 250 bps. Nanopore sequencing relies on determination of the chemical composition of the nucleic acids by passing a single strand of nucleic acids through a protein nanopore6. Once the nucleic acid is inside the nanopore, the presence of nucleic acids will partially restrict the flow of ions through the pore, which can be observed as anionic current drop. The ionic current measurements will depend on the composition of the nucleic acid, and the sequence of the nucleic acid can be determined in realtime as it passes through the nanopore. Nanopore sequencing does not require labelling or amplification of the nucleic acid sample, and importantly, can yield ultra long reads of 10,000 to 1,000,000 bases. As an added advantage, Nanopore technology can also determine chemical modifications of the DNA, which is important for understanding the function of the human genome as further discussed below. The biggest challenge of both the PacBio and Nanopore sequencing technologies is the larger error rate and lower yield than the previous next generation sequencing technology. The successful application of LRS technologies to the generation of the first complete human genome is sure to spur more widespread use and improvement of these technologies.
- i) Centromeres function to ensure accurate transmission of the genome during cell division.
Centromeres serve a critical function in faithful transmission of the genome during cell division. Briefly, the centromeres form a structure that allows binding to cellular spindles that physically pull the two chromatids of the same chromosome towards opposite poles of the cell to ensure that each divided cell contains a complete set of chromosomes7. Structurally, human centromeres are built on a series of repeated 171 base pair (bp) AT-rich DNA elements that extend for several mega bases (Mb) and make up ~3% of the human genome. Breaks in the centromeric regions are common features of tumour cells. One reason for the relatively frequent failure of centromeres in dividing cells and especially tumour cells may be that replication of repetitive DNA is a particularly demanding job. Accordingly, centromeres are fragile and prone to rearrangements during tumorigenesis. Improved understanding of these processes and how they fail may lead to new and improved clinical medicine or diagnosis of human diseases characterised by abnormal cell division such as cancer.
- ii) Telomeres protect the ends of the chromosomes and prevent inappropriate cell cycle arrest and DNA repair of the chromosome ends.
Telomeres play an important role in normal genome function and development of human disease. Indeed, telomeres are considered hallmarks of multiple human health conditions such as cancer and ageing. Telomeres contain repetitive nucleotide sequences that form a ‘cap structure’ that functions to protect the ends of chromosomes, which may otherwise be mistaken for damaged or broken DNA8. If the ends of the chromosomes are not distinguished from damaged or broken DNA, the cells might undergo cell cycle arrest and continuously try to repair the chromosome ends with disruptive effects on the cells. The telomere capping of the chromosome ends prevents this from happening. However, the telomeres shorten during cell division because of the directional synthesis of DNA, and lose their protective capacity if the length of the telomeres is not maintained. This shortening of the telomeres is counteracted by specialised proteins called reverse transcriptase telomerases, but maintaining telomere length may breakdown due to ageing and human diseases such as cancer. The study of telomere structures and how they are maintained will be enhanced by the new capability to get accurate sequences of these structure as provided by the LRS technology and the new reference genome.
- iii) The role of repetitive DNA in organising the genome.
In addition to the specific functions of centromeres and telomeres, the highly repetitive DNA regions also play a crucial role in organising the three-dimensional structure of the genome into so-called heterochromatin. This three dimensional packing of DNA is crucial for gene regulation and for the DNA to fit into the cells because the total length of cellular DNA in cells is up to a hundred thousand times the cell’s length. The compacted DNA is relatively inaccessible and genes in the heterochromatin regions are often silenced, which is important for protecting cells from expressing DNA elements that are harmful, or incompatible with the cell’s specialised function.
The new LRS technologies allow accurate sequencing of repetitive DNA associated with heterochromatin and will improve our understanding of how heterochromatin formation is organised in the cells. In turn, this will aid our understanding of the cells capacity to silence harmful DNA elements or prevent inappropriate gene expression.
- iv) Highly repetitive regions contain some of the highest mutation rates.
The highly repetitive regions that contain centromeres and telomeres have some of the highest mutation rates of the genome, suggesting that they have important roles in the pathogenesis of human disease9. The development of LRS sequencing technologies discussed above and the completion of the human genome provide hope that these more complex and dynamic regions of the genome can be understood, and that this knowledge can be put to practical use in clinical medicine.
Future challenges for translating genomics into genomic medicine
Although the completion of the human genome marks a great milestone in human genomics, it also highlights that the work of understanding the human genome is far from over. In fact, the human genome sequence is actually still not entirely complete. The Y chromosome sequence has not yet been published, though the T2T Consortium has indicated that the Y chromosome is sequenced and will be published soon. The new reference genome is also only of the haploid genome or single set of chromosomes, so the completion of the diploid genome remains to be done. One of the next milestones would be to characterise the DNA modifications that organise the three dimensional structure of the human genome. Intriguingly, nanopore sequencing can also be used to determine chemical modifications of DNA, including DNA methylation6. With the advent of these new sequencing technologies, we can expect great progress in determining DNA modification patterns of the genome, and how they relate to the genome’s three dimensional structure and function. The work will continue to understand how all the different DNA structures work together to form a functional genome, how changes in these DNA structures influence human health, and last but not least, how this information can be used to improve diagnostics and clinical medicine.
Volume 23 – Issue 3, Summer 2022