Mining Plant Genomes – A modern approach to herbal healing

Mining Plant Genomes – A modern approach to herbal healing

By Michelle Vierra

Plants are stationary soldiers. Rooted to one spot, they are not able to chase nutrients or flee from herbivores and pathogens. So in addition to the basic metabolites they synthesise for their survival, they produce a diverse array of organic compounds through specialised biochemical pathways to counterattack threats.

Some of these compounds have been found to combat human threats as well, and herbalists have been scouring these palettes of secondary metabolites for their health-promoting properties for centuries.

Modern medicine also incorporates plant compounds. Around 80% of the world’s population already relies on ethnobotanical remedies and plant drugs, such as the antineoplastic Taxol, the antimalarial artemisinin, the analgesic codeine, the antidiabetic allicin, and the cardiac depressant quinidine. The high cost of new drugs, unpalatable side-effects and microbial resistance are driving a constant and renewed public interest on alternative and complementary medicine.

Yet only a small fraction of the vast diversity of plant metabolism has been explored.

This is quickly changing, as synthetic biologists set out to mine the quarry of alkaloids, terpenoids and phenolic plant compounds in order to manufacture new natural products and molecular ‘pharmers’ try to identify ways to use the plants themselves as biopharmaceutical factories.

The increased affordability and sophistication of genetic sequencing technology is making all of this possible. But is it being used to its full potential? How can the technology be best utilised in drug discovery and development?

It is no longer enough to simply sequence bits of a genome. In order to understand the full metabolic potential of plants, comprehensive genomic information must be combined with transcriptomic, proteomic and metabolomic data. We need to be able to answer questions such as: How are the genes coded? Where are they clustered? Clustered genes in Arabidopsis, for example, are enriched in phenylpropanoid and terpenoid metabolism. Gene duplication, such as whole-genome duplications (WGDs) and local (tandem) duplication (LDs), can also play an important role in specialised metabolism, including the expression of flavonoid related genes.

Fortunately, sequencing technology has evolved to equip researchers with the tools to tackle nearly all of these questions.

Single Molecule, Real-Time (SMRT) Sequencing, which works like a giant microscope that can literally ‘see’ DNA synthesis in real time, enables researchers to assemble highly contiguous and accurate megabase-size stretches, or contigs, of plant genomes. These ‘long reads’ capture undetected structural variations, fully intact genes and regulatory regions embedded in complex structures that fragmented draft genomes often miss.

Most genome-wide knowledge is obtained at the level of gene expression (ie, variations in mRNA quantity). It is often assumed that each individual gene transcribes identical RNA molecules. But in reality, one gene may produce several different isoforms by the use of alternative promoters, exons and terminators. During transcription, alternative RNA molecules (ie, isoforms) are often produced. They can vary in length and differ markedly in function and expression pattern. Alternatively, spliced multiple transcript isoforms can dramatically increase the protein-coding potential of the genome. And spliced isoforms transcribed from the same gene can have significantly different and even antagonistic effects.

As such, accurately capturing isoform activity can be crucial to understanding gene structure, regulatory elements and coding regions. And covering the entire length of cDNA sequences and transcripts can even enable the discovery of new genes.

Enter the isoform sequence (Iso-Seq) method, which uses long-read technology and requires no assembly, making it an increasingly popular tool – especially in the absence of reference genomes, which is a reality for many researchers working on non-model organisms and plants with genomes that are large and complex.

Marijuana mysteries

Despite its growing popularity for medicinal, food, industrial and recreational use, much remains unknown about the genetics of Cannabis sativa, the plant responsible for both marijuana and hemp.

Figure 1 Cannabis sativa, the plant responsible for both marijuana and hemp

The first genome assembly was attempted in 2011 using short read sequencing, but was highly fragmented and incomplete. Several other iterations were done over the years and the original team recently released a much-improved genome (1), assembled using SMRT Sequencing technology. Another team from the University of Toronto and Icahn School of Medicine at Mount Sinai, New York, resequenced the drug-type strain Purple Kush and the hemp variety ‘Finola’ and created a combined physical and genetic map (2) in order to better understand the cannabinoid biosynthesis pathway.

One mystery that has stumped cannabis researchers revolves around the expression of the related enzymes THCA synthase (THCAS) and CBDA synthase (CBDAS), which synthesise the compounds THC (responsible for the well-known psychoactive effects of cannabis consumption) and CBD (responsible for therapeutic properties and investigated as a potential treatment for pain relief, gastrointestinal disorders, schizophrenia and Alzheimer’s Disease).

There are two competing theories. In one, CBDAS and THCAS are mutually exclusive alleles (ie, very different isoforms, as the protein sequences are only 84% identical). The other theory is that THCAS and CBDAS are closely linked (ie, adjacent on a chromosome), and one or the other is inactivated in drug-type or hemp strains. The draft genome and transcriptome of C. sativa described in 2011 (for a female plant of the drugtype strain Purple Kush and the hemp variety ‘Finola’) was unable to resolve these theories due to high fragmentation. Nearly 70% of the C. sativa draft genome is composed of repetitive sequences with a high rate of single-nucleotide variants (SNVs), as well as inter- and intra-cultivar karyotype polymorphisms (ie, differences in homologous chromosomes), which are not captured well in short read sequencing.

To address these complications, the Toronto/NY team resequenced the two cannabis varieties using SMRT Sequencing which provided new insights into the arrangement of the chromosomes and the cannabinoid biosynthetic genes, including discovery of substantial rearrangement and gene duplications at the closely-linked THC and CBD acid synthase gene loci.

Rather than resolve the THCAS/CBDAS mystery, however, the genetic map raised more questions.

“They are not isoforms at an otherwise equivalent locus, and no equivalent of THCAS (deactivated or not) is found in hemp,” the authors wrote.

Their observations suggested that either polymorphisms or differential regulation of aromatic prenyltransferase (AP) contributes to cannabinoid production, presumably by controlling substrate concentration for THCAS and CBDAS. Purple Kush has greater than five-fold higher transcript levels of AP than Finola, with no difference in copy number, suggesting that AP enzyme levels may be higher in drug-type plants partly due to differences in transcript levels.

In order to truly understand these mechanisms, further analysis, ideally at the transcriptome level, is needed.

“Comparative sequence analysis of the enzymes will help ascertain which amino acids are important in catalysis, and may lead to the rational design of cannabinoid biosynthetic enzymes that produce novel cannabinoids not observed in nature,” the authors write.

Bittersweet success

One of herbal genomics’ biggest success stories also illustrates the limits that come with incomplete genomic and transcriptomic coverage.

Chinese scientist Youyou Tu received a Nobel Prize in Physiology or Medicine in 2015 for her discovery of the anti-malaria function of artemisinin, an endoperoxide sesquiterpene lactone isolated from sweet wormwood (Artemisia annua, or Qinghao in Chinese), an annual herb of the Asteraceae family.

Figure 2 Sweet wormwood, is an annual herb of the Asteraceae family

Artemisinin-based combination therapies (ACTs), recommended by the World Health Organization for the treatment of uncomplicated malaria caused by the Plasmodium falciparum parasite, have saved millions of lives. Other therapeutic effects have also been reported for artemisinin for diseases such as cancer, tuberculosis and diabetes, so demand is high for the compound. But plant-based production is struggling to meet the global demand due to the low amount of artemisinin produced in A. annua leaves (0.1%- 1.0% of dry weight).

Many have used metabolic engineering in attempts to increase artemisinin content in A. annua. Their strategies included overexpression of artemisinin biosynthetic pathway genes; overexpression of transcription factors (TFs) that can enhance the expression of artemisinin biosynthetic genes; and overexpression of the ADP-FPS fusion gene to stimulate substrate channelling. However, by focusing only on modifying the upstream or downstream parts of the artemisinin biosynthetic pathway and overexpressing single genes, they were unable to effectively boost the entire metabolic flux toward artemisinin biosynthesis.

To get a more complete picture, an international research team led by Kexuan Tang of Shanghai Jiao Tong University, turned to a combination of long-read genomic and transcriptomic analyses.

The result, reported in April 2018 (3), is a high quality draft genome assembly of the 1.74-gigabase genome of A. annua, containing 63,226 protein-coding genes – one of the largest numbers among all sequenced plant species.

The researchers found that multiple enzymatic steps are involved in artemisinin biosynthesis, implying that there may be more than one enzymatic step limiting the metabolic flux into artemisinin biosynthesis.

With this in mind, the team generated transgenic A. annua lines. By simultaneously overexpressing multiple genes functioning in the upstream (HMGR), midstream (FPS) and downstream (DBR2) of the artemisinin biosynthetic pathway, they were able to produce high artemisinin levels that were 51%-103% higher than wild-type plants.

Where none have gone before

Among the 900 species of the medicinally important Rutaceae family is the winged prickly ash (Zanthoxylum planispinum), which produces a wide variety of phytochemicals, including alkaloids, amides, lignans, essential oils and fatty acids.

Widely used as a medicinal herb for the treatment of colds, stomach aches, snakebites, toothaches and roundworm, Z. planispinum may also have anticancer, antiviral, antimicrobial, antiplatelet aggregation, antioxidant and anti-inflammatory activities, making it a focus of many pharmacology studies. However, molecular biology studies of this plant are rare, with no comprehensive genomic and transcriptomic data available in the NCBI plant database.

To rectify this, a team of Korean researchers, led by Ik-Young Choi of Kangwon National University, turned to whole transcriptome sequencing to obtain gene function data and elucidate the complex mechanisms involved in regulating gene expression.

As reported in July 2018 (4), the scientists analysed full-length cDNA sequences using the Iso-Seq method, obtaining 51,402 uniquely-assembled transcripts (‘unigenes’) from the tissues of leaf, early fruit and maturing fruit of Z. planispinum.

Among their targets of particular interest were cytochrome P450 monooxygenases (CYP450s), which comprise a large and complex superfamily whose members are found in almost all living organisms. Plant P450s are involved in various pathways, such as the synthesis of UV protectants (flavonoids and anthocyanins); defence compounds (isoflavonoids, hydroxamic acids and terpenes); and signalling molecules (salicylic acid and jasmonic acid). Targeting plant P450s via metabolic engineering would be valuable for the large-scale production of their phytochemicals as medicine, but these enzymes are present in very low quantities and pinpointing their metabolic functions is very difficult.

Comprehensive transcriptome sequencing allowed the Choi team to overcome this barrier. They were able to identify 76 cytochrome P450s and classify them into unique families, which should enable their individual functions to be predicted with accuracy.

They were also able to piece together another crucial puzzle: antibiotic biosynthesis. This is actually where they found the highest number of isoforms (1,250) and enzymes (135).

“From this data, it can be affirmed that a number of antibiotics can be obtained from this plant,” the authors wrote.

A case for ginseng

Until recently, researchers had to rely on patchy genomic and transcriptomic data in their quest to get to the root of one of the oldest and most popular traditional medicines of East Asia: ginseng.

Figure 3 Ginseng, One of the oldest and most popular traditional medicines of East Asia

Purported to have therapeutic effects on neurodegenerative disorders, cardiovascular diseases, diabetes and cancer, Panax ginseng and Panax notoginseng contain unique saponins called ginsenosides. Study into these glycosylated triterpenes has been hampered, however, due to the slow growth (~4 years/generation), long generation time, low seed production and complicated genome structure of Panax plants.

The first de novo assembly of a Panax genome – a 2.36 Gbp diploid P. notoginseng with 35,451 protein-encoding genes – was finally reported as a pre-print in July 2018 (5) by a team from the Chinese University of Macau led by Simon Ming-Yuen Lee. A de novo assembly of a 2.98 Gbp genome (with 59,352 annotated genes) of the tetraploid P. ginseng cultivar Chunpoong (ChP), produced by a team from Seoul National University led by Tae Jin Yang, followed shortly thereafter (6).

Sequencing of both DNA and mRNA enabled researchers to take deep dives into not only the ginsenoside biosynthetic machinery, but also its regulation and metabolic utilisation. In the case of P. ginseng, Yang et al constructed genome-scale metabolic networks covering nearly 5,000 gene products, catalysing 2,194 reactions and 2,003 unique metabolites.

Ginsenosides accumulate differently in roots, leaves, stems, flower buds and berries, in quantities varying with tissue, age, environment and cultivar.

Yang’s team was able to determine from wholegenome sequencing that the high ginsenoside contents in older P. ginseng roots are likely the result of transportation from shoot tissues rather than active biosynthesis. Co-expression analysis using RNA sequencing data identified important enzymes with which ginsenoside production co-evolved.

In the case of P. notoginseng, two types of ginsenosides (PPD and PPT) with opposing biological activities (pro-angiogenesis and anti-angiogenesis) can be found in the same plant. Only by thoroughly characterising the entire genome of the plant, as well as isoforms from eight of its constituent parts, was the Lee team able to determine that the aerial parts (eg leaf and flower) contain a higher abundance of PPD compared to roots. They identified several key genes, including several seen for the first time, as well as their transcription factor binding sites and other related parts involved in the ginsenosides synthesis pathway.

As Yang points out, such information will be vital to enabling in silico metabolic engineering to predict candidate genes associated with overproduction of desired metabolites and thus accelerate overall metabolic engineering processes.

“These results provide essential targets to increase the production of ginsenosides through the latest biotechnological approaches,” he wrote.


Far from being a relic of an antiquated past, medicinal plants and herbal remedies have informed much of modern medicine and could contribute a great deal to sound, science-based solutions of the future.

We still have much to learn about the genetic and epigenetic mechanisms of these potentially health-promoting plants. Luckily, modern sequencing platforms enable us to study the unique structural organisation of genes and the regulatory mechanisms underlying their expression patterns, allowing the generation of catalogues of specialised metabolism in ways unfathomable to the herbalists who first harnessed their healing properties centuries ago.

As Tessa Moses and Alain Goossens point out in the Journal of Experimental Botany (7), all living plant species in the world together contribute to a greater chemical diversity of bioactive compounds than any man-made chemical library.

By returning to our roots via roots, and combining ancient observations with modern molecular mining, we can herald a new era of healing and drug discovery. DDW

This article originally featured in the DDW Spring 2019 Issue

Michelle Vierra is Strategic Marketing Manager of Plant and Animal Sciences at Pacific Biosciences, a Menlo Park company that offers sequencing technologies to help scientists solve genetically complex problems. Based on its novel Single Molecule, Real-Time (SMRT®) technology, PacBio products enable de novo genome assembly; full-length transcript sequencing for a complete view of isoform diversity; targeted sequencing to more comprehensively characterise genetic variations; and DNA base modification identification to help characterise epigenetic regulation and DNA damage.


1 McKernan, K, Helbert, Y, Kane, LT, Ebling, H, Zhang, L, Liu, B, Eaton, Z, Sun, L, Dimalanta, E, Kingan, S, Baybayan, P, Pres, M, Barbazuk, W and Harkins, T. Cryptocurrencies and Zero Mode Wave guides: An unclouded path to a more contiguous Cannabis sativa L. genome assembly. OSF Preprints (2018).

2 Laverty, KU, Stout, JM, Sullivan, MJ, Shah, H, Gill, N, Holbrook, L, Deikus, G, Sebra, R, Hughes, TR, Page, JE, van Bakel, H. A physical and genetic map of Cannabis sativa identifies extensive rearrangements at the THC/CBD acid synthase loci. Genome Res. (2019)

3 Shen, Q, Zhang, L, Liao, Z, Wang, S, Yan, T, Shi, P, Liu, M, Fu, X, Pan, Q, Wang, Y, Lv, Z, Lu, X, Zhang, F, Jiang, W, Ma, Y, Chen, M, Hao, X, Li, L, Tang, Y, Lv, G, Zhou, Y, Sun, X, Brodelius, PE, Rose, JKC, Tang, K. The Genome of Artemisia annua Provides Insight into the Evolution of Asteraceae Family and Artemisinin Biosynthesis. Mol Plant. (2018).

4 Kim, JA, Roy, NS, Lee, IH, Choi, AY, Choi, BS, Yu, YS, Park, NI, Park, KC, Kim, S, Yang, HS, Choi, IY. Genome-wide transcriptome profiling of the medicinal plant Zanthoxylum planispinum using a singlemolecule direct RNA sequencing approach. Genomics (2018)

5 Fan, G, Fu, Y, Yang, B, Liu, M, Zhang, H, Liang, X, Shi, C, Ma, K, Wang, J, Liu, W, Shao, L, Huang, C, Guo, M, Cai, J, Wong, AKC, Li, C, Zhuang, D, Chen, K, Cong, W, Sun, X, Liu, X, Xu, X, Tsui, SK, Chen, W and Lee, SM. Sequencing of Panax notoginseng genome reveals genes involved in disease resistance and ginsenoside biosynthesis. bioRxiv 362046 (2018); doi: 10.1101/362046.

6 Kim, NH, Jayakodi, M, Lee, SC, Choi, BS, Jang, W, Lee, J, Kim, HH, Waminal, NE, Lakshmanan, M, van Nguyen, B, Lee, YS, Park, HS, Koo, HJ, Park, JY, Perumal, S, Joh, HJ, Lee, H, Kim, J, Kim, IS, Kim, K, Koduru, L, Kang, KB, Sung, SH, Yu, Y, Park, DS, Choi, D, Seo, E, Kim, S, Kim, YC, Hyun, DY, Park, YI, Kim, C, Lee, TH, Kim, HU, Soh, MS, Lee, Y, In, JG, Kim, HS, Kim, YM, Yang, DC, Wing, RA, Lee, DY, Paterson, AH, Yang, TJ. Genome and evolution of the shade-requiring medicinal herb Panax ginseng. Plant Biotechnol (2018)

7 Moses, T and Goossens, A. Plants for human health: greening biotechnology and synthetic biology. Journal of Experimental Botany (2017)

Suggested Reading

Join FREE today and become a member
of Drug Discovery World

Membership includes:

  • Full access to the website including free and gated premium content in news, articles, business, regulatory, cancer research, intelligence and more.
  • Unlimited App access: current and archived digital issues of DDW magazine with search functionality, special in App only content and links to the latest industry news and information.
  • Weekly e-newsletter, a round-up of the most interesting and pertinent industry news and developments.
  • Whitepapers, eBooks and information from trusted third parties.
Join For Free