Quantum Computing Meets Genomics: The Dawn of Hyper-Fast DNA Analysis


A pioneering collaboration has been established to focus on using quantum computing to enhance genomics. The team will develop algorithms to accelerate the analysis of pangenomic datasets, which could revolutionize personalized medicine and pathogen management. Credit: SciTechDaily.com

A new project unites world-leading experts in quantum computing and genomics to develop new methods and algorithms to process biological data.

Researchers aim to harness quantum computing to speed up genomics, enhancing our understanding of DNA and driving advancements in personalized medicine

A new collaboration has formed, uniting a world-leading interdisciplinary team with skills across quantum computing, genomics, and advanced algorithms. They aim to tackle one of the most challenging computational problems in genomic science: building, augmenting, and analyzing pangenomic datasets for large population samples. Their project sits at the frontiers of research in both biomedical science and quantum computing.

The project, which involves researchers based at the University of Cambridge, the Wellcome Sanger Institute, and EMBL’s European Bioinformatics Institute (EMBL-EBI), has been awarded up to US $3.5 million to explore the potential of quantum computing for improvements in human health.

The team aims to develop quantum computing algorithms with the potential to speed up the production and analysis of pangenomes – new representations of DNA sequences that capture population diversity. Their methods will be designed to run on emerging quantum computers. The project is one of 12 selected worldwide for the Wellcome Leap Quantum for Bio (Q4Bio) Supported Challenge Program.

Advancements in Genomics

Since the initial sequencing of the human genome over two decades ago, genomics has revolutionized science and medicine. Less than one percent of the 6.4 billion letters of DNA code differs from one human to the next, but those genetic differences are what make each of us unique. Our genetic code can provide insights into our health, help to diagnose disease, or guide medical treatments.

However, the reference human genome sequence, which most subsequently sequenced human DNA is compared to, is based on data from only a few people, and doesn’t represent human diversity. Scientists have been working to address this problem for over a decade, and in 2023 the first human pangenome reference was produced. A pangenome is a collection of many different genome sequences that capture the genetic diversity in a population. Pangenomes could potentially be produced for all species, including pathogens such as SARS-CoV-2.

Quantum Computing in Genomics

Pangenomics, a new domain of science, demands high levels of computational power. While the existing human reference genome structure is linear, pangenome data can be represented and analyzed as a network, called a sequence graph, which stores the shared structure of genetic relationships between many genomes. Comparing subsequent individual genomes to the pangenome then involves mapping a route for their sequences through the graph.

In this new project, the team aims to develop quantum computing approaches with the potential to speed up both the key processes of mapping data to graph nodes, and finding good routes through the graph.

Quantum technologies are poised to revolutionize high-performance computing. Classical computing stores information as bits, which are binary — either 0 or 1. However, a quantum computer works with particles that can be in a superposition of different states simultaneously. Rather than bits, information in a quantum computer is represented by qubits (quantum bits), which could take on the value 0, or 1, or be in a superposition state between 0 and 1. It takes advantage of quantum mechanics to enable solutions to problems that are not practical to solve using classical computers.

Challenges and Future Prospects

However, current quantum computer hardware is inherently sensitive to noise and decoherence, so scaling it up presents an immense technological challenge. While there have been exciting proof of concept experiments and demonstrations, today’s quantum computers remain limited in size and computational power, which restricts their practical application. But significant quantum hardware advances are expected to emerge in the next three to five years.

The Wellcome Leap Q4Bio Challenge is based on the premise that the early days of any new computational method will advance and benefit most from the co-development of applications, software, and hardware – allowing optimizations with not-yet-generalizable, early systems.

Building on state-of-the-art computational genomics methods, the team will develop, simulate and then implement new quantum algorithms, using real data. The algorithms and methods will be tested and refined in existing, powerful High Performance Compute (HPC) environments initially, which will be used as simulations of the expected quantum computing hardware. They will test algorithms first using small stretches of DNA sequence, working up to processing relatively small genome sequences like SARS-CoV-2, before moving to the much larger human genome.

Perspectives From the Team

Dr. Sergii Strelchuk, Principal Investigator of the project from the Department of Applied Mathematics and Theoretical Physics, University of Cambridge, said: “The structure of many challenging problems in computational genomics and pangenomics in particular make them suitable candidates for speedups promised by quantum computing. We are on a thrilling journey to develop and deploy quantum algorithms tailored to genomic data to gain new insights, which are unattainable using classical algorithms.”

David Holland, Principal Systems Administrator at the Wellcome Sanger Institute, who is working to create the High Performance Compute environment to simulate a quantum computer, said: “We’ve only just scratched the surface of both quantum computing and pangenomics. So to bring these two worlds together is incredibly exciting. We don’t know exactly what’s coming, but we see great opportunities for major new advances. We are doing things today that we hope will make tomorrow better.”

Dr. David Yuan, Project Lead at EMBL-EBI, said: “On the one hand, we’re starting from scratch because we don’t even know yet how to represent a pangenome in a quantum computing environment. If you compare it to the first moon landings, this project is the equivalent of designing a rocket and training the astronauts. On the other hand, we’ve got solid foundations, building on decades of systematically annotated genomic data generated by researchers worldwide and made available by EMBL-EBI. The fact that we’re using this knowledge to develop the next generation of tools for the life sciences, is a testament to the importance of open data and collaborative science.”

The potential benefits of this work are huge. Comparing a specific human genome against the human pangenome — instead of the existing human reference genome — gives better insights into its unique composition. This will be important in driving forward personalized medicine. Similar approaches for bacterial and viral genomes will underpin the tracking and management of pathogen outbreaks.

Current recommendations for cancer surveillance in Gorlin syndrome: a report from the SIOPE host genome working group (SIOPE HGWG)


Abstract

Gorlin syndrome (MIM 109,400), a cancer predisposition syndrome related to a constitutional pathogenic variation (PV) of a gene in the Sonic Hedgehog pathway (PTCH1 or SUFU), is associated with a broad spectrum of benign and malignant tumors. Basal cell carcinomas (BCC), odontogenic keratocysts and medulloblastomas are the main tumor types encountered, but meningiomas, ovarian or cardiac fibromas and sarcomas have also been described. The clinical features and tumor risks are different depending on the causative gene. Due to the rarity of this condition, there is little data on phenotype-genotype correlations. This report summarizes genotype-based recommendations for screening patients with PTCH1 and SUFU-related Gorlin syndrome, discussed during a workshop of the Host Genome Working Group of the European branch of the International Society of Pediatric Oncology (SIOPE HGWG) held in January 2020. In order to allow early detection of BCC, dermatologic examination should start at age 10 in PTCH1, and at age 20 in SUFU PV carriers. Odontogenic keratocyst screening, based on odontologic examination, should begin at age 2 with annual orthopantogram beginning around age 8 for PTCH1 PV carriers only. For medulloblastomas, repeated brain MRI from birth to 5 years should be proposed for SUFU PV carriers only. Brain MRI for meningiomas and pelvic ultrasound for ovarian fibromas should be offered to both PTCH1 and SUFU PV carriers. Follow-up of patients treated with radiotherapy should be prolonged and thorough because of the risk of secondary malignancies. Prospective evaluation of evidence of the effectiveness of these surveillance recommendations is required.

Conclusion

GS was described many years ago, but the identification of different underlying molecular genetic characteristics (PTCH1 or SUFU germline variants) has changed the way in which this predisposing syndrome is identified and has modified the phenotypic characteristics of the tumor risks historically described. There are still many uncertainties about the risk of tumors associated with GS. Given the rarity of this syndrome, the estimation of the risks is still not very precise and must be improved by studies on a larger number of patients with analysis taking into account ascertainment bias. Nevertheless, due to the risk of tumor, any patient with GS predisposition syndrome should benefit from a specific genotype-based surveillance program as described in these recommendations. The evaluation of the feasibility and efficacy of these recommendations is necessary.

High diversity in Delta variant across countries revealed by genome-wide analysis of SARS-CoV-2 beyond the Spike protein


Abstract

The highly contagious Delta variant of SARS-CoV-2 has become a prevalent strain globally and poses a public health challenge around the world. While there has been extensive focus on understanding the amino acid mutations in the Delta variant’s Spike protein, the mutational landscape of the rest of the SARS-CoV-2 proteome (25 proteins) remains poorly understood. To this end, we performed a systematic analysis of mutations in all the SARS-CoV-2 proteins from nearly 2 million SARS-CoV-2 genomes from 176 countries/territories. Six highly prevalent missense mutations in the viral life cycle-associated Membrane (I82T), Nucleocapsid (R203M, D377Y), NS3 (S26L), and NS7a (V82A, T120I) proteins are almost exclusive to the Delta variant compared to other variants of concern (mean prevalence across genomes: Delta = 99.74%, Alpha = 0.06%, Beta = 0.09%, and Gamma = 0.22%). Furthermore, we find that the Delta variant harbors a more diverse repertoire of mutations across countries compared to the previously dominant Alpha variant. Overall, our study underscores the high diversity of the Delta variant between countries and identifies a list of amino acid mutations in the Delta variant’s proteome for probing the mechanistic basis of pathogenic features such as high viral loads, high transmissibility, and reduced susceptibility against neutralization by vaccines.

Synopsis

image

Systematic bioinformatics analysis of amino acid mutations in all 26 SARS-CoV-2 proteins from nearly 2 million SARS-CoV-2 genomes from 176 countries/territories highlights country-specific differences in the mutational profile of the Delta variant.

  • The Delta variant harbors a more diverse repertoire of mutations across countries than the previously dominant Alpha variant.
  • In addition to Spike protein mutations, six highly prevalent missense mutations in the less-studied viral proteins are almost exclusive to the Delta variant compared to the Alpha, Beta, and Gamma variants.
  • These mutations in the Delta variant’s proteome may provide a mechanistic basis of its pathogenic features such as high viral loads, high transmissibility, and reduced susceptibility against neutralization by vaccines.

Introduction

The ongoing COVID-19 pandemic has infected over 210 million people and killed nearly 4.5 million people worldwide as of August 2021 (COVID-19 map—Johns Hopkins Coronavirus Resource Center, https://coronavirus.jhu.edu/map.html). Throughout the pandemic, the SARS-CoV-2 virus has acquired novel mutations, and the US government SARS-CoV-2 Interagency Group (SIG) has classified the mutant strains as variant of concern (VOC), variant of interest (VOI), and variant of high consequence (VOHC) (CDC, 2021). The variants of concern (Alpha: PANGO lineage B.1.1.7, Beta: B.1.351, Gamma: P.1, and Delta: B.1.617.2), as of August 2021, are more transmissible, cause more severe disease, and/or reduce neutralization by vaccines and monoclonal antibodies (CDC, 2021; Tracking SARS-CoV-2 variants, https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/). The Delta variant (PANGO lineage B.1.617.2), first isolated from India in October 2020 (Tracking SARS-CoV-2 variants, https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/), has emerged as the dominant global variant alongside the Alpha variant (PANGO lineage B.1.1.7), with genome sequences deposited from 104 and 150 countries, respectively, in the GISAID database (Shu & McCauley, 2017) and has worsened the public health emergency [WHO press conference on coronavirus disease (COVID-19)—July 30 2021; COVID-19 Virtual Press conference transcript—July 12 2021 (https://www.who.int/publications/m/item/covid-19-virtual-press-conference-transcript—12-july-2021)].

Recent studies are reporting nearly 1,000-fold higher viral loads in infections associated with the Delta variant (preprint: Li et al2021) and reduced neutralization of this variant by vaccines (Bernal et al2021; Liu et al2021a; Mallapaty, 2021; Wall et al2021; preprint: Tada et al). The NCBI database lists 26 proteins (structural, non-structural, and accessory proteins) in the SARS-CoV-2 proteome (SARS-Co-2 protein datasets—NCBI Datasets, https://www.ncbi.nlm.nih.gov/datasets/coronavirus/proteins/) totaling 9,757 amino acids. These include four structural proteins (Spike, Envelope, Membrane, and Nucleocapsid), 16 non-structural proteins (NSP1–NSP16), and six accessory proteins (NS3, NS6, NS7a, NS7b, NS8, and ORF10). As of August 2021, the CDC identifies 11 amino acid mutations in the Spike protein of the Delta variant (CDC, 2021), and the functional role of the SARS-CoV-2 Spike protein mutations has been well studied (Duan et al2020; Huang et al2020; Shang et al2020). However, the mutational landscape of the rest of the Delta variant’s proteome remains poorly understood. Concerted global genomic data sharing efforts through the GISAID database (Shu & McCauley, 2017) have led to the availability of nearly 2 million SARS-CoV-2 genomes from over 175 countries/territories, thereby providing a timely opportunity to analyze the mutational landscape of SARS-CoV-2 variants across all the 26 proteins.

Here, we perform a systematic analysis of amino acid mutations across the SARS-CoV-2 proteome (26 proteins) for the variants of concern and identify that the Delta variant harbors the highest mutational load in this proteome. Interestingly, the Delta variant’s proteome is also highly diverse across different countries compared to the Alpha variant. Our observations suggest the need to account for country-specific mutational profiles for comprehensively understanding the biological attributes of the Delta variant such as increased viral loads and transmissibility, and reduced susceptibility against neutralization by vaccines.

Results

Delta variant has highly prevalent mutations in the viral life cycle-associated Membrane, Nucleocapsid, NS3, and NS7a proteins

Currently, only the Spike protein mutations are being used in literature to define the SARS-CoV-2 variants of concern and interest (CDC, 2021; Tracking SARS-CoV-2 variants, https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/). However, the analysis of 1.99 million genome sequences of SARS-CoV-2 from 176 countries/territories in the GISAID database (Shu & McCauley, 2017) revealed mutations in 52.3% of the 9,757-amino-acid-long SARS-CoV-2 proteome. In all, there are 8,157 unique mutations in 5,107 amino acids spanning 24 of the 26 SARS-CoV-2 proteins (Fig EV1). The 1,055 unique amino acid mutations across 617 positions in the Spike protein contribute to only 6.3% of the mutated SARS-CoV-2 proteome (617 mutated positions of the total 9,757 amino acids in the SARS-CoV-2 proteome). This emphasizes the need to study the mutational profile across all the proteins of SARS-CoV-2.

Details are in the caption following the image
Figure EV1. The proteome-wide mutational profile of SARS-CoV-2

Of the 1.99 million SARS-CoV-2 genomes analyzed here, there are 198,460 genomes corresponding to the Delta variant from 104 countries. We identified seven highly prevalent mutations in the following proteins of the Delta variant: Membrane (I82T: 99.9%), Nucleocapsid (R203 M: 99.9%, D377Y: 99.6%), NSP12 (P323L: 99.9%), NS3 (S26L: 99.9%), and NS7a (V82A: 99.4%, T120I: 99.7%). Strikingly, all these mutations except P323L in NSP12 are nearly exclusive to the Delta variant compared to other variants of concern (Alpha, Beta, and Gamma variants of SARS-CoV-2) (mean prevalenceDelta = 99.74%, mean prevalenceotherVariantsofConcern = 0.12%) (Fig EV2, Appendix Table S1). Within the Spike protein, there are four such mutations (T19R, L452R, T478K, and P681R) as well (mean prevalenceDelta = 99.86%, mean prevalenceotherVariantsofConcern = 0.04%). In total, there are 10 mutations across the proteome that are characteristic of the Delta variant, which can serve as candidates for probing the mechanistic basis of the Delta variant’s pathogenic features.

Details are in the caption following the image
Figure EV2. Prevalence of mutations in the Delta variant

The known functional implications of Delta variant mutations include antibody escape (Chi et al2020; Li et al2020b; Liu et al2021b; preprint: Venkatakrishnan et al2021), high viral load (Plante et al2021), increased transmissibility (Li et al2021; preprint: Cherian et al), and infectivity (Zhang et al2020; Table 1). We have assessed the evolutionary conservation of the 10 characteristic Delta variant mutations using Consurf (Ashkenazy et al2016)—graded on a scale of 1 (variable) to 9 (conserved) (Table 2). Protein sequence homologs were retrieved using HMMER (Eddy, 2011) against the UniRef90 database (Suzek et al2015), and the multiple sequence alignment was built using MAFFT (Katoh et al2002). We found the R203 M mutation in the Nucleocapsid protein to be highly conserved across 139 homologous protein sequences from coronaviruses. This position is indeed functionally important and is involved in the increased spread of the virus (Syed et al2021). It might also alter the binding of the human 14-3-3 protein to the proximal phosphorylated residues, leading to changes in the subcellular localization of the viral protein (Surjit et al2005; Del Veliz et al2021). Similarly, we also found that the I82T mutation in the Data ref: Membrane protein, 2020 is highly conserved across 92 homologous protein sequences from coronaviruses. This functionally important residue might lead to altered glucose binding and uptake, as predicted previously in literature (Shen et al2021). The functional impact of the remaining eight mutations could not be assessed due to low conservation. Further experimental validation of these functional effects is warranted for a better understanding of their physiological impact.Table 1. Functional implications of mutations in SARS-CoV-2 Delta variant.

MutationFunctional domain/regionIs solvent accessible?Functional implications
Spike E156GN-terminal domainYesAntibody escape (Chi et al2020; preprint: Venkatakrishnan et al2021)
Spike ΔF157
Spike ΔR158
Spike L452RReceptor-binding domainYesAntibody escape (Li et al2020b; Liu et al2021b)
Spike T478K
Spike D614GYesIncreases spike density and infectivity of virion (Zhang et al2020), and viral replication (Plante et al2021)
Spike P681RYesIncreased transmissibility (preprint: Cherian et al; Scudellari, 2021)
M I82TMembrane-spanning helix (TM3)(Shen et al2021)YesMore biologically fit, with altered glucose uptake during viral replication (Shen et al2021)
NSP12 P323LYesIncreased transmissibility (preprint: Wang et al2020)
  • Mutations in the SARS-CoV-2 Delta variant with known functional implications.

Table 2. Computational characterization of highly prevalent SARS-CoV-2 mutations, exclusive to the Delta variant.

MutationSecondary structureDomain/SiteConSurf gradeNo. of protein homologsOverall predicted change in protein function
Spike T19RLoopN-terminal domain (Data ref: Spike glycoprotein, 2020)a150 (coronaviruses)Altered antibody interactions (Data ref: Cerutti et al2020)
Spike L452RStrandReceptor-binding domain (Data ref: Spike glycoprotein, 2020)1Potentially increases binding to the ACE2 receptor
Spike T478KStrand1
Spike P681RLoopProximal to furin cleavage site (Data ref: Spike glycoprotein, 2020)1Altered cleavage by host furin (Hoffmann et al2020)
Nucleocapsid R203 MLoopProximal to phosphorylation site (SR-rich domain) (Tung & Limtung, 2020; preprint: Yaron et al2020)9139 (coronaviruses)Increased spread of the virus (Syed et al2021) and altered interaction with the human 14-3-3 protein (Del Veliz et al2021) leading to changes in subcellular localization (Surjit et al2005)
Nucleocapsid D377YLoop1Functional impact of the mutation is unclear
Membrane I82THelixTransmembrane domain (Data ref: Membrane protein, 2020)792 (coronaviruses)Altered glucose binding and uptake
NS3 S26LHelixProximal to viroporin transmembrane domain (Data ref: ORF3a protein, 2020)a135 (coronaviruses)Altered ion channel activity leading to change in NLRP3 inflammasome activation (key component of host antiviral response) (Chen et al2019)
NS7a V82ALoopa150 (coronaviruses)Functional impact of the mutation is unclear
NS7a T120ILoopProximal to polyubiquitination site (Li et al2020a)1Altered IFN-I response (Xia et al2020)
  • The evolutionary conservation of the residues was analyzed using Consurf (Ashkenazy et al2016), and graded on a scale of 1 (variable) to 9 (conserved) by the program. Protein sequence homologs were retrieved using one iteration of HMMER (Eddy, 2011) (E-value ≤ 0.0001) against the UniRef90 database (Suzek et al2015), and the multiple sequence alignment was built using MAFFT (Katoh et al2002).
  • a Unreliable conservation score due to calculations performed on less than six non-gapped homologous sequences.

Delta variant is variable across countries and has country-specific core mutations

While the Alpha variant spread widely during the pre-vaccination phase of the pandemic (Tracking SARS-CoV-2 variants, https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/; Ledford et al2020), the Delta variant emerged as a global strain during the vaccination period. Given that the extent of vaccination coverage is highly variable across countries (Holder, 2021), the selection pressure against the Delta variant is also likely to vary. To understand mutational profiles of SARS-CoV-2 variants of concern across countries, we generate “mutational prevalence vectors” for each country of occurrence and calculate their pairwise cosine similarities (Fig 1A, Materials and Methods). The cosine similarity distributions for the Alpha and Delta variants are significantly different (Jensen–Shannon divergence = 0.21, 95% confidence Interval: [0.17, 0.24], P < 0.001). The mean and standard deviation (SD) of pairwise cosine similarity values for the globally dominant Alpha and Delta variants (meanAlpha = 0.94, S.D. Alpha = 0.05; meanDelta = 0.86, S.D. Delta = 0.1) show a significantly higher diversity in the Delta variant as compared to Alpha (Cohen’s d = 1.17, 95% confidence Interval: [1.02, 1.28], P < 0.001; Fig 1B, Appendix Fig S1).

Details are in the caption following the image
Figure 1. Schematic overview of the studyGeneration of country-specific mutation prevalence vectors and calculation of pairwise cosine similarity. The study dataset, updated as of July 31 2021, with nearly 2 million sequences were retrieved from GISAID. For a variant of concern, mutational prevalence vectors were calculated for each country of their occurrence. For example, the Delta variant has been reported in 104 countries worldwide and harbors 6,916 unique mutations. Thus, we generate 104 mutational prevalence vectors with 1 × 6,916 dimensions and calculate the pairwise cosine similarities for 104C2 (5356) combinations.Comparison of probability distributions of pairwise cosine similarity values for the Alpha and Delta variants. The cosine similarity distributions for the Alpha and Delta variants are significantly different (Jensen–Shannon divergence = 0.21, 95% confidence Interval: [0.17, 0.24], P < 0.001). The mean and standard deviation (SD) of pairwise cosine similarity values for the globally dominant Alpha and Delta variants show significantly higher values in the Delta variant as compared to Alpha and thus a higher diversity (Cohen’s d = 1.17, 95% confidence Interval: [1.02, 1.28], P < 0.001).DOWNLOAD FIGURE DOWNLOAD POWERPOINT

To determine mutations that can contribute to country-specific differences in the Delta variant, we identified the highly prevalent mutations at the country level (“country-specific core mutations”) (Fig 2A; Materials and Methods). As an example, here we compare  the country-specific core mutations in the United States (DeltaUnitedStates) and in India (DeltaIndia). DeltaUnitedStates has 29 country-specific core mutations compared with 19 country-specific core mutations in DeltaIndia (Fig 2B). Of these, 16 mutations are common, spanning structural proteins (Spike, Nucleocapsid, and Membrane), non-structural proteins (NSP3, NSP4, NSP6, NSP12, and NSP13), and accessory proteins (NS3 and NS7a).

Details are in the caption following the image
Figure 2. Identification of country-specific core mutationsSchematic overview of the method for defining country-specific core mutations for a lineage. See Materials and Methods for further details.Comparison of prevalence of country-specific core mutations in the Delta variant in India and the United States. A total of 16 country-specific core mutations are common to both India and the United States, whereas 13 and 3 mutations are unique to the United States and India, respectively. The six mutations (in other SARS-CoV-2 proteins) marked with an asterisk are highly prevalent in all countries of occurrence of Delta variant (mean prevalence = 99.74%) but are nearly absent (mean prevalence = 0.12%) in the other variants of concern (Alpha, Beta, and Gamma variants of SARS-CoV-2). The mutations are highlighted on the structure of the Spike protein and the structural models of the other SARS-CoV-2 proteins (see Methods). Residues corresponding to Spike protein mutations T19R, T478K, and P681R are missing from the structure of the Spike protein and hence not shown here. The 43-amino-acid-long NS7b protein has no structure/model available and hence is not represented here.DOWNLOAD FIGURE DOWNLOAD POWERPOINT

There are three mutations in three proteins that are highly prevalent in DeltaIndia but not in DeltaUnitedStates. In contrast, there are 13 mutations spanning six proteins that are highly prevalent in DeltaUnitedStates but not DeltaIndia, including in the exoribonuclease NSP14, which is critical for the viral replication machinery (Ogando et al2020) and can inhibit the host translational machinery (Hsu et al2021). We have assessed the evolutionary conservation of these mutations using Consurf, as described in the previous section (Appendix Table S2). We found the T492I mutation in the Nsp4C domain (possibly involved in protein–protein interactions; Data ref: Annotation rule, 2020) of the NSP4 protein is highly conserved across 139 homologous protein sequences from coronaviruses. This mutation can affect its interactions with protein-like ER homeostasis factors, N-linked glycosylation machinery, unfolded protein response-associated proteins, and antiviral innate immune signaling factors (Davies et al2020). The mutations in the functionally important positions in DeltaUnitedStates—Spike G142D, E156G, ΔF157, ΔR158 mutations—map to the antigenic supersite (Cerutti et al2021), possibly lead to immune evasion, and thus increase the virulence of this variant. The presence of country-specific differences in the Delta variants motivate the need to understand whether these genome-level differences manifest differences in the disease phenotypes and vaccine effectiveness.

Discussion

COVID-19 is the first pandemic of the post-genomic era (van Dorp et al2021) that has been under intense genomic surveillance through concerted global viral sequencing efforts. This has led to the identification and tracking of emerging variants of concern, such as the highly transmissible Alpha variant and Delta variant. Through analysis of nearly 2 million genomes from 176 countries/territories, we have identified that there are mutations beyond the Spike protein that are characteristic of the Delta variant and that the Delta variant is more variable across countries than other variants of concern.

Our study has identified 10 highly prevalent mutations characteristic of the Delta variant across five proteins, which can serve as therapeutic targets and as candidates for probing the mechanistic basis of the Delta variant’s pathogenic features such as high viral loads, increased transmissibility, and reduced susceptibility against neutralization by vaccines. The country-specific differences in the Delta variant’s mutational profile identified in this study can also be used to guide the design of vaccines/boosters that can comprehensively combat COVID-19. Our study also motivates that the diversity at the proteome level should be considered in designating the variants of concern and interest. This study shows that the sub-variants of the Delta variant (Fig 3A) are prevalent in geographically distant countries (Fig 3B), eliminating a causal relationship of geographical proximity with Delta variant diversity. However, future studies are warranted to comprehensively examine the combinations of factors such as vaccination rates, geographical proximity, and airline connectivity (Fig EV3) to dissect the difference in the epidemiology of Delta variants across countries.

Details are in the caption following the image
Figure 3. Comparison of the Delta sub-variantsHierarchical clustering of pairwise cosine similarities across countries. We identified four clusters corresponding to four sub-variants of the Delta variant. The dendrogram shows the hierarchical relationship among the Delta sub-variants.Geographical locations of the countries of localization of the sub-variants. The annotations on a map of the world show that the sub-variants are prevalent in geographically distant countries.DOWNLOAD FIGURE DOWNLOAD POWERPOINT
Details are in the caption following the image
Figure EV3. Effect of geographical separation and airline connectivity on the diversity of the Delta variant

This study has a few limitations. Since this study is based on publicly available data from the GISAID database, it may carry biases associated with sequencing disparities across countries and reporting delays. Although there is extensive genomic surveillance, there is a lack of clinical annotation of the genomes, limiting our ability to assess the clinical impact of the country-specific differences in the variants. The GISAID database does not record mutations in the recently discovered ORFs in the SARS-CoV-2 genome such as ORF10, ORF9b, and ORF9c. The assignment of the mutations in these ORFs may reveal further differences between SARS-CoV-2 variants.

Although mass vaccination efforts are underway around the world, there are huge differences in the population immunity of countries due to the differences in the vaccines approved regionally and the extent of vaccination coverage in populations. These differences contribute to the risk of emergence of new SARS-CoV-2 variants, which could pose challenges to existing therapies and vaccination (Weber et al2021). Continued genome surveillance is imperative for developing comprehensive global and country-specific preventive and therapeutic measures to end the ongoing pandemic.

Materials and Methods

SARS-CoV-2 genome sequences

We retrieved 1,987,504 SARS-CoV-2 high-coverage complete-genome sequences from human hosts in 176 countries/territories spanning 1,336 PANGO lineages on August 18 2021 from GISAID (Shu & McCauley, 2017) for December 2019 to July 2021, of which 816 sequences do not harbor any mutations. We removed sequences from other hosts and those with incomplete dates (YYYY-MM or YYYY) from further analyses. A total of 1,986,688 sequences harbor a total of 89,875 unique amino acid mutations. However, to account for errors arising from sequencing, we only consider 8157 unique mutations in 24 proteins that are present in 100 or more sequences for all our further analyses. We did not identify any mutations in NSP11 (for which no mutations are present in 100 or more sequences) and ORF10 (for which no information on mutations are available in GISAID data), and hence are not considered in further analyses.

Although 99.15% of all SARS-CoV-2 genome sequences possess one or more mutations in the Spike protein, 98.91% and 95.2% of sequences also bear mutations in the crucial NSP12 (RNA-dependent RNA polymerase, RdRp) and Nucleocapsid proteins, respectively.

We retrieved the list of proteins in the SARS-CoV-2 proteome from NCBI (SARS-CoV-2 protein datasets—NCBI Datasets, https://www.ncbi.nlm.nih.gov/datasets/coronavirus/proteins/) on August 2 2021. The structure of the Spike protein was retrieved from PDB (code: 6VSB) and that of the structural models of the other SARS-CoV-2 proteins from https://zhanglab.ccmb.med.umich.edu/COVID-19/ (on June 11 2021).

Cosine similarity across countries

To calculate the cosine similarity of a lineage L among countries, we generated a prevalence vector of constituent mutations for each country of occurrence of the lineage L. For a pair of countries, the cosine similarity of the lineage L was calculated for their mutation vectors (A, B) (Equation 1, Fig 1A).

Cosinesimilarity(A,B)=A⋅B|A|×|B|CosinesimilarityA,B=A·BA×B(1)

The mean and standard deviation (SD) of pairwise cosine similarity values for variants of concern (meanAlpha = 0.94, SDAlpha = 0.05; meanBeta = 0.89, SDBeta = 0.06; meanGamma = 0.95, SDGamma = 0.03; and meanDelta = 0.86, SDDelta = 0.1) show a higher diversity of the Delta variant across countries. To check the effect size, Cohen’s d was calculated (Equation 2).

Cohen′sd=M2−M1(((n1−1)×SD21)+((n2−1)×SD22)(n1+n2−2))−−−−−−−−−−−−−−−−−−−−−−√Cohen′sd=M2-M1n1-1×SD12+n2-1×SD22n1+n2-2(2)

where M: mean, n: sample size, and SD: standard deviation.

Probability distributions of pairwise cosine similarities were calculated by binning frequencies (bins = 25), and their Jensen–Shannon divergence (with base 2) was calculated using the jensenshannon function available in SciPy [v1.7.0] (Virtanen et al2020). P was calculated using bootstrapping with 1,000 iterations.

To identify countries with similar mutational profiles, we clustered the pairwise cosine similarity matrix with Ward’s variance minimization algorithm (Ward & Hook, 1963) available in SciPy [v1.7.0] (Fig 3A).

Bootstrapping of cosine similarities

For each country, we resampled (with replacement) all the sequences deposited in the GISAID database and generated a cosine similarity distribution for Alpha and Delta variants (Fig EV4). For calculating 95% confidence interval, we calculated Jensen–Shannon divergence (JSD) and Cohen’s d for each bootstrap iteration. To get a null distribution for JSD and Cohen’s d, we calculated these metrics from the Alpha and Delta cosine similarity distribution generated in each bootstrap iteration (n = 1,000). The P-values were calculated based on the distribution of all bootstrapped values and original JSD/Cohen’s d values.

Details are in the caption following the image
Figure EV4. Bootstrapping Methodology

Cosine similarity for airline connectivity

Air traffic data were accessed on June 13 2021 from The OpenSky Network 2020 (Olive et al2021; Strohmeier et al2021). Only international flights were considered in this analysis. A matrix of the number of international flights across all countries of the world was generated for the period of February 2021 to June 2021. For country A, a vector of the number of outgoing flights to all the other countries normalized with respect to the total number of outgoing flights from country A was generated. Similarly, for country B, a vector of the number of incoming flights from all the other countries normalized with respect to the total number of incoming flights to country B was generated. Cosine similarity for airline connectivity for this pair of countries was calculated as in Equation 1.

Country-specific core mutations

Genome sequences of Alpha, Beta, Gamma, and Delta variants in GISAID data are available from 150, 95, 61, and 104 countries, respectively. For country C, we calculated the prevalence of a mutation M as in Equation 3.

Prevalence ofM(L|C)=Number of sequences of lineageLin countryCthat harbor a mutationMTotal number of deposited sequences of lineageLin countryC∗100Prevalence ofM(L|C)=Number of sequences of lineageLin countryCthat harbor a mutationMTotal number of deposited sequences of lineageLin countryC∗100(3)

The prevalence of all mutations identified in lineage L in country C was calculated and further clustered using K-means clustering algorithm (Lloyd, 1982) (in scikit-learn; Pedregosa et al2011) for unbiased identification of the highly prevalent set (core) of mutations for lineage L in country C. Based on K-means clustering sensitivity analysis, we partitioned the observations into two clusters for K-means clustering with initial cluster centroids at 0% and 100% (Appendix Fig S2). All mutations with labels corresponding to the higher centroid are called the core mutations of lineage L in country C (“country-specific core mutations”). A union set of country-specific core mutations from all countries in which lineage L is present were also determined. We observed that the Delta variant’s union set of country-specific core mutations are distinct and higher from those in the other variants of concern (Fig EV5, Appendix Table S3).

Details are in the caption following the image
Figure EV5. Distinct mutational profiles of the SARS-CoV-2 variants

The characteristic Spike protein mutations defined by the CDC (CDC, 2021) (as of August 2 2021) overlap with those identified in our analysis (Appendix Fig S3), thus validating our method of identifying mutations in the SARS-CoV-2 proteome.

Beyond CRISPR: A guide to the many other ways to edit a genome


The popular technique has limitations that have sparked searches for alternatives.

Argonaute proteins (model pictured) are one of many potential alternatives to the CRISPR–Cas9 gene-editing system.

The CRISPR–Cas9 tool enables scientists to alter genomes practically at will. Hailed as dramatically easier, cheaper and more versatile than previous technologies, it has blazed through labs around the world, finding new applications in medicine and basic research.

But for all the devotion, CRISPR–Cas9 has its limitations. It is excellent at going to a particular location on the genome and cutting there, says bioengineer Prashant Mali at the University of California, San Diego. “But sometimes your application of interest demands a bit more.”

The zeal with which researchers jumped on a possible new gene-editing system called NgAgoearlier this year reveals an undercurrent of frustration with CRISPR–Cas9 — and a drive to find alternatives. “It’s a reminder of how fragile every new technology is,” says George Church, a geneticist at Harvard Medical School in Boston, Massachusetts.

NgAgo is just one of a growing library of gene-editing tools. Some are variations on the CRISPR theme; others offer new ways to edit genomes.

A mini-me

CRISPR–Cas9 may one day be used to rewrite the genes responsible for genetic diseases. But the components of the system — an enzyme called Cas9 and a strand of RNA to direct the enzyme to the desired sequence — are too large to stuff into the genome of the virus most commonly used in gene therapy to shuttle foreign genetic material into human cells.

A solution comes in the form of a mini-Cas9, which was plucked from the bacterium Staphylococcus aureus1. It’s small enough to squeeze into the virus used in one of the gene therapies currently on the market. Last December, two groups used the mini-me Cas9 in mice to correct the gene responsible for Duchenne muscular dystrophy2, 3.

Expanded reach

Cas9 will not cut everywhere it’s directed to — a certain DNA sequence must be nearby for that to happen. This demand is easily met in many genomes, but can be a painful limitation for some experiments. Researchers are looking to microbes to supply enzymes that have different sequence requirements so that they can expand the number of sequences they can modify.

One such enzyme, called Cpf1, may become an attractive alternative. Smaller than Cas9, it has different sequence requirements and is highly specific4, 5.

Another enzyme, called C2c2, targets RNA rather than DNA — a feature that holds potential for studying RNA and combating viruses with RNA genomes6.

True editors

Many labs use CRISPR–Cas9 only to delete sections in a gene, thereby abolishing its function. “People want to declare victory like that’s editing,” says Church. “But burning a page of the book is not editing the book.”

Those who want to swap one sequence with another face a more difficult task. When Cas9 cuts DNA, the cell often makes mistakes as it stitches together the broken ends. This creates the deletions that many researchers desire.

But researchers who want to rewrite a DNA sequence rely on a different repair pathway that can insert a new sequence — a process that occurs at a much lower frequency than the error-prone stitching. “Everyone says the future is editing many genes at a time, and I think: ‘We can’t even do one now with reasonable efficiency’,” says plant scientist Daniel Voytas of the University of Minnesota in Saint Paul.

But developments in the past few months have given Voytas hope. In April, researchers announced that they had disabled Cas9 and tethered to it an enzyme that converts one DNA letter to another. The disabled Cas9 still targeted the sequence dictated by its guide RNA, but could not cut: instead the attached enzyme switched the DNA letters, ultimately yielding a T where once there was a C7. A paper published in Science last week reports similar results8.

Voytas and others are hopeful that tethering other enzymes to the disabled Cas9 will allow different sequence changes.

Pursuing Argonautes

In May, a paper in Nature Biotechnology9 unveiled an entirely new gene-editing system. Researchers claimed that they could use a protein called NgAgo to slice DNA at a predetermined site without needing a guide RNA or a specific neighbouring genome sequence. Instead, the protein — which is made by a bacterium — is programmed using a short DNA sequence that corresponds to the target area.

The finding kicked off a wave of excitement and speculation that CRISPR–Cas9 would be unseated, but laboratories have so far failed to reproduce the results. Even so, there is still hope that proteins from the family that NgAgo belongs to — known as Ago or Argonautes — made by other bacteria could provide a way forward, says genome engineer Jin-Soo Kim at the Institute for Basic Science in Seoul.

Programming enzymes

Other gene-editing systems are also in the pipeline, although some have lingered there for years. For an extensive project that aimed to edit genes in bacteria, Church’s lab did not reach for CRISPR at all. Instead, the team relied heavily on a system called lambda Red, which can be programmed to alter DNA sequences without the need for a guide RNA. But despite 13 years of study in Church’s lab, lambda Red works only in bacteria.

Church and Feng Zhang, a bioengineer at the Broad Institute of MIT and Harvard in Cambridge, Massachusetts, say that their labs are also working on developing enzymes called integrases and recombinases for use as gene editors. “By exploring the diversity of enzymes, we can make the genome-editing toolbox even more powerful,” says Zhang. “We have to continue to explore the unknown.”

MASTER ORCHESTRATOR OF THE GENOME IS DISCOVERED, STEM CELL SCIENTISTS REPORT


150508110526_1_900x600

One of developmental biology’s most perplexing questions concerns what signals transform masses of undifferentiated cells into tremendously complex organisms, a process called ontogeny.

New research by University at Buffalo scientists, published last week in PLOS ONE, provides evidence that it all begins with a single “master” growth factor receptor that regulates the entire genome.

“The finding provides a new level of understanding of the fundamental aspects of how organisms develop,” says senior author Michal K. Stachowiak, PhD, professor in the Department of Pathology and Anatomical Sciences in the UB School of Medicine and Biomedical Sciences and senior author. He also directs the Stem Cell Engraftment and In Vivo Analysis Facility and the Stem Cell Culture and Training Facility at the Western New York Stem Cell Culture and Analysis Center at UB.

“Our research shows how a single growth factor receptor protein moves directly to the nucleus in order to program the entire genome,” he said.

The research challenges a long-held supposition in biology that specific types of growth factors only functioned at a cell’s surface. For two decades, Stachowiak’s team has been intrigued by the possibility that growth factors function from within the nucleus, a point, he says, this current paper finally proves.

A more advanced understanding of how organisms form, based on this work, has the potential to significantly enhance the understanding and treatment of cancers, which result from uncontrolled development as well as congenital diseases, the researchers say. The new research also will contribute to the understanding of how stem cells work.

This work was conducted on mouse embryonic stem cells, not human cells.

Organizing ‘this cacophony of genes’

“We’ve known that the human body has almost 30,000 genes that must be controlled by thousands of transcription factors that bind to those genes,” Stachowiak said, “yet we didn’t understand how the activities of genes were coordinated so that they properly develop into an organism.

“Now we think we have discovered what may be the most important player, which organizes this cacophony of genes into a symphony of biological development with logical pathways and circuits,” he said.

At the center of the discovery is a single protein called nuclear Fibroblast Growth Factor Receptor 1 (nFGFR1). “FGFR1 occupies a position at the top of the gene hierarchy that directs the development of multicellular animals,” said Stachowiak.

The FGFR1 gene is known to govern gastrulation, occurring in early development, where the three-layered embryonic structure forms. It also plays a major role in the development of the central and peripheral nervous systems and the development of the body’s major systems, including muscles and bones.

To study how nuclear FGFR1 worked, the UB team used genome-wide sequencing of mouse embryonic stem cells programmed to develop cells of the nervous system, with additional experiments in which nuclear FGFR1 was either introduced or blocked. The researchers found that the protein was responsible, either alone or with so-called partner nuclear receptors, for ensuring that embryonic stem cells develop into differentiated cells. By targeting thousands of genes, it controls the development of the major points of growth in the body (known as axes) as well as neuronal and muscle development.

The research shows that nuclear FGFR1 binds to promoters of genes that encode transcription factors, the proteins that control which genes are turned on or off in the genome.

“We found that this protein works as a kind of ‘orchestration factor,’ preferably targeting certain gene promoters and enhancers. The idea that a single protein could bind thousands of genes and then organize them into a hierarchy, that was unknown,” Stachowiak said. “Nobody predicted it.”

Sequencing advances

The discovery that a single protein can exert such a global genomic function stems from recent advances in DNA sequencing technologies, which allow for the sequencing of a complex genome in just hours.

“NextGen DNA sequencing allows us to analyze millions of DNA sequences selected by the interacting protein,” Stachowiak said.

In the UB research, the DNA sequencing data were processed by the supercomputer at the university’s Center for Computational Research (CCR). Stachowiak and his colleagues then spent weeks aligning these data to the genome and conducting further analyses.

“We imposed nuclear FGFR1 on every little corner of genome,” he said. “The computer spit out which genes are affected by nuclear FGFR1: it was an enormously complex network of genome activity.”

They found that the protein binds to genes that make neurons and muscles as well as to an important oncogene, TP53, which is involved in a number of common cancers.

Other studies in Stachowiak’s laboratory demonstrate that these interactions also take place in the human genome, controlling function and possibly underlying diseases like schizophrenia. Targeting of the nuclear FGFR1 allows for the reactivation of neural development in the adult brain in preclinical studies and thus, Stachowiak says, may offer unprecedented opportunity for regenerative medicine. Nuclear accumulation of nuclear FGFR1 may be altered in some cancer cells, and thus could become a focus in cancer therapy, he added.

Stachowiak concluded: “This seminal discovery lends new perspectives to the origin, nature and treatment of a variety of human diseases.”

Too Much Information? Geneticist Mark Robson Discusses Accidental Genetic Findings.


Genetic testing of tumors is becoming increasingly common in cancer care. The molecular alterations found in a tumor can provide critical information for making an accurate diagnosis and determining the best treatment.

Although current clinical testing usually focuses on a panel of specific mutations, cancer centers are developing programs to analyze entire cancer genomes routinely — an approach made possible by cheaper sequencing costs — in order to individualize care. This process raises a thorny issue: What happens when a genome analysis of a person’s tumor reveals that he or she is at risk for developing a different type of cancer or other disease?

Recently, Memorial Sloan-Kettering Clinical Genetics Service Chief Kenneth Offit, Clinical Genetics Service Clinic Director Mark E. Robson, and researcher Yvonne Bombardpublished a viewpoint in the Journal of the American Medical Association regarding this question of incidental genetic findings, which cancer researchers have dubbed the “incidentalome.”

We asked Dr. Robson to discuss some of the issues surrounding accidental genetic findings and what Memorial Sloan-Kettering is doing to address them.

What is an example of a genetic variation that might be discovered by accident while sequencing the genome of a patient’s tumor?

For instance, you could be sequencing a lung cancer tumor in search of an EGFR mutation to target with an anticancer drug, and find a mutation in BRCA1, which is associated with increased risk for breast and ovarian cancer. Since most of a tumor’s DNA sequence is identical to the sequence of a normal cell from that same patient, this additional variation is probably inherited — and is what is called a germline mutation.

In that situation, are you obligated to inform the patient? It’s a very complex question. There are many variables to consider, such as individual preference, whether anything can be done to control risk, and whether other people — such as close relatives — may be affected.

Has this actually become a problem for doctors and researchers, or is it still a hypothetical situation for now?

Right now, most clinical testing of tumors is for a relatively limited number of specific mutations, not the full genome. But soon we’re going to be testing for a much broader panel of genes, increasing the chances of incidental findings.

On the research side, it’s quickly becoming an issue. Many tumor samples that have been stored in tissue banks for years or decades are now being fully sequenced. If incidental discoveries are made during that process, is there an obligation to try to find those patients and inform them? This has not been established, and there are obvious practical barriers. We need to lay the intellectual groundwork now for how we’re going to respond to these questions.

What steps have been taken at Memorial Sloan-Kettering to address the issue?

This summer, our Institutional Review Board (IRB), which oversees all of our patient-related research, updated part of our patient consent policy. When patients agree to have a tissue sample taken, they are asked whether they are open to being re-contacted if an investigator finds something that might affect their health.

Under the new procedure, if a researcher finds something that might be important to communicate to the patient, the specific question will be put before the IRB and carefully considered. If there is agreement the information should be conveyed, and the patient has indicated that he or she wants to be re-contacted, we’ll reach out to that person. We think this protects the people participating in our studies without restricting important research.

With all the genetic research taking place at Memorial Sloan-Kettering, is the IRB facing a deluge of these cases?

So far, no. The way the analyses are being conducted is that the computer looks for mutations in specific spots and subtracts all other information about the inherited genetic sequence before the investigator sees it. In other words, if you have genetic variants present in the tumor that are also in the normal cells, they are being filtered out by the software. The investigator ends up seeing variants that are only in the tumor.

As we pointed out in the JAMA paper, this is one way of limiting potential incidentalome issues.

But some researchers don’t have the germline DNA sequence available for comparison purposes, so while sequencing the tumor they see potentially relevant variations. For example, they could be sequencing a prostate cancer genome and see a mutation in theBRCA1 gene, which increases risk of other cancers.

The question becomes, under what circumstances do you tell the patient, and what about the patient’s siblings or children who may carry the mutation as well? In addition, sometimes multiple variants associated with disease risk may be found — and how do we provide counseling for all of them at once?

Have you gotten a sense from patients about what their preference usually is regarding being informed of these incidental genetic discoveries?

Commonly, people say, “I want to know everything,” but the devil’s in the details when you start considering the risk for diseases that can’t be prevented or treated. We are setting up focus groups of patients and unaffected people to try to understand how people think when they are confronted with these situations and how they prioritize different types of genetic information. We also have an active IRB protocol in which we are giving people who had their sequence determined as part of research studies the opportunity to learn their results.

Right now, it’s not clear what the dividing lines are. We want to reach a point where mutations are sorted into different categories, where certain incidental findings are nearly always appropriate to communicate to patients, others almost never so, and some require more context to determine.

We’re moving from the traditional model of asking patients if they would like to hear the results of a specific test before that test is performed, to this brave new world where we’re trying to help people make decisions about genetic information revealed by accident that is not possible to fully anticipate. It’s a very complicated issue, but it also offers a tremendous opportunity to benefit patients.

If you are interested in participating in the focus group, call 646-888-4867. Everyone is welcome, including patients, relatives, Memorial Sloan-Kettering employees, and the general public. No sequencing is provided.

Source: MSKCC

 

 

 

Faces are sculpted by ‘junk DNA’


Scientists have identified thousands of regions in the genome that control the activity of genes for facial features.

Smiling child

‘Transcriptional enhancers‘ switch genes on or off in different parts of the face. Photograph: Rex Features

Researchers have started to figure out how DNA fine-tunes faces. In experiments on mice, they have identified thousands of regions in the genome that act like dimmer switches for the many genes that code for facial features, such as the shape of the skull or size of the nose.

Specific mutations in genes are already known to cause conditions such as cleft lips or palates. But in the latest study, a team of researchers led by Axel Visel of the Lawrence Berkeley National Laboratory in Berkeley, California, wanted to find out how variations seen across the normal range of faces are controlled.

Though everybody’s face is unique, the actual differences are relatively subtle. What distinguishes us is the exact size and position of things like the nose, forehead or lips. Scientists know that our DNA contains instructions on how to build our faces, but until now they have not known exactly how it accomplishes this.

Visel’s team was particularly interested in the portion of the genome that does not encode for proteins – until recently nicknamed “junk” DNA – but which comprises around 98% of our genomes. In experiments using embryonic tissue from mice, where the structures that make up the face are in active development, Visel’s team identified more than 4,300 regions of the genome that regulate the behaviour of the specific genes that code for facial features.

The results of the analysis are published on Thursday in Science.

These “transcriptional enhancers” tweak the function of hundreds of genes involved in building a face. Some of them switch genes on or off in different parts of the face, others work together to create, for example, the different proportions of a skull, the length of the nose or how much bone there is around the eyes.

“If you think about face development, a gene that is important for both development of the nose and the mouth might have two different enhancers and one of them activates the gene in the nose and the other just in the mouth,” said Visel.

“Certainly, one evolutionary advantage that is associated with this is that you can now change the sequence of the nose or mouth enhancers and, independently, affect the activity of the gene in just one structure or the other. It may be a way a way that nature has evolved in which you can fine-tune the expression of genes in complex ways without having to mess with the gene itself. If you destroy the protein itself that usually has much more severe consequences.”

In further experiments to test their findings, the scientists genetically engineered mice to lack three of the enhancers they had identified. They then used CT (computed tomography) scanning to build 3D images of the resulting mouse skulls at the age of eight weeks.

Compared with normal mice, the skulls of the modified mice had microscopic, but consistent, changes in the length and width of the faces, as expected. Importantly, all of the modified mice only showed subtle changes in their faces, and there were no serious harmful results such as cleft lips or palates.

Though the work was done in mice, Visel said that the lessons transfer across to humans very well. “When you look at the anatomy and development of the mouse versus the human, we find that the faces are actually very similar. Both are mammals and they have, essentially, all the same major bones and structures in their skulls, they just have a somewhat different shape in the mouse. The same genes that are important for mouse face development are important in humans.”

Visel said that the primary use of this information, beyond basic genetic knowledge, would be as part of a diagnostic tool, for clinicians who might be able to advise parents if they are likely to pass on particular mutations to their children.

Peter Hammond, a professor of computational biology at University College London‘s Institute of Child Health, who researches genetic effects on facial development, said understanding how faces develop can be important for health.

“There are many genetic conditions where the face is a first clue to diagnosis, and even though the facial differences are not necessarily severe the condition may involve significant intellectual impairment or adverse behavioural traits, as well as many other effects,” he said. “Diagnosis is important for parents as it reduces the stress of not knowing what is wrong, but also can be important for prognosis.”

The technology to go beyond diagnosis and make precise corrections of the genome does not yet exist and, even if it did, it is not clear that changing genes or enhancers to create “designer” faces would be worthwhile. “I don’t think it would be desirable to even attempt that. It’s certainly not something that motivates me to work on this,” said Visel. “And I don’t think anyone working in this field would seriously view this as a possible motivation.”

Researchers identify key proteins that help establish cell function


Researchers at the University of California, San Diego School of Medicine have developed a new way to parse and understand how special proteins called “master regulators” read the genome, and consequently turn genes on and off.

Writing in the October 13, 2013 Advance Online Publication of Nature, the scientists say their approach could make it quicker and easier to identify specific gene associated with increased – an essential step toward developing future targeted treatments, preventions and cures for conditions ranging from diabetes to neurodegenerative disease.

“Given the emerging ability to sequence the genomes of individual patients, a major goal is to be able to interpret that DNA sequence with respect to disease risk. What diseases is a person genetically predisposed to?” said principal investigator Christopher Glass, MD, PhD, a professor in the departments of Medicine and Cellular and Molecular Medicine at UC San Diego.

“Mutations that occur in protein-coding regions of the genome are relatively straight forward, but most mutations associated with disease risk actually occur in regions of the genome that do not code for proteins,” said Glass. “A central challenge has been developing a strategy that assesses the potential functional impact of these non-coding mutations. This paper lays the foundation for doing so by examining how natural genetic variation alters the function of genomic regions controlling gene expression in a cell specific-manner.”

Cells use hundreds of different proteins called transcription factors to “read” the genome, employing those instructions to turn genes on and off. These factors tend to be bound close together on the genome, forming functional units called “enhancers.” Glass and colleagues hypothesized that while each cell has tens of thousands of enhancers consisting of myriad combinations of factors, most enhancers are established by just a handful of special transcription factors called “master regulators.” These master regulators play crucial, even disproportional, roles in defining each cell’s identity and function, such as whether it will be a muscle, skin or heart cell.

“Our main idea was that the binding of these master regulators is necessary for the co-binding of the other transcription factors that together enable enhancers to regulate the expression of nearby genes,” Glass said.

The scientists tested and validated their hypothesis by looking at the effects of approximately 4 million DNA sequence differences affecting master regulators in macrophage cells in two strains of mice. Macrophages are a type of immune response cell. They found that DNA sequence mutations deciphered by master regulators not only affected how they bound to the genome, but also impacted neighboring needed to make functional .

The findings have practical importance for scientists and doctors investigating the genetic underpinnings of disease, said Glass. “Without actual knowledge of where the master regulator binds, there is relatively little predictive value of the DNA sequence for non-coding variants. Our work shows that by collecting a focused set of data for the master regulators of a particular cell type, one can greatly reduce the ‘search space’ of the in a particular cell type that would be susceptible to the effects of mutations. This allows prioritization of mutations for subsequent analysis, which can lead to new discoveries and real-world benefits.”

Source:  University of California – San Diego

Scientist who mapped human genome says we will be able to ‘print’ alien life from Mars


J. Craig Venter says the next revolution in genetics will come from synthetic biology, as we learn to design and ‘print’ organisms with computers.

Related articles

Scientists will soon be able to design and print simple organisms using biological 3D printers says J. Craig Venter, the scientist who led the private-sector’s mapping of the human genome.

Venter predicts that new methods of digital design and manufacture will provide the next revolution in genetic with synthetic cells and organism tailor-made to tackle humanity’s problems: a toolkit of sequenced genes will be used to create disease-resistant animals; higher yielding crops; and drugs that extend human life and boost our brain power.

These ideas have been outlined in Venter’s latest book ‘Life at the Speed of Light: From the Double Helix to the Dawn of Digital Life’, in which the geneticists asks the age-old question ‘what is life?’ before detailing the history – and future – of creating the stuff from scratch.

For Venter life can be reduced to “protein robots” and “DNA machines” but he also believes that technology will unlock far more exotic opportunities for creating life. The title of the publication refers to the idea that we may be able to transmit DNA sequences found on Mars back to Earth (at the speed of light) to be replicated at home by biological printers.

“I am confident that life once thrived on Mars and may well still exist there today,” writes Venter. “The day is not far off when we will be able to send a robotically controlled genome-sequencing unit in a probe to other planets to read the DNA sequence of any alien microbe life that may be there.”

Venter’s ideas may sound like science fiction but he has achieved comparable feats in the past. Frustrated by what he viewed as slow government-led efforts to sequence the human genome in the 90s, Venter raised private capital to create a rival effort under the company name of Celera

Fears that Venter and his backers would attempt to patent the genome spurred the US-led effort into action and global genes-race was sparked, with both sides eventually agreeing to announce their result one day apart in February 2001.

Venter parted ways with Celera in 2002 and founded the J.Craig Venter institute in 2006. In 2010 he and his colleagues at the institute announced that they had created the world’s first synthetic organism.

The team creating a bacterium genome from scratch and ‘watermarked’ it with custom DNA strings (these included an encoded email address) before transplanting it into another cell. The cell then began to reproduce, making it the first living species created by humanity.

Although such pioneering work frequently raises ethical questions over the danger of humanity ‘playing God’, Venter writes that he is not concerned with such concerns. In ‘Life at the Speed of Light’ he writes: “My greatest fear is not the abuse of technology but that we will not use it at all.”

The ENCODE Project: ENCyclopedia Of DNA Elements.


ENCODE Overview

The National Human Genome Research Institute (NHGRI) launched a public research consortium named ENCODE, the Encyclopedia Of DNA Elements, in September 2003, to carry out a project to identify all functional elements in the human genome sequence. The project started with two components – a pilot phase and a technology development phase.

The pilot phase tested and compared existing methods to rigorously analyze a defined portion of the human genome sequence (See: ENCODE Pilot Project). The conclusions from this pilot project were published in June 2007 in Nature and Genome Research [genome.org]. The findings highlighted the success of the project to identify and characterize functional elements in the human genome. The technology development phase also has been a success with the promotion of several new technologies to generate high throughput data on functional elements.

With the success of the initial phases of the ENCODE Project, NHGRI funded new awards in September 2007 to scale the ENCODE Project to a production phase on the entire genome along with additional pilot-scale studies. Like the pilot project, the ENCODE production effort is organized as an open consortium and includes investigators with diverse backgrounds and expertise in the production and analysis of data (See: ENCODE Participants and Projects). This production phase also includes a Data Coordination Center [genome.ucsc.edu] to track, store and display ENCODE data along with a Data Analysis Center to assist in integrated analyses of the data. All data generated by ENCODE participants will be rapidly released into public databases (See: Accessing ENCODE Data) and available through the project’s Data Coordination Center.

Source: genome.gov