The Next Human Genome Challenge.


Scientists sequenced the human genome back in 2003 but how this code produces the rich tapestry of human life is still a profound mystery.

X-Chromosome contains a DNA helix. Low-poly design of interconnected lines and dots. Blue background.

One of the great achievements of modern science was the human genome project to map the sequence of genes in human DNA. The project produced unprecedented insight into the function of genes, their role in human health and the nature of life itself.

And yet the human genome project was just the beginning. Armed with the sequence of genes in DNA, life scientists now want to know how the extraordinarily rich complexity of life emerges from this code.

Closely linked and just as puzzling is how small changes in the genome can lead to the rich tapestry of human life with its infinite variety of faces, ethnicities and susceptibility to certain diseases.

Now scientists at more than 120 laboratories in the US and elsewhere have joined forces to search for an answer. The group is called the Impact of Genomic Variation on Function Consortium and its goal is to understand how variations in the genome influences its function and consequently the phenotype of the resulting human.

The project has the potential to revolutionize the way scientists understand life and the role that genes play in disease. “To unlock these insights, we need a systematic and comprehensive catalog of genome function and the molecular and cellular effects of genomic variants,” say the team.

Profound Problem

The scale of the task is huge. The human genome project revealed some 25,000 genes. But just a small subset of these is switched on in any particular tissue at any one time. How this genetic switching works in perfect synchrony is a question of profound importance.

Scientists know that each gene codes for a specific protein. In other words, it is a section of DNA that can be transcribed into RNA and then translated into a protein. These proteins are the building blocks of all cells and of the molecular machinery of life. But transcribing a single gene is no simple task.

Each copy of the genome — almost every cell has its own copy — consists of about 3 billion base pairs lined up in the famous double helix structure. If laid out in a straight line, this strand of DNA would stretch to about 2 meters.

But instead, it is packed tightly inside the cell and must be unpacked to access the genes it carries. This packing and unpacking is highly orchestrated. The DNA strand is first wound onto molecular “cotton reels” called histones, which then weave tightly together to form a “DNA rope” called chromatin. This itself eaves back and forth into shapes called chromosomes.

To access a gene, the chromatin must be unpacked in a way that reveals the precise location of the gene and then packed away again afterwards.

All this is managed by complex networks of molecules working together in synchrony. One of the great discoveries of the Human Genome Project was that DNA does not just code for proteins. It also contains numerous genes that produce RNA strands that do not code for proteins.

This non-coding RNA coordinates the processes of life in a complex network of operations–switching, shepherding, binding and so on — to control this enormous ballet of molecular construction.

Given this glimpse of the processes of life, scientists now want to know how it all works; that’s essentially the goal of the Impact of Genomic Variation on Function Consortium.

They already know that single changes within the genome can lead to significant differences between individuals, for example in their susceptibility to certain diseases. But it is no easy task to tease apart the role of each single nucleotide variation, not least because many phenotypic features are the result of combinations of many nucleotide variations. Even when scientists are aware of the variations between individuals, their significance is not always clear.

Rate-Limiting Step

And that makes it hard to determine the role of genes in many diseases or how to fix them. “The interpretation of the impact of genomic variation on function is currently a rate-limiting step for delivering on the promise of precision medicine,” say the group.

So the Impact of Genomic Variation on Function project aims to create a map of the predicted effects of every possible single-nucleotide variant on the key aspects of genome function. That will mean working out how coding variants change the shape and function of proteins, how non-coding variants influence gene expression and together these might influence molecular networks throughout a cell.

Given that the genome has 3 billion nucleotides, there will be no way to experimentally measure the effect of a variant in each position in all cells in every circumstance. The combination of possibilities is mind-bogglingly large.

So scientists will attempt to measure the effects of many variants, but computer modelling will need to shoulder the load for predicting the effects of many others. “The amount of data needed to build accurate models of genome function is unknown, and fully realizing the goal of mapping the impact of genomic variation on function will require additional advances in both experimental and computational methods,” say the team.

That’s why the IGVF consortium is so large — the required skills range across the whole of the life sciences sector and beyond into bioinformatics and computer science.

That’s an ambitious goal with profound implications for how we understand human health and in particular the role of genetic variance in disease. The results will be worth following for years to come.

Did the Black Death shape the human genome? Study challenges bold claim


An ancient-DNA study of medieval Cambridge found no sign of genes that helped people to survive the plague, casting doubt on an earlier study.

The Black Death in 1349 as recorded by historian Gilles le Muiset while Abbot in Tournai, France.
The Black Death in 1349, as recorded by historian Gilles le Muiset.

The Black Death struck Cambridge, England, in 1349. The bodies piled up so fast — up to six in ten people died in Europe — that gravediggers struggled to keep up, and many remains wound up in mass burials.

Despite its heavy toll, this wave of bubonic plague doesn’t seem to have had a lasting impact on the genomes of the people of Cambridge, suggests a 17 January study in Science Advances1. The findings contradict a high-profile 2022 Nature paper2 that identified variants in immune genes that were enriched in people who survived the Black Death, suggesting that the variants might have had a protective effect.Bubonic plague left lingering scars on the human genome

“Because it’s such a devastating event, people naturally expect it will leave some genetic signature,” says Ruoyun Hui, a population geneticist at the University of Cambridge, UK, who co-led the latest study. “We didn’t find much evidence supporting adaptation of immune responses.”

Scarred genomes

The human genome is littered with scars of ancient disease outbreaks, and there is good evidence that gene variants common in different populations helped their ancestors to survive infections — and pass those helpful variants onto their children3.

But connecting such changes to specific disease outbreaks, such as the Black Death, has been difficult. Natural selection tends to act over many generations, and the influence of one disease outbreak can be especially hard to see in the small numbers of ancient human genomes typically available, says Luis Barreiro, a human-population geneticist at the University of Chicago in Illinois.

In the 2022 study2, Barreiro’s team looked at hundreds of ancient genomes of people from London and Denmark. The researchers identified more than 200 immune gene variants that became either more or less common in people who survived the Black Death, compared with people who died before or during the plague years. In laboratory studies, variants in one gene, called ERAP2, helped immune cells to control the plague-causing bacterium Yersinia pestis.Ancient DNA traces origin of Black Death

To see how the Black Death might have shaped people’s genomes in other regions, a team led by Hui and her University of Cambridge colleague population geneticist Toomas Kivisild sequenced the genomes of 275 people from medieval and post-medieval Cambridge and surrounding villages.

However, in a subset of the 70 most-complete genomes, the researchers found few signs of natural selection after the Black Death. About 10% of the 245 immune variants that Barreiro’s team had found in Londoners changed in frequency in the Cambridge cohort. But out of 22 gene variants, 10 that seemed to help people to survive the Black Death in Cambridge became less common in London, or vice versa.

Hui and Kivisild’s team detected no shift in the frequency of the plague-protecting ERAP2 variants. A version of another gene that protects against leprosy became slightly more common after the Black Death, but this association did not meet a statistical threshold typically applied to genomic studies. Kivisild says that it’s still possible that the Black Death influenced the evolution of immune genes in ways his team’s study could not detect. “But it is a little bit premature to make far-reaching conclusions at this stage.”

Question of scales

Iain Mathieson, a population geneticist at the University of Pennsylvania in Philadelphia, isn’t surprised that the Black Death didn’t seem to alter the genomes of people in Cambridge. In a 2023 preprint4, Mathieson and his colleagues identified what they say are analytical flaws in Barreiro’s study. When they corrected for these issues, they found that the variants identified no longer met a statistical threshold used in genome-wide studies to avoid spurious associations.

Barreiro stands by his team’s findings, particularly their interpretation of ERAP2’s role5. A separate epidemiological study showed that variants in the gene protect against respiratory infections6. And in another 2023 preprint7, researchers trained a deep-learning model to identify natural selection in ancient-genome data and found hints that two of the ERAP2 variants had been under selection in the past 2,000 years — a period that includes not only the Black Death but also other plague epidemics. “Once you start having these pieces coming together, I think you can build a strong case for selection,” Barreiro says.

Ultimately, the debate is about timescales. “Personally, I think that a single epidemic is unlikely to make much difference,” says Mathieson, “but repeated epidemics and pandemics certainly could.”

Firm answers on the Black Death’s impact — or that of any past disease outbreak — will require thousands of ancient-human genomes and maybe more, researchers agree. “I don’t think it’s going to be resolved in a definitive manner that pleases everyone one until we reach much, much larger sample sizes,” says Barreiro.

Most Complete Human Genome of All Time Revealed by Scientists


Most Complete Human Genome of All Time Revealed by Scientists https://curiosmos.com/most-complete-human-genome-of-all-time-revealed-by-scientists/

Does the data we produce serve us, or vice versa?


You’ve heard the argument before: Genes are the permanent aristocracy of evolution, looking after themselves as fleshy hosts come and go. That’s the thesis of a book that, last year, was christened the most influential science book of all time: Richard Dawkins’ The Selfish Gene.

But we humans actually generate far more actionable information than is encoded in all of our combined genetic material, and we carry much of it into the future. The data outside of our biological selves—call it the dataome—could actually represent the grander scaffolding for complex life. The dataome may provide a universally recognizable signature of the slippery characteristic we call intelligence, and it might even teach us a thing or two about ourselves.

It is also something that has a considerable energetic burden. That burden challenges us to ask if we are manufacturing and protecting our dataome for our benefit alone, or, like the selfish gene, because the data makes us do this because that’s what ensures its propagation into the future.

Take, for instance, William Shakespeare.

Scharf_BREAKER-1
Who’s in charge?: The bard has become a living part of the human dataome.

Shakespeare died on April 23, 1616 and his body was buried two days later in Holy Trinity Church in Stratford-Upon-Avon. His now-famous epitaph carries a curse to anyone who dares “move my bones.” And as far as we know, in the past 400 years, no one has risked incurring Will’s undead wrath.

But he has most certainly lived on beyond the grave. At the time of his death Shakespeare had written a total of 37 plays, among other works. Those 37 plays contain a total of 835,997 words. In the centuries that have come after his corporeal life an estimated 2 to 4 billion physical copies of his plays and writings have been produced. All of those copies have been composed of hundreds of billions of sheets of paper acting as vessels for more than a quadrillion ink-rich letters.

We have been pumping out persistent data since our first oral exchange of a good story.

Across time these billions of volumes have been physically lifted and transported, dropped and picked up, held by hand, or hoisted onto bookshelves. Each individual motion has involved a small expenditure of energy, maybe a few Joules. But that has added up across the centuries. It’s possible that altogether the simple act of human arms raising and lowering copies of Shakespeare’s writings has expended well over 4 trillion Joules of energy. That’s equivalent to combusting several hundred thousand kilograms of coal.

Additional energy has been utilized every time a human has read some of those 835,997 words and had their neurons fire. Or spoken them to a rapt audience, or spent tens of millions of dollars to make a film of them, or turned on a TV to watch one of the plays performed, or driven to a Shakespeare festival. Or for that matter bought a tacky bust of “the immortal bard” and hauled it onto a mantelpiece. Add in the energy expenditure of the manufacture of paper, books, and their transport and the numbers only grow and grow.

It may be impossible to fully gauge the energetic burden that William Shakespeare unwittingly dumped on the human species, but it is substantial. Of course, we can easily forgive him. He wrote some good stuff. But there is also a sense in which the data of Shakespeare has become its own living part of the dataome, propagating itself into the future and compelling all of us to support it, just as is happening right now in this sentence.

Shakespeare, to be fair, contributed barely a drop to a vast ocean of data that is both ethereal yet actually extremely tangible in its effects upon us. This is both the glory and millstone of Homo sapiens.

We have been pumping out persistent data since our first oral exchange of a good story and our first experimental handprint on a cave wall. Neither of those things were explicitly encoded in our DNA, yet they could readily outlive the individual who created them. Indeed, data like these have outlived generation after generation of humans.

But as time has gone by our production of data has accelerated. Today, by some accounts, our species generates about 2.5 quintillion bytes of data a day. That’s more than a billion billion bytes for each planetary rotation. And that rate of output is still growing. While lots of that data is a mixture of fleeting records—from Google searches to air traffic control—more and more ends up persisting in the environment. Pet videos, GIFs, political diatribes, troll responses, as well as medical records, scientific data, business documents, emails, tweets, photo albums, all wind up as semi-permanent electrical blips in doped silicon or magnetic dots on hard drives.

Scharf_BREAKER-3
In perspective: The human genome fits on about two CDs. The human species produces about 20,000 CDs worth of data a second.

production and storage takes a lot of energy to maintain, from the moment someone’s hands scrabble for rare-earth elements in the soil, to the electricity that sustains it all. There’s a reason that a large company like Apple builds its own data server farms, and looks for ways to optimize the power generation that these air-conditioned, electron-pushing factories demand, whether it’s building massive solar farms in Nevada or utilizing hydroelectricity in Oregon.

Even Shakespeare’s medium—traditional paper—is still an energy-hungry beast. In 2006 it was estimated that United States paper production gulped down about 2,400 trillion BTUs (about 4 million trillion trillion trillion Joules) to churn out 99.5 million tons of pulp and paper products. That amounts to some 28,000 Joules of energy used per gram of final material—before any data is even printed on it. Or to put it another way, this is equivalent to roughly 5 grams of high-quality coal being burnt per page of paper.

Why are we doing this? Why are we expending ever increasing amounts of effort to maintain the data we, and our machines, generate? This behavior may represent far more than we at first think.

Our dataome is both an advantage to us humans, and a burden.

On the face of things, it seems pretty obvious that our capacity to carry so much data with us through time is a critical part of our success at spreading across the planet. We can continually build on our knowledge and experience in a way that no other species seemingly does. Our dataome provides us with a massive evolutionary advantage.

But it’s clearly not free. We may be trapped in a bigger Darwinian reality where we are in effect now serving as a supporting organelle for our own dataome.

This is an unsettling framework for looking at ourselves. But it has parallels in other parts of the natural world. Our microbiome, of tens of trillions of single-celled organisms, is perpetuated not so much by us as individuals, but by generations of us carrying this biological information through time. Yet we could also flip this around and conceptualize the situation as the microbiome carrying us through time. The microbiome exists in us because we’re a good environment. But that’s a symbiotic relationship. The microbes have to do things a certain way, have to work at supporting their human carrying systems. A human represents an energetic burden as much as an evolutionary advantage to microbes. Similarly, our dataome is both an advantage to us humans, and a burden.

The question is, is our symbiosis still healthy? The present-day energetic burden of the dataome seems like it could be at a maximum level in the history of our species. It doesn’t necessarily follow that we’re experiencing a correspondingly large benefit. We might do well to examine whether there is an optimal state for the dataome, a balance between the evolutionary advantages it confers on its species and the burden it represents.

The proliferation of data of seemingly very low utility (that I might grumpily describe as cat pictures and selfies) could actually be a sign of worrying dysfunction in our dataome. In other words, undifferentiated and exponential growth of low-value data suggests that data can get cancer. In which case we’d do well to take this quite seriously as a human health issue—especially if treatment reduces our global energy burden, and therefore our impact on the planetary environment.

Improving the utility of our data, purging it of energy-wasting junk might not be popular, but could perhaps be incentivized. Either through data credit schemes akin to domestic solar power feeding back to the grid, or making the loss of data a positive feature. What you might call a Snapchat approach.

In that case, the human-dataome symbiosis might become the only example in nature of a symbiotic relationship that is consciously managed by one party. What the long-term evolutionary robustness of that would be is hard to say.

But more optimistically; if the dataome is indeed an integral and integrated part of our evolutionary path then perhaps by mining it we can learn more about not just ourselves and our health, but the nature of life and intelligence in general. Precisely how we interrogate the dataome is a wide-open question. There may be emergent structure within it that we simply haven’t recognized, and we will need to develop measures and metrics to examine it properly. Existing tools like network theory or computational genomics might help.

The potential gains of such an analysis could be enormous. If the dataome is a real thing then it represents a missing piece of our puzzle; of the function and evolution of a sentient species. We’d do well to at least take a look. As Shakespeare once said : “The web of our life is of a mingled yarn, good and ill together.”

Caleb Scharf is an astrophysicist, the Director of Astrobiology at Columbia University in New York, and a founder of yhousenyc.org, an institute that studies human and machine consciousness. His latest book is The Zoomable Universe: An Epic Tour Through Cosmic Scale, from Almost Everything to Nearly Nothing.

Our Bodies, Our Data


New technologies have launched the life sciences into the age of big data. Biologists must now make sense of their windfall.

Twenty years ago, sequencing the human genome was one of the most ambitious science projects ever attempted. Today, compared to the collection of genomes of the microorganisms living in our bodies, the ocean, the soil and elsewhere, each human genome, which easily fits on a DVD, is comparatively simple. Its 3 billion DNA base pairs and about 20,000 genes seem paltry next to the roughly 100 billion bases and millions of genes that make up the microbes found in the human body.

And a host of other variables accompanies that microbial DNA, including the age and health status of the microbial host, when and where the sample was collected, and how it was collected and processed. Take the mouth, populated by hundreds of species of microbes, with as many as tens of thousands of organisms living on each tooth. Beyond the challenges of analyzing all of these, scientists need to figure out how to reliably and reproducibly characterize the environment where they collect the data.

“There are the clinical measurements that periodontists use to describe the gum pocket, chemical measurements, the composition of fluid in the pocket, immunological measures,” said David Relman, a physician and microbiologist at Stanford University who studies the human microbiome. “It gets complex really fast.”

Ambitious attempts to study complex systems like the human microbiome mark biology’s arrival in the world of big data. The life sciences have long been considered a descriptive science — 10 years ago, the field was relatively data poor, and scientists could easily keep up with the data they generated. But with advances in genomics, imaging and other technologies, biologists are now generating data at crushing speeds.

One culprit is DNA sequencing, whose costs began to plunge about five years ago, falling even more quickly than the cost of computer chips. Since then, thousands of human genomes, along with those of thousands of other organisms, including plants, animals and microbes, have been deciphered. Public genome repositories, such as the one maintained by the National Center for Biotechnology Information, or NCBI, already house petabytes — millions of gigabytes — of data, and biologists around the world are churning out 15 petabases (a base is a letter of DNA) of sequence per year. If these were stored on regular DVDs, the resulting stack would be 2.2 miles tall.

“The life sciences are becoming a big data enterprise,” said Eric Green, director of the National Human Genome Research Institute in Bethesda, Md. In a short period of time, he said, biologists are finding themselves unable to extract full value from the large amounts of data becoming available.

Solving that bottleneck has enormous implications for human health and the environment. A deeper understanding of the microbial menagerie inhabiting our bodies and how those populations change with disease could provide new insight into Crohn’s disease, allergies, obesity and other disorders, and suggest new avenues for treatment. Soil microbes are a rich source of natural products like antibiotics and could play a role in developing crops that are hardier and more efficient.

Life scientists are embarking on countless other big data projects, including efforts to analyze the genomes of many cancers, to map the human brain, and to develop better biofuels and other crops. (The wheat genome is more than five times larger than the human genome, and it has six copies of every chromosome to our two.)

However, these efforts are encountering some of the same criticisms that surrounded the Human Genome Project. Some have questioned whether massive projects, which necessarily take some funding away from smaller, individual grants, are worth the trade-off. Big data efforts have almost invariably generated data that is more complicated than scientists had expected, leading some to question the wisdom of funding projects to create more data before the data that already exists is properly understood. “It’s easier to keep doing what we are doing on a larger and larger scale than to try and think critically and ask deeper questions,” said Kenneth Weiss, a biologist at Pennsylvania State University.

Compared to fields like physics, astronomy and computer science that have been dealing with the challenges of massive datasets for decades, the big data revolution in biology has also been quick, leaving little time to adapt.

“The revolution that happened in next-generation sequencing and biotechnology is unprecedented,” said Jaroslaw Zola, a computer engineer at Rutgers University in New Jersey, who specializes in computational biology.

Biologists must overcome a number of hurdles, from storing and moving data to integrating and analyzing it, which will require a substantial cultural shift. “Most people who know the disciplines don’t necessarily know how to handle big data,” Green said. If they are to make efficient use of the avalanche of data, that will have to change.

Big Complexity

When scientists first set out to sequence the human genome, the bulk of the work was carried out by a handful of large-scale sequencing centers. But the plummeting cost of genome sequencing helped democratize the field. Many labs can now afford to buy a genome sequencer, adding to the mountain of genomic information available for analysis. The distributed nature of genomic data has created its own challenges, including a patchwork of data that is difficult to aggregate and analyze. “In physics, a lot of effort is organized around a few big colliders,” said Michael Schatz, a computational biologist at Cold Spring Harbor Laboratory in New York. “In biology, there are something like 1,000 sequencing centers around the world. Some have one instrument, some have hundreds.”

As an example of the scope of the problem, scientists around the world have now sequenced thousands of human genomes. But someone who wanted to analyze all of them would first have to collect and organize the data. “It’s not organized in any coherent way to compute across it, and tools aren’t available to study it,” said Green.

Big Data in Biology

A selection of big data projects in the life sciences exploring health, the environment and beyond.

Cancer Genome Atlas: This effort to map the genome of more than 25 types of cancers has generated 1 petabyte of data to date, representing 7,000 cases of cancer. Scientists expect 2.5 petabytes by completion.

Encyclopedia of DNA Elements (ENCODE): This map of the functional elements in the human genome — regions that turn genes on and off — contains more than 15 terabytes of raw data.

Human Microbiome Project: One of a number of projects characterizing the microbiome at different parts of the body, this effort has generated 18 terabytes of data — about 5,000 times more data than the original human genome project.

Earth Microbiome Project: A plan to characterize microbial communities across the globe, which has created 340 gigabytes of sequence data to date, representing 1.7 billion sequences from more than 20,000 samples and 42 biomes. Scientists expect 15 terabytes of sequence and other data by completion.

Genome 10K: The total raw data for this effort to sequence and assemble the DNA of 10,000 vertebrate species and analyze their evolutionary relationships will exceed 1 petabyte.

Researchers need more computing power and more efficient ways to move their data around. Hard drives, often sent via postal mail, are still often the easiest solution to transporting data, and some argue that it’s cheaper to store biological samples than to sequence them and store the resulting data. Though the cost of sequencing technology has fallen fast enough for individual labs to own their own machines, the concomitant price of processing power and storage has not followed suit. “The cost of computing is threatening to become a limiting factor in biological research,” said Folker Meyer, a computational biologist at Argonne National Laboratory in Illinois, who estimates that computing costs ten times more than research. “That’s a complete reversal of what it used to be.”

Biologists say that the complexity of biological data sets it apart from big data in physics and other fields. “In high-energy physics, the data is well-structured and annotated, and the infrastructure has been perfected for years through well-designed and funded collaborations,” said Zola. Biological data is technically smaller, he said, but much more difficult to organize. Beyond simple genome sequencing, biologists can track a host of other cellular and molecular components, many of them poorly understood. Similar technologies are available to measure the status of genes — whether they are turned on or off, as well as what RNAs and proteins they are producing. Add in data on clinical symptoms, chemical or other exposures, and demographics, and you have a very complicated analysis problem.

“The real power in some of these studies could be integrating different data types,” said Green. But software tools capable of cutting across fields need to improve. The rise of electronic medical records, for example, means more and more patient information is available for analysis, but scientists don’t yet have an efficient way of marrying it with genomic data, he said.

To make things worse, scientists don’t have a good understanding of how many of these different variables interact. Researchers studying social media networks, by contrast, know exactly what the data they are collecting means; each node in the network represents a Facebook account, for example, with links delineating friends. A gene regulatory network, which attempts to map how different genes control the expression of other genes, is smaller than a social network, with thousands rather than millions of nodes. But the data is harder to define. “The data from which we construct networks is noisy and imprecise,” said Zola. “When we look at biological data, we don’t know exactly what we are looking at yet.”

Despite the need for new analytical tools, a number of biologists said that the computational infrastructure continues to be underfunded. “Often in biology, a lot of money goes into generating data but a much smaller amount goes to analyzing it,” said Nathan Price, associate director of the Institute for Systems Biology in Seattle. While physicists have free access to university-sponsored supercomputers, most biologists don’t have the right training to use them. Even if they did, the existing computers aren’t optimized for biological problems. “Very frequently, national-scale supercomputers, especially those set up for physics workflows, are not useful for life sciences,” said Rob Knight, a microbiologist at the University of Colorado Boulder and the Howard Hughes Medical Institute involved in both the Earth Microbiome Project and the Human Microbiome Project. “Increased funding for infrastructure would be a huge benefit to the field.”

In an effort to deal with some of these challenges, in 2012 the National Institutes of Health launched the Big Data to Knowledge Initiative (BD2K), which aims, in part, to create data sharing standards and develop data analysis tools that can be easily distributed. The specifics of the program are still under discussion, but one of the aims will be to train biologists in data science.

“Everyone getting a Ph.D. in America needs more competency in data than they have now,” said Green. Bioinformatics experts are currently playing a major role in the cancer genome project and other big data efforts, but Green and others want to democratize the process. “The kinds of questions to be asked and answered by super-experts today, we want a routine investigator to ask 10 years from now,” said Green. “This is not a transient issue. It’s the new reality.”

Not everyone agrees that this is the path that biology should follow. Some scientists say that focusing so much funding on big data projects at the expense of more traditional, hypothesis-driven approaches could be detrimental to science. “Massive data collection has many weaknesses,” said Weiss. “It may not be powerful in understanding causation.” Weiss points to the example of genome-wide association studies, a popular genetic approach in which scientists try to find genes responsible for different diseases, such as diabetes, by measuring the frequency of relatively common genetic variants in people with and without the disease. The variants identified by these studies so far raise the risk of disease only slightly, but larger and more expensive versions of these studies are still being proposed and funded.

“Most of the time it finds trivial effects that don’t explain disease,” said Weiss. “Shouldn’t we take what we have discovered and divert resources to understand how it works and do something about it?” Scientists have already identified a number of genes that are definitely linked to diabetes, so why not try to better understand their role in the disorder, he said, rather than spend limited funds to uncover additional genes with a murkier role?

Many scientists think that the complexities of life science research require both large and small science projects, with large-scale data efforts providing new fodder for more traditional experiments. “The role of the big data projects is to sketch the outlines of the map, which then enables researchers on smaller-scale projects to go where they need to go,” said Knight.

Small and Diverse

Efforts to characterize the microbes living on our bodies and in other habitats epitomize the promise and the challenges of big data. Because the vast majority of microbes can’t be grown in the lab, the two major microbiome projects — the Earth Microbiome and the Human Microbiome — have been greatly enabled by DNA sequencing. Scientists can study these microbes mainly through their genes, analyzing the DNA of a collection of microbes living in the soil, skin or any other environment, and start to answer basic questions, such as what types of microbes are present and how they respond to changes in their environment.

The goal of the Human Microbiome Project, one of a number of projects to map human microbes, is to characterize microbiomes from different parts of the body using samples taken from 300 healthy people. Relman likens it to understanding a forgotten organ system. “It’s a somewhat foreign organ, because it’s so distant from human biology,” he said. Scientists generate DNA sequences from thousands of species of microbes, many of which need to be painstakingly reconstructed. It’s like recreating a collection of books from fragments that are shorter than individual sentences.

“We are now faced with the daunting challenge of trying to understand the system from the perspective of all this big data, with not nearly as much biology with which to interpret it,” said Relman. “We don’t have the same physiology that goes along with understanding the heart or the kidney.”

One of the most exciting discoveries of the project to date is the highly individualized nature of the human microbiome. Indeed, one studyof about 200 people showed that just by sequencing microbial residue left on a keyboard by an individual’s fingertips, scientists can match that individual with the correct keyboard with 95 percent accuracy. “Until recently, we had no idea how diverse the microbiome was, or how stable within a person,” said Knight.

Researchers now want to figure out how different environmental factors, such as diet, travel or ethnicity, influence an individual’s microbiome. Recent studies have revealed that simply transferring gut microbes from one animal to another can have a dramatic impact on health, improving infections or triggering weight loss, for example.  With more data on the microbiome, they hope to discover which microbes are responsible for the changes and perhaps design medical treatments around them.

Relman said that some of the major challenges will be determining which of the almost unmanageable number of variables involved are important, and figuring out how to define some of the microbiome’s most important functions. For example, scientists know that our microbes play an integral role in shaping the immune system, and that some people’s microbial community is more resilient than others — the same course of antibiotics can have little long-term impact on one individual’s microbial profile and throw another’s completely out of whack. “We just don’t have a big sense of how to go about measuring these services,” said Relman, referring to the microbes’ role in shaping the immune system and other functions.

The Earth Microbiome Project presents an even larger data analysis challenge. Scientists have sequenced about 50 percent of the microbial species living in our guts, which makes it much easier to interpret new data. But only about one percent of the soil microbiome has been sequenced, leaving researchers with genomic fragments that are often impossible to assemble into a whole genome.

Data in the Brain

If genomics was the early adopter of big data analysis in the life sciences, neuroscience is quickly gaining ground. New imaging methods and techniques for recording the activity and the structure of many neurons are allowing scientists to capture large volumes of data.

Jeff Lichtman, a neuroscientist at Harvard, is collaborating on a project to build neural wiring maps from an unprecedented amount of data by taking snapshots of thin slices of the brain, one after another, and then computationally stitching them together. Lichtman said his team, which uses a technique called scanning electron microscopy, is currently generating about a terabyte of image data per day from a single sample. “In a year or so, we hope to be doing multiple terabytes per hour,” he said. “That’s a lot of still raw data that has to be processed by computer algorithms.” A cubic millimeter of brain tissue generates about 2,000 terabytes of data. As in other areas of the life sciences, storing and managing the data is proving to be a problem. While cloud computing works for some aspects of genomics, it may be less useful to neuroscience. Indeed, Lichtman said they have too much data for the cloud, too much even for passing around on hard drives.

Lichtman believes the challenges neuroscientists face will be even greater than those of genomics. “The nervous system is a far more complicated entity than the genome,” he said. “The whole genome can fit on a CD, but the brain is comparable to the digital content of the world.”

Lichtman’s study is just one of a growing number of efforts to chart the brain. In January, the European Union launched an effort to model the entire human brain. And the U.S. is now working on its own large-scale project — the details are still under discussion, but the focus will likely be on mapping brain activity rather than the neural wiring itself.

As in genomics, Lichtman said, neuroscientists will need to get used to the concept of sharing their data. “It’s essential that this data become freely and easily accessible to anyone, which is its own challenge. We don’t know the answer yet to problems like this.”

Questions remain about funding and necessary advances in hardware, software and analytical methods. “Ideas like this are almost certainly going to cost a lot, and they have not produced fundamental findings yet,” said Lichtman. “Will you just end up with a meaningless mass of connectional data? This is always a challenge for big data.”

Still, Lichtman is convinced that the major findings will come with time. “I feel confident that you don’t have to know beforehand what questions to ask,” he said. “Once the data is there, anyone who has an idea has a dataset they can use to mine it for an answer.

“Big data,” he said, “is the future of neuroscience but not the present of neuroscience.”

Discovery of ‘Alien’ DNA in Human Genome Challenges Darwin’s Theory


How will the scientific community deal with these new anomalies?

Scientists are now finding clues as to the beginning of the creation of Earth. Research from the University of Cambridge has discovered what appears to be ‘foreign’ DNA – 145 genes that may threaten one of modern orthodoxy’s sacred cows: Darwin’s theory of evolution.

This latest discovery has invigorated the curiosity of those seeking answers to the question which was brought into mainstream discourse following the release of the blockbuster filmPrometheus – which references Greek mythology and the epic story of how Prometheus had stolen fire from the gods in order to give it to man. The film outlines a narrative of how the human race was engineered, and our DNA was “seeded”  hundreds of thousands of years ago by advanced ‘Ancient Astronaut’ extra-terrestrial visitors.

Director Ridley Scott, told Hollywood Reporter:

“NASA and the Vatican agree that it is almost mathematically impossible that we can be where we are today without there being a little help along the way … That’s what we’re looking at (in the film), at some of Eric Von Daniken’s ideas of how did we humans come about.”

The Abydos carvings showing a helicopter and other futuristic Vehicles Located in the Temple of Seti The First – Abydos, Egypt

Esoteric website Vigilant Citizen explains the historic religious context of the ‘Ancient Astronaut’ theory:

Proponents of the Ancient Astronauts theory claim that many ancient religious texts contain references to visitors from outer space. Two of the main works often cited are the Book of Genesis and the Book of Enoch, which both mention the existence on Earth of enigmatic giant beings named the Nephilim.

The Book of Genesis mentions the presence on Earth of beings named Nephilim (the King James version uses the term Giants). These beings are described as hybrids that are the result of procreation between human females and “sons of Gods”.

“When human beings began to increase in number on the earth and daughters were born to them, the sons of God saw that the daughters of humans were beautiful, and they married any of them they chose. (…) The Nephilim were on the earth in those days—and also afterward—when the sons of God went to the daughters of humans and had children by them.”
– Genesis 6:1–4 (New International Version)

How little we really know about the beginnings of human life on Earth.

Is this proof of ‘The Engineer’s experiment’?

What we are seeing in the beginning is the creation of Earth. The giant ship (which is different from the ring-shaped one we see later in the film, weirdly) has landed on Earth to drop off the Engineer so that he can terraform the planet and make it sustainable for life. We think he drinks the black goo to break down his own structure and spread life on Earth through his own DNA, but that doesn’t really explain his surprise while he’s disintegrating (and if the Engineers do have the same DNA as us, it’s hard to say why the Engineers had to be broken down in order to create humanity) – See more at: http://www.cinemablend.com/new/Prometheus-Explained-Unraveling-Unanswered-Questions-31317.html#sthash.JCDhRTmI.dpuf
What we are seeing in the beginning is the creation of Earth. The giant ship (which is different from the ring-shaped one we see later in the film, weirdly) has landed on Earth to drop off the Engineer so that he can terraform the planet and make it sustainable for life. We think he drinks the black goo to break down his own structure and spread life on Earth through his own DNA, but that doesn’t really explain his surprise while he’s disintegrating (and if the Engineers do have the same DNA as us, it’s hard to say why the Engineers had to be broken down in order to create humanity) –

1-DNA-Prometheus

Mystery of our 145 ‘alien’ genes: Scientists discover some DNA is NOT from our ancestors – and say it could change how we think about evolution

• Study challenges views that evolution relies solely on genes passed down
• Instead says we acquired essential ‘foreign’ genes from microorganisms

Mark Prigg
Daily Mail

Humans contain ‘alien’ genes not passed on from our ancestors, researchers have discovered.

They say we acquired essential ‘foreign’ genes from micro organisms co-habiting their environment in ancient times.

The study challenges conventional views that animal evolution relies solely on genes passed down through ancestral lines – and says the process could still be going on.

The research published in the open access journal Genome Biology focuses on the use of horizontal gene transfer, the transfer of genes between organisms living in the same environment.

‘This is the first study to show how widely horizontal gene transfer (HGT) occurs in animals, including humans, giving rise to tens or hundreds of active ‘foreign’ genes,’ said lead author Alastair Crisp from the University of Cambridge.

‘Surprisingly, far from being a rare occurrence, it appears that HGT has contributed to the evolution of many, perhaps all, animals and that the process is ongoing, meaning that we may need to re-evaluate how we think about evolution.’…

WHICH ‘LETTERS’ IN THE HUMAN GENOME ARE FUNCTIONALLY IMPORTANT?


150120160323-large

In work published today in Nature Genetics, researchers at Cold Spring Harbor Laboratory (CSHL) have developed a new computational method to identify which letters in the human genome are functionally important. Their computer program, called fitCons, harnesses the power of evolution, comparing changes in DNA letters across not just related species, but also between multiple individuals in a single species. The results provide a surprising picture of just how little of our genome has been “conserved” by Nature not only across species over eons of time, but also over the more recent time period during which humans differentiated from one another.

“In model organisms, like yeast or flies, scientists often generate mutations to determine which letters in a DNA sequence are needed for a particular gene to function,” explains CSHL Professor Adam Siepel. “We can’t do that with humans. But when you think about it, Nature has been doing a similar experiment on a very large scale as species evolve. Mutations occur across the genome at random, but important letters are retained by natural selection, while the rest are free to change with no adverse consequence to the organism.”

It was this idea that became the basis of their analysis, but it alone wasn’t enough. “Massive research consortia, like the ENCODE Project, have provided the scientific community with a trove of information about genomic function over the last few years,” says Siepel. “Other groups have sequenced large numbers of humans and nonhuman primates. For the first time, these big data sets give us both a broad and exceptionally detailed picture of both biochemical activity along the genome and how DNA sequences have changed over time.”

Siepel’s team began by sorting ENCODE consortium data based on combinations of biochemical markers that indicate the type of activity at each position. “We didn’t just use sequence patterns. ENCODE provided us with information about where along the full genome DNA is read and how it is modified with biochemical tags,” says Brad Gulko, a Ph.D. student in Computer Science at Cornell University and lead author on the new paper. The combinations of these tags revealed several hundred different classes of sites within the genome each having a potentially different role in genomic activity.

The researchers then turned to their previously developed computational method, called INSIGHT, to analyze how much the sequences in these classes had varied over both short and long periods of evolutionary time. “Usually, this, kind of analysis is done comparing different species – like humans, dogs, and mice – which means researchers are looking at changes that occurred over relatively long time periods,” explains Siepel. But the INSIGHT model considers the changes among dozens of human individuals and close relatives, such as the chimpanzee, which provides a picture of evolution over much shorter time frames.

The scientists found that, at most, only about 7% of the letters in the human genome are functionally important. “We were impressed with how low that number is,” says Siepel. “Some analyses of the ENCODE data alone have argued that upwards of 80% of the genome is functional, but our evolutionary analysis suggests that isn’t the case.” He added, “other researchers have estimated that similarly small fractions of the genome have been conserved over long time evolutionary periods, but our analysis indicates that the much larger ENCODE-based estimates can’t be explained by gains of new functional sequences on the human lineage. We think most of the sequences designated as ‘biochemically active’ by ENCODE are probably not evolutionarily important in humans.”

According to Siepel, this analysis will allow researchers to isolate functionally important sequences in diseases much more rapidly. Most genome-wide studies implicate massive regions, containing tens of thousands of letters, associated with disease. “Our analysis helps to pinpoint which letters in these sequences are likely to be functional because they are both biochemically active and have been preserved by evolution.” says Siepel. “This provides a powerful resource as scientists work to understand the genetic basis of disease.”

The ENCODE Project: ENCyclopedia Of DNA Elements.


ENCODE Overview

The National Human Genome Research Institute (NHGRI) launched a public research consortium named ENCODE, the Encyclopedia Of DNA Elements, in September 2003, to carry out a project to identify all functional elements in the human genome sequence. The project started with two components – a pilot phase and a technology development phase.

The pilot phase tested and compared existing methods to rigorously analyze a defined portion of the human genome sequence (See: ENCODE Pilot Project). The conclusions from this pilot project were published in June 2007 in Nature and Genome Research [genome.org]. The findings highlighted the success of the project to identify and characterize functional elements in the human genome. The technology development phase also has been a success with the promotion of several new technologies to generate high throughput data on functional elements.

With the success of the initial phases of the ENCODE Project, NHGRI funded new awards in September 2007 to scale the ENCODE Project to a production phase on the entire genome along with additional pilot-scale studies. Like the pilot project, the ENCODE production effort is organized as an open consortium and includes investigators with diverse backgrounds and expertise in the production and analysis of data (See: ENCODE Participants and Projects). This production phase also includes a Data Coordination Center [genome.ucsc.edu] to track, store and display ENCODE data along with a Data Analysis Center to assist in integrated analyses of the data. All data generated by ENCODE participants will be rapidly released into public databases (See: Accessing ENCODE Data) and available through the project’s Data Coordination Center.

Source: genome.gov