Same But Different: A Short History of the Human Genome

THURSDAY, 14 APRIL 2022

For a species younger than 200,000 years old, we have a surprisingly complex history etched into our genomes. Although only emerging around 180,000 years ago, Homo sapiens has spread out across the globe over the last 100,000 years, creating pockets of isolated populations in wildly differing climates. This history can be pieced together by examining the differences in our DNA, which result from errors and adaptations, passed on through generations. Our genomes tell a story of a young species with common ancestry and very little genetic variation between us, bound much more by our similarities than divided by our differences. However, by examining these small differences, we can begin piecing together the causes behind differences in our traits. For example, understanding how differences contribute to disease susceptibility has the potential to transform healthcare into a new era of personalised medicine.

A new understanding of variation

Variation can be thought of as differences in physical features, such as flower colour, beak shape or height. But what determines these differences? Ultimately, physical features are encoded by genes — basic units of inheritance — which are sequences of DNA letters. Each of us carries a slightly different, unique set of genes that interact to give complex individual traits, often difficult to predict from DNA alone. The era of genomics has allowed us to peer into our DNA, and we now know variation is heavily based on differences in our DNA sequences. We are only just beginning to uncover the true complexity of this genetic variation.

The Human Genome Project, initially published in 2001, successfully determined the arrangement of letters that make up human DNA from a mosaic of individuals, creating a reference genome that can be used as a comparison between all human genomes. It turns out that, on average, humans share 99.9% of our genetic code, and there is more genetic variation within populations than between them. The boundaries we set between ‘races’ are artificial. However, a more detailed knowledge of genetic variation between humans, from a more diverse cohort, is necessary if we are to understand genetic-based diseases and use this knowledge to improve healthcare.

To that end, in 2008, the 1000 Genomes Project was announced. This aimed to sequence the genomes of at least 1,000 people from across the world to produce a more detailed, diverse, and medically useful picture of human genetic variation. The global sequencing effort captured 2,504 individuals from 26 populations and found over 88 million genetic changes. A typical human genome differs at approximately 4-5 million sites; most differences are single-letter changes, known as Single Nucleotide Polymorphisms (SNPs), while some are caused by removals or insertions of small sections of DNA. It seems that large changes in our genomes are rare, suggesting most diseases are caused by SNPs.

Populations within Africa show the greatest numbers of sites of genetic change. This is consistent with the Out of Africa hypothesis that all modern non-Africans are descended from a single exodus from the African continent around 120,000 years ago. African populations are therefore significantly older, so there has been more time for genetic differences to build up within them. Most specific genetic differences are rare in the global sample but common between cells in a single genome. The 1000 Genomes Project demonstrated that we are more alike than different on a grand scale and are connected by a common origin.

Clinical consequences

Our ability to characterise genetic variation on a global, single-letter level promises a new era in medical diagnosis. If we can accurately map differences in the genomes of healthy and ill people, perhaps we will be better able to uncover the mechanisms behind many currently poorly characterised genetic diseases. The 1000 Genomes Project showed that some genetic changes, albeit rare, can cluster in certain populations — perhaps we can create more targeted preventative measures?

Genome association studies have been designed to test for hundreds of known genetic differences, or SNPs, at once, to try to find those associated with a specific trait or disease. Findings are useful but sobering. Naively, we might think that one change in our DNA will result in a dysfunctional protein which causes a disease. Unfortunately, however, the true picture is much more complex. Most traits are influenced by thousands of causal genetic differences — each on its own confers very little risk, but together they can greatly increase disease susceptibility. It is rare for a disease to be caused by just one SNP.

This makes hunting for a causal gene or SNP exceedingly difficult, with no obvious place in which to start our search. Most are found in mysterious non-coding regions — those that do not appear to code for a protein — making it hard to assign function to them. Moreover, many SNPs thought to cause a disease are often associated with other SNPs that confer no extra risk, making it hard to detangle and pin-point the real culprit. That being said, we are starting to gain a few insights. For example, a single-letter change has been shown to be a powerful predictor of obesity, and assigning functional roles to some SNPs has helped discover a genetically encoded pathway inside cells that is linked to Crohn’s disease.

While we are still a long way off identifying the mechanisms behind diseases, we have made progress in our ability to predict susceptibility. We can now assign a numerical risk-factor score for a specific disease to an individual’s genome, based on our knowledge of which SNPs are associated with a given trait. Risk calculated from these studies is only probabilistic, with a high degree of uncertainty — low-risk individuals may still develop the disease, whilst those identified as high-risk may never do so. Many discovered risk-associated SNPs are based on European cohorts, and even small regional biases can skew results, making them less meaningful for people across the globe. Predictors are far from universal. The only solution is to increase the diversity of our sampled cohorts, making programmes like the 1000 Genomes Project key to more accurate predictors.

Problems yet unanswered

We now can sequence entire human genomes in mere days at a fraction of the cost of the Human Genome Project. We do not know the function of most of the genetic differences we detect, most being in non-coding regions we know little about.

One of the biggest remaining problems is the fact that the genetic variations we have discovered do not account for the heritability of all traits, termed the missing heritability problem. Some traits, such as height, for example, are highly heritable, meaning that a large amount of the variation we see in height is encoded in our DNA, and yet, the genetic differences we have found so far are insufficient to account for all our inherited traits and diseases. Advances in longer-read sequencing techniques are slowly uncovering new types of variations, but efforts even more ambitious than the 1000 Genomes Project are needed if we are to solve this conundrum to find the missing variation.

Maybe one day in the not-too-distant future, when all our genomes are sequenced at birth and we fully understand the heritability of all our traits and gene interactions, we will be able to design a truly cradle-to-grave healthcare system, with drugs and treatments uniquely designed for each individual on the basis of their genome. Until then, we have more of a handle on our similarities than our differences, which remain so minutely small as to be undiscoverable.

Bartek Witek is a third-year undergraduate studying Natural Sciences at St Catharine’s College. Illustration by Mariadaria Ianni-Ravn.