It's not Junk
Aug. 13th, 2018 10:07 pmJul 2017
Junk DNA - Nessa Carey - Icon Books, 2015
* * *
This book has a number of annoying features, of which the most irritating is its downright fallacious title. This is something of a personal matter, because it relates to what is probably my most cited academic paper, a report by the DNA committee of the Human Genome Mapping 10.5 workshop held in Oxford in 1990. Needless to say, this had nothing to do with my research. As a post-graduate student, I was recruited as a runner for the committee, passing messages on good old-fashioned paper to the other committees (there was one for each chromosome). I got on so well with the chairman that he insisted on listing me as an author of the report. Anyway, the committee's job was to assess some DNA sequences that were not part of genes but which were nonetheless of scientific interest. So I like to think that in a small way, I contributed to the most surprising discovery of the Human Genome Project, which was that just 2% of the 3 billion DNA bases in the human genome are in genes (defining a gene as a sequence that codes for a protein). The remaining 98% appear to have no obvious purpose, though some sequences are highly conserved. So what are they doing there?
When I was an active academic, there were a number of hypotheses. One was that it is simply junk, an artefact of evolution that nature hasn't got around to clearing up. Another was that it is involved in the structural organisation of genes. Each human cell contains around 1800 mm of DNA packed into a nucleus just 0.006 mm in diameter, so ensuring that genes are physically accessible to the transcription enzymes that start the process of protein creation is clearly an important consideration. A third involved the interesting new area of epigenetics, a form of gene control based on chemical modification of the DNA and its scaffolding proteins. Now, some 25 years later, a great deal more research has been done and a great deal more is known. So which of these hypotheses is true? Well, this is nature, so the answer is - all of them, at least to some extent. But the one that is least true is that it is junk.
The structural theory was well attested and has become more so in the intervening years. Telomeres, the end caps of chromosomes that shorten as we age, are made of highly repetitive sequences of DNA which of course don't code for anything. Sadly, increasing their length won't allow us to live forever - cancer cells have already learned the trick of turning on telomerase, the enzyme that maintains their length, and it is unlikely that we can do better. Preventing the "cliff edge" where the telomeres are so short that chromosomal rearrangements become commonplace may, however, help to prevent some nasty diseases. Likewise, the mechanism of action of the centromere, a chromosomal region that ensures that the two copies of a replicated chromosome end up in separate daughter cells rather than both in the same cell during cell division, is now much better understood. Interestingly, it is not the sequence of the DNA that is important in identifying the centromere, but a structural difference in the histone scaffolding around which it is wrapped. The DNA sequence of a centromere could be anything. Failures of the centromeric process result in aneuploidy, where one daughter cell inherits nothing and the other gets two copies of a replicated chromosome. Intriguingly, around 90% of solid tumours contain aneuploid cells.
The process of X inactivation - the mechanism that randomly inactivates one X chromosome in females to avoid gene dosage effects - is now (partly) understood and is deeply fascinating. There are two DNA sequences responsible called Xist and Tsix and both make RNA that does not code for a protein. As their names suggest, they are quite literally opposites. Xist and Tsix are made by reading the same double-stranded DNA in opposite directions. At an early stage in development, one X chromosome starts making Xist and becomes inactive, and the other starts making Tsix and stays active. This pattern of expression is set in a two hour period and lasts for life. It has some bizarre consequences, such as the case of genetically identical female twins, one of whom had muscular dystrophy and the other did not. It also means that the unique patterns of tortoiseshell cats cannot be cloned.
Xist and Tsix are examples of long non-coding RNAs, products of DNA transcription which are not used for protein synthesis but which may have functional effects. Their sequences are poorly conserved between species suggesting that it is their structure, not their precise sequence, that matters, which is probably why they were overlooked back in my day. There are quite a large number of them - somewhere between 10,000 and 32,000 in the human genome, so about the same as the number of coding genes - and they have been associated with maintenance of pluripotency in embryonic stem cells, in the aggressiveness of some types of cancer, and possibly even in the formation of the plaques characteristic of Alzheimer's disease. Long non-coding RNAs also appear to be responsible for the specificity of methylation patterns in epigenetic gene control - they interact with the methylating enzymes and guide them to a specific location, like a tugboat guiding an ocean liner into port. Much more research is clearly needed on this class of gene product.
Structural DNA is involved in another rather remarkable feature of cells, which is their implementation of the principles of supply chain management and lean manufacturing. When a cell needs to be make a structure requiring multiple protein-coding genes, such as a haemoglobin molecule or an antibody, it arranges for them to be transcribed at the same time. Loops of DNA unfurl from the chromosome, exposing the coding genes at their ends, and attach to a complex known as a transcription factory. This means that the non-coding regions surrounding genes must be important to their function. It isn't entirely clear how this molecular dance is coordinated, but if I had to guess, I would bet on long non-coding RNAs being involved.
Non-coding RNAs don't have to be long to have an effect. There is a whole swarm of short (20 base pair) RNAs that are involved in fine-tuning gene expression, particularly in the immune system and in development. A particular class of them called small interfering RNAs (siRNAs) have an interesting effect called post-transcriptional gene silencing in which protein-coding genes are effectively switched off because their messenger RNAs degrade at a very high rate. Sadly their potential therapeutic use - for example in cancer treatments - has not yet been realised due to the unsolved problem of delivering these notoriously fragile molecules into the appropriate cells.
As I mentioned, this book has a number of annoyances. Carey says at the start that she is not going to use the scientific names for the genes that she describes - fortunately she is inconsistent in this, but some of the explanations would have less vague had she just used the standard acronyms. The material could also have done with being better organised, either by type (long non-coding RNAs, structural non-coding DNA, short RNAs) or by function (promoters, enhancers, riboproteins and the like). As it is, it reads like a series of reports from the research front line rather than having an overarching narrative. It was interesting to me as someone who studied the subject many years ago, but a more general reader is likely to get lost in the thickets of the disparate phenomena that she describes, particularly when the explanations are mixed in with descriptions of the complex diseases that they cause.
But the most irritating feature by far is her use of the term "junk DNA" for any sequence that doesn't code for a protein. This undermines the whole thesis of the book, which is that actually there is probably relatively little waste in the human genome. This is not entirely true - some 42% of the human genome is made up of retrotransposons, many of which appear to have been inactivated by mutation and could probably safely be removed without loss of cellular function - but the remainder probably has some use, if only to space out the genes so that they can be transcribed or to mark regions important to chromosome function such as centromeres and telomeres. For much of the genome, the precise sequence is probably not essential to its function, so it will not be heavily conserved in evolution. But that doesn't make it junk, and Carey misrepresents the very interesting science that she describes by using the term.
