Making DNA Storage a Reality
In the late 1970s, a bizarre theory began making its way around the scientific community. DNA sequencing pioneer Frederick Sanger of the Medical Research Council’s Laboratory of Molecular Biology and his colleagues had just published their landmark paper on the genome of virus Phi X174 (or φX174), a well-studied bacteriophage found in E. coli.1 That genome, some said in the excitement that followed, contained a message from aliens.
In what they termed a “preliminary effort. . . to investigate whether or not phage φX174 DNA carries a message from an advanced society,” Japanese researchers Hiromitsu Yokoo and Tairo Oshima explored some of the reasons extraterrestrials might choose to communicate with humans via a DNA code.2 DNA is durable, the authors noted in their 1979 article, and can be easily replicated. What’s more, it is ubiquitous on Earth, and unlikely to become obsolete as long as life continues—convenient for aliens waiting for humans to develop the sequencing technologies necessary to decode their messages.
The thesis wasn’t taken terribly seriously, and the researchers themselves admitted there was no obvious pattern to Phi X174’s genome. But for biologist George Church, then a Harvard University graduate student learning how to sequence DNA under Walter Gilbert, the speculation in the paper was intriguing. “I didn’t believe it,” he says of the alien theory, “but it planted the idea that one could encode messages into biological DNA.”
At the time, of course, there was a glaring obstacle: cost. Back then, “we synthesized 10 nucleotides for $6,000, and that was considered a pretty good deal,” says Church, now a professor of genetics at Harvard. “Obviously, you can’t encode much information in 10 nucleotides.”
A few decades later, however, things began to change. Oligonucleotide synthesis was becoming more routine, and researchers could write small amounts of arbitrary information into nucleic acids for under a dollar per base. In 2001, for example, a team at Mount Sinai School of Medicine wrote out two Dickens quotes totaling 70 bytes in DNA sequences—encoding each letter of the alphabet as combinations of the bases A, C, and T (e.g., AAA = A, AAC = B, etc.). Eight years later, researchers in Toronto created a plasmid library containing more than 200 bytes of coded text, music, and an image from the nursery rhyme “Mary Had a Little Lamb.” In 2010, Craig Venter’s group demonstrated progress in oligonucleotide synthesis by artificially synthesizing the entire genome of the bacterium Mycoplasma mycoides—about 1.1 million base pairs.
Around this time, Church decided to get involved. He and two Harvard colleagues translated an HTML draft of a 50,000-word book on synthetic biology, coauthored by Church, into binary code, converted it to a DNA sequence—coding 0s as A or C and 1s as G or T—and “wrote” this sequence with an ink-jet DNA printer onto a microchip as a series of DNA fragments. In total, the team made 54,898 oligonucleotides, each including 96 bases of data along with a 22-base sequence at each end to allow the fragments to be copied in parallel using the polymerase chain reaction (PCR), and a unique, 19-base “address” sequence marking the segment’s position in the original document.
The resulting blobs of DNA—which the team later copied with PCR and ran through an Illumina sequencer to retrieve the text—held around 650 kB of data in such a compact form that the team predicted a storage potential for their method of more than 700 terabytes per cubic millimeter.3 Not only did this result represent far and away the largest volume of data ever artificially encoded in DNA, it showcased a data density for DNA that was several orders of magnitude greater than that of state-of-the-art storage media, never mind the average computer hard drive. (For comparison, an 8-terabyte disk drive has the dimensions of a small book.)
The study’s publication in late 2012 was met with excitement, and not only among biologists. In the years since Yokoo and Oshima’s discussion on extraterrestrial communiqués, the world of computing had started to acknowledge an impending crisis: humans are running out of space to store their data. “We are approaching limits with silicon-based technology,” explains Luis Ceze, a computer architect at the University of Washington in Seattle. Church’s paper, along with a similar study published a few months later by Nick Goldman’s group at the European Bioinformatics Institute, part of the European Molecular Biology Laboratory (EMBL) in Germany,4 brought the idea of using DNA for data storage squarely into the spotlight. For Ceze and his colleagues, “the closer we looked, the more it made sense that molecular storage is something that probably has a place in future computer systems.”
The idea of a nucleic acid–based archive of humanity’s burgeoning volume of information has drawn serious support in recent years, both from researchers across academic disciplines and from heavyweights in the tech industry. Last April, Microsoft made a deal with synthetic biology startup Twist Bioscience for 10 million long oligonucleotides for DNA data storage. “We see DNA being very useful for long-term archival applications,” Karin Strauss, a researcher at Microsoft and colleague of Ceze at Washington’s Molecular Information Systems Lab, tells The Scientist in an email. “Hospitals need to store all health information forever, research institutions have massive amounts of data from research projects, manufacturers want to store the data collected from millions of sensors in their products.”
With continued improvements in the volume of information that can be packed into DNA’s tiny structure—data can be stored at densities well into millions of gigabytes per gram—such a future doesn’t look so fanciful. As the costs of oligonucleotide synthesis and sequencing continue to fall, the challenge for researchers and companies will be to demonstrate that using DNA for storage, and maybe even other tasks currently carried out by electronic devices, is practical…