Int Conf Bioinspired Comput Theor Appl. Author manuscript; available in PMC 2010 July 10.
Published in final edited form as:
Int Conf Bioinspired Comput Theor Appl. 2007 September; 2007: 236–239.
PMCID: PMC2901528
NIHMSID: NIHMS137485

# DNA Media Storage

## Abstract

In 1994, University of Southern California computer scientist Dr. Leonard Adelman solved the Hamiltonian Path Problem using DNA as a computational mechanism. He proved the principle that DNA computing could be used to solve computationally complex problems. Because of the limitations in discovery time, resource requirements, and sequence mismatches, DNA computing has not yet become a commonly accepted practice. However, advancements are continually being discovered that are evolving the field of DNA Computing. Practical applications of DNA are not restricted to computation alone. This research presents a novel approach in which DNA could be used as a means of storing files. Through the use of Multiple Sequence Alignment combined with intelligent heuristics, the most probabilistic file contents can be determined with minimal errors.

## DNA Representation of Digital Information

How one approaches a problem is often defined in how the problem is represented. Various representations lend themselves to a set of predefined actions that can easily shape one’s perspective and approach in the quest for the solution. For example, when presented with the problem of determining the time at which a thrown ball is at a given height, represented by the following equation [1]

$h=vtsin(A)−(1/2)Gt2$
(1)

where

• h: height at time t
• t: time in air
• A: angle at which thrown
• G: Universal Gravitational Constant

it is possible to algebraically solve for the corresponding time values. However, it is easier to graph the associated equation of height based on time, then reference the graph for the corresponding solution. Additionally, from the graphical representation, it is apparent that there are two corresponding times at which the thrown ball is at a given height – one time in which the ball is traveling upward and an additional time in which the ball is traveling downward.

Computer scientists have long used the notion of a binary bit to represent information, wherein 1 indicates that the given element is present and 0 indicates that the given element is not present. [2] Combining a series of binary bits enables more states to be represented for a given element; a two-bit binary sequence can represent four possible states – 00, 01, 10, 11 – where each element represents an associated state in the problem. In this same manner, geneticists represent the four possible DNA states with a quaternary bit, using the symbols A, C, G, and T to encode for each of the four states. Understanding the relationship among various representations, such as between the digital binary bit of computer scientist and the DNA quaternary bit of the geneticists, enables one to easy translate between different representations to approach the same problem from the new perspective. For example, to translate between a digital binary state to the corresponding DNA state is a direct substitution where a given digital state is assigned to a given DNA base.

## Adelman and the Hamiltonian Path Problem

In 1994, University of Southern California computer scientist Dr. Leonard Adelman solved the Hamiltonian Path Problem using DNA as a computational mechanism. [3,4] The Hamiltonian Path Problem, also known as the Traveling Salesman Problem (TSP), states that given a finite number of cities and the cost of traveling between the various cities, find the lowest costing path, if such a path exists, that visits every city exactly once and then returns to the starting city. [5] For example, given the following graph in Fig 1 where all edges are bidirectional and have an associated cost of one unit, a Hamiltonian Path starting from city 0 would be

Trajectory of a ball thrown as a function of time

with a total cost of seven units.

To solve the Hamiltonian Path Problem with DNA, Adelman began by using 20-mer oligonucleotide sequences to unique represent each of the cities. Since only seven cities were present, sequences of 20 bases is more than sufficient to avoid potential conflicts in representations. To represent a path between each of the cities, the complementary 20-mer oligonucleotide sequences were generated by combining the last 10 bases of the starting city with the first 10 bases of the ending city. Thus, when the oligonucleotide sequences were combined, DNA’s desire to form a stable double helix structure enables the paths to be constructed through the combination of the city sequences with the complementary edge sequences. For example, as shown in Fig 2, the first three sequences represent 20-mer oligonucleotide representations of three of the cities – cities 2, 3, and 4. Since a path exists from city 2 to city 3, the last 10 bases from city 2 are combined with the first 10 bases of city and the complementary sequence of this new 20-mer sequence will enable the two cities to be combined. Since this is not a directed graph, meaning a path is bidirectional, it is also important to generate the converse path as well. In other words, the process must be repeated using the last 10 bases from city 3 with the first 10 bases of city 2, representing the directed path from city 3 to city 2.

Conversion between digital bit-based and DNA-based alphabet

Having generated all representations of the cities and corresponding paths, Adelman then created a large magnitude of copies so as to generate all possible combinations of cities and edges, thereby generating all possible paths through the graph. Since all possible paths have been generated, Adelman then systemically eliminated all paths that did not meet all of the problem rules. First, since it is known that there are seven cities in the given problem, then a valid Hamiltonian Path through the cities must have seven corresponding edges between eight cities (since the path must return to the starting city). Thus, all generated paths that were not this length, whether too short or too long, could be quickly eliminated. Second, since the starting city and ending city must be the same, all generated paths where the starting city and ending city were not equal could also be quickly eliminated. Finally, sequences with duplicated cities, excluding the starting and ending cities, could be eliminated since the problem states that the path must visit each city exactly once. Any remaining generated paths are valid Hamiltonian Paths through the graph. If no generated paths remain, then the graph has no Hamiltonian Paths through it.

Adelman’s solution to the Hamiltonian Path Problem proved that DNA could in principle be used to solve NP-complete problems. However, there are several limitations to DNA computing that prevent it from being widely accepted. One of the primary benefits of DNA computing is its ability to make computations in parallel, meaning the system can simultaneously solve problems. This benefit comes at the cost of a lengthy discovery of the DNA solution. For Adelman’s solution to the Hamiltonian Path problem, all possible solutions were able to be quickly enumerated in only a few hours. However, it took Adelman approximately seven days to eliminate all of the invalid paths to find the valid solutions. While Adelman’s methodology was slow and inefficient when compared with today’s methodology, it is still a lengthy process to biologically find the DNA solutions among a given mixture.

Because of the minute size of DNA, it has the ability to store a vast amount of information. Current methods of data storage require approximately 1012 nm3 of space to store a single bit; DNA has the ability to store a single bit in only 1 nm3. [5] But DNA representation of problems can be difficult. Adelman represented each city and edge with a 20-mer oligonucleotide sequence to ensure that there would be no errors in his calculations of the Hamiltonian Paths. If one were to scale the Hamiltonian Path problem from the original seven cities to two hundred cities, the DNA required to represent all of the cities and corresponding edges would be greater than the weight of Earth.

Finally, since Adelman’s experiment was limited to only seven cities, he could represent the cities and the corresponding edges in such a manner that mismatches in alignment did not result in improvable solutions because the DNA representations for each of the elements could dramatically different from each other such that errors were minimized. However, when performing thousands of calculations such as those performed by current silicon-based computers, the amount of errors present grows exponentially and must be resolved.

## Using Multiple Sequence Alignment in Error Reduction

DNA computing enables parallel processing, where numerous calculations can be performed simultaneously in the same time complexity as a single operation could be performed with linear processing. Determining which of the calculations are correct and which are incorrect is problematic. There are a number of heuristics and DNA characteristics that can be applied to answer this. First, given the drastic reduction in space required to store one bit of information with DNA as opposed to the current methods of storage, one could store and process multiple copies of the same problem then use multiple sequence alignment to deduce the most probable solution.

Consider the application of using DNA to store the contents of a file. Current media storage methods enable one file to be stored with 1012 nm3 per bit, but DNA can store the file in 1 nm3 per bit. Thus, one could easily store one hundred or more copies of the same file and still have a conservation of space over current media devices. Multiple Sequence Alignment could then be performed on the set of sequences to determine areas that are 100% error-free, areas of potential errors, and areas of unrecoverable errors.

### Multiple Sequence Alignment

Multiple Sequence Alignment is the process of finding a representative model of the similarities between three or more sequences. Like pairwise sequence alignment, it finds an optimal solution for the model conditions placed upon it. If the conditions are changed, then the model may or may not hold. For a set of very similar sequences, the multiple sequence alignments are easily seen, even with the naked eye. As the sequences increasingly diverge, so does the complexity of finding the best alignment.

Multiple Sequence Alignment begins by finding the pairwise sequence alignment between every pair of sequences within the given set. Once found, there are a number of approaches used to discover the underlying model, each of which has the potential to produce a different model that is still representative of the sequence set. The top three approaches are progressive modeling [7], iterative modeling [8], and statistical or probabilistic modeling [9]. Progressive modeling begins with the alignment of the two most similar sequences and iteratively adds sequences to the alignment in descending order of their similarity. Iterative modeling aligns any pair of similar sequences or set of sequences, continually clustering until only one group remains. Finally, statistical or probabilistic modeling selects the ordering of alignment based on a given statistically or probabilistic model believed to represent the given set of sequences. One commonly used method for representing these models is a Hidden Markov Model. [10]

Within Bioinformatics, Multiple Sequence Alignment has several practical applications. [7] Since Multiple Sequence Alignment is more sensitive to the similarities among a set of sequences when compared with pairwise sequence alignment, it is often used to determine if a given sequence is related to a larger group of sequences. For example, using Multiple Sequence Alignment enables one to examine the phylogenetic tree, a graphical representation of the similarities among various organisms. It can also provide insight into questions of functionality, structure, and evolution by comparing a given sequence to a set of sequences known to express the given functionality or structure being examined. Multiple Sequence Alignment has also been extensively used in the field of cloning, where it is critical to examine the differences among cloned organisms.

### Multiple Sequence Alignment for Error Reduction

Since it takes significantly less space to store a single bit in DNA compared to current media devices, one could store a multitude of copies of the same file without increasing the space complexity of current mediums. Knowing that Multiple Sequence Alignment is sensitive to sequence similarities, it can be used to combine the multiple copies of the same file to find the most probabilistic contents. There are three scenarios that can be discovered: (1) areas that are completely conserved among all of the sequences, (2) areas that are highly conserved among the sequences, and (3) areas that are not conserved among the sequences. Each of these scenarios directly corresponds with the level of error within the region.

First, consider areas that are completely conserved among all of the sequences. If the region is completely conserved, then no mutations have occurred in any of the file copies. Since the region is an exact clone of all other copies, there are no discrepancies introduced and as such, the region is completely 100% free of errors.

Next, consider areas that are highly conserved. Because there are discrepancies among the given set sequences, there is a potential that errors have been introduced into the region. Since a multitude of copies have been stored, then it is probable that the majority of sequences will be highly correlated. Thus, the emission properties of the associated Hidden Markov Model state will clearly indicate which one of the bases is most probable of being emitted as it will have a significantly higher emission over the remaining bases. It is important to note that pseudocounts should not be introduced within the Hidden Markov Model, as they will skew the emissions of the state.

Finally, consider areas that are not conserved among the sequences. It may not be possible to determine the most probabilistic emission because a great deal of discrepancies have been introduced into the region. Since there can be no determination as to what the sequence was originally, this region represents the system state of irrecoverable errors. In such circumstances, there are a number of external alternatives to be considered. An artificial intelligent agent could be introduced to make the final determination of the state. Conversely, all of the represented sequences could be presented to the end user to make the final determination as to what were the original contents of the file.

### Improving the Multiple Sequence Alignment

Multiple Sequence Alignment enables the similarities to be emphasized among the sequences, but there is still the possibility of an irrecoverable error in the event that the mutations in the sequences have resulted in an indeterminate state. However, there is the potential to improve the alignment of the regions that are not highly conserved by aligning the corresponding translated amino acid sequences, not the original DNA sequence.

A three-base nucleotide encodes for one of twenty amino acids within an organism. Since there are four possible bases (A, C, G, T) for each of the three possible bases of the amino acid, there are a total of sixty-four possible combinations [7], meaning there are multiple codon representations that encode for a single amino acid. For example, Lysine (K) is encoded by AAA and AAG. Thus, a discrepancy between A and G in the third base would not have a difference in the resulting amino acid. Likewise, Threonine (T) is encoded by ACT, ACC, ACA, and ACG, meaning the third base is obsolete in the determination of the amino acid because all four bases translate into Threonine. Consequently, alignment of the translated amino acid sequences has a greater probability of defining more highly conserved regions as previously indeterminate states are now distinguishable.

While increased accuracy is possible, it comes at a cost of a dramatic increase in the computational time required to find the alignment. To translate a DNA sequence into its corresponding amino acid sequence results in six possible translations. This is the result of an unknown open reading frame, or in other words, lack of knowledge as to which base is the correct starting location of the translation and not a carryover of the previous amino acid. As such, each of first three bases of the DNA sequence must be considered as a possible starting codon location. Additionally, since the DNA forms a double helix, one must also consider the reverse sequence’s first three bases as possible codons since there is no decisive method of determining in which direction the sequence was originally read. Thus, the result is a total of six possible translated amino acid sequences for a single DNA sequence.

Since it is the pairwise alignment between two nucleotide sequences being sought, both sequences must be translated into their corresponding six amino acid sequences. Then, when the pairwise alignment is calculated, all thirty-six combinations must be considered, aligning each of the six amino acid sequence from the first translated DNA sequence with each of the six amino acid sequence from the second translated DNA sequence. The pairwise alignment with the highest score is then deemed to be the best representation of the two sequences. Thus, the pairwise alignment between the sequences has exponentially increased in complexity from the pairwise alignment of two sequences to the pairwise alignment of thirty-six sequences.

### Heuristic Improvements of the Algorithm

Knowing that the aligned sequences are very similar, if not identical, there are number of heuristics that can be applied to reduce the computational, storage, and time complexity required for the multiple sequence alignment. Continuing with the example of the storage of a file, it is reasonable to assume that the majority of sequences being aligned will be of the same length within a bounded threshold. Since a file will not produce or reduce the amount of information contained within it without some sort of external stimuli, one can quickly eliminate sequences which disproportionately longer or shorter than majority of sequences being aligned.

Additionally, since the sequences are highly similar, the alignment will probabilistically follow the diagonal of the alignment matrix. [11,12] Thus, performing a bounded alignment in which only cells within a given threshold above and below the diagonal are calculated will reduce the computational complexity and the storage complexity required for all of the pairwise sequence alignments performed. Determining the appropriate threshold is dependent upon the application, however for any sequence set of substantial length, it is reasonable to assume that the threshold could be set between 5–10% and still produce highly accurate results.

To reduce these complexities even further, an intelligent agent could retain the probabilities of the identical alignments without requiring the actual storage of the alignments. Specifically, if two or more of the sequences are identical, it is inefficient to store the alignment, as the highest pairwise alignment is an exact copy of itself. However, the frequencies of the identical sequences must be retained in order for the Hidden Markov Model emissions to be representative of the aligned sequences. If these frequencies are not retained, then the significance of the different regions of the sequences is skewed by the emphasis of a discrepancy in an outlining sequence when compared to the majority.

## Conclusion

Adelman solved the Hamiltonian Path Problem using DNA as a computational mechanism; he proved the principle that DNA computing could be used to solve computationally complex problems. Practical applications of DNA are not restricted to computation alone. This research presents a novel approach in which DNA could be used as a means of storing files. Through the use of Multiple Sequence Alignment combined with intelligent heuristics, the most probabilistic file contents can be determined with minimal errors. Completely conserved regions have no discrepancies and as such are 100% error free. Highly conserved regions have minimal discrepancies, whose correct content can be determined based on the emission probabilities of the associated Hidden Markov Model. Finally, poorly conserved regions represent the most difficult areas because of the high discrepancies with low emission probabilities. However, using the associated translated amino acid sequences, it is possible to improve the accuracy of the region’s emission probabilities with multiple codons encoding a single amino acid.

Traveling Salesman Problem
DNA Representation of the Traveling Salesman Problem
Amino Acid Translation Table
Translation of a single DNA sequence

## References

1. Halliday David, Resnick Robert, Walker Jearl. Fundamentals of Physics. 6. New York, NY: John Wiley & Sons, Inc; 2001.
2. Brookshear J Glenn. Computer Science: An Overview. 6. Reading, MA: Addison-Wesley; 2000.
3. Adelman Leonard M. Molecular Computation of Solutions to Combinatorial Problems. Science. 1994;266:1021–1024. [PubMed]
4. Amos Martyn. Theoretical and Experimental DNA Computation. Netherlands: Springer; 2005.
5. Baase Sara, Van Gelder Allen. Computer Algorithms: Introduction to Design and Analysis. 3. Reading, MA: Addison-Wesley; 2000.
6. Parker Jack. Computing with DNA. European Molecular Biology Organization. 2003;4.1:7–10. [PubMed]
7. Pevsner Jonathan. Bioinformatics and Functional Genomics. Hoboken, NJ: John Wiley & Sons, Inc; 2003.
8. Mount David W. Bioinformatics: Sequence and Genome Analysis. 2. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press; 2004.
9. Durbin R, et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. 11. Cambridge, UK: Cambridge University Press; 2006.
10. Rabiner LR, Juang BH. An Introduction to Hidden Markov Models. IEEE ASSP Magazine. 1986 Jan;:4–16.
11. Carillo H, Lipman D. The Multiple Sequence Alignment Problem in Biology. SIAM Journal on Applied Mathematics. 1988;48:1073–1082.
12. Myers Eugene W. Technical Report TR 91–29. University of Arizona, Department of Computer Science; 1991. An Overview of Sequence Comparison Algorithms in Molecular Biology.

 PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers.