Home » Intelligent Design » DNA as the Repository of Intelligence

DNA as the Repository of Intelligence

Here’s an article just in from PhysOrg.com. What Professor Shepherd proposes should prove to be very enlightening. He used his algorithm on the book, Emma, by Jane Austen, and was able to break up 80% of the text–minus punctuation marks and inputted just as a string of letters–into words and sentences without any knowledge of grammar. Just think of what analogies can be drawn if they end up breaking up 80% of DNA into grammatical wholes!

Here’s a quote:

Professor Shepherd originally tested his computer programme on the entire text of Emma by Jane Austen after removing all the spaces and punctuation, leaving just a long impenetrable line of letters. Despite having no knowledge of the English vocabulary or syntax, the programme managed to identify 80 per cent of the words and separate them back into sentences.

Professor Shepherd believes that this can be applied to the genetic sequence, which contains around 3 billion letters and is currently baffling scientists as to how to interpret it. Within these sequence there is information that nobody knows how to extract – codes that regulate, control or describe all kinds of cellular processes.

Professor Shepherd believes that his method of number crunching will be able to make an interpretation. He said: “We are treating DNA as we used to treat problems in intelligence. We want to break the code at the most fundamental level.”

Here’s the link.

  • Delicious
  • Facebook
  • Reddit
  • StumbleUpon
  • Twitter
  • RSS Feed

23 Responses to DNA as the Repository of Intelligence

  1. I wish the article explained how his algorithm works…I’m interested in that aspect of it.

  2. To have further information about the project:

    http://www.simonshepherd.supanet.com/genesys.htm

  3. IBM is way ahead of him…

    http://www.hpcwire.com/hpc/643538.html

    I found Patrick blogged about IBM earlier this year and provided a link to the PNAS article which goes into more detail in the abstract.

    http://www.pnas.org/cgi/conten.....01688103v1

  4. Interesting. I’ll be interested to see where this goes. Although of course the results will support NDE becuse it can’t be otherwise. :P

  5. Wasn’t it formerly thought that the complexities and information density of Human language (to say nothing of the intellect necessary to invent and develop computer languages and the machines that parse them) had to be the current end product of at least hundreds of millions of years of evolution?

    Now are we saying that this supreme complexity came first?!

    Has anyone noticed this extrordinarily profound paradigm shift regarding what NDE is able to produce in a very short time-span?

    I’ve quoted Eugenie Scott before: “…the universe is just more complex than we thought.”

    This “thinking” makes no sense to me at all.

  6. Determining English word boundaries without any knowledge of syntax or morphology is an impressive feat, to say the least. English, after all, has very little inflectional morphology and fairly free word structure. Speakers generally determine word boundaries through paralinguistic elements like prosody and with the help of a lexicon.

    Not to mention the fact that orthographic words (that is, the way we break things up when writing them down) don’t always match up with syntactic or phonological words.

    I just find it hard to believe.

  7. This might lead to a nice refutation of that “Tower of Babel” book. You know the one trying to prove common descent by analogy to the evolution of human language. Who wrote that?

  8. I went to the link mentioned by kairos in comment 2 above. I found this paragraph. I was amazed at the anti-ID spin:

    “We have already demonstrated the statistical results for the rate and redundancy of the base nucleotides and the codon triplets. We were the first to show the key results that the genetic code is both instantaneous and optimal and that it exactly meets the Kraft-MacMillan bound for such codes. We have also shown by an information theoretic argument why nature chooses to use the 64 possible codons to code for precisely 20 amino acids. This is a highly significant result that says a lot about the optimisation of the genetic code by evolution. We have also originated the notion of an error-correction mechanism in the DNA replication process, an argument that strongly supports one of the central theses of Richard Dawkins’ work on evolution.”

  9. What do you all think of the significance of the remark Dr. Shepherd makes: “We are treating DNA as we used to treat problems in intelligence. We want to break the code at the most fundamental level.” ? This is, then, essentially an ID approach to the human genome; and if it turns out positively, should be seen as confirmation of ID thinking. Conversely, should it turn up nothing, that would be refutation of ID. Hence, ID is falsifiable; and, it’s possible to use ID as a way of coming up with new experiments.

  10. So I’ve given this more thought since my last comment, and here’s what I’ve come up with:

    The algorithm is probably discovering morphemes rather than actual words. In English, there is very little bound morphology, so words and morphemes tend to coincide. I’m not sure how I would put a number on this, but 80% doesn’t sound unrealistic. I predict that the algorithm would fail quite miserably at finding words in polysynthetic languages, where almost all of the morphology is bound.

    Sadly, I doubt this algorithm will provide revolutionary results in genetics. Gene expression is radically context dependent, even more so than human language (shocking though that may be).

  11. Collin,

    “This might lead to a nice refutation of that ‘Tower of Babel’ book. You know the one trying to prove common descent by analogy to the evolution of human language. Who wrote that?”

    A bunch of monkeys.

  12. Reed Orak:
    “The algorithm is probably discovering morphemes rather than actual words. ”

    The summary article says that the program identified 80% of the “words” and broke them up into sentences. What’s the distinction between a “morhpeme” and a “word” that you’re making?

  13. Morphemes are meaning-bearing or functional units in language. For example, “dog” and “-ed” are both morphemes in English, the former is a word and the latter is a suffix.

    It so happens that English doesn’t have very many affixes, and most words are composed of only one morpheme. In many other languages, there are virtually no words that are monomorphemic.

    Recognizing morphemes in an unknown language is not terribly difficult, but it takes some careful analysis. Recognizing words per se given only a string of letters seems practically insurmountable.

  14. 14

    Reed,

    Gene expression is radically context dependent, even more so than human language (shocking though that may be). – Reed Orak

    I agree with the basis of your thought here. While this method of analysis could prove useful, I think any sequence interpretation that leaves out the epigenome will be a woefully incomplete annotation of multicellular organisms.

  15. Douglas,

    Ha ha. But when did the monkeys know they had a book that made sense? I’m surprised they didn’t throw it away because it wasn’t Hamlet.

  16. Reed

    Not very impressive at all unless it was a small sample instead of a whole book. A whole book would be on the order of a hundred thousand words. Many of those would be repeats of common words. All one has to do is scan the text for strings that repeat very often. Common words like “and”, “but”, “said”, “she”, “he”, “they” etcetera would stick out like a sore thumb. Do a first pass breaking out the common words which have a high confidence level then iterate on the leftovers looking for repeats. Break your most common repeats out and do it again, and again, and again until you have no repeats. Four out of five correct delineations sounds about right. Putting it back into sentences is a no brainer based on capitalization of the first word in a sentence.

    A similar technique is commonly used in compression algorithms which scan long data strings and compile a “dictionary” of most to least frequently used strings. The dictionary is then analyzed and tokens are assigned to strings where the most common strings get tokens with the least number of bits. The compressed data then becomes a list of tokens that point back to the dictionary. If text is being compressed very high compression rates are achieved as there’s a lot of redundancy in text. Data sets with little redundancy will produce a “compressed” file that is larger than the original.

    There’s probably a lot more tweaking of the algorithm that would occur after seeing the results of various attempts. Breaking at the wrong points would probably leave a hideous number of long strings that don’t repeat so you might want to iterate entire passes looking for breakpoints which give the least number of unrepeating long strings. I don’t know how much knowledge of word boundaries was allowed to sneak in but you could also throw out breakpoints which leave words behind with unpronouncable first or last letters. Say we take “last letters” and break “as” out of it as a word. That would cause a word beginning with “tl” following it. You could throw that out as not likely but that would probably be sneaking too much information in it to know what letters are unpronouncable. What would result in the end is a word that begins with “tl” that probably wouldn’t repeat anywhere so you’d want to try something other than breaking out “as” to see if it makes more words that do indeed have some repeats. That would be fair dinkum. Processing time could get excessive but the article didn’t say throughput was of any particular concern. Most compression algorithms are concerned with not taking a long time to do the compression.

  17. Reed,

    Recall that it takes few inflected forms to account for most of a writer’s verbiage. In Thackeray’s Vanity Fair, 72 distinct inflected forms account for half the verbiage. In the text of a poorly educated writer I studied, 50 distinct inflected forms account for half the verbiage.

    I would rate Austen’s language as less complex than Thackeray’s, so I would guess that the number of distinct inflected forms accounting for half of the verbiage of Emma is less than 72, and is closer to 72 than to 50…

    Well, speculating like that bugged me, so I went ahead and downloaded the novel (plain text format) from Project Gutenberg. There are 160,990 words of text. Simply deleting apostrophes and replacing all other punctuation marks with white space, I count 7156 distinct inflected forms. The most frequent 56 cover 50% of the verbiage, and the most frequent 473 (6.6% of inflected forms in the novel) cover 80% of the verbiage. All of those 473 inflected forms occur 40 or more times in the text. While this does not account for everything Shepherd’s system did, I think it gives a good idea of why his results are possible.

    For anyone who would like to check my work, here’s the Unix shell script I used to process the Gutenberg text:

    tr -d “‘” |
    tr -cs “[:alpha:]” “\n” |
    tr “[:upper:]” “[:lower:]” |
    sort |
    uniq -c |
    sort -rn |
    awk ‘{ cum += $1; print $1, cum, $2; }’

    It reads from the standard input and writes to the standard output. The output begins with

    5242 5242 to
    5204 10446 the
    4897 15343 and
    4293 19636 of
    3191 22827 i

    The first column is the count of instances of the inflected form and the second column is the cumulative count.

  18. Thanks DS, I appreciate you sharing your take on how you’d do it. I think (from my own CS/algorithm background) your approach makes sense.

  19. DaveScot,

    The process you’re describing is exactly the kind of morpheme-discovery process that I had envisioned. Making use of capital letters to find sentence boundaries seems like cheating (which doesn’t mean that they didn’t do it), but using English orthography rather than a phonetic representation also seems like cheating.

    The problem is that you’ll end up with “words” like ‘lessly’ (from examples like ‘painlessly’, ‘hopelessly’) or ‘ment’ (‘pavement’, ‘bewilderment’) etc., and there’s no way for this system to distinguish between the plural ‘-s’ suffix and the possessive ‘=s’ clitic, for example.

  20. I also agree that an 80% discrimination for English doesn’t seem an impressive result. Probably, looking at Sheperd’s background and publications, he did use some form of compression and cryptoanalysis.
    Instead going back to how his work could possibly be relevant for ID I think that it could be from a opractical point of view, but he has been perhaps forced to be accepted in the genome community to add some formal anti-ID statements together with the classical reference to RD’s relevance (please remember that he is British).

  21. “Professor Shepherd … He said: “We are treating DNA as we used to treat problems in intelligence. We want to break the code at the most fundamental level.” ”

    What I see as important here is that his approach is not NDE based. It assumes an intelligent design – that’s what codes inherently are. .. despite the fact that he admitted to approaching it as any problems in intelligence.

    In retrospect, especially if his algorithm is fruitful, this will be best classified as an ID approach to science.

  22. kvwells(#5),

    E. Scott has is also working in concert with churches to push NDE through compatibleism–never mind that Provine says to this integrating -”check your brain at the church door.” Physicists Rob Phillips and Stephen R. Quake, write in The Biological Fronteir of Physics (http://www.physicstoday.org/vol-59/iss-5/p38.html)…”Molecular machines are the basis of life. DNA, a long molecule that encodes the blueprints to create an organism, may be life’s information storage medium, but it needs a bevy of machines to read and translate that information into action. The cell’s nanometer-scale machines are mostly protein molecules, although a few are made from RNA, and they are capable of surprisingly complex manipulations. They perform almost all the important active tasks in the cell: metabolism, reproduction, response to changes in the environment, and so forth. They are incredibly sophisticated, and they, not their manmade counterparts, represent the pinnacle of nanotechnology. Yet scientists have no general theory for their assembly or operation. The basic physical principles are individually well understood; what is lacking is a framework that combines the elegance of abstraction with the power of prediction.” And…”Proteins as molecules are polymers, and can often be treated with a combination of continuum mechanics and statistical mechanics. They act, in other words, as essentially classical objects….The theme of collective action is also revealed in the flow of information in biological systems. For example, the precise spatial and temporal orchestration of events that occurs as an egg differentiates into an embryo requires that information be managed in processes called signal transduction. Biological signal transduction is often broadly presented as a series of cartoons: Various proteins signal by interacting with each other via often poorly understood means. That leads to a very simple representation: a network of blobs sticking or pointing to other blobs. Despite limited knowledge, it should be possible to develop formal theories for understanding such processes. Indeed, the general analysis of biological networks—systems biology—is now generating great excitement in the biology community.

    Information flow in the central dogma is likewise often presented as a cartoon: a series of directed arrows showing that information moves from DNA to RNA to proteins, and from DNA to DNA. But information also flows from proteins to DNA because proteins regulate the expression of genes by binding to DNA in various ways. Though all biologists know that interesting feature of information flow, central-dogma cartoons continue to omit the arrow that closes the loop. That omission is central to the difference between a formal theory and a cartoon. A closed loop in a formal theory would admit the possibility of feedback and complicated dynamics, both of which are an essential part of the biological information management implemented by the collective action of genes, RNA, and proteins. ”

    Now physicists claim there exist a wide open field for research on actually how the cells nanomachinery works.

    Eric Peterson

  23. Anti-informationists read:
    “Yet scientists have no general theory for their assembly or operation.” Physicists Rob Phillips and Stephen R. Quake in “The Biological Frontier of Physics.”

Leave a Reply