Home » Genomics, Informatics, Intelligent Design » A simple statistical test for the alleged “99% genetic identity” between humans and chimps

A simple statistical test for the alleged “99% genetic identity” between humans and chimps


Typical figures published in the scientific literature for the percentage similarities between the genomes of human beings (Homo sapiens) and chimpanzees (Pan troglodytes) range from 95% to 99%. However, in press releases intended for popular consumption, evolutionary biologists frequently claim that human and chimpanzee genomes are 99% identical. Skeptics of neo-Darwinian evolution have repeatedly punctured this”99% myth,” but unfortunately, it seems to have gained widespread credence, due to its being continually propagated by evolutionists! For instance, one often encounters statements like these in the literature:

“Because the chimpanzee lies at such a short evolutionary distance with respect to human, nearly all of the bases are identical by descent and sequences can be readily aligned” (The Chimpanzee Sequencing and Analysis Consortium, Initial sequence of the chimpanzee genome and comparison with the human genome, Vol. 437/1 September 2005/doi:10.1038/nature04072).

“The consortium [National Human Genome Research Institute] found that the chimp and human genomes are very similar and encode very similar proteins. The DNA sequence that can be directly compared between the two genomes is almost 99 percent identical.” (here.)

“The genetic codes of chimps and humans are 99 percent identical.” (here)

Supporters of the neo-Darwinian theory of evolution have a strong ideological motivation for minimizing the differences between humans and chimps, as they claim that these two species evolved from a common ancestor, as a result of random mutations filtered by natural selection. Now, I don’t personally believe that humans and chimps share a common ancestry, for a host of reasons that would take me too long to explain in this post. Nor do I attach much significance to the magnitude of the genetic differences between these two species, per se, because in my opinion, the fundamental differences between these creatures lie elsewhere. However, since the genomic data is now available for free on the Internet, I decided to perform some sleuthing of my own, and check out the wildly exaggerated claims that are often made regarding the percentage similarities between human and chimp genomes. Here is what I discovered.

Interactive functional comparison methods
Usually, molecular biologists compare genomes on a functional basis. For example, they may search for similar genes in the genomes of human beings and chimpanzees, and try to identify the bases or nucleotides where they differ or match. Many different technologies have been developed to investigate genomes. One of these is BLAST (Basic Local Alignment Search Tool) software (see the NCBI Web site for more details). BLAST is an extremely powerful computer aided tool, as it is able to locate regions of local similarity among sequences by searching a whole database of genomes. Alignment methods (such as those implemented by BLAST and other techniques) allow geneticists to search interactively for common local patterns in different positions. However, this interactive task has its limits, as it can compare only portions of different genomes. Additionally, some critics have pointed out that these tools are susceptible to slip-ups (see here). Given the amount of data involved (in the order of Gigabytes), the global comparison of two genomes is a very demanding job, which cannot be completed interactively in a short time by human beings, even with the aid of tools such as BLAST. At the present time, only fully automated computer programs are capable of performing such a task on entire genomes. However, the development of an automated computer program which is capable of performing a complete functional comparison between human and chimpanzee genomes is practically impossible, for the simple reason that the functional architecture of these genomes is not yet perfectly known.

Automatic statistical comparison methods
From a mere informatics and statistical point of view, DNA sequences are simply strings of symbols or characters. Thus it is also possible to develop tests comparing genomes as unstructured sequences of characters, without taking into consideration genes, pseudo-genes, coding and non-coding regions, vertical and horizontal gene transfer, open reading frames (ORFs), or any other functional concepts. The characters most commonly present in DNA sequences are A, C, G and T. There are other less important characters which are used basically to indicate ambiguity regarding the identity of certain bases in the sequences. The comparison I performed was completely different from those usually performed by geneticists, because was purely statistical in nature. In a sense, it could be described as an application of the well-known Monte Carlo method. The Monte Carlo method is frequently used when data or processes involved are huge, and one wants to reduce the computer running time. In short, it involves dealing with a partial random sample, instead of the whole space which is under investigation. In the Monte Carlo method, only a small portion of the data population is actually investigated; nevertheless, this portion is statistically large enough to reveal the characteristics of the whole.

Metrics, distances and similarity measures
One theoretical approach to the problem would be to consider the set of all strings of characters as a metric space, and then define a distance function for all pairs of strings. Many distance functions have been developed by mathematicians for studying the degree of similarity between strings (for a list of them see (here). Given a metric or pseudo-metric space and its distance function, we can refer to a particular similarity, which differs from the similarity distance of another metric space. In a pairwise comparison identity test, we can easily calculate a simple metric distance called the “Hamming distance.” In this test, the order is important, because the n-th character of string A is compared to the n-th character of string B, after the initial characters of A and B have been aligned. After each comparison, if the two characters don’t match, then the Hamming distance increases by 1. If the order doesn’t matter, we can compare sub-strings of the parent strings A and B. Additionally, if they are at different positions in the two strings then many different tests are possible. We call these pattern matching or similarity tests. While there is only one possible method of comparing identity between strings of characters (the above pairwise comparison), there are many methods of comparing similarity. In other words, there are many measures of similarity, depending on the rules of pattern matching that we choose. In practice, calculating a certain distance function between two genomes can be a demanding job, in terms of running time, even for powerful computers.

Specifications for a statistical similarity test
Any final result for a complete statistical similarity test (especially if it is a unique number) is meaningful only if: 1) the distance function is mathematically defined; 2) the rules for pattern matching and the formulas for calculating the result are explained in detail; 3) it is clearly stated which parts of the input strings are being examined; 4) in the event that computer programs were used to perform the comparison, the source codes and algorithms are provided. My explanations below have the goal to meet the three first constraints. To satisfy the fourth condition, the source file of the Perl script used for the test is freely downloadable here.

How the genome data was obtained
Genome data for Homo sapiens and Pan troglodytes was freely downloaded from public bio-informatics archives at UCSC Genome Bioinformatics. The downloaded DNA sequences were in FASTA format. Before running the test, I decided to discard all symbols in the sequences, except for A, C, G and T. Most of the symbols I had to discard were “N” symbols, which represented rare, undefined situations (probably due to the level of sophistication of the scanning technology). The frequency of other symbols was very low. As it turned out, the deletion of a few “N” symbols didn’t affect the overall result very much. Given that the chimp’s genome contains two chromosomes (referred to as chr_02a and chr_02b) corresponding to chromosome #2 in human beings, I decided to concatenate them, in order to compare them with human chromosome #2 (chr_02).

30 Base Pattern Matching (30BPM) similarity test
The 30BPM similarity test is a very simple one: it performs searches for shared 30 base-long patterns, on two homologous chromosomes. This method is a true pattern-matching test, because it searches for identical patterns in the chromosomes of humans and chimpanzees. The beauty of this test is that it allows patterns to match, independently of their position in the chromosome. The significance of local similarities in homologous chromosomes is that identical patterns may be found in quite different positions along the two chromosomes. In fact, this test allows a total scrambling of patterns between homologous chromosomes. Of course, it is generally very difficult to know what the functional implications of this scrambling are. In particular, the positions of the genes might shift, but when non-gene coding is scrambled, it is doubtful that functionality is preserved. However, from a purely quantitative point of view, in this particular test, I don’t need to worry about qualitative issues such as functionality; only statistical issues count.

The algorithm implemented
For each pair of homologous chromosomes A and B, a PRNG (pseudo-random number generator) generates 10,000 uniformly distributed pseudo-random numbers which specify the offset, or starting point, of 10,000 30-base patterns that are contained in source chromosome A. The 30BPM test involves searching for all 10,000 of these DNA sub-strings of chromosome A in our target chromosome B. Now let F be the number of patterns located (at least once) in chromosome B. The 30BPM similarity is simply defined as F/100 (minimum value = 0%, maximum value = 100%). The absolute difference between 10,000 and F (minimum 0, maximum 10,000) is the 30BPM distance. Thus the greater the similarity is, the smaller the distance will be. Strictly speaking, this 30BPM space is only a pseudo-metric, inasmuch as the axiom of identity (“the distance is zero if and only if A and B are equal”) defining a true metric space is somewhat relaxed (in some cases, the distance could still be zero even if A and B were different), while the axiom of symmetry (“the distance between A and B is equal to the distance between B and A”) does not hold in some cases. It can easily be seen that the 30BPM distance will be zero (30BPM similarity = 100%) if the two strings are identical. In an additional test which I performed on two random 100 million-base DNA strings, the 30-BPM distance was 10,000 (i.e. no patterns on A were located in B). Hence I shall refer to the value 10,000 as the “random 30BPM distance.” In other words, the 30BPM similarity between two artificially generated random 100 million-base DNA strings is zero. Of course, when generating these artificial DNA strings I had to take into consideration the fact that that on average, the true probabilities of A, T, G and C occurring in natural DNA are not exactly 0.25 each, but as follows: A=0.3, T=0.3, G=0.2, and C=0.2. In such a case, the following formula accurately describes the probability of obtaining a single-base match between the two DNA sequences:

(30*30 + 30*30 + 20*20 + 20*20)/(100*100) = (900+900+400+400)/10000 = 26%

In a supplementary test in which I performed a pure pair-wise comparison between human/chimp genomes, I obtained a global figure 25.90%, which matches very closely with the theoretically predicted result above.

Results obtained
The following table and graph show the report of the 30BPM similarity test on the whole set of human/chimp chromosomes.

The results obtained are statistically valid. The same test was previously run on a sampling of 1,000 random 30-base patterns and the percentages obtained were almost identical with those obtained in the final test, with 10,000 random 30-base patterns. When human and chimp genomes are compared, the X chromosome is the one showing the highest degree of 30BPM similarity (72.37%), while the Y chromosome shows the lowest degree of 30BPM similarity (30.29%). On average the overall 30BPM similarity, when all chromosomes are taken into consideration, is approximately 62%. Here we have the classic case of the glass which some people perceive as being half-full, while others perceive it as being half-empty. When compared to two random strings which are 0% similar, 62% is a very large value, so nobody would deny that human and chimp genomes are quite similar! On the other end, 62% is a very low value when compared to the more than 95% similarity percentages which are published by bioinformatics evolutionary researchers. Now, I realize that it may seem somewhat arbitrary to choose 30-base-long patterns, as I did in my test, and indeed it is arbitrary to some degree. However, if the two genomes were really 95% similar or more, as is commonly claimed, also a 30BPM statistical test should produce 95% results, and it does not.

An analogy from politics: an exit poll
To help readers to grasp the significance and potential implications of my test, here is a simple analogy. Consider an election, in which 100 million electors are eligible to vote. One exit poll, based on a sample of 10,000 voters, calculates that party X has received 62% of the popular vote. However, at the end of election party X declares it has received more than 95% of the vote! The 30BPM statistical test described above is analogous to the exit poll, while the claims made by evolutionary biologists are analogous to party X’s “95%” claim. The sample of 10,000 patterns is taken from a global population of 100 million bases (the approximate number of bases on a typical human/chimp chromosome), while the ratio of population to sample is 100,000,000/10,000=10,000. The 30BPM exit poll metaphorically says that only 62% voted for Darwin’s party, whereas modern Darwinists claim that over 95% did. Something doesn’t quite add up.

I believe that the classic evolutionary comparisons between human and chimp genomes exaggerate the similarities, for at least two reasons: (1) they don’t consider whole chromosomes, but only portions of them (e.g. particular genes); (2) the rules of pattern matching are relaxed in some way (e.g. sometimes two bases are said to match, even when they don’t really match). Now, there is nothing intrinsically wrong with comparisons where (1) and (2) hold. However, any research that is truly worthy of being called “scientific” should openly acknowledge built-in limitations, such as (1) and (2) above. Sadly, this is very rarely done. It is perfectly acceptable to publish partial results that are obtained by relaxing the rules, but one should not publicize them as global and mathematically sound, when in fact, they are nothing of the sort.

Conclusion
We have seen that in a genome comparison, the only thing that matters is the degree of similarity. However, once we put the concept of similarity between two text strings on the table we open a can of worms. Many different measures of the similarity between two strings are possible, and different methods of comparing two genomes can result in wildly different estimates of the similarity between them. The assumptions that drive the methods used also drive the results obtained, as well as their interpretation. A simple layman’s statistical test, such as the 30BPM, shows that the “95% claim” described above is a highly controversial one. It is worth noting that as more information comparing the two genomes is published, the differences between them will appear more profound than they were originally thought to be. The big question that still remains is: what should one conclude from the similarities and differences between the genomes of humans and chimpanzees? Commonly reported evolutionary statistics that should provide an informative answer to this question may actually obscure the true answer.

  • Delicious
  • Facebook
  • Reddit
  • StumbleUpon
  • Twitter
  • RSS Feed

82 Responses to A simple statistical test for the alleged “99% genetic identity” between humans and chimps

  1. I am so glad you did this analysis niwrad. This 99% myth must be the one myth, on top of all other evolutionary myths, that is used to promulgate evolution to school children as a undeniable fact.

    Now maybe perhaps you can do an analysis on the ‘cartoon drawings’ showing man evolving from ape?!? 8)

  2. niwrad

    I wonder what results you would get if you did the test on two unrelated humans?

  3. Interesting. I’ve got two questions. First, to clarify the 30 BPM metric. Let’s say you’ve got a string of 30 T’s on chromosome A, and you’re looking for a matching string on chromosome B. Suppose there is no string of 30 T’s on Chromosome A, but there is a string of 29 T’s followed by a G. That is, a single point mutation could account for the difference between strings. Does your metric count that as a match, or as no match? Because clearly, if it’s been a couple million years since two species diverged, there would be an accumulation of mutations in each that might break up a pattern to some extent (particularly in non-coding DNA). In other words, your pseudo-metric might be biased toward Type 1 errors.

    Second, you ran the comparison between two randomly generated “chromosomes” to confirm that unrelated strings generate a pseudo-metric value of 0. Good start. But did you do that for the other end of the pseudo-metric? In particular, did you generate two identical strings, then apply random mutations (base changes, insertions, deletions, relocations, etc.) to each of them, and THEN run the 30 BPM test? I’d be interested to see the results of that test case.

  4. niwrad, I would be interested to know what numbers you would get if you ran this comparison through:

    Kangaroo genes close to humans
    Excerpt: Australia’s kangaroos are genetically similar to humans,,, “There are a few differences, we have a few more of this, a few less of that, but they are the same genes and a lot of them are in the same order,” ,,,”We thought they’d be completely scrambled, but they’re not. There is great chunks of the human genome which is sitting right there in the kangaroo genome,”
    http://www.reuters.com/article.....P020081118

    If you want the sequences for the kangaroo genome I believe one of the people at the bottom of this page can help you:

    Australian First: Kangaroo Genome Mapped
    http://archive.uninews.unimelb.....55743.html

  5. Bornagain77,

    My guess is less than 10%. For one thing, Kangaroos have 12 chromosomes, and Niwrad’s algorithm compares one chromosome to another. For another, according to the article you cite, we’re supposed to have shared a common ancestor 150 million years ago with ‘Roos, whereas it’s only supposed to be 5-7 million years since our common ancestor with chimps.

    Also, I noticed that the article just says there are large chunks of DNA that we share with ‘Roos. It doesn’t say what the percent similarity is, and in particular it doesn’t say anything about the similarities in non-coding DNA. I’d bet good money we’re miles apart on that front.

  6. well AMW, I have my reservations as to your certainty, and I think niwrad’s test may just well lend itself to such a test with a few modifications. certainly he won’t have to relax constraints as much as Darwinists have in order to arrive at their biased 95% to 99% conclusion!

    Further notes:

    the chimp genome is about 12% larger than the human genome. A recent, more accurate, human/chimp genome comparison study, by Richard Buggs in 2008, has found when he rigorously compared the recently completed sequences in the genomes of chimpanzees to the genomes of humans side by side, the similarity between chimps and man fell to slightly below 70%! Why is this study ignored since the ENCODE study has now implicated 100% high level functionality across the entire human genome? Finding compelling evidence that implicates 100% high level functionality across the entire genome clearly shows the similarity is not to be limited to the very biased ‘only 1.5% of the genome’ studies of evolutionists.

    Chimpanzee?
    10-10-2008 – Dr Richard Buggs – research geneticist at the University of Florida
    …Therefore the total similarity of the genomes could be below 70%.
    http://www.idnet.com.au/files/pdf/Chimpanzee.pdf

    Moreover, when scientists did a actual Nucleotide by Nucleotide sequence comparison, to find the ‘real world’ difference between the genomes of chimps and Humans, they found the difference was even more profound than what Dr. Richard Buggs, or the statistical test, had estimated:

    Do Human and Chimpanzee DNA Indicate an Evolutionary Relationship?
    Excerpt: the authors found that only 48.6% of the whole human genome matched chimpanzee nucleotide sequences. [Only 4.8% of the human Y chromosome could be matched to chimpanzee sequences.]
    http://www.apologeticspress.org/articles/2070

    As well niwrad, I thought you might like to know that your ‘stunning’ y chromosome dissimilarity added weight to these studies:

    Recent Genetic Research Shows Chimps More Distant From Humans,,, – Jan. 2010
    Excerpt: “many of the stark changes between the chimp and human Y chromosomes are due to gene loss in the chimp and gene gain in the human” since “the chimp Y chromosome has only two-thirds as many distinct genes or gene families as the human Y chromosome and only 47% as many protein-coding elements as humans.”,,,, “Even more striking than the gene loss is the rearrangement of large portions of the chromosome. More than 30% of the chimp Y chromosome lacks an alignable counterpart on the human Y chromosome, and vice versa,,,”
    http://www.evolutionnews.org/2.....shows.html

    Chimp and human Y chromosomes evolving faster than expected – Jan. 2010
    Excerpt: “The results overturned the expectation that the chimp and human Y chromosomes would be highly similar. Instead, they differ remarkably in their structure and gene content.,,, The chimp Y, for example, has lost one third to one half of the human Y chromosome genes.
    http://www.physorg.com/news182605704.html

    further notes:

    When we consider the remote past, before the origin of the actual species Homo sapiens, we are faced with a fragmentary and disconnected fossil record. Despite the excited and optimistic claims that have been made by some paleontologists, no fossil hominid species can be established as our direct ancestor. Richard Lewontin – Harvard Zoologist
    http://www.discovery.org/a/9961

    Evolution of the Genus Homo – Annual Review of Earth and Planetary Sciences – Tattersall, Schwartz, May 2009
    Excerpt: “Definition of the genus Homo is almost as fraught as the definition of Homo sapiens. We look at the evidence for “early Homo,” finding little morphological basis for extending our genus to any of the 2.5–1.6-myr-old fossil forms assigned to “early Homo” or Homo habilis/rudolfensis.”
    http://arjournals.annualreview.....208.100202

    Man is indeed as unique, as different from all other animals, as had been traditionally claimed by theologians and philosophers. Evolutionist Ernst Mayr
    http://www.y-origins.com/index.php?p=home_more4

    “Something extraordinary, if totally fortuitous, happened with the birth of our species….Homo sapiens is as distinctive an entity as exists on the face of the Earth, and should be dignified as such instead of being adulterated with every reasonably large-brained hominid fossil that happened to come along.”
    Anthropologist Ian Tattersall
    (curator at the American Museum of Natural History)

  7. Bornagain77,

    For what it’s worth, the probability that I’ll read your entire comment is inversely related to the number of links that you post in it. I just haven’t the time to go through them all. What’s more, when I responded to one link, you gave me seven more to look up. It’s worse than fighting a hydra! Follow-up comments would just lead us further afield.

    In short, when I see a Gish Gallop, I disengage. If you want to discuss any one concept or article in depth, I’ll be much more likely to respond.

  8. Also, do you just have excerpts and links on file somewhere? Because I did a Google search on part of your intro to the Dr. Richard Bugg link, and the first six or seven links were all to different comments where you’d posted it.

  9. niwrad: While your code seems to be valid, I suggest that your matching scheme is naive, and will report more mis-matches than a more complicated algorithm. Hamming distance does not work very well with deletions and insertions, which are common in DNA.

    I suggest that a Needleman-Wunch distance may be a better solution to a string-matching problem.

    Or for those visual thinkers among us, I suggest using a dot plot such as DNA dot plotter to compare sequences.

    Sources: Course I am currently taking + the textbook

  10. AMW, the Buggs’s link I listed works fine for me,, here it is again,,,

    http://www.idnet.com.au/files/pdf/Chimpanzee.pdf

    As far as the ‘hydra’ your fighting, I just want you to know that I have definite reasons for doubting the certainty with which you state your ‘conclusion’ that man evolved from apes.

  11. DCX, but your computer model for establishing similarity seems to be based, I’m pretty sure, on the assumption that evolution has occurred, so as to find similarities and ignore discrepancies, so of course it will give a different reading.,, Thus your model would seem to commit the same fallacy as the other models niwrad is critiquing. Namely your computer program ends up proving evolution is true because evolution is first assumed evolution to be true prior to the search,,, Just a bit biased wouldn’t you say?

  12. To make it even easier:

    Look at this tutorial video:
    human-chimp dot plot

    It describes how a dot plot would look for humans and chimps.

    You can then look at the SynMap program yourself at:
    SynMap

  13. Ah bornagain77, what a quick response. Thank you.

    This problem is not based entirely on evolution. It is an fact a problem based entirely on string comparison. DNA, like any piece of text, can undergo this treatment. For example, you could run two student’s essays into a dot plot and find out if significant passages have been stolen.

    A “plagiarism dot plot” Google query will bring up many pages on the subject.

    I believe the point is not to prove that evolution is true. It is to merely state that evolution would predict “closely-related” genomes to be more similar than “non-closely-related” genomes.

    From these simple comparisons, it looks pretty likely that while evolution throws the DNA out of alignment and changes it, evolution still “preserves” most of the text.

    What predictions does the design hypothesis have? Can design be seen in the mutation/insertion/deletion/chromosome adjustment/reversals?

  14. DCX, well I guess ID would predict something like this:

    Chimp chromosome creates puzzles – 2004
    Excerpt: However, the researchers were in for a surprise. Because chimps and humans appear broadly similar, some have assumed that most of the differences would occur in the large regions of DNA that do not appear to have any obvious function. But that was not the case. The researchers report in ‘Nature’ that many of the differences were within genes, the regions of DNA that code for proteins. 83% of the 231 genes compared had differences that affected the amino acid sequence of the protein they encoded. And 20% showed “significant structural changes”. In addition, there were nearly 68,000 regions that were either extra or missing between the two sequences, accounting for around 5% of the chromosome.,,, “we have seen a much higher percentage of change than people speculated.” The researchers also carried out some experiments to look at when and how strongly the genes are switched on. 20% of the genes showed significant differences in their pattern of activity.
    http://www.nature.com/news/199.....524-8.html

    Chimps are not like humans – May 2004
    Excerpt: the International Chimpanzee Chromosome 22 Consortium reports that 83% of chimpanzee chromosome 22 proteins are different from their human counterparts,,, The results reported this week showed that “83% of the genes have changed between the human and the chimpanzee—only 17% are identical—so that means that the impression that comes from the 1.2% [sequence] difference is [misleading]. In the case of protein structures, it has a big effect,” Sakaki said.
    http://cmbi.bjmu.edu.cn/news/0405/119.htm

    or maybe ID would predict something like this DCX,,,

    This following article, which has a direct bearing on the 98.8% genetic similarity myth, shows that over 1000 ‘ORFan’ genes, that are completely unique to humans and not found in any other species, and that very well may directly code for proteins, were stripped from the 20,500 gene count of humans simply because the evolutionary scientists could not find corresponding genes in primates. In other words evolution, of humans from primates, was assumed to be true in the first place and then the genetic evidence was directly molded to fit in accord with their unproven assumption. It would be hard to find a more biased and unfair example of practicing science!

    Human Gene Count Tumbles Again – 2008
    Excerpt: Scientists on the hunt for typical genes — that is, the ones that encode proteins — have traditionally set their sights on so-called open reading frames, which are long stretches of 300 or more nucleotides, or “letters” of DNA, bookended by genetic start and stop signals.,,,, The researchers considered genes to be valid if and only if similar sequences could be found in other mammals – namely, mouse and dog. Applying this technique to nearly 22,000 genes in the Ensembl gene catalog, the analysis revealed 1,177 “orphan” DNA sequences. These orphans looked like proteins because of their open reading frames, but were not found in either the mouse or dog genomes. Although this was strong evidence that the sequences were not true protein-coding genes, it was not quite convincing enough to justify their removal from the human gene catalogs. Two other scenarios could, in fact, explain their absence from other mammalian genomes. For instance, the genes could be unique among primates, new inventions that appeared after the divergence of mouse and dog ancestors from primate ancestors. Alternatively, the genes could have been more ancient creations — present in a common mammalian ancestor — that were lost in mouse and dog lineages yet retained in humans. If either of these possibilities were true, then the orphan genes should appear in other primate genomes, in addition to our own. To explore this, the researchers compared the orphan sequences to the DNA of two primate cousins, chimpanzees and macaques. After careful genomic comparisons, the orphan genes were found to be true to their name — they were absent from both primate genomes.
    http://www.sciencedaily.com/re.....161406.htm

    The sheer, and blatant, shoddiness of the science of the preceding study should give everyone who reads it severe pause whenever, in the future, someone tells them that genetic studies have proven evolution to be true.

    This following site has a brief discussion on the biased methodology of the preceding study:
    http://www.uncommondescent.com.....ent-358505

    If the authors of the preceding study were to have actually tried to see if the over 1000 unique ORFan genes of humans may actually encode for proteins, instead of just written them off because they were not found in in other supposedly related species, they would have found that there is ample reason to believe that they may very well encode for biologically important proteins:

    A survey of orphan enzyme activities
    Abstract: We demonstrate that for ~80% of sampled orphans, the absence of sequence data is bona fide. Our analyses further substantiate the notion that many of these (orfan) enzyme activities play biologically important roles.
    http://www.biomedcentral.com/1471-2105/8/244

    Dr. Howard Ochman – Dept. of Biochemistry at the University of Arizona
    Excerpt of Proposal: Although it has been hypothesized that ORFans might represent non-coding regions rather than actual genes, we have recently established that the vast majority that ORFans present in the E. coli genome are under selective constraints and encode functional proteins.
    http://www.uncommondescent.com.....ent-358868

    In fact it turns out that the authors of the ‘kick the ORFans out in the street’ paper actually did know that there was unbiased evidence strongly indicating the ORFan genes encoded proteins but chose to ignore it in favor of their preconceived evolutionary bias:
    http://www.uncommondescent.com.....ent-358547

    I would like to reiterate that evolutionists cannot even account for the origination of just unique one gene or protein, much less over one thousand completely unique ORFan genes:

    Could Chance Arrange the Code for (Just) One Gene?
    “our minds cannot grasp such an extremely small probability as that involved in the accidental arranging of even one gene (10^-236).”
    http://www.creationsafaris.com/epoi_c10.htm

    “Estimating the Prevalence of Protein Sequences Adopting Functional Enzyme Folds” 2004: – Doug Axe ,,,this implies the overall prevalence of sequences performing a specific function by any domain-sized fold may be as low as 1 in 10^77, adding to the body of evidence that functional folds require highly extraordinary sequences.”
    http://www.mendeley.com/resear.....yme-folds/

  15. Ah, another quick reply. Good show.

    I would like to comment that your first two links were discussions on only one chromosome, which one link said was 1% of the genome. Is that significant, or relatively minor mutations? Is that predicted by having a designer, or is it one chromosome that has undergone quite a few mutations?

    Does ID have a certain percentage of DNA that we would expect to find different between the human and chimp genome? 20% 50% 90%? What would the dot plot look like? How would it compare to human computer code?

    And as for your other articles, I am not surprised to hear that identifying genes is a tricky business. From the textbook (a quick glance though) genes do not have a defined starting/ending sequence. They can overlap. They can be expressed or not-expressed erroneously. It’s a very complicated issue, especially since bioinformatics is a relatively new field. I believe it is a topic that I will cover in class, so I will have to come back another time and discuss it more fully then.

    But returning to the original post… On the comparison that niwrad made, do you agree that pair-wise comparisons of DNA may not be “cutting-edge”? Would you join me in suggesting a research program of testing ID predictions against SynMap dot plots to whoever wants to pursue this study?

  16. 16

    FYI:

    “Now researchers have learned that only about two percent of human and chimp DNA encodes genetic blueprints for proteins. They also know that most of the rest — once referred to dismissively as “junk DNA” — contains sequences that affect whether, where and when proteins are made – and in what combinations, a key factor in development.

    Pollard raises a question that scientists have been debating for decades: “Do you make a human by making different proteins or do you make one by taking the same building blocks and putting them together in a different way?”

    She says most scientists now believe the greatest potential for change arises from rearranging the building blocks. Some of the DNA formerly regarded as junk plays an important role in these rearrangements…

    Pollard and her collaborators are most interested in rapidly evolving bits of DNA that may play a role in determining human attributes such as language, the complexity of the brain’s cerebral cortex, hairless skin, fine motor coordination of the thumb and fingers, and the ability to easily digest certain foods we commonly eat.

    The top-ranking piece of human DNA to emerge form Pollard’s first comprehensive round of number-crunching differed from chimp DNA in 18 of 118 base pairs. In contrast, between chimp and chicken —a vertebrate that has evolved on a separate path from our evolutionary ancestors for about 300 million years – there were only two differences along the same DNA stretch. Pollard and colleagues named the DNA segment HAR1, for “human accelerated region.” The name refers to this DNA’s relatively fast evolution in our human ancestors.

    Pollard’s colleagues subsequently showed that HAR1 encodes RNA. But it’s not like the biology-textbook messenger RNA that is translated into protein. Instead the HAR1-encoded RNA has a more direct influence. There is more to learn about HAR1 RNA, but already a Belgian colleague of Pollard’s has shown that it is made in specific nerve cells within the brain’s developing cerebral cortex.

    The second highest-ranking DNA in Pollard’s screen, dubbed HAR2, is a switch regulating the activation of specific genes. Scientists have discovered that it plays a role in limb development. Differences between human and chimp may help explain why human can more precisely control finger and thumb movements.”

    http://www.physorg.com/news196962452.html

  17. DCX, I think a major prediction for ID is that the majority of Junk DNA will be found to have function, and indeed ENCODE has bore this out as well as subsequent studies on ‘non-coding regions. Thus as I stated earlier this will only greatly increase the already insurmountable bridge that Darwinism has yet to honestly address. The problem for evolution is a lot worse than you seem to realize, another problem for you is that you must assume a substantial portion of beneficial mutations to account for the ‘dramatic’ changes in genes, as previously noted in chromosome 22 and the y chromosome, and you simply have slim to none to none whatsoever ‘beneficial mutations’ to the human genome to point to as evidence for Darwinism (nor do you have any anywhere else to point to).

    the evidence for the detrimental nature of mutations in humans is overwhelming for scientists have already cited over 100,000 mutational disorders.

    Inside the Human Genome: A Case for Non-Intelligent Design – Pg. 57 By John C. Avise
    Excerpt: “Another compilation of gene lesions responsible for inherited diseases is the web-based Human Gene Mutation Database (HGMD). Recent versions of HGMD describe more than 75,000 different disease causing mutations identified to date in Homo-sapiens.”

    I went to the mutation database website cited by John Avise and found:

    HGMD®: Now celebrating our 100,000 mutation milestone!
    http://www.biobase-internation.....mddatabase

    This following study confirmed the detrimental mutation rate for humans, of 100 to 300 per generation, estimated by John Sanford in his book ‘Genetic Entropy’ in 2005:

    Human mutation rate revealed: August 2009
    Every time human DNA is passed from one generation to the next it accumulates 100–200 new mutations, according to a DNA-sequencing analysis of the Y chromosome. (Of note: this number is derived after “compensatory mutations”)
    http://www.nature.com/news/200.....9.864.html

    This ‘slightly detrimental’ mutation rate of 100 to 200 per generation is far greater than even what evolutionists agree is an acceptable mutation rate for an organism:

    Beyond A ‘Speed Limit’ On Mutations, Species Risk Extinction
    Excerpt: Shakhnovich’s group found that for most organisms, including viruses and bacteria, an organism’s rate of genome mutation must stay below 6 mutations per genome per generation to prevent the accumulation of too many potentially lethal changes in genetic material.
    http://www.sciencedaily.com/re.....172753.htm

    Contamination of the genome by very slightly deleterious mutations:
    why have we not died 100 times over? Kondrashov A.S.
    http://www.ingentaconnect.com/.....4/art00167

    Another huge problem that you don’t seem to be aware of is the fact that genomes are severely poly-constrained to mutations because they are now shown to be poly-functional:

    Scientists Map All Mammalian Gene Interactions – August 2010
    Excerpt: Mammals, including humans, have roughly 20,000 different genes.,,, They found a network of more than 7 million interactions encompassing essentially every one of the genes in the mammalian genome.
    http://www.sciencedaily.com/re.....142044.htm

    Poly-Functional Complexity equals Poly-Constrained Complexity
    http://docs.google.com/Doc?doc.....Zmd2emZncQ

    DNA – Evolution Vs. Polyfuctionality – video
    http://www.metacafe.com/watch/4614519

    I don’t know DCX you seem to be pretty certain Humans evolved from apes but I can find no compelling evidence for your certainty, In fact find plenty of evidence that strongly argues against it.

  18. OH another tidbit DCX, mutations to DNA don’t even control Body Plan morphogenesis, thus whatever the sequence similarity or dissimilarity of the DNA it doesn’t matter, for the point is moot in the first place:

    Cortical Inheritance: The Crushing Critique Against Genetic Reductionism – Arthur Jones – video
    http://www.metacafe.com/watch/4187488
    entire video:
    http://edinburghcreationgroup.org/fishfossils.xml

    The Origin of Biological Information and the Higher Taxonomic Categories – Stephen Meyer”Neo-Darwinism seeks to explain the origin of new information, form, and structure as a result of selection acting on randomly arising variation at a very low level within the biological hierarchy, mainly, within the genetic text. Yet the major morphological innovations depend on a specificity of arrangement at a much higher level of the organizational hierarchy, a level that DNA alone does not determine. Yet if DNA is not wholly responsible for body plan morphogenesis, then DNA sequences can mutate indefinitely, without regard to realistic probabilistic limits, and still not produce a new body plan. Thus, the mechanism of natural selection acting on random mutations in DNA cannot in principle generate novel body plans, including those that first arose in the Cambrian explosion.”
    http://eyedesignbook.com/ch6/eyech6-append-d.html

    Stephen Meyer – Functional Proteins And Information For Body Plans – video
    http://www.metacafe.com/watch/4050681

    Hopeful monsters,’ transposons, and the Metazoan radiation:
    Excerpt: Viable mutations with major morphological or physiological effects are exceedingly rare and usually infertile; the chance of two identical rare mutant individuals arising in sufficient propinquity to produce offspring seems too small to consider as a significant evolutionary event. These problems of viable “hopeful monsters” render these explanations untenable.
    Paleobiologists Douglas Erwin and James Valentine

    “Yet by the late 1980s it was becoming obvious to most genetic researchers, including myself, since my own main research interest in the ‘80s and ‘90s was human genetics, that the heroic effort to find the information specifying life’s order in the genes had failed. There was no longer the slightest justification for believing that there exists anything in the genome remotely resembling a program capable of specifying in detail all the complex order of the phenotype (Body Plan).”
    Michael John Denton page 172 of Uncommon Dissent

    etc.. etc.. etc..

  19. I was wondering if you could clarify a point for me. When you are testing a 30-base pattern, do you only accept a pattern with a perfect match on the two chromosomes tested?

    If we say that we have a 1% difference between the human and chimp genomes, isn’t the probability of finding a mismatch in a 30 base sequence roughly 1 in 4?

  20. DCX, you may find this following video a bit more clear for explaining exactly why mutations to the DNA do not control Body Plan morphogenesis, since they are the ‘bottom rung of the ladder’ as far as the ‘layered information’ of the cell is concerned:

    Stephen Meyer on Craig Venter, Complexity Of The Cell & Layered Information
    http://www.metacafe.com/watch/4798685

  21. The discontinuity between humans and all other forms of life is so profound and so obvious that it clearly cannot be explained by differences in DNA. The appearance of humans represents an evolutionary sea change that makes all other evolutionary discontinuities seem trivial in comparison.

    Something very strange and extraordinarily marvelous took place when the first humans appeared on the scene.

  22. DCX, please stop cutting heads off the Hydra. The internet is running out of bandwidth.

  23. I know the author of the thread said it didn’t matter to him how close we are to chimps by Dna yet it does seem to matter to I.D people and evolutionists.
    To this biblical creationist we are so alike to apes that the small differences in our bodies does not in any way suggest a different origin.
    Therefore since the bible says Adam/Eve were not born but instantly created then it can only be that there is a general common blueprint for life and we simply got given the best bodies one could pick in the existing blueprint.
    Everything has eyes, ears, legs, lungs, etc. Therefore the sameness must be from a simple program in nature.
    Therefore we should have the same Dna if we have the same parts.
    I read recently bats and whales had the same dna score for radar. right on.
    Dna is not a trail of heritage but merely a parts department and the connections are a part.
    like form equals like Dna. Yet no actual biological relation.
    I see this in marsupials and placentals. They surely are the same creatures yet have different Dna. Therefore the marsupial change brought a score change that hides actual biological relationship.
    Creationism should welcome ape likeness as simply showing a creator with a simple program for life.
    99% is fine and makes more sense.

  24. Thanks to all commenters for the objections/suggestions, which will be useful to me if I will continue my tests.

    AMW #3

    Let’s say you’ve got a string of 30 T’s on chromosome A, and you’re looking for a matching string on chromosome B. Suppose there is no string of 30 T’s on Chromosome A, but there is a string of 29 T’s followed by a G. That is, a single point mutation could account for the difference between strings. Does your metric count that as a match, or as no match?

    No match, my test only accepts perfect pattern matching between two 30 base sequences. If we begin to relax the rules then we may arrive to the 99% identity that is exactly what is controversial.

    CharlesJ #19

    If we say that we have a 1% difference between the human and chimp genomes, isn’t the probability of finding a mismatch in a 30 base sequence roughly 1 in 4?

    If we have a 1% difference between the genomes in average we find a mismatch in a 100 base sequence, then 3 mismatches in a series of 10 30 base sequences. Hence the probability of finding a mismatch in a 30 base sequence is 3/10.

  25. Robert Byers #23

    Creationism should welcome ape likeness as simply showing a creator with a simple program for life. 99% is fine and makes more sense.

    The problem is that evolutionism uses the 99% myth to counter creationism and ID. Less the DNA differences between humans and chimps more the believability of unguided macroevolution, evolutionists think.

    My test doesn’t disprove creationism and ID, rather the inverse. Humans and chimps are beings with extremely different potentialities just from the beginning. At all levels, their similarities show their designs share some common templates, their differences show they didn’t arise by random evolution from each other.

  26. “If we have a 1% difference between the genomes in average we find a mismatch in a 100 base sequence, then 3 mismatches in a series of 10 30 base sequences. Hence the probability of finding a mismatch in a 30 base sequence is 3/10.”

    With that kind of logic, the probability of finding a mismatch in a 100 base sequence would be 10/10? It’s a bit like saying that since the probability of getting a given number on a dice is 1/6, then the probability of getting a specific number in 6 try is 6/6. A result on a dice does not get more likely every time you throw the dice; it’s always 1/6. So the probability of getting a specific number at least once on a dice in 6 try is approximately 62%. But I did make a mistake in my previous post, it should have been: the probability of finding “at least” 1 mismatch is approximately 1 in 4.

    If the differences between human and chimp genome was 66% like you claim, that would mean on average 1 different base every 3 bases; a 30-base pattern with a perfect match would be very rare. You probably would not need to do so much complex calculation to demonstrate that the 99% claim is erroneous either; a simple look at 2 aligned sequence would do the job.

    Another way of explaining what I mean is with an example. Let’s say that during your calculations, you have a 30-base pattern that gives a mismatch (which means that it is not a 100% perfect match). If you took that pattern and did a blast with it, you would probably find that there is only one base that is a mismatch in the 30-base pattern, yet the whole pattern is considered a mismatch. For 1 base difference, you consider that the 30 bases are a mismatch, causing an overestimation of the mismatches.

    Like I said, if we consider there is 1% difference between the 2 genomes, then the probability of finding at least 1 mismatch in a 30-base pattern is roughly 1 in 4 (around 75%). The probability of finding at least 1 mismatch in a 30-base pattern considering 2% difference between the 2 genomes is around 45%. Your results are pretty much in between (66% average), which is consistent with ~98.5% homology between our genome and the chimp genome.
    If you don’t believe me, you should try a simple test. Take a small chromosome (to save calculation time) and do a 40 BPM and a 50 BPM analysis. The probability of finding a single mismatch in a 40-base pattern is bigger that for a 30-base pattern, so you should find a lower level of similarity. If you try it on chromosome 22, my prediction is that the 40-BPM similarity will be around 55% and the 50-BPM similarity around 47%.

  27. Actually, even using your 3/10 probability, we should expect ~70% perfect matches with a 1% difference in both genomes. This is not very far from your results. Thus based on your study, there is approximately a 99% homology between human and chimp genome.

  28. Todd Wood has a short response on his blog.

  29. CharlesJ #26,27

    Thank you for your involvement in the probability calculations.

    Let’s look at the problem from another point of view.

    My test gives near 62% 30BMP-similarity. This means that, in average, in 10000 searched patterns we have 6200 matches and 3800 mismatches. The ratio between matches/mismatches is 6200/3800 = 1.63.

    In your hypothesis of two genomes that differ only 1% in average every 100 bases there is a mismatch. To simplify the scenario let’s imagine that these mismatches are uniformly distributed along the coupled genomes A and B, as the tags in a ruler. Now let’s consider a random 30 base pattern in A. In every range of 100 bases we have 70 successive positions where there are no mismatches followed by 30 positions where there are mismatches. Now we have that the ratio between matches/mismatches is roughly 70/30 = 2.3.

    Since 1.63 is lesser than 2.3 I wouldn’t say that the 30BMP test agrees well with a 1% difference, rather with a larger difference.

  30. #29

    Forgive me butting in, but why did you not complete this approach for the 2% case? Imagine there was a mismatch every 50 bases. Now for a 30 base sample we get 20 matches, followed by 30 mismatches – giving 20/30 = 0.66 – less than 1.63. So your figure suggests somewhere between 1% and 2% – which is exactly what the literature and Charles is suggesting.

  31. I find it easier to understand when I work directly with the % of mismatches. In your case it would be 38% (have you taken into account the relative size of each chromosome compared to the whole genome when doing the average?). For most of those 38% mismatches, the actual number of bases that does not match will be 1 (I can explain why in greater details if you want). So, by simply saying that the percentage of patterns that do not have a perfect match can directly be translated in the percentage of mismatches between the 2 genomes, you would be overestimating the percentage of mismatches by approximately 30-folds.
    In other word, your calculations do not make the distinction between a complete mismatch (0 bases out of 30; a deletion) and a partial mismatch (1, 2, 3 or more mismatches; SNPs). 98.5% from SNPs (based on the 2005 chimp genome paper), and like I said previously most of the mismatches will be only on 1 base in each pattern that scores a mismatch. At some point, this has to be acknowledged in your calculations. The easiest way would be to divide the percentage of patterns that do not score a perfect match by 30, since it would be true in most cases. And that will give you a rough estimation of the percentage of mismatches between human and chimp genome (in this case; 38% / 30 = ~1.2%).

  32. CharlesJ, I’m of the completely opposite camp that believes we should throw out all the biased genetic similarity studies that have focused solely on finding similarities in the genomes while throwing out all discrepancies. Primarily I do this because looking solely for similarities presupposes that humans evolved from apes in the first place and will thus end up proving its presupposition in its final analysis.

    For example Charles look how much of the genome was ‘thrown out’ here:

    Chimpanzee? – Richard Buggs PhD.
    Excerpt: When we do this alignment, we discover that only 2400 million of the human genome’s 3164.7 million ’letters’ align with the chimpanzee genome – that is, 76% of the human genome. Some scientists have argued that the 24% of the human genome that does not line up with the chimpanzee genome is useless ”junk DNA”.
    http://www.idnet.com.au/files/pdf/Chimpanzee.pdf

    Charles perhaps you say that they were justified to throw out 24% of the genome?!? If that is the case please let me ask you just how much of the kangaroo genome should we be allowed to throw out to find similarity with humans???:

    Kangaroo genes close to humans
    Excerpt: Australia’s kangaroos are genetically similar to humans,,, “There are a few differences, we have a few more of this, a few less of that, but they are the same genes and a lot of them are in the same order,” ,,,”We thought they’d be completely scrambled, but they’re not. There is great chunks of the human genome which is sitting right there in the kangaroo genome,”
    http://www.reuters.com/article.....P020081118

    If you say we shouldn’t be allowed to why not? You see Charles you can’t presuppose your conclusion into the way in which you gather evidence or it will give you a false positive in your final analysis!

    But perhaps you say that the 24% is Junk and ‘deserved’ to be thrown out???? If you think so that is a another false assumption for there has now been found regulatory codes in the “Junk DNA” that is of a higher level of information than the genetic code is:

    This following study, that discovered a ‘Second Regulatory Code” on top of the protein coding DNA code:

    Nature Reports Discovery of “Second Genetic Code” But Misses Intelligent Design Implications – May 2010
    Excerpt: Rebutting those who claim that much of our genome is useless, the article reports that “95% of the human genome is alternatively spliced, and that changes in this process accompany many diseases.” ,,,, the complexity of this “splicing code” is mind-boggling:,,, A summary of this article also titled “Breaking the Second Genetic Code” in the print edition of Nature summarized this research thusly: “At face value, it all sounds simple: DNA makes RNA, which then makes protein. But the reality is much more complex.,,, So what we’re finding in biology are:

    # “beautiful” genetic codes that use a biochemical language;
    # Deeper layers of codes within codes showing an “expanding realm of complexity”;
    # Information processing systems that are far more complex than previously thought (and we already knew they were complex), including “the appearance of features deeper into introns than previously appreciated”
    http://www.evolutionnews.org/2.....of_se.html

    This following paper highlights the regulatory role that the ‘second code’ has over the primary protein coding DNA code:

    Researchers Crack ‘Splicing Code,’ Solve a Mystery Underlying Biological Complexity
    Excerpt: “For example, three neurexin genes can generate over 3,000 genetic messages that help control the wiring of the brain,” says Frey. “Previously, researchers couldn’t predict how the genetic messages would be rearranged, or spliced, within a living cell,” Frey said. “The splicing code that we discovered has been successfully used to predict how thousands of genetic messages are rearranged differently in many different tissues.
    http://www.sciencedaily.com/re.....133252.htm

    Thus Charles we have high level function arising from ‘Junk DNA’ regions that have in all probability not been stringently accounted for in previous ‘similarity’ studies of evolutionists simply because they did not match. Much like how you are trying to arrive at a artificially high percentage of similarity at the present time!!!

    As well Charles, even if we assume that the rate of mutations to DNA was not overwhelmingly detrimental, the time it would take to fix a ‘coordinated’ beneficial mutation in the human lineage is 216 million years, and this number (216 m.y.) is taken directly from a paper written by a Darwinist using the equations of the ‘modern synthesis!!!

    Waiting Longer for Two Mutations – Michael J. Behe
    Excerpt: Citing malaria literature sources (White 2004) I had noted that the de novo appearance of chloroquine resistance in Plasmodium falciparum was an event of probability of 1 in 10^20. I then wrote that ‘‘for humans to achieve a mutation like this by chance, we would have to wait 100 million times 10 million years’’ (Behe 2007) (because that is the extrapolated time that it would take to produce 10^20 humans). Durrett and Schmidt (2008, p. 1507) retort that my number ‘‘is 5 million times larger than the calculation we have just given’’ using their model (which nonetheless “using their model” gives a prohibitively long waiting time of 216 million years). Their criticism compares apples to oranges. My figure of 10^20 is an empirical statistic from the literature; it is not, as their calculation is, a theoretical estimate from a population genetics model.
    http://www.discovery.org/a/9461

    Thus Charles can you see the problem? Even if we presupposed evolution to be true and lined up the genomes as best we could, and threw out all the mismatches, and arrived at the 99% number that evolutionists so desperately want us to arrive at, the fact is that 1% of roughly 3 billion is a 3 million DNA base pair difference and yet you can’t even account for the fixation of a single coordinated mutation within the human lineage!! This is more than a slight problem for evolutionists.

    As well Charles DNA does not even encode for Body Plan morphogenesis in the first place, so the point of DNA similarity or dissimilarity, from a strict scientific perspective, is moot, since mutations to DNA is not even the right tool for the job for constructing a new animal. In fact mutations to the DNA can in all honesty can be considered the bottom rung of the ladder as far as the information hierarchy of the cell is concerned:

    Stephen Meyer on Craig Venter, Complexity Of The Cell & Layered Information
    http://www.metacafe.com/watch/4798685

    Splicing Together the Case for Design, Part 2 (of 2) – Fazale Rana – June 2010
    Excerpt: Remarkably, the genetic code appears to be highly optimized, further indicating design. Equally astounding is the fact that other codes, such as the histone binding code, transcription factor binding code, the splicing code, and the RNA secondary structure code, overlap the genetic code. Each of these codes plays a special role in gene expression, but they also must work together in a coherent integrated fashion. The existence of multiple overlapping codes also implies the work of a Creator. It would take superior reasoning power to structure the system in such a way that it can simultaneously harbor codes working in conjunction instead of interfering with each other. As I have written elsewhere, the genetic code is in fact optimized to harbor overlapping codes, further evincing the work of a Mind.
    http://www.reasons.org/splicin.....n-part-2-2

    As well Charles I don’t know what propaganda you have been fed on the fossil record of humans and apes, but the fossil record is certainly not the neat little progression of apes evolving into man that we see popularly depicted in those cartoons:

    Evolution of the Genus Homo – Annual Review of Earth and Planetary Sciences – Tattersall, Schwartz, May 2009
    Excerpt: “Definition of the genus Homo is almost as fraught as the definition of Homo sapiens. We look at the evidence for “early Homo,” finding little morphological basis for extending our genus to any of the 2.5–1.6-myr-old fossil forms assigned to “early Homo” or Homo habilis/rudolfensis.”
    http://www.annualreviews.org/d.....208.100202

    this might interest you Charles:

    Shoddy Engineering or Intelligent Design? Case of the Mouse’s Eye – April 2009
    Excerpt: — The (entire) nuclear genome is thus transformed into an optical device that is designed to assist in the capturing of photons. This chromatin-based convex (focusing) lens is so well constructed that it still works when lattices of rod cells are made to be disordered. Normal cell nuclei actually scatter light. — So the next time someone tells you that it “strains credulity” to think that more than a few pieces of “junk DNA” could be functional in the cell – remind them of the rod cell nuclei of the humble mouse.
    http://www.evolutionnews.org/2.....20011.html

  33. CharlesJ #31 – markf #30

    I agree with your remarks and evaluations. However, as I said previously in my article, the 30BPM test, which I declared to be a similarity test, considers as matching patterns independently from their positions in the target genome. In these conditions, whatever be the quantitative values obtained in the tests, to speak of genomic “identity” is improper, since identity would imply that the matching patterns have also the same positions in the source and target genomes. In general this is not the case in the human and chimp genomic comparisons. As a consequence I think that my criticism about the “99% identity” as publicized continues to be valid. I have noted that CharlesJ aptly uses the term “homology” to describe the situation and I agree with him about this terminology. Homology and similarity are terms far more convenient than identity in genomics. Not only numbers, also words matter.

  34. correction;

    the fact is that 1% of roughly 3 billion is a 30 million DNA base pair difference and yet you can’t even account for the fixation of a single coordinated mutation within the human lineage!! ,,,

  35. niwrad,

    ToE says chimps are probably very close cousins to humans. If that’s true it means that at one time we had the same genome. (More correctly, our genomes were in the same pool, but thinking about a single genome is simpler.) So the ToE model says there was this basal genome that some ancient species of ape had. The breeding population with that genome split in (at least) two. The genome of one population acquired one set of mutations, and eventually became modern chimps. The genome of the other acquired an independent set of mutations, and eventually became modern humans. (Obviously, I’m leaving out a lot of subsequent splits along the way.) Since the split didn’t happen that long ago (in geologic time) the genomes of chimps and humans should be *very* similar, because they haven’t had that much time to acquire independent mutations.

    You purport to show that the human and chimp genomes aren’t all that similar, so ToE is wrong (or has a big hole in it). Charlesj, markf and I have argued that actually your algorithm is biased toward showing low levels of similarity, even between genomes that are very similar.

    So why don’t you do the following? Create a genome; just a long series of A’s, T’s, G’s, and C’s. It doesn’t have to be meaningful, just a string that’s more or less as long as a human or chimp genome. Next, make an exact copy of that genome, so you’ve got copy A and copy B. Now, put copy A through a series of “mutations.” Randomly change some of the letters, insert new ones, delete others, take chunks of letters from one place in the string and put them somewhere else, reverse some of their ordering, etc. All the mutations should be of the type we find in nature, and in their observed proportions. And there should be about as many mutations as ToE suggests there would have been between the time the human/chimp common ancestor split and now. Call this mutated version of A copy A’. Next, do the same process on copy B, but make sure you’re mutating it in an independent process. Call the resulting copy B’. Finally, use your algorithm described above to compare A’ and B’ for similarity. Since you’re doing a Monte Carlo, you probably want to repeat the process of independent mutation and comparison 1,000+ times. Then you can report back on the results.

    Here’s the rub. If your comparisons come back that A’ and B’ are, on average, 95% or so similar, you’ve got some evidence that ToE is wrong, because you’ve done a simulation of speciation and the genomes are more similar than we find in the real world. But if they come back such that A’ and B’ are, say, ~65% similar, that’s evidence in support of ToE, because your simulation of speciation produces similarities that are comparable to those found in the real world.

    In short, I like that you’re getting your hands dirty with the data. I just want a more rigorous treatment of it before I accept your conclusions.

  36. AMW, and exactly why should any similarity evidence be considered more trustworthy over the more foundational evidence I presented here?

    http://www.uncommondescent.com.....ent-364779

    Does only evidence for evolution count in your book? ,,,Even though the evidence against evolution is of a more solid basis scientifically? Does it not bother you to be so biased in your weighing of the evidence?

  37. bornagain77,

    I don’t understand how the test I proposed is biased. If A’ and B’ are more similar than the human and chimp genomes (using niwrad’s algorithm), then that counts as evidence against evolution. If they are about as similar as the human and chimp genomes, then that counts as evidence in its favor. Not proof, certainly, but definitely evidence. I’m offering a refinement to niwrad’s current research agenda. That’s not uncommon in scholarly disciplines.

    As for your foundational evidence, I quit reading your links when it became clear that you’ll only respond to criticisms of your links with yet more links, ad nauseum.

    Niwrad clearly has some intent to engage in reasoned discourse, so I’m more than happy to respond to him. Give me something roughly equivalent to the comments he has*, and I’ll be more likely to respond to you as well.

    *To wit: cordial, on point, cohesive and non-redundant.

  38. Well, that’ll teach me to try my hand at html tags!

  39. AMW, excuse me for not more directly engaging in the debate of how to get a more biased genetic similarity reading for evolution or not. I thought I made my case clear by showing that evolutionists have completely biased the previous test in the first place by throwing out dissimilar “Junk DNA” sequences. Dissimilar Junk DNA sequences that are now known not to be ‘Junk” at all, but are in fact known to contain high level regulatory information. Information that is of a ‘deeper’, more crucial, level than the genes themselves are since they regulate the genes,,,

    I sorry that you don’ find it interesting that since evolution is shown to be impossible, even to be impossible as proposed by the very foundational mechanisms, and equations, proposed and used by leading evolutionists themselves, as I clearly illustrated in the following post you refused to read,,,,

    http://www.uncommondescent.com.....ent-364779

    ,,,then the point of genetic similarity is moot! Moot, absurd, and completely irrelevant since it is shown that the evolution of Humans from some hypothetical ape-like ancestor is completely impossible, even if the most generous assumptions are granted to the Darwinian methods, for ascertaining genetic similarity. Myself, though you may not find that fact interesting, I find it very interesting since it in fact goes to the very heart of the matter being discussed!

    ,,, That you would refuse to even acknowledge the crushing foundational evidence that is brought to bear against your position, brought to bear by the work of evolutionists themselves, reveals that you don’t really seem to care to be objective in this matter as to carefully weigh the evidence so to find the truth of whether you evolved from some ape-like creature or not. I would think such an important matter would make you a little more careful as to how you looked at the evidence.,,,

    Further notes:

    4 Nails in The Coffin of Darwin

    Population Genetics Vs. Whale Evolution – Part 1 – Richard Sternberg PhD in Evolutionary Biology
    http://www.metacafe.com/watch/5263733

    Neo-Darwinism Vs. Whale Evolution – Part 2 – Richard Sternberg PhD in Evolutionary Biology
    http://www.metacafe.com/watch/5263746

  40. I’m just not seeing how using niwrad’s algorithm on two mock genomes is a biased test. I’m also not sure why you keep insisting that I am biased in my approach to the data.

    Do you think I’m unwilling or incapable of changing my mind on the subject in response to evidence?

  41. #35

    Good idea. Or (if the data is available) it might be easier to do what I suggest in #2 and run the test against two unrelated human genomes or indeed any two individuals of the same species . This would give a benchmark.

  42. I appreciate the collaborative spirit showed in this thread by all.

    AMW #35

    Your idea of simulating evolution is a good one. Unfortunately it is not at all easy to implement because implies knowledge of evolutionary theory and genomics far beyond mine (and maybe involves issues that today are controversial among the evolutionary biologists themselves).

    markf #41

    Also your idea to test against two unrelated human genomes is interesting and perhaps easier to implement than AMW’s, given we find such genomes somewhere.

    I promise nothing but if I will do other studies along the line of your suggestions I will post here the results if noteworthy.

    The good news: it seems that statistical methods of comparison, as the 30BPM test, can have a place beside the functional approaches and give results comparable. The former are simple and automatic, the latter are complex and interactive (and need big knowledge of the field).

    What’s sure is that there is really a lot of work to do in bio-informatics.

  43. test

  44. check

  45. one more test

  46. gpuccio:

    How did anyone achieve this result? Through a simple mutation?

  47. I’m still not sure we understand each other very well, and by reading my older post I think it is at least in part because my comment were not clear enough. I will try one last time with an example.
    If you were to run the same analysis, using the same dataset (human and chimp genome) and only changing the size of the pattern analyzed (i.e. ; 40-BPM and 50-BPM), you would get very different results. For a 40-BPM analysis, it would be around 53% and for a 50-BPM analysis it would be around 45% (I mixed those up in one of my last post, those are the right values).
    Yet, it is still the same type of analysis on the same dataset. Why would we get so different results? Should we not expect that (30-BPM similarity 40-BPM similarity 50-BPM similarity)?
    Actually, all those values are indeed equivalent if you take into account the length of the pattern (and the average mismatch expected in the patterns rejected in your analysis). I can give you the calculation if you want, but for the sake of simplicity I’ll only tell you that you have to divide the % of similarity by 24 in a 30-BPM analysis, by 30 in a 40-BPM analysis and by 35 in a 50-BPM analysis.
    This will give you:
    ((30-BPM = 1.58) (40-BPM = 1.58) (50-BPM – 1.58))
    1.58 is a constant in every analysis, and I’ll let you guess what it is.
    The conclusion is: In order to be able to compare your results with results obtained in the literature and also to be able to compare your results within your own algorithm using different size of pattern, you can’t directly use BPM % of similarity.

  48. Some of the characters have been lost in my last post, when I compare the value of 30-BPM with 40-BPM and 50-BPM, there should be a equivalence symbol between 30-BPM/40-BPM and between 40-BPM/50-BPM.

  49. If you’re looking to repeat the test on different human genomes, look here: http://www.ncbi.nlm.nih.gov/si....._uids=9558

  50. Does anyone know why all the comments after #37 are showing up in italics?

  51. AMW@50

    Because someone open an italic tag and never closed it. I’ll close it for you.

    Should be be normal now.

  52. Looks like it dd not work. Hope this works.

  53. The paper from nature that you mention uses a per-base comparison. This means, roughly, that if you compare one base from each genome, about 98% of the comparisons will be a match. Now, if you compare a sequence of two bases, about 96% (98% x 98%) will match. In the case of your test, you compare sequences of 30. So, if there is a 2% difference between the genomes, you would expect your test to return 55% [(98%)^30] (your average is higher). Your simple statistical test actually shows that 98% similarity is too low. I don’t think you should cite papers without addressing what they actually say–it’s misleading.

  54. AMW@50

    Okay. Things are showing up as italics because you opened an italics tag at “ad nauseum” @37 and then tried to close it. But instead of doing </i> you did <i />. Somehow this got through WordPress’s validation and left an open tag. Unfortunately I can’t close it because WordPress filters the unmatched close tag in my submission.

    A note to the site admin / moderator. The template the page uses includes the Google ad script in each post. So it can be included hundreds of times. Surely that is not right. Have a look at the page source to see what I mean. I think long pages would load a lot faster if it was fixed.

  55. Very odd. I’ve never heard of html tags in one comment flowing through to another.

  56. CJL2718 #51

    Thank you for the reference. It is likely I will use such data in the future. In the meantime I do two things: (1) collect the useful suggestions and ideas from the commenters in this forum; (2) look around to improve the hardware for increasing the processing power, which in this job is important.

    CharlesJ #47.

    Forgive me if I don’t understand what you mean in details. Nevertheless I agree with you that the results of the 30BMP test are not directly comparable to those in genomics literature. The 62% 30BPM similarity is not directly comparable with the 99% identity. We need a corrective coefficient. I agree with you also that such corrective coefficient differs depending on we do a 30BPM or 40BPM or 50BPM test …

    To understand this I argue according to what I did in #29. Given two supposed genomes that match 99% a 30BPM test gives 70% matches. Since the real test gave 62% my first idea to obtain a 30BPM value comparable to 99% is to apply the simple formula: 99×62/70 = 87.7%. In other words the multiplier coefficient that we must apply to the 62% is 99/70 = 1.41.

    Of course in a 40BPM test the coefficient is different because in genomes that match 99% a 40BPM test gives 60% matches. In this case the coefficient is 99/60 = 1.65. In a 50BPM test the coefficient would be 2 and so on. This seems reasonable because longer are the patterns searched for lesser are the matches. As a consequence the coefficient values increase with the length of the patterns.

    These multipliers provide us a way to normalize, so to speak, the XXBPM values and make them comparable to the values obtained with other methods of comparison.

    The problem that remains is that, looking at the numbers, my normalization seems to differ from yours. It would be fine if we could arrive to a shared convincing normalization.

  57. AMW@55

    The browser just sees one block of HTML which it parses. When you submit your comment it gets checked by WordPress and all disallowed tags are removed so you can’t do malicious things like embed script tags or load remote content via an iframe, etc. WordPress will normally discard unmatched but allowed tags e.g. italics which is why I can’t close the tag now, but in this it seems you attempted to close the tag but used a trailing slash instead of a leading slash therefore didn’t really close it but managed to get it past WordPress.

    I may submit the bug you discovered to them so they can patch it.

  58. A 98-percent DNA similarity between man and chimp was cited on today’s Rush Limbaugh show.

  59. The reason we have different coefficient is probably because we are using different assumptions in our calculation. I’ll start by explaining the principle of my calculation of the correction coefficient, then I’ll give a detailed example and I’ll try to point out where it differs from yours.
    Like I said before, your algorithm brings an extra variable that can affect the results of the analysis: the length of the pattern. Since the results in the literature do not depend on this variable, it is normal that you have to make an extra calculation to remove the effect of this variable in order to compare your results with the litterature. I know I’m repeating myself here but it is very important to keep that fact in mind to understand the normalization process.
    The goal of my normalization is to estimate the average number of mismatch we are to expect in the patterns that are not perfect matches (38% of the patterns in your study using 30-BPM). If we divide the number of base in the pattern by the average number of base mismatch expected (inside a pattern that is not a perfect match), we get the correction coefficient.
    In other word, while your algorithm gives us the expected number of x-base pattern that would not give a perfect match in the genome, my correction gives us the expected number of bases that are mismatches inside the patterns that are not perfect matches.
    That is the principle of my correction; I’ll give the details of the calculations and explain why it differs from yours in my next post.

  60. Since the goal of your study is to make an unbiased estimate of the similarity between the human and chimp’s genome, we cannot use numbers that are in the literature (at least not directly). And this is where our correction differs. Your correction takes for granted that the number you have to use for the correction is 99%, based on the literature I guess.
    My correction only uses number available in your analysis. So let’s start with 38%. This is the percentage of pattern that contains at least one mismatch. From this number, we can estimate the expected number of mismatches in the genome by asking this question: If I know that on average 38% of my 30-base patterns contain at least one mismatch, what is the probability of mismatch to occur in the genome?
    I did this using a binomial probability calculator using k = 1 (meaning at least 1 mismatch) and N = 30 (the size of the pattern). To save some time I started by using a p = 0.01 (from the literature) and checked what was the probability of getting at least one event. The calculator gave me many results and one of them was the probability of having at least 1 mismatch: 0.26 (lower than your results). I changed the value of p until the probability of having at least 1 mismatch was as close to 0.38 as possible: 0.0158 seemed good enough. I know this is not very elegant but I felt a bit lazy and I was able to save some time using that strategy. (Note that while I used the number in the literature to help me guess the value, the result is independent of that value and is only based on your results.)
    At this point, we do not really need the correction coefficient anymore since we have the result we were looking for: 0.0158. Now that we have that value (p), we can predict how your algorithm would behave if we used different sized patterns simply by changing the value of N.
    I calculated the coefficient for my post #47 to skip the part about the details on the calculation of the probability hoping to make my post easier to understand. The number I gave in that post are based pretty much on the same logic and I can give you the details if you want.

  61. I think I should mention that while I’m being critic on your analysis, I do think that your results are valid (it’s with the conclusion I am in disagreement). And I do think your algorithm could be useful to guess similarity between any two genomes that we do not know the exact value using minimal computation resources. We would need to do the correction I mentioned to get a value that could be compared with other results in the literature though.

    The principal weakness of your algorithm is that we can only compare closely related genomes as the number of pattern with a perfect match will drop to 0 quite fast as we would analyze more distant genomes. This could in part be corrected in part by reducing the size of the pattern, but this could potentially increase the number of false positive (i.e.: a 2-base pattern would give many positive results, almost 100% of them being useless in this kind of analysis).

    Also, it would be interesting to have some more information on the patterns that are a considered as mismatches, especially for the deletion/insertion events. One simple way to do this could be to remove a base at the extremity of the pattern that did not score a perfect match and rerun that pattern through the program to see if you get a perfect match. If the answer is no, you remove another base and re-run it again until you have removed 15 bases or that you have a perfect match. If you don’t have a perfect match after removing 15 bases, you take the full pattern again and start removing bases from the other extremity and you run those new patterns in the program until you get a perfect match or that you have removed 15 bases. If you still don’t have a perfect match after those two analyses, the most likely explanation is that the pattern is inside of a deletion/insertion.
    If you do this on every pattern that did not score a perfect match, you’ll get a pretty good estimation of the number of deletion/insertion events. The number of pattern having been reduced in the first round, it should also help you to reduce the calculation time. You can even make an extra correction taking into account the fact that pattern with 2 or more mismatches will score as deletion using this new algorithm. Since the probability of having 2 or more mismatches in a pattern is ~8%, you could simple multiply the % of deletion/insertion events by .92 (1 – 0.08) to get a better estimation.

  62. After reading my last post another time, I realized that the correction for the % of insertion/deletion using the probability of having 2 or more mismatches in a pattern is incorrect. It is possible for the 2 mismatches to be on the same side of the pattern and therefore they would not score as an insertion/deletion. It would be the case ~25% of the time.

    You should multiply the % of insertion/deletion by 0.94 in order to make a better correction.

  63. niwrad #56
    “look around to improve the hardware for increasing the processing power, which in this job is important.”

    Dont waste your time in harware improvin because inside ADN Is not information. It’s seeming at your eyes so, but I assure you that you are seeking in the wrong side. Better seek you to God directly. There I’ll sure you that is the correct answer, and is so evidently that nobody was able of See It.

    God is with you,

    Obriton (Silav) CL&J A.

  64. CharlesJ #59, 60, 61, 62

    Thank you very much for the detailed explanations of your normalization method. Your idea of modeling the statistics by using a binomial probability distribution is excellent and surely can be a valid method of normalization (there can be other methods to obtain the same result though).

    For now there is a single thing on which I am not sure about your mathematical analysis (or I haven’t understood for my ignorance). You say: “I changed the value of p until the probability of having *at least* 1 mismatch was as close to 0.38 as possible: 0.0158 seemed good enough”. Are you sure that such probability is not the one of having *exactly* 1 mismatch?

    Anyway I will continue to study your method and eventually I will comment it as soon as possible.

  65. You are right to mention that the probability of having at least 1 mismatch is different from the probability of having exactly 1 mismatch.

    In our case, it’s the probability of having at least 1 mismatch that should be used. The reason is that if a pattern has 1 mismatch, it won’t be considered as a perfect match. The same goes if there are 2 mismatches in the pattern, 3, 4, etc… When using the probability of having at least 1 mismatch, you can account for every number of mismatches, not just 1.

  66. By the way, if you take the probability of getting exactly 1 mismatch + the probability of having exactly 2 mismatches … + the probability of having exactly n mismatches, you will get the same result as the probability of getting at least 1 mismatch. For a 30-base pattern, you will notice that the probability of having 6 or more mismatches is so small that you can dismiss them without influencing significantly your results.

  67. CharlesJ #65, 66

    This discussion is interesting. Sorry if I insist Charles. It is true that to have at least 1 mismatch means every number of mismatches equal or lesser than 30 and greater than 0. But in my opinion the problem is that when we obtain 0.0158 by mean of the binomial formula (as function of n=30 and k=1) we are calculating the probability of having exactly 1 mismatch when we should deal with the probability of having at least 1 mismatch (the binomial formula gives the probability that the event will happen exactly X times in N trials). It is true that the probability of 6 or more mismatches in a 30 base pattern is small but what about the probabilities of 5, 4, 3, 2 mismatches? Are they really negligible?

    Moreover there is the problem of my normalization, which gives different coefficients. Until now I don’t realize where it is wrong. I admit that my method of normalization is simpler than your, but where is it wrong? Really I would prefer to have two different normalizations giving the same results! Unfortunately nobody shows me where mine is wrong and in the same time I have the above doubt about yours. Oh my.

  68. About the first point, I think we are saying the same thing. If you use a binomial probability calculator, what is the result you obtain for the probability of having at least one mismatch with N=30, k=1 and p=0.0185? (Note: The calculator I`m using gives me 3 answers: “P: exactly 1 out of 30”, “P: 1 or fewer out of 30” and “P: 1 or more out of 30”. It is the third one that I think should be used)

    About the second point, I took the time to read your post carefully and I think I missed some points the first time I tried to explain the differences between our normalization ratio. I also noticed I made a mistake in my formulation when I said: “you have to divide the % of similarity by 24 in a 30-BPM analysis, by 30 in a 40-BPM analysis and by 35 in a 50-BPM analysis”. What I meant to say is you have to divide the % of pattern that do not score a perfect match by 24 in a 30-BPM analysis, by 30 in a 40-BPM analysis and by 35 in a 50-BPM analysis. This should give you the 1.58 I was talking about (i.e.: 38/24=1.58). Sorry for the confusion.

    In essence both our methods are equivalent. You have saying that you have to multiply 62 by 1.41 and I say we have to divide 38 by 24. While the logic is equivalent but we are not using the same estimations for the expected number of mismatches in the genome. I’m using 98.42 (100 minus 1.58; see my post #60 for details on the calculation) and you are using 99%.

    You should also recalculate the probability of having a perfect match using the binomial probability calculator with N=30, k=1 and p=0.0158. You have to do 1 minus the probability of having at least one mismatch ( 1 – 0.3798 = ~0.62).

    If you recalculate your ratio with those new values you get: 98.42 / 62 = 1.58. So if you want to compare both our ratios, you can say that: ((62 * 1.58) = (1 – (38 / 24)) = ~98). Of course, since we are using rounded up numbers, there is a slight difference in the numbers but that should be expected. Yours is also more direct than mine since you are working directly on the % of similarity instead of working on the % of pattern without a perfect match, but it’s still essentially the same result.

    The most important point we have to be sure to agree on is the calculation of probabilities. So I’ll ask again: What is the result you obtain for the probability of having at least one mismatch when you use N=30, k=1 and p=0.0185?

    My answer is: 0.3798. What is yours?

  69. CharlesJ #68

    To answer your specific question at the bottom I used the binomial calculator found at:

    http://stattrek.com/Tables/Binomial.aspx

    I would take the “cumulative probability P(X > 1)” that is 0.4289.

    Why don’t we agree here? It should be simple matter of using the binomial formula independently from any genomics question. It is likely you use another calculator, however all calculators should agree.

    You say that our methods are equivalent and I am glad of that but my normalizing coefficient for the 30BPM is 1.41, that multiplied by 62% gives 87.7% similarity that is lesser than your 98.42%. It seems to me that we aren’t properly converging, unfortunately. Probably I am missing something. I will try to find what tomorrow. Bye.

  70. Using the calculator you linked, with “Probability of success on a single trial = 0.0158”, “Number of trials = 30” and “Number of successes (x) = 1”, I get a “Cumulative Probability: P(X > 1) = 0.3798”.

  71. From your post #56
    “Given two supposed genomes that match 99% a 30BPM test gives 70% matches. Since the real test gave 62% my first idea to obtain a 30BPM value comparable to 99% is to apply the simple formula: 99×62/70 = 87.7%. In other words the multiplier coefficient that we must apply to the 62% is 99/70 = 1.41.”

    The 70% comes from your post #29, right?
    “In your hypothesis of two genomes that differ only 1% in average every 100 bases there is a mismatch. To simplify the scenario let’s imagine that these mismatches are uniformly distributed along the coupled genomes A and B, as the tags in a ruler. Now let’s consider a random 30 base pattern in A. In every range of 100 bases we have 70 successive positions where there are no mismatches followed by 30 positions where there are mismatches.”

    There are 2 flaws with that line of reasoning. The first is that mismatches are not spaced evenly every 100 base in the two genomes you are comparing. There are cases when you will have two or more mismatches within the range of a pattern. And for those cases you will not get 70 successive positions where there are no mismatches followed by 30 positions where there are mismatches. This is why the binomial probability of having at least one mismatch is higher than 70%. From the calculator you linked, put 0.01 in the first box, 30 in the second and 1 in the third. The Cumulative Probability: P(X > 1) is: 0.26. So if we suppose that the two genomes match 99%, a 30BPM test will give 74% matches.

    This brings us to the second point. The 99% match value is from the literature, you have to estimate the % of matching between the two genomes from your own results (see my post #70). While a 0.58% difference may seem small, you should notice that the probability of having at least one mismatch changes considerably between 99% match and 98.42% match: 26% to 38% respectively. In the first case a 30BPM test will give 74% matches and in the second 62% matches.

  72. niwrad #25
    Yes evolutionists use the 99% to make their case but creationism(s) need only embrace this number as I said. Its fine from a common blueprint concept.
    Your work is very welcome in showing that the differences couldn’t be by selection on mutation etc
    This is a good angle in any way to demonstrate the unlikelyness of evolution by selection/mutation for physical change or growth.

  73. Charles,

    About your point #2. The 99% identity is yes the literature result but in the same time can well be a hypothesis or supposition in a reasoning, as mine, why not.

    About your point #1. My simplified model of evenly spaced mismatches (ESM) is generous towards the side you defend (with honor). I try to explain why.

    First some abbreviations: M=matches; U=un-matches (mismatch); min=minimum; max=maximum.

    If we want U max and M min in a xxBPM test the best strategy is indeed to uniformly space all mismatches along the entire length of the coupled genomes as I did in the ESM model. On the contrary if all mismatches were concentrated somewhere in a single string we would have U min and M max. For instance, as extreme case, if all mismatches were concatenated at the end of the genomes and the xxBPM test never hits there M might arrive even to 100%, despite the fact that the genomes are not identical!

    In a two 1% different genomes (99% equal genomes) ESM model a 30BMP has M=70%, M is min and U is max. And it is a consequence of this bias of an ESM that you rightly say that in less biased real conditions M might be higher than 70.

    The principle of my normalization consists in saying: in an ESM couple of 99% equal genomes M=70; the 30BPM test gives in a couple of X% similar genomes M=62; what is the value of X (the normalized value of M)? The proportion is 99/70=X/62, then X=99*62/70=88. But in real conditions 70 could be higher, say 80. In this case X decreases to 77.

    In this sense I claim that the ESM model is generous towards who likes high human/chimp similarity. Despite that my normalization, based on an abstract ESM model, applied to the real 30BPM test, gives 88% similarity.

  74. Robert Byers #72

    As I said at the beginning of my article, I will not desist to be an IDer if I discover that the human/chimp genetic similarity is really 99% (and as you can see from my previous post I haven’t yet arrived there). Analogously you can well remain creationist also if the similarity were say 88% or something like that.

    You are right that homologies between living forms point to a unique Designer. There is more. As it was said: “In any thing [not only life] there is a sign that He is unique”. In the same time there are astonishing differences between living forms in particular (and between all things in the universe) but this evidences the immensity of His creative power.

  75. “The proportion is 99/70=X/62, then X=99*62/70=88.”
    There are limitations to cross multiplication. Let`s say I have a value (u) that double every second.
    1s = 1u
    2s = 2u
    3s = 4u

    I cannot say that: 1s/1u = 3s/4u. The reason for that is that cross multiplication only works if the two variables are directly proportional.
    The % of similarity of a x-BPM is not directly proportional to the % match between two genomes. This is why you cannot estimate % match in your analysis using a cross multiplication; you have to use binomial probability to get much better estimate.

  76. Late to the party-

    Looks like someone forgot to close a italics tag-

    No one has done a complete side by side comparison of the two genomes.

  77. trying again to close the tag…

  78. It wouldn’t matter if the genetic similarity were 99% or 50% or 12% for those who wish to support evolution. For evolution is a comparative endeavor. If the genomes show similarity, then that “closeness” is used as evidence of evolution. But so is the distance between genomes also used as evidence for evolution, for evolution is supposed to occur with one species moving away from another; the further the genome has moved away from another species, the more it has “evolved.” Nothing would stand to falsify this line of thinking, neither genetic similarities nor genetic differences. The evolutionist uses both as evidence of evolution; they indeed want it both ways.

  79. Charles and all,

    A linear interpolation can work as first approximation in a short range. If in my ESM model we consider two genomes differing 2% we have M=40%. This way we have two points of the interpolation line: (99,70) and (98,40). The relative equation is 30X – Y = 2900 (where X are the normalized 30BPM similarities and Y are the un-normalized ones, those I show in the table and graph). If Y=62 (the median at the right-bottom of the table) we get from the equation X=98.73%.

    Therefore we have two normalization methods (yours binomial non-linear, mine linear) that agree as final result.

    Consider that this high figure is obtained under the following conditions very favorable to similarity:

    (1) the ESM model helps to obtain high value of similarities;
    (2) the 30BPM test, for definition, is a lavish one because allows a total scrambling of patterns.

    If one or both of these conditions is not applied the scenario can only get worse for similarity.

    The conditions #2 implies that to speak of “identity” between genomes is nonsense, despite the high value obtained in the test. Besides 1.27% of difference in 3 billions base genomes makes 38 millions point mutations after all.

    As a consequence the normalized result of the 30BPM test in no way supports the evolutionist claim of a common ancestor of these genomes. A blind evolution that changes and scrambles 38 millions bases is unthinkable.

    I am satisfied of this work and wish to thank you for the collaboration.

  80. Here is something that directly reflects on the 99% similarity myth:

    Response to John Wise
    Excerpt: But there are solid empirical grounds for arguing that changes in DNA alone cannot produce new organs or body plans. A technique called “saturation mutagenesis”1,2 has been used to produce every possible developmental mutation in fruit flies (Drosophila melanogaster),3,4,5 roundworms (Caenorhabditis elegans),6,7 and zebrafish (Danio rerio),8,9,10 and the same technique is now being applied to mice (Mus musculus).11,12

    None of the evidence from these and numerous other studies of developmental mutations supports the neo-Darwinian dogma that DNA mutations can lead to new organs or body plans–because none of the observed developmental mutations benefit the organism.
    http://www.evolutionnews.org/2.....38811.html

    So thus we have evolutionists trying desperately to establish a point of DNA similarity, through very questionable means of excluding that which does not match, regardless of the fact that it is now known that mutations to DNA do not even effect Body Plan morphogenesis in the first place.

  81. I am satisfied too that we now pretty much agree about the normalization progress. I agree that we have said pretty much everything there was to say about your algorithm. I’d like to thank you for this very interesting exchange, I found it very informative and I hope I helped you with your analysis.
    While I disagree with you about the 38 million bases being too much on the time scale since the last common ancestor as we should expect that half of those have been accumulated in the chimp genome and half of it in the human genome. But this is outside the scope of your original post.

    If you have the time later, I do think it could be interesting to do the analysis of the patterns without a perfect match to see what proportion of them are deletion/insertion events (post #61). There are a few points I noticed that should be changed with the correction at #62. If someday you are interested in doing such an analysis, we could discuss about it by mail.

  82. niwrad, you may this paper very interesting to your topic:

    Pattern pluralism and the Tree of Life hypothesis – 2006
    Excerpt: Hierarchical structure can always be imposed on or extracted from such data sets by algorithms designed to do so, but at its base the universal TOL rests on an unproven assumption about pattern that, given what we know about process, is unlikely to be broadly true.
    http://www.pnas.org/content/104/7/2043.abstract

Leave a Reply