A simple statistical test for the alleged “99% genetic identity” between humans and chimps

_{September 27, 2010

Genomics, Informatics, Intelligent Design}

Share: Facebook; Twitter; LinkedIn; Flipboard; Print; Email

Typical figures published in the scientific literature for the percentage similarities between the genomes of human beings (Homo sapiens) and chimpanzees (Pan troglodytes) range from 95% to 99%. However, in press releases intended for popular consumption, evolutionary biologists frequently claim that human and chimpanzee genomes are 99% identical. Skeptics of neo-Darwinian evolution have repeatedly punctured this”99% myth,” but unfortunately, it seems to have gained widespread credence, due to its being continually propagated by evolutionists! For instance, one often encounters statements like these in the literature:

“Because the chimpanzee lies at such a short evolutionary distance with respect to human, nearly all of the bases are identical by descent and sequences can be readily aligned” (The Chimpanzee Sequencing and Analysis Consortium, Initial sequence of the chimpanzee genome and comparison with the human genome, Vol. 437/1 September 2005/doi:10.1038/nature04072).

“The consortium [National Human Genome Research Institute] found that the chimp and human genomes are very similar and encode very similar proteins. The DNA sequence that can be directly compared between the two genomes is almost 99 percent identical.” (here.)

“The genetic codes of chimps and humans are 99 percent identical.” (here)

Supporters of the neo-Darwinian theory of evolution have a strong ideological motivation for minimizing the differences between humans and chimps, as they claim that these two species evolved from a common ancestor, as a result of random mutations filtered by natural selection. Now, I don’t personally believe that humans and chimps share a common ancestry, for a host of reasons that would take me too long to explain in this post. Nor do I attach much significance to the magnitude of the genetic differences between these two species, per se, because in my opinion, the fundamental differences between these creatures lie elsewhere. However, since the genomic data is now available for free on the Internet, I decided to perform some sleuthing of my own, and check out the wildly exaggerated claims that are often made regarding the percentage similarities between human and chimp genomes. Here is what I discovered.

Interactive functional comparison methods
Usually, molecular biologists compare genomes on a functional basis. For example, they may search for similar genes in the genomes of human beings and chimpanzees, and try to identify the bases or nucleotides where they differ or match. Many different technologies have been developed to investigate genomes. One of these is BLAST (Basic Local Alignment Search Tool) software (see the NCBI Web site for more details). BLAST is an extremely powerful computer aided tool, as it is able to locate regions of local similarity among sequences by searching a whole database of genomes. Alignment methods (such as those implemented by BLAST and other techniques) allow geneticists to search interactively for common local patterns in different positions. However, this interactive task has its limits, as it can compare only portions of different genomes. Additionally, some critics have pointed out that these tools are susceptible to slip-ups (see here). Given the amount of data involved (in the order of Gigabytes), the global comparison of two genomes is a very demanding job, which cannot be completed interactively in a short time by human beings, even with the aid of tools such as BLAST. At the present time, only fully automated computer programs are capable of performing such a task on entire genomes. However, the development of an automated computer program which is capable of performing a complete functional comparison between human and chimpanzee genomes is practically impossible, for the simple reason that the functional architecture of these genomes is not yet perfectly known.

Automatic statistical comparison methods
From a mere informatics and statistical point of view, DNA sequences are simply strings of symbols or characters. Thus it is also possible to develop tests comparing genomes as unstructured sequences of characters, without taking into consideration genes, pseudo-genes, coding and non-coding regions, vertical and horizontal gene transfer, open reading frames (ORFs), or any other functional concepts. The characters most commonly present in DNA sequences are A, C, G and T. There are other less important characters which are used basically to indicate ambiguity regarding the identity of certain bases in the sequences. The comparison I performed was completely different from those usually performed by geneticists, because was purely statistical in nature. In a sense, it could be described as an application of the well-known Monte Carlo method. The Monte Carlo method is frequently used when data or processes involved are huge, and one wants to reduce the computer running time. In short, it involves dealing with a partial random sample, instead of the whole space which is under investigation. In the Monte Carlo method, only a small portion of the data population is actually investigated; nevertheless, this portion is statistically large enough to reveal the characteristics of the whole.

Metrics, distances and similarity measures
One theoretical approach to the problem would be to consider the set of all strings of characters as a metric space, and then define a distance function for all pairs of strings. Many distance functions have been developed by mathematicians for studying the degree of similarity between strings (for a list of them see (here). Given a metric or pseudo-metric space and its distance function, we can refer to a particular similarity, which differs from the similarity distance of another metric space. In a pairwise comparison identity test, we can easily calculate a simple metric distance called the “Hamming distance.” In this test, the order is important, because the n-th character of string A is compared to the n-th character of string B, after the initial characters of A and B have been aligned. After each comparison, if the two characters don’t match, then the Hamming distance increases by 1. If the order doesn’t matter, we can compare sub-strings of the parent strings A and B. Additionally, if they are at different positions in the two strings then many different tests are possible. We call these pattern matching or similarity tests. While there is only one possible method of comparing identity between strings of characters (the above pairwise comparison), there are many methods of comparing similarity. In other words, there are many measures of similarity, depending on the rules of pattern matching that we choose. In practice, calculating a certain distance function between two genomes can be a demanding job, in terms of running time, even for powerful computers.

Specifications for a statistical similarity test
Any final result for a complete statistical similarity test (especially if it is a unique number) is meaningful only if: 1) the distance function is mathematically defined; 2) the rules for pattern matching and the formulas for calculating the result are explained in detail; 3) it is clearly stated which parts of the input strings are being examined; 4) in the event that computer programs were used to perform the comparison, the source codes and algorithms are provided. My explanations below have the goal to meet the three first constraints. To satisfy the fourth condition, the source file of the Perl script used for the test is freely downloadable here.

How the genome data was obtained
Genome data for Homo sapiens and Pan troglodytes was freely downloaded from public bio-informatics archives at UCSC Genome Bioinformatics. The downloaded DNA sequences were in FASTA format. Before running the test, I decided to discard all symbols in the sequences, except for A, C, G and T. Most of the symbols I had to discard were “N” symbols, which represented rare, undefined situations (probably due to the level of sophistication of the scanning technology). The frequency of other symbols was very low. As it turned out, the deletion of a few “N” symbols didn’t affect the overall result very much. Given that the chimp’s genome contains two chromosomes (referred to as chr_02a and chr_02b) corresponding to chromosome #2 in human beings, I decided to concatenate them, in order to compare them with human chromosome #2 (chr_02).

30 Base Pattern Matching (30BPM) similarity test
The 30BPM similarity test is a very simple one: it performs searches for shared 30 base-long patterns, on two homologous chromosomes. This method is a true pattern-matching test, because it searches for identical patterns in the chromosomes of humans and chimpanzees. The beauty of this test is that it allows patterns to match, independently of their position in the chromosome. The significance of local similarities in homologous chromosomes is that identical patterns may be found in quite different positions along the two chromosomes. In fact, this test allows a total scrambling of patterns between homologous chromosomes. Of course, it is generally very difficult to know what the functional implications of this scrambling are. In particular, the positions of the genes might shift, but when non-gene coding is scrambled, it is doubtful that functionality is preserved. However, from a purely quantitative point of view, in this particular test, I don’t need to worry about qualitative issues such as functionality; only statistical issues count.

The algorithm implemented
For each pair of homologous chromosomes A and B, a PRNG (pseudo-random number generator) generates 10,000 uniformly distributed pseudo-random numbers which specify the offset, or starting point, of 10,000 30-base patterns that are contained in source chromosome A. The 30BPM test involves searching for all 10,000 of these DNA sub-strings of chromosome A in our target chromosome B. Now let F be the number of patterns located (at least once) in chromosome B. The 30BPM similarity is simply defined as F/100 (minimum value = 0%, maximum value = 100%). The absolute difference between 10,000 and F (minimum 0, maximum 10,000) is the 30BPM distance. Thus the greater the similarity is, the smaller the distance will be. Strictly speaking, this 30BPM space is only a pseudo-metric, inasmuch as the axiom of identity (“the distance is zero if and only if A and B are equal”) defining a true metric space is somewhat relaxed (in some cases, the distance could still be zero even if A and B were different), while the axiom of symmetry (“the distance between A and B is equal to the distance between B and A”) does not hold in some cases. It can easily be seen that the 30BPM distance will be zero (30BPM similarity = 100%) if the two strings are identical. In an additional test which I performed on two random 100 million-base DNA strings, the 30-BPM distance was 10,000 (i.e. no patterns on A were located in B). Hence I shall refer to the value 10,000 as the “random 30BPM distance.” In other words, the 30BPM similarity between two artificially generated random 100 million-base DNA strings is zero. Of course, when generating these artificial DNA strings I had to take into consideration the fact that that on average, the true probabilities of A, T, G and C occurring in natural DNA are not exactly 0.25 each, but as follows: A=0.3, T=0.3, G=0.2, and C=0.2. In such a case, the following formula accurately describes the probability of obtaining a single-base match between the two DNA sequences:

(30*30 + 30*30 + 20*20 + 20*20)/(100*100) = (900+900+400+400)/10000 = 26%

In a supplementary test in which I performed a pure pair-wise comparison between human/chimp genomes, I obtained a global figure 25.90%, which matches very closely with the theoretically predicted result above.

Results obtained
The following table and graph show the report of the 30BPM similarity test on the whole set of human/chimp chromosomes.

The results obtained are statistically valid. The same test was previously run on a sampling of 1,000 random 30-base patterns and the percentages obtained were almost identical with those obtained in the final test, with 10,000 random 30-base patterns. When human and chimp genomes are compared, the X chromosome is the one showing the highest degree of 30BPM similarity (72.37%), while the Y chromosome shows the lowest degree of 30BPM similarity (30.29%). On average the overall 30BPM similarity, when all chromosomes are taken into consideration, is approximately 62%. Here we have the classic case of the glass which some people perceive as being half-full, while others perceive it as being half-empty. When compared to two random strings which are 0% similar, 62% is a very large value, so nobody would deny that human and chimp genomes are quite similar! On the other end, 62% is a very low value when compared to the more than 95% similarity percentages which are published by bioinformatics evolutionary researchers. Now, I realize that it may seem somewhat arbitrary to choose 30-base-long patterns, as I did in my test, and indeed it is arbitrary to some degree. However, if the two genomes were really 95% similar or more, as is commonly claimed, also a 30BPM statistical test should produce 95% results, and it does not.

An analogy from politics: an exit poll
To help readers to grasp the significance and potential implications of my test, here is a simple analogy. Consider an election, in which 100 million electors are eligible to vote. One exit poll, based on a sample of 10,000 voters, calculates that party X has received 62% of the popular vote. However, at the end of election party X declares it has received more than 95% of the vote! The 30BPM statistical test described above is analogous to the exit poll, while the claims made by evolutionary biologists are analogous to party X’s “95%” claim. The sample of 10,000 patterns is taken from a global population of 100 million bases (the approximate number of bases on a typical human/chimp chromosome), while the ratio of population to sample is 100,000,000/10,000=10,000. The 30BPM exit poll metaphorically says that only 62% voted for Darwin’s party, whereas modern Darwinists claim that over 95% did. Something doesn’t quite add up.

I believe that the classic evolutionary comparisons between human and chimp genomes exaggerate the similarities, for at least two reasons: (1) they don’t consider whole chromosomes, but only portions of them (e.g. particular genes); (2) the rules of pattern matching are relaxed in some way (e.g. sometimes two bases are said to match, even when they don’t really match). Now, there is nothing intrinsically wrong with comparisons where (1) and (2) hold. However, any research that is truly worthy of being called “scientific” should openly acknowledge built-in limitations, such as (1) and (2) above. Sadly, this is very rarely done. It is perfectly acceptable to publish partial results that are obtained by relaxing the rules, but one should not publicize them as global and mathematically sound, when in fact, they are nothing of the sort.

Conclusion
We have seen that in a genome comparison, the only thing that matters is the degree of similarity. However, once we put the concept of similarity between two text strings on the table we open a can of worms. Many different measures of the similarity between two strings are possible, and different methods of comparing two genomes can result in wildly different estimates of the similarity between them. The assumptions that drive the methods used also drive the results obtained, as well as their interpretation. A simple layman’s statistical test, such as the 30BPM, shows that the “95% claim” described above is a highly controversial one. It is worth noting that as more information comparing the two genomes is published, the differences between them will appear more profound than they were originally thought to be. The big question that still remains is: what should one conclude from the similarities and differences between the genomes of humans and chimpanzees? Commonly reported evolutionary statistics that should provide an informative answer to this question may actually obscure the true answer.

Comments

niwrad, you may this paper very interesting to your topic: Pattern pluralism and the Tree of Life hypothesis - 2006 Excerpt: Hierarchical structure can always be imposed on or extracted from such data sets by algorithms designed to do so, but at its base the universal TOL rests on an unproven assumption about pattern that, given what we know about process, is unlikely to be broadly true. http://www.pnas.org/content/104/7/2043.abstractbornagain77_{October 2, 2010
October
10
Oct
2
02
2010
06:26 AM
6
06
26
AM
PDT}

I am satisfied too that we now pretty much agree about the normalization progress. I agree that we have said pretty much everything there was to say about your algorithm. I’d like to thank you for this very interesting exchange, I found it very informative and I hope I helped you with your analysis. While I disagree with you about the 38 million bases being too much on the time scale since the last common ancestor as we should expect that half of those have been accumulated in the chimp genome and half of it in the human genome. But this is outside the scope of your original post. If you have the time later, I do think it could be interesting to do the analysis of the patterns without a perfect match to see what proportion of them are deletion/insertion events (post #61). There are a few points I noticed that should be changed with the correction at #62. If someday you are interested in doing such an analysis, we could discuss about it by mail.CharlesJ_{October 1, 2010
October
10
Oct
1
01
2010
06:51 PM
6
06
51
PM
PDT}

Here is something that directly reflects on the 99% similarity myth: Response to John Wise Excerpt: But there are solid empirical grounds for arguing that changes in DNA alone cannot produce new organs or body plans. A technique called "saturation mutagenesis"1,2 has been used to produce every possible developmental mutation in fruit flies (Drosophila melanogaster),3,4,5 roundworms (Caenorhabditis elegans),6,7 and zebrafish (Danio rerio),8,9,10 and the same technique is now being applied to mice (Mus musculus).11,12 None of the evidence from these and numerous other studies of developmental mutations supports the neo-Darwinian dogma that DNA mutations can lead to new organs or body plans--because none of the observed developmental mutations benefit the organism. http://www.evolutionnews.org/2010/10/response_to_john_wise038811.html So thus we have evolutionists trying desperately to establish a point of DNA similarity, through very questionable means of excluding that which does not match, regardless of the fact that it is now known that mutations to DNA do not even effect Body Plan morphogenesis in the first place.bornagain77_{October 1, 2010
October
10
Oct
1
01
2010
06:38 PM
6
06
38
PM
PDT}

Charles and all, A linear interpolation can work as first approximation in a short range. If in my ESM model we consider two genomes differing 2% we have M=40%. This way we have two points of the interpolation line: (99,70) and (98,40). The relative equation is 30X – Y = 2900 (where X are the normalized 30BPM similarities and Y are the un-normalized ones, those I show in the table and graph). If Y=62 (the median at the right-bottom of the table) we get from the equation X=98.73%. Therefore we have two normalization methods (yours binomial non-linear, mine linear) that agree as final result. Consider that this high figure is obtained under the following conditions very favorable to similarity: (1) the ESM model helps to obtain high value of similarities; (2) the 30BPM test, for definition, is a lavish one because allows a total scrambling of patterns. If one or both of these conditions is not applied the scenario can only get worse for similarity. The conditions #2 implies that to speak of "identity" between genomes is nonsense, despite the high value obtained in the test. Besides 1.27% of difference in 3 billions base genomes makes 38 millions point mutations after all. As a consequence the normalized result of the 30BPM test in no way supports the evolutionist claim of a common ancestor of these genomes. A blind evolution that changes and scrambles 38 millions bases is unthinkable. I am satisfied of this work and wish to thank you for the collaboration.niwrad_{October 1, 2010
October
10
Oct
1
01
2010
02:31 PM
2
02
31
PM
PDT}

It wouldn't matter if the genetic similarity were 99% or 50% or 12% for those who wish to support evolution. For evolution is a comparative endeavor. If the genomes show similarity, then that "closeness" is used as evidence of evolution. But so is the distance between genomes also used as evidence for evolution, for evolution is supposed to occur with one species moving away from another; the further the genome has moved away from another species, the more it has "evolved." Nothing would stand to falsify this line of thinking, neither genetic similarities nor genetic differences. The evolutionist uses both as evidence of evolution; they indeed want it both ways.Clive Hayden_{October 1, 2010
October
10
Oct
1
01
2010
09:42 AM
9
09
42
AM
PDT}

trying again to close the tag...Joseph_{October 1, 2010
October
10
Oct
1
01
2010
05:59 AM
5
05
59
AM
PDT}

Late to the party- Looks like someone forgot to close a italics tag- No one has done a complete side by side comparison of the two genomes.Joseph_{October 1, 2010
October
10
Oct
1
01
2010
05:59 AM
5
05
59
AM
PDT}

“The proportion is 99/70=X/62, then X=99*62/70=88.” There are limitations to cross multiplication. Let`s say I have a value (u) that double every second. 1s = 1u 2s = 2u 3s = 4u ... I cannot say that: 1s/1u = 3s/4u. The reason for that is that cross multiplication only works if the two variables are directly proportional. The % of similarity of a x-BPM is not directly proportional to the % match between two genomes. This is why you cannot estimate % match in your analysis using a cross multiplication; you have to use binomial probability to get much better estimate.CharlesJ_{October 1, 2010
October
10
Oct
1
01
2010
05:57 AM
5
05
57
AM
PDT}

Robert Byers #72 As I said at the beginning of my article, I will not desist to be an IDer if I discover that the human/chimp genetic similarity is really 99% (and as you can see from my previous post I haven’t yet arrived there). Analogously you can well remain creationist also if the similarity were say 88% or something like that. You are right that homologies between living forms point to a unique Designer. There is more. As it was said: "In any thing [not only life] there is a sign that He is unique". In the same time there are astonishing differences between living forms in particular (and between all things in the universe) but this evidences the immensity of His creative power.niwrad_{October 1, 2010
October
10
Oct
1
01
2010
04:18 AM
4
04
18
AM
PDT}

Charles, About your point #2. The 99% identity is yes the literature result but in the same time can well be a hypothesis or supposition in a reasoning, as mine, why not. About your point #1. My simplified model of evenly spaced mismatches (ESM) is generous towards the side you defend (with honor). I try to explain why. First some abbreviations: M=matches; U=un-matches (mismatch); min=minimum; max=maximum. If we want U max and M min in a xxBPM test the best strategy is indeed to uniformly space all mismatches along the entire length of the coupled genomes as I did in the ESM model. On the contrary if all mismatches were concentrated somewhere in a single string we would have U min and M max. For instance, as extreme case, if all mismatches were concatenated at the end of the genomes and the xxBPM test never hits there M might arrive even to 100%, despite the fact that the genomes are not identical! In a two 1% different genomes (99% equal genomes) ESM model a 30BMP has M=70%, M is min and U is max. And it is a consequence of this bias of an ESM that you rightly say that in less biased real conditions M might be higher than 70. The principle of my normalization consists in saying: in an ESM couple of 99% equal genomes M=70; the 30BPM test gives in a couple of X% similar genomes M=62; what is the value of X (the normalized value of M)? The proportion is 99/70=X/62, then X=99*62/70=88. But in real conditions 70 could be higher, say 80. In this case X decreases to 77. In this sense I claim that the ESM model is generous towards who likes high human/chimp similarity. Despite that my normalization, based on an abstract ESM model, applied to the real 30BPM test, gives 88% similarity.niwrad_{October 1, 2010
October
10
Oct
1
01
2010
03:26 AM
3
03
26
AM
PDT}

niwrad #25 Yes evolutionists use the 99% to make their case but creationism(s) need only embrace this number as I said. Its fine from a common blueprint concept. Your work is very welcome in showing that the differences couldn't be by selection on mutation etc This is a good angle in any way to demonstrate the unlikelyness of evolution by selection/mutation for physical change or growth.Robert Byers_{September 30, 2010
September
09
Sep
30
30
2010
11:38 PM
11
11
38
PM
PDT}

From your post #56 “Given two supposed genomes that match 99% a 30BPM test gives 70% matches. Since the real test gave 62% my first idea to obtain a 30BPM value comparable to 99% is to apply the simple formula: 99×62/70 = 87.7%. In other words the multiplier coefficient that we must apply to the 62% is 99/70 = 1.41.” The 70% comes from your post #29, right? “In your hypothesis of two genomes that differ only 1% in average every 100 bases there is a mismatch. To simplify the scenario let’s imagine that these mismatches are uniformly distributed along the coupled genomes A and B, as the tags in a ruler. Now let’s consider a random 30 base pattern in A. In every range of 100 bases we have 70 successive positions where there are no mismatches followed by 30 positions where there are mismatches.” There are 2 flaws with that line of reasoning. The first is that mismatches are not spaced evenly every 100 base in the two genomes you are comparing. There are cases when you will have two or more mismatches within the range of a pattern. And for those cases you will not get 70 successive positions where there are no mismatches followed by 30 positions where there are mismatches. This is why the binomial probability of having at least one mismatch is higher than 70%. From the calculator you linked, put 0.01 in the first box, 30 in the second and 1 in the third. The Cumulative Probability: P(X > 1) is: 0.26. So if we suppose that the two genomes match 99%, a 30BPM test will give 74% matches. This brings us to the second point. The 99% match value is from the literature, you have to estimate the % of matching between the two genomes from your own results (see my post #70). While a 0.58% difference may seem small, you should notice that the probability of having at least one mismatch changes considerably between 99% match and 98.42% match: 26% to 38% respectively. In the first case a 30BPM test will give 74% matches and in the second 62% matches.CharlesJ_{September 30, 2010
September
09
Sep
30
30
2010
07:48 PM
7
07
48
PM
PDT}

Using the calculator you linked, with “Probability of success on a single trial = 0.0158”, “Number of trials = 30” and “Number of successes (x) = 1”, I get a “Cumulative Probability: P(X > 1) = 0.3798”.CharlesJ_{September 30, 2010
September
09
Sep
30
30
2010
05:04 PM
5
05
04
PM
PDT}

CharlesJ #68 To answer your specific question at the bottom I used the binomial calculator found at: http://stattrek.com/Tables/Binomial.aspx I would take the "cumulative probability P(X > 1)" that is 0.4289. Why don’t we agree here? It should be simple matter of using the binomial formula independently from any genomics question. It is likely you use another calculator, however all calculators should agree. You say that our methods are equivalent and I am glad of that but my normalizing coefficient for the 30BPM is 1.41, that multiplied by 62% gives 87.7% similarity that is lesser than your 98.42%. It seems to me that we aren’t properly converging, unfortunately. Probably I am missing something. I will try to find what tomorrow. Bye.niwrad_{September 30, 2010
September
09
Sep
30
30
2010
02:25 PM
2
02
25
PM
PDT}

About the first point, I think we are saying the same thing. If you use a binomial probability calculator, what is the result you obtain for the probability of having at least one mismatch with N=30, k=1 and p=0.0185? (Note: The calculator I`m using gives me 3 answers: “P: exactly 1 out of 30”, “P: 1 or fewer out of 30” and “P: 1 or more out of 30”. It is the third one that I think should be used) About the second point, I took the time to read your post carefully and I think I missed some points the first time I tried to explain the differences between our normalization ratio. I also noticed I made a mistake in my formulation when I said: “you have to divide the % of similarity by 24 in a 30-BPM analysis, by 30 in a 40-BPM analysis and by 35 in a 50-BPM analysis”. What I meant to say is you have to divide the % of pattern that do not score a perfect match by 24 in a 30-BPM analysis, by 30 in a 40-BPM analysis and by 35 in a 50-BPM analysis. This should give you the 1.58 I was talking about (i.e.: 38/24=1.58). Sorry for the confusion. In essence both our methods are equivalent. You have saying that you have to multiply 62 by 1.41 and I say we have to divide 38 by 24. While the logic is equivalent but we are not using the same estimations for the expected number of mismatches in the genome. I’m using 98.42 (100 minus 1.58; see my post #60 for details on the calculation) and you are using 99%. You should also recalculate the probability of having a perfect match using the binomial probability calculator with N=30, k=1 and p=0.0158. You have to do 1 minus the probability of having at least one mismatch ( 1 – 0.3798 = ~0.62). If you recalculate your ratio with those new values you get: 98.42 / 62 = 1.58. So if you want to compare both our ratios, you can say that: ((62 * 1.58) = (1 - (38 / 24)) = ~98). Of course, since we are using rounded up numbers, there is a slight difference in the numbers but that should be expected. Yours is also more direct than mine since you are working directly on the % of similarity instead of working on the % of pattern without a perfect match, but it’s still essentially the same result. The most important point we have to be sure to agree on is the calculation of probabilities. So I’ll ask again: What is the result you obtain for the probability of having at least one mismatch when you use N=30, k=1 and p=0.0185? My answer is: 0.3798. What is yours?CharlesJ_{September 30, 2010
September
09
Sep
30
30
2010
01:12 PM
1
01
12
PM
PDT}

CharlesJ #65, 66 This discussion is interesting. Sorry if I insist Charles. It is true that to have at least 1 mismatch means every number of mismatches equal or lesser than 30 and greater than 0. But in my opinion the problem is that when we obtain 0.0158 by mean of the binomial formula (as function of n=30 and k=1) we are calculating the probability of having exactly 1 mismatch when we should deal with the probability of having at least 1 mismatch (the binomial formula gives the probability that the event will happen exactly X times in N trials). It is true that the probability of 6 or more mismatches in a 30 base pattern is small but what about the probabilities of 5, 4, 3, 2 mismatches? Are they really negligible? Moreover there is the problem of my normalization, which gives different coefficients. Until now I don’t realize where it is wrong. I admit that my method of normalization is simpler than your, but where is it wrong? Really I would prefer to have two different normalizations giving the same results! Unfortunately nobody shows me where mine is wrong and in the same time I have the above doubt about yours. Oh my.niwrad_{September 30, 2010
September
09
Sep
30
30
2010
12:06 PM
12
12
06
PM
PDT}

By the way, if you take the probability of getting exactly 1 mismatch + the probability of having exactly 2 mismatches … + the probability of having exactly n mismatches, you will get the same result as the probability of getting at least 1 mismatch. For a 30-base pattern, you will notice that the probability of having 6 or more mismatches is so small that you can dismiss them without influencing significantly your results.CharlesJ_{September 30, 2010
September
09
Sep
30
30
2010
09:24 AM
9
09
24
AM
PDT}

You are right to mention that the probability of having at least 1 mismatch is different from the probability of having exactly 1 mismatch. In our case, it's the probability of having at least 1 mismatch that should be used. The reason is that if a pattern has 1 mismatch, it won’t be considered as a perfect match. The same goes if there are 2 mismatches in the pattern, 3, 4, etc… When using the probability of having at least 1 mismatch, you can account for every number of mismatches, not just 1.CharlesJ_{September 30, 2010
September
09
Sep
30
30
2010
09:19 AM
9
09
19
AM
PDT}

CharlesJ #59, 60, 61, 62 Thank you very much for the detailed explanations of your normalization method. Your idea of modeling the statistics by using a binomial probability distribution is excellent and surely can be a valid method of normalization (there can be other methods to obtain the same result though). For now there is a single thing on which I am not sure about your mathematical analysis (or I haven’t understood for my ignorance). You say: "I changed the value of p until the probability of having *at least* 1 mismatch was as close to 0.38 as possible: 0.0158 seemed good enough". Are you sure that such probability is not the one of having *exactly* 1 mismatch? Anyway I will continue to study your method and eventually I will comment it as soon as possible.niwrad_{September 30, 2010
September
09
Sep
30
30
2010
08:54 AM
8
08
54
AM
PDT}

niwrad #56 "look around to improve the hardware for increasing the processing power, which in this job is important." Dont waste your time in harware improvin because inside ADN Is not information. It's seeming at your eyes so, but I assure you that you are seeking in the wrong side. Better seek you to God directly. There I'll sure you that is the correct answer, and is so evidently that nobody was able of See It. God is with you, Obriton (Silav) CL&J A.Obriton_{September 30, 2010
September
09
Sep
30
30
2010
08:14 AM
8
08
14
AM
PDT}

After reading my last post another time, I realized that the correction for the % of insertion/deletion using the probability of having 2 or more mismatches in a pattern is incorrect. It is possible for the 2 mismatches to be on the same side of the pattern and therefore they would not score as an insertion/deletion. It would be the case ~25% of the time. You should multiply the % of insertion/deletion by 0.94 in order to make a better correction.CharlesJ_{September 30, 2010
September
09
Sep
30
30
2010
07:22 AM
7
07
22
AM
PDT}

I think I should mention that while I’m being critic on your analysis, I do think that your results are valid (it’s with the conclusion I am in disagreement). And I do think your algorithm could be useful to guess similarity between any two genomes that we do not know the exact value using minimal computation resources. We would need to do the correction I mentioned to get a value that could be compared with other results in the literature though. The principal weakness of your algorithm is that we can only compare closely related genomes as the number of pattern with a perfect match will drop to 0 quite fast as we would analyze more distant genomes. This could in part be corrected in part by reducing the size of the pattern, but this could potentially increase the number of false positive (i.e.: a 2-base pattern would give many positive results, almost 100% of them being useless in this kind of analysis). Also, it would be interesting to have some more information on the patterns that are a considered as mismatches, especially for the deletion/insertion events. One simple way to do this could be to remove a base at the extremity of the pattern that did not score a perfect match and rerun that pattern through the program to see if you get a perfect match. If the answer is no, you remove another base and re-run it again until you have removed 15 bases or that you have a perfect match. If you don’t have a perfect match after removing 15 bases, you take the full pattern again and start removing bases from the other extremity and you run those new patterns in the program until you get a perfect match or that you have removed 15 bases. If you still don’t have a perfect match after those two analyses, the most likely explanation is that the pattern is inside of a deletion/insertion. If you do this on every pattern that did not score a perfect match, you’ll get a pretty good estimation of the number of deletion/insertion events. The number of pattern having been reduced in the first round, it should also help you to reduce the calculation time. You can even make an extra correction taking into account the fact that pattern with 2 or more mismatches will score as deletion using this new algorithm. Since the probability of having 2 or more mismatches in a pattern is ~8%, you could simple multiply the % of deletion/insertion events by .92 (1 – 0.08) to get a better estimation.CharlesJ_{September 30, 2010
September
09
Sep
30
30
2010
07:05 AM
7
07
05
AM
PDT}

Since the goal of your study is to make an unbiased estimate of the similarity between the human and chimp’s genome, we cannot use numbers that are in the literature (at least not directly). And this is where our correction differs. Your correction takes for granted that the number you have to use for the correction is 99%, based on the literature I guess. My correction only uses number available in your analysis. So let’s start with 38%. This is the percentage of pattern that contains at least one mismatch. From this number, we can estimate the expected number of mismatches in the genome by asking this question: If I know that on average 38% of my 30-base patterns contain at least one mismatch, what is the probability of mismatch to occur in the genome? I did this using a binomial probability calculator using k = 1 (meaning at least 1 mismatch) and N = 30 (the size of the pattern). To save some time I started by using a p = 0.01 (from the literature) and checked what was the probability of getting at least one event. The calculator gave me many results and one of them was the probability of having at least 1 mismatch: 0.26 (lower than your results). I changed the value of p until the probability of having at least 1 mismatch was as close to 0.38 as possible: 0.0158 seemed good enough. I know this is not very elegant but I felt a bit lazy and I was able to save some time using that strategy. (Note that while I used the number in the literature to help me guess the value, the result is independent of that value and is only based on your results.) At this point, we do not really need the correction coefficient anymore since we have the result we were looking for: 0.0158. Now that we have that value (p), we can predict how your algorithm would behave if we used different sized patterns simply by changing the value of N. I calculated the coefficient for my post #47 to skip the part about the details on the calculation of the probability hoping to make my post easier to understand. The number I gave in that post are based pretty much on the same logic and I can give you the details if you want.CharlesJ_{September 30, 2010
September
09
Sep
30
30
2010
07:01 AM
7
07
01
AM
PDT}

The reason we have different coefficient is probably because we are using different assumptions in our calculation. I’ll start by explaining the principle of my calculation of the correction coefficient, then I’ll give a detailed example and I’ll try to point out where it differs from yours. Like I said before, your algorithm brings an extra variable that can affect the results of the analysis: the length of the pattern. Since the results in the literature do not depend on this variable, it is normal that you have to make an extra calculation to remove the effect of this variable in order to compare your results with the litterature. I know I’m repeating myself here but it is very important to keep that fact in mind to understand the normalization process. The goal of my normalization is to estimate the average number of mismatch we are to expect in the patterns that are not perfect matches (38% of the patterns in your study using 30-BPM). If we divide the number of base in the pattern by the average number of base mismatch expected (inside a pattern that is not a perfect match), we get the correction coefficient. In other word, while your algorithm gives us the expected number of x-base pattern that would not give a perfect match in the genome, my correction gives us the expected number of bases that are mismatches inside the patterns that are not perfect matches. That is the principle of my correction; I’ll give the details of the calculations and explain why it differs from yours in my next post.CharlesJ_{September 30, 2010
September
09
Sep
30
30
2010
05:42 AM
5
05
42
AM
PDT}

A 98-percent DNA similarity between man and chimp was cited on today's Rush Limbaugh show.tribune7_{September 29, 2010
September
09
Sep
29
29
2010
07:01 PM
7
07
01
PM
PDT}

AMW@55 The browser just sees one block of HTML which it parses. When you submit your comment it gets checked by Wordpress and all disallowed tags are removed so you can't do malicious things like embed script tags or load remote content via an iframe, etc. Wordpress will normally discard unmatched but allowed tags e.g. italics which is why I can't close the tag now, but in this it seems you attempted to close the tag but used a trailing slash instead of a leading slash therefore didn't really close it but managed to get it past Wordpress. I may submit the bug you discovered to them so they can patch it.andrewjg_{September 29, 2010
September
09
Sep
29
29
2010
12:50 PM
12
12
50
PM
PDT}

CJL2718 #51 Thank you for the reference. It is likely I will use such data in the future. In the meantime I do two things: (1) collect the useful suggestions and ideas from the commenters in this forum; (2) look around to improve the hardware for increasing the processing power, which in this job is important. CharlesJ #47. Forgive me if I don’t understand what you mean in details. Nevertheless I agree with you that the results of the 30BMP test are not directly comparable to those in genomics literature. The 62% 30BPM similarity is not directly comparable with the 99% identity. We need a corrective coefficient. I agree with you also that such corrective coefficient differs depending on we do a 30BPM or 40BPM or 50BPM test … To understand this I argue according to what I did in #29. Given two supposed genomes that match 99% a 30BPM test gives 70% matches. Since the real test gave 62% my first idea to obtain a 30BPM value comparable to 99% is to apply the simple formula: 99x62/70 = 87.7%. In other words the multiplier coefficient that we must apply to the 62% is 99/70 = 1.41. Of course in a 40BPM test the coefficient is different because in genomes that match 99% a 40BPM test gives 60% matches. In this case the coefficient is 99/60 = 1.65. In a 50BPM test the coefficient would be 2 and so on. This seems reasonable because longer are the patterns searched for lesser are the matches. As a consequence the coefficient values increase with the length of the patterns. These multipliers provide us a way to normalize, so to speak, the XXBPM values and make them comparable to the values obtained with other methods of comparison. The problem that remains is that, looking at the numbers, my normalization seems to differ from yours. It would be fine if we could arrive to a shared convincing normalization.niwrad_{September 29, 2010
September
09
Sep
29
29
2010
12:10 PM
12
12
10
PM
PDT}

Very odd. I've never heard of html tags in one comment flowing through to another.AMW_{September 29, 2010
September
09
Sep
29
29
2010
12:03 PM
12
12
03
PM
PDT}

AMW@50 Okay. Things are showing up as italics because you opened an italics tag at "ad nauseum" @37 and then tried to close it. But instead of doing </i> you did <i />. Somehow this got through Wordpress's validation and left an open tag. Unfortunately I can't close it because Wordpress filters the unmatched close tag in my submission. A note to the site admin / moderator. The template the page uses includes the Google ad script in each post. So it can be included hundreds of times. Surely that is not right. Have a look at the page source to see what I mean. I think long pages would load a lot faster if it was fixed.andrewjg_{September 29, 2010
September
09
Sep
29
29
2010
11:51 AM
11
11
51
AM
PDT}

The paper from nature that you mention uses a per-base comparison. This means, roughly, that if you compare one base from each genome, about 98% of the comparisons will be a match. Now, if you compare a sequence of two bases, about 96% (98% x 98%) will match. In the case of your test, you compare sequences of 30. So, if there is a 2% difference between the genomes, you would expect your test to return 55% [(98%)^30] (your average is higher). Your simple statistical test actually shows that 98% similarity is too low. I don't think you should cite papers without addressing what they actually say--it's misleading.vsakko_{September 29, 2010
September
09
Sep
29
29
2010
11:37 AM
11
11
37
AM
PDT}

1 2 3 Next

You must be logged in to post a comment.

Leave a Reply