New Dembski-Marks Paper

_{William Dembski

December 7, 2009

Intelligent Design}

Share: Facebook; Twitter; LinkedIn; Flipboard; Print; Email

William A. Dembski and Robert J. Marks II, “Bernoulli’s Principle of Insufficient Reason and Conservation of Information in Computer Search,” Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics. San Antonio, TX, USA – October 2009, pp. 2647-2652.

Abstract: Conservation of information (COI) popularized by the no free lunch theorem is a great leveler of search algorithms, showing that on average no search outperforms any other. Yet in practice some searches appear to outperform others. In consequence, some have questioned the significance of COI to the performance of search algorithms. An underlying foundation of COI is Bernoulli’s Principle of Insufficient Reason1(PrOIR) which imposes of a uniform distribution on a search space in the absence of all prior knowledge about the search target or the search space structure. The assumption is conserved under mapping. If the probability of finding a target in a search space is p, then the problem of finding the target in any subset of the search space is p. More generally, all some-to-many mappings of a uniform search space result in a new search space where the chance of doing better than p is 50-50. Consequently the chance of doing worse is 50-50. This result can be viewed as a confirming property of COI. To properly assess the significance of the COI for search, one must completely identify the precise sources of information that affect search performance. This discussion leads to resolution of the seeming conflict between COI and the observation that some search algorithms perform well on a large class of problems.

[ IEEE | pdf ]

Comments

Dear Dr. Dembski, you didn't define what you understand by a these some-to-many mapping, and it's not a common term. Could you do so, please? For me, they seem to be just reverse images under a mapping from Ω' to Ω....DiEb_{December 11, 2009
December
12
Dec
11
11
2009
04:43 PM
4
04
43
PM
PDT}

Mystic[79], Good point. It's even an uncountable space if you choose probabilities in [0,1]. This error has also been pointed out by Tom English.Prof_P.Olofsson_{December 11, 2009
December
12
Dec
11
11
2009
07:32 AM
7
07
32
AM
PDT}

This may have been lost above due to the long moderation delay. (Why are Zachriel's comments being moderated?)
Dembski & Marks: Prior knowledge about the smoothness of a search landscape required for gradient based hill-climbing, is not only common but is also vital to the success of some search optimizations. Such procedures, however, are of little use when searching to find a sequence of, say, 7 letters from a 26-letter alphabet to form a word that will pass successfully through a spell checker …
Does that mean an evolutionary algorithm can't navigate the wordscape of the dictionary from shorter to longer words because there are no hills to climb?Zachriel_{December 10, 2009
December
12
Dec
10
10
2009
04:44 AM
4
04
44
AM
PDT}

My first application of evolutionary programming was to optimization of the connection weights of recurrent neural nets. Now, I could have exploited the mathematical analysis that yielded partial derivatives of the objective function for the weights. But I knew also that the computation of the partial derivatives was very slow, and I had to believe that "quick and dirty" EP would go faster, by the clock on the wall, than "neat and clean" gradient descent. And I was right. Dembski and Marks will say that the gradient descent algorithm did better than EP because it required fewer trials than EP to obtain an acceptable solution. But I say that EP outperformed gradient descent because it found an acceptable solution with much less computational work than gradient descent did. The "information gain" in knowing the gradient was not worth what it cost.Mystic_{December 10, 2009
December
12
Dec
10
10
2009
12:49 AM
12
12
49
AM
PDT}

The argument of Appendix B is invalid. For a fixed representation of randomized algorithms (e.g., as probabilistic Turing machines), the space of search algorithms, Omega_2, is countably infinite, not finite. Consider a search algorithm that obtains i.i.d. uniform bits from a random source, and simulates a toss of a biased coin (probability of heads is p) to decide which of two deterministic search algorithms to run on a search problem. There are at least as many randomized algorithms for simulating the coin toss as there are rational probabilities p = n / d, and the set of rational probabilities is countably infinite. Note also that the algorithms must obtain |d| random bits in the worst case. Thus there is no upper bound on running time for algorithms of this form. Vanishingly few random search processes have exact implementations as randomized search algorithms. And many randomized search algorithms require impractical time and space. When we observe a success for a search algorithm, the algorithm is much more likely to be small and fast than big and slow. It is very important to consider this observational bias of ours.Mystic_{December 10, 2009
December
12
Dec
10
10
2009
12:16 AM
12
12
16
AM
PDT}

I'll make one more comment on the paper, regarding Appendix B: The LCI actually follows immediately from the second sentence of the appendix. That sentence says that blindly finding the target has the same probability as finding the target with a blindly selected algorithm. From this it follows that blindly finding the target is at least as probable as finding the target with a blindly selected algorithm AND the selected algorithm being as good as it is. That's the LCI.R0b_{December 9, 2009
December
12
Dec
9
09
2009
03:46 PM
3
03
46
PM
PDT}

GradStudent[75], Sorry, that's just way too much for me to read right now. If they assume maximum entropy based on PrOIR alone, and draw conclusions from that assumption alone, without analyzing any data or invoking any other information, sure, it's an application. I doubt that's what is done though. I agree, we should probably quit. Let me just finish like I started and quote myself: Consider D&M’s Formula (1). On the one hand they claim that it assumes the PrOIR due to “no prior knowledge.” But before Formula (1) they state that the deck is well shuffled which is a lot of prior knowledge and precisely the prior knowledge that warrants the uniform distribution. One must not confuse prior knowledge of the distribution of cards (which we do have) with prior knowledge of the location of the ace of spades (which we don’t have).Prof_P.Olofsson_{December 9, 2009
December
12
Dec
9
09
2009
01:02 PM
1
01
02
PM
PDT}

UP[72]. No it's not the same thing and we're not nitpicking. The NFLT says that if the distribution is uniform, then certain conclusions hold. If you can verify uniformity, the NFLT applies. It's got nothing to do with the PrOIR, just like your other examples don't. Opinion polls obviously has nothing at all to do with PrOIR as they are based on collecting data from which one gets information. You are right, I'd rather be painting my nails!Prof_P.Olofsson_{December 9, 2009
December
12
Dec
9
09
2009
12:55 PM
12
12
55
PM
PDT}

Hi Prof P.Olofsson, Well, thanks for the discussion. It seems this thread has been booted from the front page and so maybe we should wrap it up. Regarding applying the PrOIR repeatedly at each level, actually, I don't think that is necessary to assume PrOIR at each level if one has a lot of levels. It seems that all you need is a prior that has positive probability over the whole space. But the key idea is that each prior adds a little more variance, and with many levels, that variance could drown out everything else and thus make the lowest prior look uniform. Regarding the "guess my prior" game, if you were absolutely *forced* to make a choice based on no knowledge, my guess is that you would guess uniform rather than placing high probability on a subset of the space. Of course I just made up these arguments offhand and there could be serious flaws in them. Regarding applications, there is a lot of work in maxent modeling as I suggested earlier. They are assuming maximum entropy and seem to be getting great results: http://homepages.inf.ed.ac.uk/lzhang10/maxent.html Is this what you mean by application of PrOIR?GradStudent_{December 9, 2009
December
12
Dec
9
09
2009
12:00 PM
12
12
00
PM
PDT}

#73 Lets go do something useful. Bake a cake. Shop. Do our nails. Make up Richard Dawkins jokes. Read more Dembski books. Is this in decreasing order of usefulness?Mark Frank_{December 9, 2009
December
12
Dec
9
09
2009
11:58 AM
11
11
58
AM
PDT}

Wow. Blogging like this is for people with lots of time. Lets go do something useful. Bake a cake. Shop. Do our nails. Make up Richard Dawkins jokes. Read more Dembski books.Uvula Presley_{December 9, 2009
December
12
Dec
9
09
2009
11:38 AM
11
11
38
AM
PDT}

"The (original) NFL Theorem does not rely on the PrOIR. It assumes a uniform distribution." This is the same thing. Would you agree if I said the NFL "assumes" the PrOIR? Are we nicking at pics? It sure seems like the Drake equation is applying (or assuming) the PrOIR. "Polls" as in Obama's approval rating. http://www.rasmussenreports.com/public_content/politics/obama_administration/daily_presidential_tracking_poll And Burg's algorithm remains solid.Uvula Presley_{December 9, 2009
December
12
Dec
9
09
2009
11:34 AM
11
11
34
AM
PDT}

GradStudent[67], In your hierachy, aren't you just applying the PrOIR repeatedly as you go upward? As for the "guess my prior," there is no way I can compute and minimize my expected loss without having a distribution for your prior q. These are nice philosophical discussions and obviously you know what you're talking about. I'm still highly skeptical when it comes to applications though.Prof_P.Olofsson_{December 9, 2009
December
12
Dec
9
09
2009
11:31 AM
11
11
31
AM
PDT}

UvulaPresley [65,66,68], The (original) NFL Theorem does not rely on the PrOIR. It assumes a uniform distribution. In any application, that assumption needs to be verified. I don't know anything about the Drake equation but from reading online, it seems that it is based on several estimated parameters, thus not relying on the PrOIR. As for the "Polling surveys where there is no prior information" I don't know what you mean. My challenge was regarding an application of the PrOIR.Prof_P.Olofsson_{December 9, 2009
December
12
Dec
9
09
2009
11:14 AM
11
11
14
AM
PDT}

JT @ 48:
It seems you could have knowledge to specify a finite space without it dictating a non-uniform distribution of that space.
My point was that even with a uniform distribution, there is still problem-specific information. Take, for instance, Dawkins' WEASEL algorithm, in which the search space consists of all sequences of 28 English uppercase letters. The baseline search is uniformly random, i.e. it has no information to guide it to a particular sequence of 28 English uppercase letters. But the baseline search does know that the target has 28 English uppercase letters. Without that knowledge, the baseline search would take much longer.R0b_{December 9, 2009
December
12
Dec
9
09
2009
10:35 AM
10
10
35
AM
PDT}

61: “In fact, I’d challenge anybody to show me an application where the PrOIR is used.” Polling surveys where there is no prior information. Want more?Uvula Presley_{December 9, 2009
December
12
Dec
9
09
2009
10:25 AM
10
10
25
AM
PDT}

Prof Olofsson, I totally agree that it is non-controversial to use a reasonably flat prior to get a posterior because hopefully with enough data the likelihood would dominate and the effect of the prior would be negligible. But it seems there is still an intuitive argument for PrOIR even without any data. One argument is the hierarchy idea which I outlined above; one can just push the uncertainty all the way back up the hierarchy (since one is maximally uncertain about all those hidden variables anyways), and as the # of levels goes to infinity, it is highly probable that the lowest prior on this hierarchy would come close to looking like a PrOIR (even before looking at any data). Another argument for PrOIR is it seems to be centered between all other possible priors. If I told you that I had a prior p over a discrete space in mind, and if I demanded that you provide a guess q of what that prior was (and you would pay me distance[p || q] for your error), it seems wise to choose a uniform p.GradStudent_{December 9, 2009
December
12
Dec
9
09
2009
10:22 AM
10
10
22
AM
PDT}

61: “In fact, I’d challenge anybody to show me an application where the PrOIR is used.” The Drake EquationUvula Presley_{December 9, 2009
December
12
Dec
9
09
2009
10:21 AM
10
10
21
AM
PDT}

61: “In fact, I’d challenge anybody to show me an application where the PrOIR is used.” Wolpert & Macready's No Free Lunch TheoremUvula Presley_{December 9, 2009
December
12
Dec
9
09
2009
10:20 AM
10
10
20
AM
PDT}

61: "In fact, I’d challenge anybody to show me an application where the PrOIR is used." Burg's algorithmUvula Presley_{December 9, 2009
December
12
Dec
9
09
2009
10:13 AM
10
10
13
AM
PDT}

GradStudent[62], No flaws, but when you talk about priors you do so in order to get to a posterior, after data collection. I have nothing against using a flat prior in a Bayeisan anlysis, but you can't draw any conclusions based on that prior. If D & M intended to do a Bayesian analysis, I'd have no problems with using PrOIR to get a prior (hey, a proir prior, that's cool!).Prof_P.Olofsson_{December 9, 2009
December
12
Dec
9
09
2009
09:43 AM
9
09
43
AM
PDT}

Hi Prof Olofsson, Thanks for the response. I guess we do know that all 52 cards are in the deck (if I am not mistaken in this example); we just don't know whether the deck has been well-shuffled or not. In that case, we can treat it as hidden and have a discrete distribution over the possible card permutations and then just put a relatively flat prior on that. And then a relatively flat prior on top of that prior, etc. I guess at some point in the hierarchy one should insert something like the PrOIR assumption. But if the hierarchy has many different levels, then it seems the influence of even a very informative prior at the top would be swamped by the addition of variance at each level, and once it gets down to the lower levels (e.g. the distribution over card permutations), it would look a lot like PrOIR. Thus it seems the idea of PrOIR is pretty reasonable. Any flaws with my thinking?GradStudent_{December 9, 2009
December
12
Dec
9
09
2009
09:24 AM
9
09
24
AM
PDT}

Mark Frank[57],
So there is no objective basis for PrOIR.
Indeed. And that's why it is never used in practice. Anytime a uniform distribution is assumed, it is because we have prior information, not because we don't. Coin flips, die rolls, and lottery drawings, warrant a uniform distribution by the laws of physics or by examination of data, not by referring to a 300-year-old "principle." In fact, I'd challenge anybody to show me an application where the PrOIR is used.Prof_P.Olofsson_{December 9, 2009
December
12
Dec
9
09
2009
09:16 AM
9
09
16
AM
PDT}

Mystic[55],
You wrote of tossing a coin to determine membership of a solution in the target
No, I wrote about the original NFL theorem, see my comment [28].Prof_P.Olofsson_{December 9, 2009
December
12
Dec
9
09
2009
09:09 AM
9
09
09
AM
PDT}

GradStudent[56], 1. Yes, they are the same. A uniform distribution on a finite set has maximum entropy over all distributions.Prof_P.Olofsson_{December 9, 2009
December
12
Dec
9
09
2009
09:04 AM
9
09
04
AM
PDT}

GradStudent[56], 3. The term "well shuffled" means that the deck has been shuffled enough so that each card is equally likely to be in any position. In other words, "well shuffled" is the colloquial term for "uniform distribution." If you shuffle well and draw the top card, shuffle well and draw the top card, and so on, in the long run you will get the ace of spades the fraction 1/52 of draws. Thus, in a well shuffled deck, you can say that the probability is 1/52 that the top card is the ace of spades. This is the frequentist interpretation of probability. If you don't know anything about the deck, what basis is there to assign a probability to the top card being the ace of spades?Prof_P.Olofsson_{December 9, 2009
December
12
Dec
9
09
2009
09:02 AM
9
09
02
AM
PDT}

#41 [R0b] To their credit, the Drs acknowledge some of the problems with the PrOIR, which they didn’t do in previous papers. Unfortunately, they do very little to mitigate the problems, leaving us to question whether their PrOIR-based measures are non-arbitrary and useful. On reflection I don't see that this paper does anything to mitigate the problems. All that equation 4 shows is that if you transform a discrete data space and apply PrOIR to the new data space you are equally likely to increase or decrease the probability of hitting a target. This doesn't change the fact that some transformations will increase the probability and others will decrease it. So there is no objective basis for PrOIR.Mark Frank_{December 9, 2009
December
12
Dec
9
09
2009
01:28 AM
1
01
28
AM
PDT}

Some questions for the more informed on this thread: 1. Is the idea of PrOIR essentially the idea of maximum entropy? Maxent modeling has been shown to be very useful in many contexts. 2. "Active Information" -- is this analogous to the well-known idea of "Information Gain"? 3. Question to Prof. Olofsson: Why should it matter that we assume a shuffled deck of cards? Isn't the result the same if we assume nothing about the "shuffleness" of the deck but just leave it hidden? 4. Comment to Allen MacNeill: Regarding your comment on the emergence of novelty just by neutral theories as opposed to some biological optimization strategy -- if one is not optimizing it seems one would just be adding unnecessary noise (variations) which wouldn't help at all other than to diversify the set of biological "particles". One could think of a strategy to interleave optimization and neutral drifting, but this is simply just another search strategy. So in short, how does optimization+drift escape the conclusions of Dembski's paper?GradStudent_{December 8, 2009
December
12
Dec
8
08
2009
11:13 PM
11
11
13
PM
PDT}

Prof. Olofsson: Bayesian is as Bayesian does. I was joking about the authors' assiduous avoidance of the term "Bayesian" in the body of the paper, as well as their failure to refer to "reference" [61], Data Analysis: A Bayesian Tutorial. (By the way, it seems that their self-reference [14] was scheduled to appear in the November 2008 issue of a journal that last released an issue in January 2008.) You wrote of tossing a coin to determine membership of a solution in the target. I had thought of rolling a fair die to determine the value of h(x), i.e., the assignment of a solution to description x. A "success" is h(x) in T. (Flipping a p-biased coin to determine success would be more direct.) The number of successes for M assignments is binomially distributed with mean pM. It bears mention that knowledge of the number of successes for a particular representation h is not exploitable in search in the space Omega-hat.Mystic_{December 8, 2009
December
12
Dec
8
08
2009
07:53 PM
7
07
53
PM
PDT}

Of course, I may have missed your point entirely.
JT: And I just wanted to point out that Prof Ollofson seems to be rejecting PrOIR outright:
According to the PrOIR you must assume that the deck is well shuffled which clearly makes no sense.
Such an assumption is useful only as a placeholder for testing. We might tentatively assume the deck is shuffled, then pull a few cards to determine whether our original hypothesis was plausible. But to *presume* the deck is shuffled is a sure way to lose your shirt.Zachriel_{December 8, 2009
December
12
Dec
8
08
2009
07:02 PM
7
07
02
PM
PDT}

1 2 3 Next

You must be logged in to post a comment.

Leave a Reply