“The phylogeny of the animals is currently incompletely resolved and has undergone major reorganisations over the past few years, mainly as a result of analyses of rRNA gene sequences”…
Large-scale sequencing and the new animal phylogeny
Although comparisons of gene sequences have revolutionised our understanding of the animal phylogenetic tree, it has become clear that, to avoid errors in tree reconstruction, a large number of genes from many species must be considered: too few genes and stochastic errors predominate, too few taxa and systematic errors appear. We argue here that, to gather many sequences from many taxa, the best use of resources is to sequence a small number of expressed sequence tags (1000–5000 per species) from as many taxa as possible. This approach counters both sources of error, gives the best hope of a well-resolved phylogeny of the animals and will act as a central resource for a carefully targeted genome sequencing programme.
12/23/06 Update: Here’s another article that describes the problem in even more stark terms:
Quotes of note (my emphasis and brackets):
Here we discuss how and why certain critical parts of the TOL [Tree of Life] may be difficult to resolve, regardless of the quantity of conventional data available. We do not mean this essay to be a comprehensive review of molecular systematics. Rather, we have focused on the emerging evidence from genome-scale studies on several branches of the TOL that sharply contrasts with viewpointsâ€â€such as that in the opening quotation [a quote by Dawkins that implies we’ll get the TOL correct eventually]â€â€which imply that the assembly of all branches of the TOL will simply be a matter of data collection. We view this difficulty in obtaining full resolution of particular cladesâ€â€when given substantial dataâ€â€as both biologically informative and a pressing methodological challenge. The recurring discovery of persistently unresolved clades (bushes) should force a re-evaluation of several widely held assumptions of molecular systematics. Now, as the field is transformed from a data-limited to an analysis-limited discipline, it is an opportune time to do so.”
Three observations generally hold true across metazoan datasets that indicate the pervasive influence of homoplasy at these evolutionary depths. First, a large fraction of single genes produce phylogenies of poor quality. For example, Wolf and colleagues [9] omitted 35% of single genes from their data matrix, because those genes produced phylogenies at odds with conventional wisdom (Figure 2D). Second, in all studies, a large fraction of charactersâ€â€genes, PICs or RGCsâ€â€disagree with the optimal phylogeny, indicating the existence of serious conflict in the DNA record. For example, the majority of PICs conflict with the optimal topology in the Dopazo and Dopazo study [10]. Third, the conflict among these and other studies in metazoan phylogenetics [11,12] is occurring at very “high†taxonomic levelsâ€â€above or at the phylum level.
For instance, theory [34] and simulation analyses [8] predict that a small fraction of substitutions will be homoplastic by chance (about 2–5%, depending upon model assumptions and evolutionary distances). However, analysis of the elephant/sirenian/hyrax dataset and the coelacanth/lungfish/ tetrapod dataset indicates that the actual level of homoplasy is ~10% of amino acid substitutions in the first case (178 homoplastic/1,743 total substitutions) and ~15% in the second case (588 homoplastic/3,800 total substitutions), several times greater than expected [8,34]. Similar high levels of homoplasy exist in datasets from other bushy clades [35] (unpublished data) and hold irrespective of analytical methodology [8].
“Although it may be heresy to say so, it could be argued that knowing that strikingly different groups form a clade and that the time spans between the branching of these groups must have been very short, makes the knowledge of the branching order among groups potentially a secondary concern.”