Monday, February 1, 2010

A plea to theropod workers- Code the Taxa in Your Analyses

In the middle of writing the next installment of my alvarezsaur relationships series, I keep coming across a particular problem that's important enough to deserve its own post.  I've often complained about workers leaving out conflicting characters (e.g. Sereno, 1999) or relevent taxa (e.g. Maryanska et al., 2002), but recently I've noticed a more insidious problem.  Somebody will include a character and plenty of taxa, but then only code the character for a few of those taxa.  Oh sure, I don't expect everyone to track down every obscure coding available, so I'd forgive you if you didn't code Avimimus as having a fused sternum or Adasaurus as having an unfused one.  And there are certainly instances where a coding is controversial enough that I could agree being conservative and coding an uncertainty is a valid choice.  But not coding for the presence of a pubic symphysis in Archaeopteryx, or for fibular-tarsal contact in Allosaurus?  That's just lazy.  Now sure you might say that coding Allosaurus for fibular-tarsal contact won't add anything to the analysis, since carnosaurs and taxa closely related to carnosaurs all have the contact, and hey you coded Sinraptor for that character anyway.  But that's assuming you have the phylogeny correct before you even run the analysis, which begs the question of why you're even going through the motions of testing relationships in the first place.  In this particular case, I agree it's relatively harmless, but what about that same analysis (Choiniere et al.'s new Haplocheirus paper) where no birds with closed popliteal fossae were coded for that character?  Instead, all pygostylians were incorrectly left unknown.  This makes it so that derived alvarezsaurids are no longer even attracted to pygostylians in the matrix based on that character, which means that despite including the relevent taxa and character, you're not really testing that hypothesis.  It just breeds complacency and a false impression of consensus, leaving us unlikely to ever find unexpected relationships from our data.  Recent papers which are guilty of this problem are the Smith et al. (2007) Cryolophosaurus analysis, the Xu et al. (2009) Limusaurus analysis and the Choiniere et al. (2010) Haplocheirus analysis.  I know coding for so many taxa can be tedious, but these papers all gave the impression they performed impressively large analyses when they each actually took shortcuts and presented cladograms which aren't representative of the data they purport to include.  As scientists we have a duty to present accurate data to the public.  Virtually no one will bother looking through your matrix to make sure you did things correctly, and that includes peer reviewers and editors unfortunately.  How else to explain how Choiniere et al. could code Rahonavis for the presence of a mandibular character, when anyone who studies theropods knows it doesn't preserve a skull.  They just copied Makovicky et al.'s (2005) codings without thought, assuming they were correct.  But the public sees your cladogram and (justifiably in my opinion) assumes you did the work to create it.

I'll repeat a point that needs to be made depressingly often- the function of cladistic analysis (and science in general) is to question, not to confirm.  It does no good to only feed PAUP the data which supports your idea.  You might as well not even waste the time, and just list the characters instead.  The entire point of cladistic analysis is that it can somewhat objectively weigh competing ideas and tell us which is most parsimonious.  So if you're going to run an analysis, take some time to track down the characters which disagree with your hypothesis and code them for every available taxon.  Hell, send your data to me and I'll code the taxa for you in exchange for being slapped on as a coauthor.  I have a database of theropod codings at my disposal anyway.  That way you save time and your analysis will actually be a useful contribution to science.  Otherwise the general populace may be impressed by your cladogram, but your peers will realize they can ignore it because it didn't include the relevent data.

Smith, Makovicky, Hammer and Currie, 2007. Osteology of Cryolophosaurus ellioti (Dinosauria: Theropoda) from the Early Jurassic of Antarctica and implications for early theropod evolution. Zoological Journal of the Linnean Society. 151, 377-421.

Xu, Clark, Mo, Choiniere, Forster, Erickson, Hone, Sullivan, Eberth, Nesbitt, Zhao, Hernandez, Jia, Han and Guo, 2009. A Jurassic ceratosaur from China helps clarify avian digital homologies. Nature. 459, 940-944.

Choiniere, Xu, Clark, Forster, Guo and Han, 2010. A basal alvarezsauroid theropod from the Early Late Jurassic of Xinjiang,China. Science. 327, 571-574.


  1. "So if you're going to run an analysis, take some time to track down the characters which disagree with your hypothesis and code them for every available taxon."


    And I add: and include all the taxa that have been considered in the alternative hypotheses, not only those agreeing with your preferred topology (for example: if Deltadromeus has been considered alternatively as a basal coelurosaur, an ornithomimosaur, a basal maniraptoran, a noasaurid or a basal ceratosaur, we have to include ALL these taxa in the phylogenetic analysis of Deltadromeus, if we want its more parsimonious position to be determined)!

  2. David MarjanovićFebruary 3, 2010 at 2:26 AM

    we have to include ALL these taxa in the phylogenetic analysis of Deltadromeus, if we want its more parsimonious position to be determined)!

    Absolutely. This is why there isn't any published morphological analysis of placental (or even neornithean!) phylogeny yet that could be taken seriously – the number of taxa that need to be considered is enormous, and few people take the time to deal with them all.