Saturday, July 13, 2019

How to add your taxon to the Lori analysis

When designing the Lori analysis, I didn't want it to just be a single use test.  With nearly every Mesozoic maniraptoromorph included, quantified characters and a character list that's been modified to avoid any correlated or composite examples, it's the best published analysis to include your new taxon in as long as its not stemward of Ornitholestes or a member of crown Aves.  The corollary is that it's so detailed that the usual quick TNT run will not find the Most Parsimonious Trees.  Don't let that deter you though, as I'm writing this blog post to walk you through the steps of adding a new taxon and finding its most parsimonious position.

The first step is to score your new taxon.  You might notice I've included two NEXUS files at PeerJ.  This one is for scoring taxa in Nexus Data Editor (NDE).  It includes character and state descriptions to make this easy.  If your specimen is immature, I've added the option to score it 'N' for characters that are known to vary with ontogeny.  The character list indicates which characters qualify for this, but they're easy to notice in NDE too because they have a series of undefined states through state 9 before it lists state N (Figure 1).  Another advantage of NDE is that you can distinguish uncertainty polymorphies from variation polymorphies.  Uncertainty polymorphies, such as 'it either has six or seven sacrals, but I can't tell which' are indicated with a slash, as in '1/2'.  Variation polymorphies, such as 'some individuals have six sacrals and others have seven' are indicated with a plus sign, such as '1+2'.  If a feature is inapplicable, such as a tooth character for a toothless taxon, score it with a dash.  N, /,  + and - are thus great ways to keep track of how much we know about your taxon.  Contrast this with Cau's MegaMatrix, which is entirely 0s, 1s and ?s.  It contains basically the same information TNT will use, but isn't as obvious or transparent.

Figure 1. Example of a Lori matrix in NDE.

These symbols would all work fine in PAUP, but that program is far too slow for the Lori analysis.  So instead I used TNT (Goloboff and Catalano, 2016).  The problem is that TNT has a different set of symbols it recognizes.  When NDE makes a NEXUS file from your matrix, uncertainty polymorphies are displayed as curly brackets, such as '{12}'.  Variation polymorphies are displayed as normal parentheses, such as '(12)'.  TNT doesn't recognize the difference and just uses curly brackets for all polymorphies.  Similarly, TNT doesn't recognize inapplicable states and doesn't allow another symbol like 'N' to count as an unknown state.  So you'll have to copy your list of scores into a word processor and 'replace all' normal parentheses with curly brackets, capital Ns with question marks and dashes with question marks.  Now you have your entry ready for TNT.

Hesperornithoides                   ???3111??? ?????????? ?????????? ?10???01?? 0110?????? ?????????1 ?????????? ?0?{01}0?10?? ???????0(01)1 0100???1?? 1?01??000? ??1{12}1????? ?????11??1 (01)??????0?? ?????2{01}000 0??10000{01}0 00001000?? ???1?1???? ?????????? ?????????1 ?000110000 1?0000??10 0{12}0?01???? ??{01}??????? {01}1?????{12}?? ???{12}00???? ?????1???0 ???1????10 ???0?0???0 ??????1?0? 00?11???0? ??2?11???1 100001?1?? ???1??1011 00{12}?00?0?? 00???2???0 ?????1010? ?00????10{12} ??11{01}11?0? 1{23}?01?0?01 ??10???1?0 0??????1?? ?00021???? ?0??000?1? ????00???? 00?102001? ???10000?? ?????????? ??0000000? 110??????0 ?0???00??? ??0????11? ?????00?10 ?000?0?0?? ???0?????? ?????????? ??0?0????? ??0?0???10 ?????0???? ???000???? 0???01???1 -1?1000-00 ?0?1?????? ?0??00001? ?0???????? ????000{01}10 0200000?0- -00?1-?0?? ?????0??-1 01????????

becomes...

Hesperornithoides ???3111????????????????????????10???01??0110???????????????1???????????0?{01}0?10?????????0{01}10100???1??1?01??000???1{12}1??????????11??1{01}??????0???????2{01}
0000??10000{01}000001000?????1?1???????????????????????1?0001100001?0000??100{12}0?01???
???{01}???????{01}1?????{12}?????{12}00?????????1???0???1????10???0?0???0??????1?0?00?11???0???2?11???1100001?1?????1??101
100{12}?00?0??00???2?
??0?????1010??00????10{12}??11{01}11?0?1{23}?01?0?01??10???1?00??????1???00021?????0??00
0?1?????00????00?102001????10000??????????????0000000?110??????0?0???00?????0????11??????
00?10?000?0?0?????0??????????????????0?0???????0?0???10?????0???????000????0???01???1?1?10
00?00?0?1???????0??00001??0????????????000{01}100200000?0??00?1??0???????0???101????????

Now take the other NEXUS file, the one designed to run in TNT.   Change the number of taxa to add one for your new taxon under the 'ntax=' commend (Figure 2), add your new taxon with its scores to the bottom of the matrix block (Figure 3), and now the important step.  I included one saved Most Parsimonious Tree in this NEXUS file, at the bottom after 'begin trees ; tree tnt_1 = [&U]'.  If you added your taxon to the base Lori TNT file with 501 taxa, yours is number 502.  So where it says '(1,(18,((2,3),(36,('... add your taxon as '(502,(1,(18,((2,3),(36,('..., being sure to include the comma and then add another parentheses to the end of the tree description before the semicolon where it says ',(59,60)))))))))));' .

Figure 2. Where to increase taxon number.
Figure 3. Where to insert your new taxon and scores.

Now save the NEXUS file and open it in TNT.  For our example, I've added newly described scansoriopterygid Ambopteryx as taxon 502.  In TNT, select 'Trees' > 'View' and you'll see your taxon at the base of the tree and at the bottom center is the tree length as 'Len.' (Figure 4).  Here it's 12175, significantly higher than the shortest trees I found at 12123, because Ambopteryx would need a LOT of steps to place so basally.  Select 'Settings' > 'Lock trees' to unlock the cladogram, and now you can click just to the left of your new taxon's name.  When you right click a node or just to the left of another taxon's name, your new taxon will move there.  If we move Ambopteryx to the base of Scansoriopterygidae, tree length drops to 12147.  You wouldn't expect it to get back down to 12123 unless your new taxon adds no new information.  Conversely, any information it adds has the power to change the topology of closely related taxa.

Figure 4. Your new taxon added and where to see tree length.
Now you let TNT use its power to find the best topology.  After increasing the 'Max. trees' under 'Memory' in 'Settings' to 10000, run a 'New Technology search' getting trees from 'RAM' using 'Sect. Search' (with 'CSS' unchecked), 'Ratchet', 'Drift' and 'Tree fusing'. With Ambopteryx, this quickly finds 13 trees of length 12142.  One thing I've noticed is that a low amount of trees, like 13, indicates there's more work to do.  So reset 'Max. trees' to 100 and run a 'Traditional search' of 'trees from RAM'.  This gets you 100 trees of that length to work with.  Now reset it to 10000 Max. trees and run the New Technology search from RAM again.  The new result is 100 trees of length 12142, which from my experience usually means those are the shortest trees you'll find.  You can keep switching New Tech and Trad searches like this until you're satisfied, but end it with a Trad search after increasing the Max. trees to 99999 to fully sample tree space.  In the present example, the topology within Scansoriopterygidae changed, it moved to the base of Paraves (1 step longer in the original matrix) and Pedopenna moved to Archaeopterygidae (1 step longer in the original matrix) (Figure 5).

Figure 5.  Taxon successfully added.
And that's how you add a new taxon to the Lori matrix.  Later in Lori Week, I'll show you the new and better way to run a constraint analysis in a huge matrix like this, and also how to track down where taxa with multiple equally parsimonious positions can go.  Also, some diagrams for exactly what to measure for some of the potentially ambiguous quantified characters.

References- Goloboff and Catalano, 2016. TNT version 1.5, including a full implementation of
phylogenetic morphometrics. Cladistics. 32(3), 221-238. DOI: 10.1111/cla.12160

Hartman, Mortimer, Wahl, Lomax, Lippincott and Lovelace, 2019. A new paravian dinosaur from the Late Jurassic of North America supports a late acquisition of avian flight. PeerJ. 7:e7247. DOI: 10.7717/peerj.7247

20 comments:

  1. That is not the correct way to re-run an analysis with a new OTU.
    The correct procedure is: score the taxon and run the new analysis following the same original protocol of the first analysis.
    You must re-run your analysis from zero, otherwise, you artificially force the analysis to explore the tree islands starting from an arbitrarily assembled tree which was based on a set of hypotheses (the first matrix) which is different from the actual set of hypotheses (the matrix including Ambopteryx). You have enforce TNT to explore the tree space from an arbitrary topology defined on a now obsolete set of hypotheses.
    What if the inclusion of the new OTU has modified the tree space so radically that the previous optimum is now just a local optimum? Could the brief TNT run you performed in the second run be able to find the novel optimal island? I doubt this.
    So, the only valid way to run an analysis after a new OTU is scored is to merely re-run the whole analysis following the same original tree search strategy used for the first matrix.

    ReplyDelete
    Replies
    1. Hold on now, your method is not re-running the analysis from zero either. You're forcing the analysis to explore tree islands starting from the incredibly arbitrary tree "random seed 1" which is based on no evolutionary hypotheses at all. Which is why it takes so long to stumble upon something actually supported by the data. If anything, the set of hypotheses forming the first topology are the null hypotheses for the topology after adding a new taxon. After all, we don't expect to have to restart our evolutionary hypotheses from randomly arbitrary chaos every time we learn about a new taxon. Instead we expect that most of the likely relationships will remain likely, and indeed that's what happens. In that sense, the most likely topology prior to a new taxon's addition is the most likely topology after its addition as well, and so is the best place to start the search.

      "Could the brief TNT run you performed in the second run be able to find the novel optimal island? I doubt this."

      This seems to be the unevidenced assumption on your part. If the mpts were otherwise unchanged apart from the new taxon's presence, I'd agree with you. But they do change, as noted above. Remember even a "brief" Trad search and New Tech search combo is testing several billion rearrangements. And I recommended doing this multiple times in a row. So I'd say in most cases TNT is able to find the optimal island. But that's a testable statement- anyone can manually move any taxa or clades in tree view mode, or constrain any topology they think might be missed by TNT and search that tree space. If they find shorter trees this way or starting from a random seed than I do by adding one taxon at a time as described above, that would support your idea. So far, I haven't located any shorter tree lengths this way though.

      Also remember that any run we do with over 10 taxa is only a heuristic process- one designed to save time but not necessarily find the shortest tree. And as a heuristic method, searching from a "known" shortest tree is not only logical as noted above but also extremely effective- cutting down tens of hours to ten minutes.

      Delete
    2. I am perfectly aware of how phylogenetic analyses work and their epistemological basis. Your reply seems not to address the important point of why we perform phylogenetic analyses. Enforcing a previous tree based on a different matrix as starting point is not an unbiased way to explore tree space. Everybody would love to reduce computation time, but this cannot be done violating the main assumption of phylogenetic analysis: that we explore tree space with no a priori assumptions on "preferred" topologies.

      I am also concerned that just adding one single and not really complete taxon like Ambopteryx the new MPT is 19 steps longer than the previous. Ambopteryx is not a weird taxon full of autapomorphies which is not comparable at all to other OTUs, it is another scansoriopterygid with a morphology very similar to the already scored Yi and Epidexipteryx. Why the new shortest trees are 19 steps longer than the previous ones? Based on my experience in adding new taxa to big matrices, I'd expect the new trees with Ambopteryx to be around 2-6 steps longer, but not 19.
      This raises the suspect that the second analysis was not that exhaustive and thus found a local suboptimum.
      This is the reason for re-running the whole analysis each time.

      Delete
    3. But isn't my assumption correct? We're not just assuming in an empirical vacuum- we know that if you have a huge extensively sampled dataset, that adding a single taxon is going to leave most of the tree 'the same'. And by 'the same' I mean that the likely alternatives are going to stay likely and vice versa, not that no topological changes at all will occur. You even see this in true weirdos like Chilesaurus- topologies stay almost identical if you add it to an ornithischian matrix, a sauropodomorph matrix, etc.. You even say it yourself here that adding a new taxon is only around 2-6 steps longer, so no previously highly unparsimonious changes occur. If the assumption weren't true, there would be no point to tracking phylogenetic conclusions because for all we knew when the next discovered taxon was added every month, the topology might be crazily different and highly unparsimonious in last months matrix.

      But again, this issue can and will be tested. If my method's recovering local suboptima, we'll know at some point. The Ambopteryx example at least partially disproves it, because if TNT weren't capable of escaping local suboptima at all it couldn't drop five steps by changing other parts of the tree. Another thought is, what if it's more likely an analysis gets stuck in a local suboptimum starting from the random seed tree than from an MPT of the previous analysis? I've certainly had it do that before, but never found a shorter tree after my taxon adding method.

      Your concern about Ambopteryx adding that many steps would have to be purely with my scoring accuracy. Remember, just adding Ambopteryx to my 12123 tree anywhere inside Scansoriopterygidae added at least 24 steps. And that was before TNT searched any tree space. It has a branch length of 10 steps, so is not going to only add 2-6 steps even in the best of times. Then Pedopenna moving is at least another, Scansoriopterygidae moving is at least another, leaving up to 7 more steps. Some of those might be scansoriopterygids switching positions, and who knows what else moved slightly in other parts of the tree due to downstream effects of these things.

      Delete
    4. Just a note that running TNT for another four hours last night when I slept checking 728 billion more trees still never found a result better than 12142 with Ambopteryx added. I'm running an even longer search while at work today, but at some point reality has to trump philosophy if it's true that my method (despite its bias) really does find the shortest trees.

      Delete
    5. Well, it's clear to me what's going on: you're an optimist, Andrea is a pessimist, and we have no idea what we should realistically expect because there's no research on this. Maybe Ambopteryx is a fluke, maybe it's the normal case, who knows.

      Maybe, though, you'd find very different trees if you ran an analysis for a week. That should be worth trying as an experiment at some point.

      Hold on now, your method is not re-running the analysis from zero either. You're forcing the analysis to explore tree islands starting from the incredibly arbitrary tree "random seed 1" which is based on no evolutionary hypotheses at all.

      So, TNT doesn't do addition-sequence replicates to generate the starting trees for the branch-swapping procedures?

      any run we do with over 10 taxa is only a heuristic process- one designed to save time but not necessarily find the shortest tree.

      Branch-and-bound is guaranteed to find the shortest trees even though it doesn't look at every mathematically possible tree. But with each added taxon the necessary time increases so quickly that it's not considered feasible above 25 or so. (Maybe your new computer can do 30 in a week?) I'm under the impression that TNT can't even do it; it's really not used often.

      Delete
    6. We should reduce the amount of a priori expectations, not encourage their increase. David is correct: I am pessimistic. And usually, being pessimistic on "naive" approaches is the winning option (and yours is naive, since it lacks a rigorous theoretical explanation).
      I am skeptical on every "naive heuristics" used in such an ad hoc way. Of course, you see the immediate positive help to your particular aims, but this is exactly the usual way for introducing a novel form of pollution: it ignores the side effects that will produce on a larger scale.
      If you cannot generalize such approach, for example determining at what point taxon addition requires a full analysis and how much it diverges from the more rigorous strategy, I prefer keeping the latter and ignore the former.

      Just because something works does not mean it is good.

      Delete
    7. "Well, it's clear to me what's going on: you're an optimist, Andrea is a pessimist, and we have no idea what we should realistically expect because there's no research on this. Maybe Ambopteryx is a fluke, maybe it's the normal case, who knows."

      I'm a realist. If we get shorter trees from running the new analyses longer times or from scratch, my method will have to be adjusted or abandoned. And it's not like you can add any amount of new data to the previous mpt or just run it once and get this to work. I recommend only one taxon at a time, or else you end up so far away from the previous length that distant previously rejected topologies have to be sorted through and rejected too. Similarly, I run each pair of analyses (trad then new tech) until I get the same length over three pairs. I've yet to see better trees after that. Which goes some way to generalizing the method, as requested by Cau. Can you do this when updating a taxon's scoring? Seems to work. Can you do this when adding characters? I don't know, I've never tried.

      "Maybe, though, you'd find very different trees if you ran an analysis for a week. That should be worth trying as an experiment at some point."

      The longest I've tried is 86 hours on various settings, so about half a week. But having tried over a hundred of the most intuitive constraint analyses and having checked countless manual rearrangements, I don't realistically see how there could be a better local optimum that I'm somehow missing. It would have to be either some unforeseen heterodox topology or somehow just the right mix of just the right normally unparsimonious changes that add together to make a significantly more parsimonious total. And that TNT missed these things in its quadrillions of checked trees.

      "So, TNT doesn't do addition-sequence replicates to generate the starting trees for the branch-swapping procedures?"

      It can, but I found a driven search works better. A driven search is the default, so I assume that's what Cau (2018) used too. What's funny is that my method is actually a kind of addition sequence, as those add the next taxon to the previously generated best tree. And while I don't manually test every alternative placement, surely those billions of trees checked by TNT do. So it's not so much a new method as a new application of an existing concept.

      "Just because something works does not mean it is good."

      Err...

      As self-defeating as that sentence was, here's an analogy. This is an epistomological shortcut I'm proposing, sure. But it's not by far the only one we use. What about choosing and using outgroup taxa? If we were in a world without this convention, Alternate Cau could argue 'Hey now, you're introducing your bias into the analysis if you're assuming not only that a certain taxon is outside your ingroup but also that it's closely related enough to be relevant. If we're to be proper blind scientists here we should be including all of life's taxa in our analysis and the program will provide the best outgroup in its results.' And that's technically true and epistomologically pure. But that would take a LOT more time and effort, so we let our prior knowledge work its way into the process and choose one to few outgroups. And it generally works. Even you use it. But for you to say 'I might use a priori outgroup expectations, AND a priori conspecific expectations, AND a priori homology expectations, AND a priori expectations for how to divide continuously variable characters, etc. but your proposal is a step too far'? It's a little contrived.

      (second run is 80% through...)

      Delete
    8. Dear Mickey, you keep not providing any theoretical explanation for your procedure, and this makes your post more dangerous than useful, in particular if the reader is not sufficiently expert in phylogenetics and is thus unable to discriminate the theoretical basis of a scientific procedure from a mechanical algorithm for performing a TNT run.

      Even Dave Peters is able to produce tons of phylogenetic trees, but that is not science. What side of the barricade is yours? The scientific one, of those who follow the methodological rigor, or the pseudoscientific one, populated by the heterogeneous community which fills the Internet with any form of pseudoscience? Because your naive heuristic in this post is exactly what pseudoscience is: it apparently has the shape of a scientific protocol, but lacking a theoretical basis, it is not scientific.

      It seems that you have been unable to realize that even if your procedure "works" in this case, or even in a million cases, it does not represent a sufficient generalization which could justify its use.

      That was the meaning of my final sentence.

      I won't comment more on this post. There are more interesting elements of your published analysis to discuss (and may eventually be discussed, here or elsewhere).

      Delete
    9. I respect your choice to end this topic, but will respond here to end the topic now that the second run just completed (turns out my '80%' estimate was for just one of the sect searches). That being said, I thought my statements above that "the set of hypotheses forming the first topology are the null hypotheses for the topology after adding a new taxon" and such are the theoretical explanation. If you want to know more about the variables needed for the method to work, it's not fully worked out yet, but the larger the dataset, the denser the taxon sampling and the less steps added at once would logically make the theory more likely to work.

      If you're demanding some mathematical or logic formulation to objectively show the method will lead to good results, not even the standard methods of phylogenetic analyses do this. Just read Goloboff (1999:418-419) on sectorial searches. Like me he gives instructions on how to perform them, lays out some generalizations of what works best ("The best size seems to be 35 to 55 nodes", "For that value of S, R = 3 and r = 3 are enough to make it likely that an optimal tree for the reduced data set is found", "The number of sector selections needed to produce the best results varied with data set size", etc.), and then states and shows the results are fast and short.

      "Even Dave Peters is able to produce tons of phylogenetic trees, but that is not science. What side of the barricade is yours?"

      Peters judges his results based on subjective criteria like "a gradual accumulation of traits" and "sister taxa that look like each other", oh and his misunderstanding that less MPTs means a more accurate analysis. Whereas I'm just using the agreed upon objective standard of tree length for my method.

      "It seems that you have been unable to realize that even if your procedure "works" in this case, or even in a million cases, it does not represent a sufficient generalization which could justify its use."

      You're right, I am unable to realize that. What you just described is the exact philosophy of inductive reasoning, which much if not all of science depends on. If a process works in its first million cases, of course we'd all use it. You'd have to be irrationally stubborn to not do so at that point.

      The second analysis checked 3.3 trillion (3,305,098,469,748) trees in 29 hours 4 minutes, and still never found trees longer than 12142. Without macros I can't even run an analysis from a tree in RAM longer than that. While nothing's ever proof in induction, I'd say that's additional evidence my method works.

      Delete
    10. Science, strictly speaking, has nothing to do with induction. We don't infer that the sun will rise tomorrow from the fact that it's done so throughout recorded history; that is merely enough to let us form a hypothesis, not to test it. Instead, we deduce that the sun will rise tomorrow from the theory of relativity and the conservation of energy. We'll test it tomorrow morning by observation, not by induction.

      However, you've just tested the hypothesis that there are shorter trees over 3.3 trillion times. Clearly, with this dataset, your method remains in the running at the very least.

      Delete
    11. On the minimum length standard:

      Gregory S. Paul has questioned whether minimum length works with parallel evolution and/or reversals while losing flight.

      I am not a paleontologist, so I may have missed something. Are there any studies comparing osteological and molecular trees for paleognath birds? Because that could shed light on whether there's any major exception to minimum length, and if so, whether there's some way to re-weight certain changes.

      Delete
    12. I guess I'm curious -- I'm really interested in the conversation that unfolded above, but disappointed Andrea stepped out of the conversation.
      I would be interested to know how we could demonstrate that this method DOESN'T work -- basically, just picking multiple data sets and letting analyses keep running over and over to see what the outcomes are?

      Delete
    13. Andrea, this is how I always do it in PAUP*.

      Delete
  2. Fortunately, the distinction between polymorphism (weird that you call it "variation polymorphies", I've never seen that before) and partial uncertainty ("uncertainty polymorphies") does not have any effect on tree search. It can drastically change tree length though. If there's any polymorphism in your matrix, the lengths given by TNT and by Mesquite for the same tree will never match up; those given by Mesquite will always be longer, because Mesquite always distinguishes polymorphism from partial uncertainty (the former adds steps within terminal branches, the latter does not), while TNT always treats the former as the latter. PAUP* can be set to all three options.

    I use Mesquite instead of NDE because Mesquite can edit matrices, display trees, let you mess with topologies and branch lengths, reconstruct the evolution of characters on trees, and much else.

    These symbols would all work fine in PAUP[*]

    Sort of. While lots of people seem to believe that "-" means "inapplicable", as far as PAUP* is concerned it means "gap in a molecular sequence". By default, PAUP* is set to treat gaps as missing data, but make sure you haven't set it to treat gaps as a 5th base/21st amino acid, because that's the other available setting.

    ReplyDelete
    Replies
    1. I should perhaps make explicit that there's no software which treats inapplicable scores as anything but missing data. Using a different symbol serves only the readers, not the software.

      Delete
  3. Replies
    1. Where I have the time to write it. What do you, the public, want? A look at a subgroup and the steps it takes to change topologies? How new taxa interact with the Lori matrix? Something else?

      Delete