Monday, March 30, 2015

Inconsequential splits in NeighborNet graphs


NeighborNet produces splits graphs based on distances between the taxa, rather than using the original character data. This approach can produce what we might call inconsequential splits in the graph — that is, splits that are not explicitly supported by the character data. Here, I present a simple example to illustrate the extent to which this can occur.

The data are taken from: Nanette Thomas, Jeremy J. Bruhl, Andrew Ford, Peter H. Weston (2014) Molecular dating of Winteraceae reveals a complex biogeographical history involving both ancient Gondwanan vicariance and long-distance dispersal. Journal of Biogeography 41: 894-904.

This dataset consists of a set of eight morphological features of the pollen from 31 extant plant taxa plus two fossil samples, as shown in this data matrix:

                    12345678
T_lanceolata        00111011
T_stipitata         00111011
T_purpurescens      00111011
T_xerophila_x       00111011
T_xerophila_r       00111011
T_vickeriana        00111011
T_glaucifolia       00111011
T_membranea         00111011
T_insipida          00111011
                    --------
T_perrieri          00111010
D_winteri           00111010
D_grenadensis       00111010
                    --------
B_comptonii         00011010
B_howeana           00011010
B_semicarpoides     00011010
B_whiteana          00011010
B_queenslandiana_q  00011010
B_queenslandiana_1  00011010
                    --------
P_axillaris         00011011
P_colorata          00011011
Pseudowinterapollis 00011011
                    --------
B_pancheri          01001011
                    --------
Harrisipollenites   01001100
                    --------
Z_acsmithii         01001101
E_stipitatum        01001101
Z_bicolor           01001101
                    --------
Z_balansae          11001101
                    --------
C_dinisii           1-111101
C_madagascariensis  1-111101
W_salutaris         1-111101
P_macranthum        1-111101
C_ekmanii           1-111101
C_winterana         1-111101


Note that there are only nine groups of taxa (separated by the dashed lines) — within each group the data are identical. Each character has two states: present / absent.

The resulting NeighborNet, as produced by default using the SplitsTree4 program, is shown in the first graph.


As expected, the taxa form nine groups. There are a number of apparently well-supported splits (ie. with long edges) separating these groups. There are also a number of smaller splits, and a whole series of very tiny splits. None of these latter two groupings are explicitly present in the dataset — the only splits supported by the characters are plotted onto the graph using the character numbers. (Note that character 5 is uninformative.)

The series of very tiny splits are present throughout the graph as extremely short edges. For example, a detailed view of the bottom left-hand corner of the graph is shown in the next figure.


Note that these six taxa have identical character data, and therefore their separation into four groups is entirely an artifact of the NeighborNet algorithm.

So, one needs to be careful when interpreting small splits in such a graph — they may have biologiocal support and they may not.

Wednesday, March 25, 2015

Network of Australian marsupials


In the literature, phylogenetic trees often appear even when the paper is discussing non-tree evolutionary histories.

A case in point is the paper by: Susanne Gallus, Axel Janke, Vikas Kumar, Maria A. Nilsson (2015) Disentangling the relationship of the Australian marsupial orders using retrotransposon and evolutionary network analyses. Genome Biology and Evolution, in press.

The authors discuss the relationship between the four Australian marsupial orders, and use data from transposable element (retrotransposon) insertions for resolving the inter- and intra-ordinal relationships of the Australian and South American orders. They plot the retrotransposon presence/absence onto a tree derived from alignments of 28 nuclear gene fragments. This is shown in the first figure, with the retrotransposons indicated as dots on the internal branches.


For comparison, the next figure is a Median-Joining network based on the presence/absence of the retrotransposons.


With the exception of the Monito del monte, Shrew opossum and Western quoll, the network matches the basic tree structure. However, it emphasizes more strongly the fact that the retrotransposons do not resolve the relationships among the Marsupial orders. As the authors note:
The retrotransposon insertions support three conflicting topologies regarding Peramelemorphia, Dasyuromorphia and Notoryctemorphia, indicating that the split between the three orders may be best understood as a network ...The rapid divergences left conflicting phylogenetic information in the genome possibly generated by incomplete lineage sorting or introgressive hybridisation, leaving the relationship among Australian marsupial orders unresolvable as a bifurcating process million years later.

Monday, March 23, 2015

Phylogenetic network of pairwise alignment methods


Phylogenetic networks can be used to illustrate the history of any set of objects or concepts, provided that this history is a divergent one (ie. the history is not simply the transformation of objects through time).

Since I have recently been writing about sequence alignments, it is worthwhile to show an example of applying a network to sequence alignment programs. This comes from the paper by Chaisson MJ, Tesler G (2012) Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13: 238.

The authors discuss programs that map reads from a sample genome onto a reference sequence. They note: "the relationship between many existing alignment methods is qualitatively illustrated in the figure."


Their legend reads:
The applications / corresponding computational restrictions shown are: (green) short pairwise alignment / detailed edit model; (yellow) database search / divergent homology detection; (red) whole genome alignment / alignment of long sequences with structural rearrangements; and (blue) short read mapping / rapid alignment of massive numbers of short sequences. Although solely illustrative, methods with more similar data structures or algorithmic approaches are on closer branches. The BLASR method combines data structures from short read alignment with optimization methods from whole genome alignment.
The reticulation refers to their new program, which "maps reads using coarse alignment methods developed during WGA [whole genome alignment] studies, while speeding up these methods by using the advanced data structures employed in many NGS [next generation sequencing] mapping studies."

Wednesday, March 18, 2015

The need for a new sequence alignment program


Multiple sequence alignment software have not yet met their primary aim for evolutionary biologists: maximizing homology of characters. If our goal is to develop an automated procedure for homology assessment, then we need someone to produce a program that explicitly implements this aim.

Alignment is just as much a part of phylogenetics as is tree or network building. It is the procedure that expresses the homology relationships among the characters, rather than the historical relationships among the taxa. Therefore, we need a computer program that accurately expresses homology relationships, as well as one that accurately expresses the historical relationships. We have some programs for the latter but currently nothing for the former.

Unfortunately, homology is a rather nebulous concept. It has to do with inheriting characters from a shared ancestor, which is not something that we can directly observe. Therefore we have to infer it. Somehow.

Homology criteria

Systematists have developed criteria for making decisions about potential homologies in an objective and (hopefully) repeatable manner, and these are directly applicable to nucleotide sequences, which these days are the most common form of data used in phylogenetics. These criteria are:

• Similarity
  1. Compositional = apparent likeness or resemblance between sequences (% similarity)
  2. Topographical = apparent likeness or resemblance between sequences (second- and third-order structure of proteins or RNA)
  3. Functional = functional relationship to other characters in the same sequence (annotated function of the sequence in protein or RNA)
  4. Ontogenetic = variation arising from the same molecular mechanism between sequences (inferred molecular mechanism creating the sequence variation — tandem repeats, inverted repeats, substitutions, inversions, translocations, transpositions, deletions, insertions)
• Conjunction = possible within-genome copies of the same sequence (i.e. paralogy)

• Congruence = agreement with other postulated homologies elsewhere in the same sequences (synapomorphy).

Traditionally, characters have been first proposed as homologous using the criteria of similarity and conjunction (together called primary homology), and then tested with the criterion of congruence (secondary homology).

It is important to note that these criteria do not necessarily always agree with each other in their inferences of homology. Changes that occur during evolutionary history can weaken the connection between these criteria so that, for example, nucleotide homology inferred from structural similarity is no longer the same as nucleotide similarity inferred from compositional similarity. It is for this reason that compositional similarity of the sequences is insufficient to establish gene orthology, for example. The same limitation applies to nucleotides.

Current computer programs

It is clear that these criteria have been incorporated singly into current computerized procedures for producing multiple sequence alignments, but rarely in combination. For example, compositional similarity is the criterion used by the most popular computer programs, such as CLUSTAL, MAFFT and Muscle. Topographical similarity is being invoked whenever structure-based alignments are produced. such as for RNA-coding sequences (eg. PicXAA-R; PMFastR), or when nucleotide sequences are translated to amino acids before alignment (eg. PROMALS). Functional similarity is used for specialist studies of conserved motifs and binding sites, for instance. Ontogenetic similarity of nucleotide sequences is based on inferring the possible molecular processes that cause the observed sequence variation — the program Prank uses this criterion by distinguishing between insertions and deletions.

Congruence as a criterion involves the observation of repeated patterns of synapomorphy in a phylogeny. Among alignment algorithms, both Direct Optimization (e.g. POY; MSAM; BeeTLe) and Statistical Alignment (e.g. BAli-Phy; StatAlign) try to simultaneously produce a multiple alignment and a phylogenetic tree, thus optimizing the criterion of congruence.

The fact that none of the current crop of programs basically apply more than one criterion is, I contend, the principal reason why so many phylogeneticists adjust their alignments manually. Personal judgment may not be perfect, but at least it can be consciously based on homology as a general character concept. Since the different criteria may conflict with each other, at the moment only human judgment is available to compare them and thus make a final decision.

Required program

To make the homology criteria fully operational, we need to compare their inferences by evaluating the comparative evidence. That is, since the different criteria may conflict with each other, we need an automated way to compare them and evaluate their relative probabilities for any alignment column. What we need is a computerized procedure that will includes all of the known criteria for homology assessment. Sadly, there are currently no mathematical models for doing this.

I suspect that there are two reasons for the failure of such a program to appear by now. First, biologists have not been clear about homology as a concept, and have not been able to express it in a form that computationalists could use to develop an algorithm. That is, we have criteria but they are not really operational criteria in a computational sense. Second, it will not be easy, because there is no obvious algorithm for inferring inheritance of characters. That is, we cannot easily separate homology from analogy.

Interactive editor

Another proposal is to have an interactive alignment editor. This editor would have the ability to show the conflicting hypotheses of homology (eg. where the homology suggested by structural pairing in a stem conflicts with homology suggested by tandem repeats), and then to annotate each column in the final alignment with the reason for the researcher having chosen to align those particular nucleotides. For example, one could press a button and see the RNA stem pairs in different colors (irrespective of whether the stem nucleotides are aligned), or press again and see the tandem repeats and inversions in different colours (once again, irrespective of how the nucleotides are aligned). One could also choose to see the annotations for the columns (summarized, using some coded schema), or simply look at the unadorned alignment itself.

This seems to me to be an achievable goal in the short-term; and the PhyDE editor already does some of it. Such an editor would also serve as a necessary step on the way to working out how to automate as much of the process as possible. The ultimate goal for some people may be total automation (ie. a black box), but I see no way to achieve that in the immediate term. Besides, I suspect that phylogeneticists will always want some judgemental control over the process, which would be best achieved with a semi-automated interactive editor. That is, we might ask the program to work out what the alternative alignments are for any specified subsequence (in an automated manner), and then we evaluate their relative merits for ourselves.

Note that I am treating the alignment as a set of hypotheses independent of their phylogenetic analysis. Subsequences can still be tentatively aligned even if the researcher intends masking those subsequences out of any subsequent tree-building analysis. Also, subsets of the taxa might be aligned confidently while other subsets are left unaligned. With current editors, this involves having a separate alignment file for each subset, which is very cumbersome, as well as error-prone.

Monday, March 16, 2015

Tattoo Monday XII


Here is a new collection of tattoos based on Charles Darwin's best-known sketch from his Notebooks (the "I think" tree). For other examples, see Tattoo Monday III, Tattoo Monday VI, and Tattoo Monday IX.


Wednesday, March 11, 2015

The need for a new sequence alignment database


Multiple sequence alignment software have not yet met their primary aim for evolutionary biologists: maximizing homology of characters. The proliferation of alignment methods have diverse optimization functions, along with assorted heuristics to search for the optimum alignment; and these methods produce detectably different multiple sequence alignments in almost all realistic cases (see The need for a new sequence alignment program). This leaves the phylogeneticists wondering what to do. In response, the majority of phylogeneticists use manual alignment or re-alignment at some stage in their procedures.

If our goal is to develop an automated procedure for homology assessment (see Multiple sequence alignment), then we need some means of evaluating the relative success of different alignment methods.

There are four suggestions for benchmarking strategies for sequence alignment (Iantorno S, Gori K, Goldman N, Gil M, Dessimoz C 2014. Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment. Methods in Molecular Biology 1079: 59-73):
  1. Benchmarks based on simulated evolution of biological sequences, to create examples with known homology.
  2. Benchmarks based on consistency among several alignment techniques.
  3. Benchmarks based on the three-dimensional structure of the products encoded by sequence data.
  4. Benchmarks based on knowledge of, or assumption about, the phylogeny of the aligned biological sequences.
These authors list a number of pros and cons for each strategy. For our purposes here we nee to consider the cons, which I discuss here (not all of these are covered by the authors).

Cons

1.
Simulation-based approaches adopt a probabilistic model of sequence evolution to describe nucleotide substitution, deletion, and insertion rates, while keeping track of “true” relationships of homology between individual residue positions (see Do biologists over-interpret computer simulations?).
(a) The simulation and analysis methods are not independent. All observations drawn from simulated data depend on the assumptions and simplifications of the model used to generate the data. This means that the results are biased towards those analysis methods that most closely match the assumptions of the simulation model.
(b) Simulations cannot straightforwardly, if at all, account for all evolutionary forces. This means that the simulations are not realistic, and their relevance for the behaviour of real datasets is unknown. The biggest failing in this regard is that, at some stage in the simulation, insertions and deletions are assumed to occur at random along the sequence (IID), and nothing could be further from the truth. Sequence variation occurs as a result of tandem repeats, inverted repeats, substitutions, inversions, translocations, transpositions, deletions, and insertions; and there are strong spatial constraints on variation such as codons and stem-loops. Current simulation methods fall well short of modeling these patterns of sequence variation.

2.
The key idea behind consistency-based benchmarks is that different good aligners should tend to agree on a common alignment (namely, the correct one) whereas poor aligners might make different kinds of mistakes, thus resulting in inconsistent alignments.
(a) Two wrongs don't make a right. That is, consistent methods may be collectively biased. Moreover, consistency is not independent of the set of methods used (some may be consistent with each other and not with others).
(b) Consistency scores are a feature of several methods, which means that the benchmark is not independent.

3. Structural benchmarks most commonly employ the superposition of known protein/RNA structures as an independent means of alignment, to which alignments derived from sequence analysis can then be compared (see Edgar RC 2010. Quality measures for protein alignment benchmarks. Nucleic Acids Research 38: 2145-2153). The best known of these include: BAliBASE, OXBench, PREFAB, SABmark, IRMBase, and BRAliBase.
(a) Datasets are limited to structurally conserved regions, and may not be relevant for other alignment objectives.
(b) Deriving the structure-based alignments is problematic. For example, there is inconsistency amongst different stuctural superpositions.

4. Given a reference tree, the more accurate is the tree resulting from a given alignment, then the more accurate the underlying alignment is assumed to be (see Dessimoz C, Gil M 2010. Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biology 11: R37).
(a) False inversion of a proposition: Accurate alignments yield accurate trees, therefore accurate trees must be based on accurate alignments.
(b) Alignment is often involved in constructing the reference tree. If not, the tree may be trivial in terms of taxon relationships.

Discussion

This evaluation leaves us in the invidious position of not yet having any benchmarking method that is relevant to homology assessment for multiple sequence alignments. This conclusion is at variance with other previous assessments (eg. Aniba MR, Poch O, Thompson JD 2010. Issues in bioinformatics benchmarking: the case study of multiple sequence alignment. Nucleic Acids Research 38: 7353-7363).

We need to consider what such a method might look like, and how we might go about constructing it. If biologists can't give the bioinformaticians a concrete goal for homology alignment then they can expect nothing in return.

It seems clear that we need to follow the idea behind option 3, but base the alignments on homology rather than structure. I once made a start with compiling some suitable datasets (see Morrison DA 2009. A framework for phylogenetic sequence alignment. Plant Systematics and Evolution 282: 127-149); but this was a very minor effort.

As I see it, we need alignments that are explicitly annotated with the reasons for considering the columns to be homologous. One suggestion would be to have relatively short alignments with annotations for "known" features, such as tandem repeats, inverted repeats, substitutions, inversions, translocations, transpositions, deletions, insertions, or stem-loops. These all create sequence variation, and they provide evidence of the homology relations among the sequences. Presumably the alignments would vary in length and number of sequences, and in the complexity of the patterns.

Perhaps the biggest practical problem will be how to deal with alignments where the homology criteria conflict with each other. That is, there are different types of criteria used to recognize homology — ie. similarity, structure, ontogeny, congruence (see Morrison DA 2015. Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26) — and they do not necessarily agree with each other.

This would allow us to come up with a set of requirements to specify various categories of the database, based on each of the above features. We would then try to accumulate as many example datasets for each category as we can. The database will presumably have protein-coding sequences in one section and RNA-coding, introns, etc in another. This dichotomy is simplistic, but I feel that it needs to be that way in order to be of practical use. Within each of those two sections we would have subsets of varying degrees of difficulty (eg. different degrees of average sequence similarity, or distinct taxon subsets in the same alignment, or orphan sequences).

This organisational approach is similar to that originally adopted for BAliBase, but it was dropped by most of the databases developed subsequently. I believe that it is the best approach for our purposes.

There are also experimentally created datasets where the alignment is known because all of the ancestors were sequenced as well. These would be useful; but their limitation is that the sequence variation was generated more or less at random, and so it does not match normal evolutionary processes. These alignments are more likely to match the IID assumption of the current automated alignment methods.

There is one further issue with this approach. Bioinformaticians often state that a few carefully prepared datasets is of little practical use to them (as opposed to being of use to phylogeneticists). What they need is a large number of datasets, the more the better. This is because they are interested in the percent success of their algorithms, and this cannot be assessed with small sample sizes. So, each alignment probably does not need to have too many taxa or too much sequence length — it is the number of alignments that is important, not their individual sizes. This could be achieved by sub-dividing larger datasets.

Monday, March 9, 2015

Another early noble pedigree


In a few recent blog posts I have discussed the early history of pedigrees, noting that they were usually presented as descent trees (with an ancestor at the top and the descendants below), although some later ones reversed this arrangement. This does not match our description of them as "family trees", of course, because the root of the pedigree is at the top.

I present here another early example, if for no other reason than that I have spent the past hour trying to decipher it. It is a Genealogy of the Saxon Dynasty, particularly the Ottonians. The picture is from the Chronica Sancti Pantaleonis, produced by the Benedictine monastery of Saint Pantaleon in Cologne in 1237 CE, which was itself based on the Chronica Regia Coloniensis [Royal Chronicle of Cologne], first compiled about 1177 CE in Michaelsberg Abbey, Siegburg.


Heinricus rex and Methildis regina are the founding couple in the double circle. Henry the Fowler did not himself become Holy Roman Emperor, but he created a situation where his descendants could do so, and did. They are numbered in the next diagram in the order in which they ruled. Number 9 is missing, this being Lothair II, who was not part of the family.


There are several things to note:
  • The interesting use of illustrative medallions, which seems to have been not uncommon at the time.
  • The consequent difficulty the illustrator has had in fitting the pedigree into the page, even though most of the descendants have been left out.
  • The pedigree is explicitly designed to establish noble ancestry, but females are included even when they are not in the direct line of descent.
  • The rulers nominally change families, from the Ottonian to the Hohenstaufen to the Salian dynasties, as a result of females in the direct line of descent.
  • Number 4 is Henry II, who made an appearance in an earlier post as the husband of Cunigunde of Luxembourg (The first royal pedigree).
  • Number 11 is Frederick I Barbarossa, who also made an appearance in an earlier post (Does it matter which way up a tree is drawn?).
  • The latter two points make it clear that the earliest written pedigrees were all closely related genealogically, and involved the attempts by certain parts of the German nobility to take control of the Holy Roman Empire, consisting at that time of what is now mostly Germany and Italy. Family descent was an important part of establishing who got to rule next.

Wednesday, March 4, 2015

Multiple sequence alignment


I started actively working on phylogenetic networks more than 10 years ago, when I gave a talk at the Phylogenetic Combinatorics and Applications meeting in Uppsala in July 2004.

However, before I started working on networks I had for several years been working on multiple sequence alignment methodology, and I still do. This work is also of direct relevance to network construction, of course, since faulty alignments will generate conflicting signals that can confound the biological signals that alone should appear in the network.

This year marks the 20th anniversary of my first publication in the alignment field (see the list appended below). To celebrate this I have some review / commentary articles planned. The first of these has now appeared online, and I would like to draw it to your attention:
  • Morrison DA (2015) Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26.
This paper relates current sequence alignment procedures to homology assessments as they are practiced for other data. Most algorithms can be seen as implementing only one of the several criteria that are used to identify homologies, which is inadequate. Suggestions are made for improving this situation.

Note: the second of these papers has now also appeared.


There will also be a couple of upcoming blog posts canvassing a few issues that I see as important for the future development of alignment methods.

Previous Publications

Theory

Ellis J, Morrison DA (1995) Effects of sequence alignment on the phylogeny of Sarcocystis deduced from 18S rDNA sequences. Parasitology Research 81: 696-699.

Morrison DA, Ellis JT (1997) Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of Apicomplexa. Molecular Biology and Evolution 14: 428-441. [This has been the most cited of these publications, surprising me by still getting cited about once per month]

Morrison DA (2006) Multiple sequence alignment for phylogenetic purposes. Australian Systematic Botany 19: 479-539.

Morrison DA (2009) A framework for phylogenetic sequence alignment. Plant Systematics and Evolution 282: 127-149. [This was actually accepted for publication in 2007]

Morrison DA (2009) Why would phylogeneticists ignore computerized sequence alignment? Systematic Biology 58: 150-158.

Morrison DA (2010) [Book review of] ‘Sequence Alignment: Methods, Models, Concepts, and Strategies’. Systematic Biology 59: 363-365.

Empirical examples

Mugridge NB, Morrison DA, Johnson AM, Luton K, Dubey JP, Votypka J, Tenter AM (1999) Phylogenetic relationships of the genus Frenkelia: a review of its history and new knowledge gained from comparison of large subunit ribosomal RNA gene sequences. International Journal for Parasitology 29: 957-972.

Mugridge NB, Morrison DA, Heckeroth AR, Johnson AM, Tenter AM (1999) Phylogenetic analysis based on full-length large subunit ribosomal RNA gene sequence comparison reveals that Neospora caninum is more closely related to Hammondia heydorni than to Toxoplasma gondii. International Journal for Parasitology 29: 1545-1556.

Mugridge NB, Morrison DA, Jäkel T, Heckeroth AR, Tenter AM, Johnson AM (2000) Effects of sequence alignment and structural domains of ribosomal DNA on phylogeny reconstruction for the protozoan family Sarcocystidae. Molecular Biology and Evolution 17: 1842-1853.

Beebe NW, Cooper RD, Morrison DA, Ellis JT (2000) Subset partitioning of the ribosomal DNA small subunit and its effects on the phylogeny of the Anopheles punctulatus group. Insect Molecular Biology 9: 515-520.

Beebe NW, Cooper RD, Morrison DA, Ellis JT (2000) A phylogenetic study of the Anopheles punctulatus group of malaria vectors comparing rDNA sequence alignments derived from the mitochondrial and nuclear small ribosomal subunits. Molecular Phylogenetics and Evolution 17: 430-436.

Monday, March 2, 2015

Network art


I have occasionally mentioned in this blog the fact that phylogenetic trees have made it into the world of art. However, until now I have not really been able to say the same for phylogenetic networks. I am happy to report that I can now do so.


These three watercolours are from the collection of Sandra Black Culliton, a microbial geneticist.


 At the time of writing the originals are still for sale at Etsy.


Alternatively, you can apparently ask her to produce one to order.