Wednesday, June 27, 2012

Charles Darwin's unpublished tree sketches, Part 2


In Part 1 of this series I discussed those unpublished tree sketches housed at The Complete Work of Charles Darwin Online. In particular, I noted that there is only one empirical phylogenetic tree in that collection. This post updates the information to include the other empirical trees that I know of. Part 3 discusses another possible tree, in one of Darwin's books.

Note: Several months after this blog post was written the following paper was published: Archibald JD (2012) Darwin's two competing phylogenetic trees: marsupials as ancestors or sister taxa? Archives of Natural History 39: 217-233. This paper reproduces both manuscript diagrams and discusses them in detail. The post has been updated accordingly.

Unpublished sketches

In a letter to Charles Lyell written on September 23 1860, Darwin argued for the origin of mammals from a single ancestor, and illustrated this with two alternative scenarios. The letter is housed in the library of the American Philosophical Society (APS 227; see An Annotated Calendar of the Letters of Charles Darwin in the Library of the American Philosophical Society 1799-1882, p. 147), but only one of the trees has been placed online.


The first Diagram posits separate monophyletic origins of Placentals and Marsupials from a non-placental non-marsupial ancestor, whereas the second one derives Placentals from with the early Marsupial group.


This letter has been published in both (i) The Life and Letters of Charles Darwin, Vol. II (1887) edited by Francis Darwin (London: John Murray), pages 341-344, and in (ii) The Correspondence of Charles Darwin, Volume 8: 1860 (1993) edited by Frederick Burkhardt, Janet Browne, Duncan M. Porter & Marsha Richmond (Cambridge: Cambridge University Press), as Letter #2925. Neither of these reproduces the original diagrams, but instead adopts the "practice to replace any handwritten annotations to the image with typeset text."


I have included here the typeset version from Life and Letters. Note that it incorrectly replaces Darwin's original "Carnivora" with "Canidae" (Correspondence gets it right), and resolves Darwin's polychotomy.

The relevant text from the letter is (including odd spellings and grammar):
I enclose two diaggrams showing the sort of manner I conjecture mammals have been developed: I thought a little on this when writing p. 429 [of The Origin] beginning "Mr. Waterhouse".— (Please read the paragraph.) I have not knowledge enough to choose between these two diagrams; if the Brain of Marsupials in embryo closely resembles that of placentals, I shd. strongly prefer nor. 2, & this agree with antiquity of microlestes. As a general rule I shd prefer nor. I diagram. whether or not Marsupials have gone on being developed or rising in rank from a very early period would depend on circumstances too complex for even a conjecture: Lingula has not risen since Silurian epoch, whereas other Molluscs may have risen.

The text accompanying the illustrations is:
A. Unknown form, probably intermediate between reptiles Mammals, Reptiles and Birds as intermediate as Lepidosiren now is between Fish and Beatractians.— probably more This unknown form is probably more closely related to Ornithorhynchus than to any other known form.—

As noted by Archibald (2012), this letter clearly shows Darwin's understanding of homology and how it relates to monophyletic groups. Given that it was written within a year of publishing the Origin, it is his earliest use of an empirical phylogeny, as well as the only time he compared two trees for the same taxa.

These days, the consensus is that Diagram I is the correct one.

Monday, June 25, 2012

Echinoderm Monday


This week we have a phylogenetic tree literally drawn by the organisms whose relationships are represented. This depicts the evolutionary relationships between the five classes of echinoderms: Asteroidea (sea stars), Ophiuroidia (brittle stars), Echinoidea (sea urchins and sand dollars), Holothuroidea (sea cucumbers), and Crinoidea (sea lillies and feathers).


The picture is by Daniel Brown, who has many more illustrations on his blog, most of them available for sale as posters.

Wednesday, June 20, 2012

Rooted networks for exploratory data analysis


Leo van Iersel has recently been trying to convince me that rooted networks might also be useful as exploratory data analysis (EDA), in addition to the unrooted networks that I have championed in print (Morrison 2010) and in this blog. I have tried to find a dataset that will support his case, and the one discussed here is the best that I have been able to find.

In infection biology we are interested in the transmission of pathogens from one host to another, possibly in geographically distant locations. It is usually assumed that pathogens (viruses, bacteria, protists, microfungi, helminths) with the same genotype found in different locations represent transmission from a single source location. Conversely, a mixture of genotypes at a single location is assumed to represent multiple sources of infection, possibly at different times. This type of analysis is a combination of population genetics and phylogenetics.

Such transmission studies can produce quite complex results, even to the extent of having different pathogen genotypes simultaneously in the same host. Data analysis is usually based on either a rooted tree or an unrooted haplotype network, but it can also conveniently be studied using a rooted reticulation network. I will illustrate the latter with a simple example.

Click to enlarge

The figure shows a rooted network for 1,544 aligned nucleotides from 72 samples of the nematode Dictyocaulus viviparus, which is the parasitic lungworm of domestic cattle. The data are concatenated mitochondrial protein (2 genes), rRNA and tRNA gene sequences, from Höglund et al. (2006). The analysis shows the inferred historical relationships among 64 farm samples from Sweden (8 worms from each of Farms 29, 34, 36, 38, 49, 65, 68 and 76) and 8 samples from a isolate that had been maintained in the laboratory (L, used as the outgroup to root the network).

The data have been analyzed using the reticulation network method of Huson et al. (2007), based on splits generated by the Median network. Since the character data are essentially binary (with two exceptions), this produces exactly the same result as for a recombination network.

In the network, most of the samples from within each farm seem to be closely related in a simple divergent fashion through time, as would also be conveniently displayed by a standard tree-based analysis. There are apparently two major clades of genotypes, with 6-7 subclades. We can conclude from the tree-like relationships that four farms show evidence of only a single source of infection (Farms 34, 36, 38 and 76 each have a single genotype), while two farms appear to have at least two genotypes and thus probably two sources of infection (Farms 49 and 68).

However, two of the farms show more complex patterns than these, which would not be revealed by a simple tree analysis. These two farms have groups of samples that descend from reticulation nodes (indicated by the arrows), thus suggesting the pooling of two distinct sources of genetic material. Note that there is no suggestion that these reticulations represent either recombination or hybridization, given that the data are from mitochondrial genes. This analysis is best treated as exploratory (EDA), highlighting genotypic complexity that warrants further biological investigation, rather then providing an explicit hypothesis of evolutionary history.

Farm 29 is shown as having one unique genotype (5 individuals) plus another genotype (3 individuals) that has elements possibly related to both of the major clades of genotypes. Perhaps these latter 3 individuals represent an earlier infection, given their apparent association with the basal branches of the two clades.

Farm 65 appears to be even more noteworthy. There are 3 individuals that are apparently related to those on Farm 36, plus 3 individuals of somewhat uncertain relationship. Then there are 2 individuals with elements possibly related to the genotypes on Farms 76 and 49. This is clearly a very interesting farm, from the point of view of lungworm infection and transmission, with at least three possible infection sources. This is important information that needs to be taken into account for possible management strategies.

This use of a rooted network analysis for exploratory data analysis seems not to have been considered before. However, it seems to me that it adds considerably to the practical information that can be gleaned from a study of the transmission of pathogens.

References

Höglund J., Morrison D.A., Mattsson J.G., Engström A. (2006) Population genetics of the bovine/cattle lungworm (Dictyocaulus viviparus) based on mtDNA and AFLP marker techniques. Parasitology 133: 89-99.

Huson D.H., Klöpper T.H. (2007) Beyond galled trees — decomposition and computation of galled networks. Lecture Notes in Bioinformatics 4453: 211-225.

Morrison D.A. (2010) Using data-display networks for exploratory data analysis in phylogenetic studies. Molecular Biology and Evolution 27: 1044-1057.

Saturday, June 16, 2012

Charles Darwin's unpublished tree sketches


Charles Darwin is usually considered to be the person who first explicitly used a tree as a metaphor (or, in his words, a simile) for genealogical history in the modern sense. Lamarck's earlier tree-like diagrams were intended as transformation series among species (based on morphoclines), and thus showed historical descent with modification, but they were not concerned with diversifying evolution in the manner envisioned by Darwin. Previously published diagrams by other authors were concerned with showing "affinity" that was not necessarily genealogical (see previous post).

It therefore perhaps comes as something of a surprise that Darwin never published an empirical phylogenetic tree. His only published diagram (in The Origin of Species) was entirely theoretical. Indeed, it expressed his doubt that empirical trees could be constructed, by not connecting many of the branches, thus indicating knowledge that he felt was impractical to obtain due to the paucity of the fossil record. (Or, as he put it in Notebook C in 1838: "The bottom of the tree of life is utterly rotten & obliterated in the course of [the] ages.") It was thus left to Fritz Müller in 1864 and St George Mivart in 1865 to publish the first known trees, followed by Franz Hilgendorf, Albert Gaudry and Ernst Heackel in 1866 (Bigoni & Barsanti 2011, Tassy 2011) (see also Who published the first phylogenetic tree?).

Nevertheless, trees appear as sketches in Darwin's notebooks, note portfolios and book drafts, both before and after The Origin was published (1859), and at least one is an attempt at an empirical tree. These notes and drafts can be viewed online at The Complete Work of Charles Darwin Online. However, it seems to me to be worthwhile gathering in one place all of the sketches of which I am aware. I have provided copies here, in approximate chronological order, as well as links to the original in CWCDO.

This is Part 1 of a three-part series. It covers the tree sketches housed in the Darwin Archive at the Cambridge University Library. Part 2 covers sketches in a Darwin letter housed in the library of the American Philosophical Society. Part 3 discusses a possible tree in one of Darwin's published books. See also the post on Predecessors of Charles Darwin.

Note: This post has been updated since its was first published, to include several additional sketches.

Figure 1

The first pair of sketches is from page 26 of Notebook B, on "Transmutation of species"
(dated 1837-1838). They are not actually trees, but instead consider Darwin's idea that maybe a coral would be the best metaphor (due to the existence of dead "branches"). Darwin had, of course, studied corals extensively on the Beagle voyage, and in May 1837 read a paper before the Geological Society about his theory for the development of coral reefs.


The text on pages 25-26 is reported to be:
[p.25 bottom] The tree of life should perhaps be called the coral of life, base of branches dead; so that passages cannot be seen. — this again offers [p.26 top] contradiction to constant succession of germs in progress
[note at very top] no only makes it excessively complicated
[between the sketches] Is it thus fish can be traced right down to simple organization. — birds — not.

Figure 2

The next sketch is the most famous one, these days even appearing on many tattooed persons. It is from page 36 of the same Notebook B.


The text on pages 36-37 is reported to be:
[top] I think
[first note] Case must be that one generation then should be as many living as now
[second note] To do this & to have many species in same genus (as is) requires extinction
[below figure] Thus between A & B. immense gap of relation. C & B. the finest gradation, B & D rather greater distinction. Thus genera would be formed. — bearing relation [p.37] to ancient types. — with several extinct forms for if each species an ancient (1) is capable of making 13 recent forms. [there are 13 lines in the sketch that have a perpendicular line at the end] Twelve of the contemporarys must have left no offspring at all, [there are 12 lines that are without a perpendicular line at the end] so as to keep number of species constant.

Figure 3

The next five sketches are from the Collection of notes on "Principle of divergence, transitional organs/instincts" (dated 1839-1872). The sketches are pages torn out of the earlier notebooks and collated into the portfolio.

The first sketch is on the back of a page numbered 90, which is dated July 1843. It continues Darwin's search for a good phylogenetic simile.


The text appears to be:
a tree not good simile — endless piece of sea weed dividing

Figure 4

The second sketch is on the back of a page numbered 127, which is dated December 1848.


The text appears to be:
Genera again in same family are united into little groups - so throughout animal Kingdom - so children even in same family - It is universal law.

Figure 5

The next sketch is on a page numbered 183. It and the next two sketches are undated but are considered to be "early 1850s". This one considers a general concept of mammal history but without naming most of the groups.


The text appears to be:
[top] Let dots represent Genera ???
[note at far left] If these had all given descendants then these w.[would] have been a great series.
[note at tree base] Parents of Marsupials and Placentals
[note within tree] Rodents
[notes at right] no form intermediate
Rodents
Marsupials

Figure 6

The next sketch is on a page numbered 184 ("early 1850s"). It considers the relationship between genealogical trees and geological history.


The text appears to be:
[top] Dot means new form - eg. ancestors
[within figure] Palaeoz. Second[ary] Tertiary

Figure 7

The next sketch is on the back of the page numbered 184. It looks somewhat like a draft or a repeat of Figure 4.


The text is the same as for the earlier figure:
Parents of Placentals and Marsupials

Figure 8

The next sketch is from the Collection of notes on "Embryology", dated 1852-1855. The sketch is on the back of a flyer for Edward Strong Printing Office and Stationery Warehouse.


This page has at least five trees on it, with the largest of them over-written with several pieces of text. Darwin appears to be struggling with alternative ideas about the role of embryology in the evolution of mammalian and bird groups.

I will not attempt to interpret it here. However, Archibald (2014) has a figure (Figure 4.9 on page 94) that tries to elucidate most of the text.

Figure 9

This sketch is one that will look vaguely familiar to anyone who has read The Origin of Species. It appears between pages 26R & 26S of Darwin's draft book on Natural Selection (sometimes referred to as his Big Species Book). This was the book originally intended to introduce Darwin's ideas to the world (written from 1856-1858), but which he had to abandon once he realized that Alfred Wallace had independently deveoped the same general ideas.


This picture is thus the draft version of the tree that appears in The Origin, as Chapter 6 of Natural Selection was summarized as Chapter 4 of The Origin. Interestingly, the initial draft of the chapter (called "On Natural Selection") was completed by the end of March 1857, but a year later (mostly between April and June 1858) Darwin revised and expanded it, particularly by interpolating a new discussion of the "Principle of Divergence" that was 40 pages long (Stauffer 1975). The figure appears as a prominent part of this later addition. Apparently, it took Darwin some time to realize the importance of divergence for his book — important enough to warrant the only illustration.

Note that the tree is upside down as compared to the final version in The Origin, presaging modern uncertainty about which way to draw a "tree". Perhaps this was a switch from the usual top-down way of drawing human pedigrees to one that more closely matched his discussion of the fossil record, which is conventionally drawn bottom upwards. Since Darwin does not call the diagram a tree, the usual orientation of botanical trees presumably played no part.

Also, in many ways this draft is more complicated than the published one, with much more in the way of annotations. The subsequent simplification is typical of the relationship between Natural Selection and The Origin, as the latter was intended to be an "abstract" of the former.

The diagram is accompanied by this text:
Compositor: To be printed on separate page to be folded out and so all exposed. Attend to distance of capital letters from each other; the letters had better be smaller: Attend to chains of dots and hyphens. The numbers to small letters to be the very smallest possible. The capital and other letters in each Diagram to match exactly in position. — I hope the 4 Diagrams will go in length of page.

Figure 10

The final two sketches appear among the material gathered for preparation of the 1st edition of The Descent of Man (1871). They represent the only explicitly empirical trees that I know of among Darwin's notes (although see the sketches in Part 2), depicting primate relationships.

This one appears to be an initial draft of the next sketch. It is undated but is considered to be 1862-1865.


The text appears to be (clockwise from the lower left):
Man
Gorilla
Orang
Semnopithecus
Macaca
Dryopithecus
New World
Old World


Figure 11

This last, much re-written, sketch is clearly dated April 21 1868.


Darwin's progressive thought processes are clearly indicated by the successive versions of the figure and the many erasures (sometimes replacing a word with itself!). He seems to be trying to work out how best to present the diagram, rather than revising its content. Interestingly, the placement of Homo near the Simians seems to owe more to St George Mivart's 1865 tree of Primates than to Mivart's more detailed 1867 tree, where Homo is placed somewhat further away (see Bigoni & Barsanti 2011).

The text at the top appears to be (from the left):
Man
Gorilla & Chimp
Orangutan
Hylobates
Cercopithecus, Macacas, Baboons
Semnopithecus

The text at the bottom appears to be (clockwise from the left):
Old World Monkeys
New World Monkeys
Lemuridae
Primates


Afterword

It is worth noting that Darwin does not explicitly call any of these diagrams a "tree". In both Natural Selection and The Origin he refers to the Tree of Life at the end of the chapters containing the figures (see Penny 2011), and later in those books he refers to relationships as being "somewhat like the branches of a tree", but neither of these is a direct reference to any diagram. Darwin's point in using the figures was to describe a continuous process of diversification and extinction, and in doing so he drew a multi-branched bush (with episodic speciation) rather than a single-stemmed tree. This idea seems to have gone astray since then, and our current focus on trees thus misses much of what Darwin was trying to tell us.

References

Archibald J.D. (2014) Aristotle's Ladder, Darwin's Tree: the Evolution of Visual Metaphors for Biological Order. Columbia University Press, New York.

Bigoni F., Barsanti G. (2011) Evolutionary trees and the rise of modern primatology: the forgotten contribution of St. George Mivart. Journal of Anthropological Sciences 89: 93-107.

Mivart St G. (1865) Contributions towards a more complete knowledge of the axial skeleton in the Primates. Proceedings of the Zoological Society of London 33: 545-592.

Mivart St G. (1867) On the appendicular skeleton of the Primates. Philosophical Transactions of the Royal Society of London 157: 299-429.

Penny D. (2011) Darwin’s theory of descent with modification, versus the biblical Tree of Life. PLoS Biology 9: e1001096.

Stauffer R.C., ed. (1975) Charles Darwin's Natural Selection; being the second part of his big species book written from 1856 to 1858. Cambridge: Cambridge University Press.

Tassy P. (2011) Trees before and after Darwin. Journal of Zoological Systematics and Evolutionary Research 49: 89-101.

Tuesday, June 12, 2012

New version of phylogenetic networks software

There's a new release of Dendroscope with several new methods for constructing phylogenetic networks. This blog will explain the new functionality added to the Cass algorithm.

Cass can be used to construct a rooted phylogenetic network from any number of rooted trees. These trees can be multifurcating (nonbinary). The output network will display all clusters of the input trees. In certain situations the algorithm has been shown to minimize the reticulation number of the network (see van Iersel et al. 2010 and Kelk et al. 2012).

The new release of Cass can also produce networks that display the input trees (rather than only the clusters from the input trees). In other words, Cass can be used as a heuristic for the problem Hybridization Number on any number of multifurcating trees. This is a notoriously hard problem and at the moment Cass is the only implemented algorithm for this problem.

For inputs consisting of two multifurcating trees, Cass solves Hybridization Number optimally (Kelk et al. 2012) and, for such instances, Dendroscope also contains a faster optimal algorithm (go to Algorithms - Hybridization Networks) by Huson and Linz (submitted). For bifurcating trees, there is one other heuristic available that can be used for any number of trees: the program PIRN by Yufeng Wu.

But, as I said, for more than two multifurcating trees, Cass is the only method currently available.

Let's have a look at the new functionality of Cass for these input trees:



The Cass algorithm can be found here (Algorithms - Level-k Network Consensus):


You get several new options:


If we don't check the box "construct only networks that display the trees", we get the following network with one reticulation, which displays all clusters of all input trees:


If we do check the box "construct only networks that display the trees", we get the following three networks with two reticulations each. Each of the networks displays all three input trees.


You see that in this case one needs more reticulations to display the trees than to display the clusters from the trees. You also see that Cass can now produce several solutions rather than just one. Note however that Cass is not guaranteed to find all optimal solutions. As a result of a collapsing step in Cass, it might miss networks (see Kelk et al. 2012). That is also the reason why Cass is not guaranteed to find an optimal solution (unless the input consists of only two trees or the output network is at most level-2).

Finally, Cass can also be used to construct a network from a multi-labelled tree. Go to Algorithms - Multi-Labelled-Tree to Network - MUL to Network, level-k-based.

Monday, June 11, 2012

Wordle, TreeCloud and SplitsNetworkCloud


There are a number of available ways to analyze word frequency and usage in a block of text, and to display the result as a diagram. Here, I have applied three of them to my one and only published book (after deleting extraneous text such as the references and glossary), to find out what my writing style is like. The results are not as embarrassing as I feared.

Wordle

This analysis uses word size in the diagram to represent word frequency in the text.

Click to enlarge.

It is good to note that most of the words refer to the topics rather than coming specifically from my writing style. Note that "data" is one of the most used words, but this actually comes from expressions like "data-display network". Sadly, "also", "although", "however", "important", "might", "much", "necessarily", "particular", "rather" and "way" seem to get a bit of a workout in the book. The only author who makes it onto the list is "Huson", not unexpectedly.


TreeCloud

The TreeCloud output helps make some of the word patterns more clear, since it uses clusters on an unrooted tree to represent words that occur near each other in the text, thus introducing word context into the analysis. Many fewer words are displayed, thus focussing on topics rather than writing style.

Click to enlarge.

Proximity presumably explains why the words "network" and "networks" are at opposite ends of the tree — they are used in quite different contexts in the book. This is also why both "data" and "data-display" occur in the tree, since "data patterns" is a commonly used expression. Also, the expression "shown in the figure" arises from the large number of illustrations, thus explaining the (perhaps unexpected) appearance of the two words.


SplitsNetworkCloud

This analysis generalizes the TreeCloud output to a data-display network. This makes even clearer some of the complexity of the word associations. The number of words has been reduced, to make the diagram less complex than it would otherwise be. Also, word colour refers to the relative placement in the book, with red at the beginning and blue at the end.

Click to enlarge.

Notably, "network" is not specifically associated with any particular word except "reticulations", whereas "networks" appears preferentially in the expressions "evolutionary networks" and "data-display networks". It is perhaps noteworthy that I use the expression "number of reticulations" rather than "reticulation number", thus revealing my non-mathematical background.

Thanks to Philippe Gambette for producing the SplitsNetworkCloud.

Thursday, June 7, 2012

Networks in the news


Phylogenetic networks are rather specialized things, and so we do not expect to find much mention of them outside the specialist literature. So far, we cannot compete with the "Tree of Life", which as an expression has a 2,000 year history at least.

That does not mean that the press don't, on occasion, try to explain what we are up to. This piece about Pete Lockhart provides a good example:

Karen Sieber (2010) Humboldtians in Focus: Farewell to the phylogenetic tree. Humboldt Kosmos 95:4.

Tuesday, June 5, 2012

Operads and the Tree of Life


As a biologist, much of the mathematical world passes me by. Therefore, whenever I come across anything mathematical on the web, I may or may not understand it.

I freely admit that this blog post, posted last year at a blog called Azimuth, went straight over my head:
Operads and the Tree of Life

The author's conclusion is this:
"And so, phylogenetic trees turn out to be related to coproducts of operads. Who’d have thought it? But we really don’t have as many fundamentally different ideas as you might think: it’s hard to have new ideas. So if you see biologists and algebraic topologists both drawing pictures of trees, you should expect that they’re related."

I mention it here for what it may be worth to any mathematician who reads this post.

Sunday, June 3, 2012

A network analysis of Médoc wines


In recent weeks I have analyzed the behavior of well-known wine commentators when evaluating the red wines of Bordeaux for two specific vintages (2004 and 2005). I will finish this series by expanding the analysis to include many vintages and also the "official" classification of the Bordeaux wineries (called "châteaux" even if they do not have an actual château).

The initial data that I have analyzed were taken from:
Albert Di Vittorio and Victor Ginsburgh (1996) Pricing red wines of Médoc vintages from 1949 to 1989 at Christie's auctions. Journal de la Société Statistique de Paris 137: 19-49.
I have then supplemented these data as best I could, to fill in missing values.

The first dataset consists of overall quality ratings of each of the 41 Médoc ("left bank") vintages from 1949-1989 from four sources:

  • Tastet & Lawton, the oldest wine broker of the Quai des Chartrons in Bordeaux, as originally compiled by Franck Dubourdieu (1992) Les Grands Bordeaux de 1945-1988. Bordeaux: Mollat
  • Robert M. Parker (1990) Les Vins de Bordeaux. Paris: Solar; supplemented with data from his online chart
  • James Suckling and Thomas Matthews (1994) The Bordeaux 50 up close. Wine Spectator, October 15; supplemented with data from an online chart
  • The price index constructed by Di Vittorio and Ginsburgh based on a regression analysis of Christie's auction sales, which takes into account wine age, no. bottles, bottle size, year of sale, case opened/unopened.

The scores have been converted to a 0-20 scale; and there are considerable missing data, especially before 1970. The manhattan distance measure was calculated between each pair of evaluators, and the result displayed as a Split Decomposition network. People who are closely connected in the network are similar to each other based on their rating patterns, and those who are further apart are progressively more different from each other.

Click to enlarge.

Note, first, that the split uniting the critic Robert Parker with the auction prices is slightly better supported than the one for the critic James Suckling. The local Bordeaux brokers (Tastet & Lawton) do not have a split shown that unites them with the prices. This supports the common observation that it is usually the internationally known wine critics' evaluations of the vintages that strongly influences the auction prices — a well-regarded vintage will sell for a higher amount.

Second, the terminal edges are quite long relative to the internal ones, indicating that a lot of the variation of the vintage evaluations is unique to each evaluation source. They certainly do not agree very much with each other about which vintages are better than others. In particular, note that the auction price has a very long terminal edge, indicating that much of the variation in the prices is actually independent of the opinion of the professional wine evaluators about the quality of the vintage.

Moving on, let us now consider the wines themselves.

In 1855 the Bordeaux Chamber of Commerce requested the Bordeaux Brokers' Association to rank and classify 60 Haut-Médoc châteaux according to quality, ranking them in five categories: First to Fifth Growth. This was done for use at the Exposition Universelle de Paris, the world's fair of the day. This ranking is said to have been based on the long-term wine prices achieved by the various vineyards, plus some local politicking (the prestige of the wine and the owner, etc).

It is preposterous to imagine that this ancient ranking could still apply unchanged today, and yet it continues to play a large part in contemporary wine lore and seems to be a big determinant of wine prices (eg. the five "first growths" sell for 3-6 times the price of the best-regarded "second growths", even though it is widely acknowledged by wine critics that today they are not much better in quality).

Naturally, some people have not taken this situation lying down, and they have done something about developing a ranking and classification that has a bit more contemporary relevance:

  • Alexis Lichine (1979) Guide to the Wines and Vineyards of France. London: Weidenfeld & Nicolson.
  • Patrick Dussert-Gerber (1988) Guide des Vins de France 1989. Paris: Albin Michel.
  • Robert M. Parker (1990) Les Vins de Bordeaux. Paris: Solar.
  • The Liv-ex Fine Wine 100 Index, which tracks the prices of 100 top wines (95% of which are from Bordeaux). Data were released in 2009, based on price averages calculated from 2003-2007, and again in 2011, based on prices from 2005-2009. I have taken the average of these two data sources (and called it the 2010 price).

For the first data analysis I have restricted myself to the 5-class classification rather than the rankings, so that the scale for each château is 1-5 for each of the five classification sources, with a small amount of missing data. The manhattan distance measure was calculated between each pair of classifications, and the result displayed as a Split Decomposition network.

Click to enlarge.

Note, first, that the creation of the Liv-ex 2010 wine price was an exercise designed to emulate the procedure used to create the original 1855 classification. It is thus interesting to note that the best-supported split separates these two classifications! As anticipated, modern perception of the quality of the châteaux does not closely match that 150 years ago; and it is time that wine writers stopped mentioning 1855 as their first comment when introducing Bordeaux wine.

The best-supported split actually unites the Parker classification with the current prices, which is not unexpected given the influence that Parker is recognized to have on Bordeaux wine prices, particularly in the USA. It is for this reason that Parker-recommended wines will never be good value for money.

The long terminal edges for most of the classifications emphasize their unique nature. Indeed, the lack of uniformity between the classifications makes wine classification look distinctly more like an art than a science.

Finally, it is interesting to note that Lichine's classification appears to be a compromise between the current price and the 1855 classification — this is supported by the second-biggest split, and Lichine has the shortest terminal edge. This was, of course, Lichine's stated objective: a marriage of the old and the new.

This leads us to the châteaux rankings rather than their classification. The following graph plots the 1855 ranking of 50 château (out of the original 60) against their ranking based on the 2010 Liv-ex Fine Wine price ranking.



Only the top three châteaux are still in their same ranked position, although 34 (two-thirds) of them are within a rank of 11 of where they were in 1855. However, the calculated r-square value reveals that only 32% of the variation in current ranking is associated with the 1885 ranking. Metaphorically, the glass is either one-third full or two-thirds empty, depending on which way you view your wine glass.

For those of you who are interested, the top five wines are (in order):
Château Lafite Rothschild, Château Latour, Château Margaux, Château Mouton Rothschild, and Château Haut-Brion (with Mouton and Haut-Brion having swapped places since 1855). The stand-out second growths are: Château Palmer (up from rank 28) and Château Léoville-Las Cases (up from 9).