Current Level

 Lecture 2.1
 Lab 2.1
 Lecture 2.2
 Lab 2.2
 Lecture 2.3
 Lab 2.3
 Take-home 2

Previous Level

 BioWeb Home
 Unit 1
 Unit 2
 Unit 3
 Unit 4
 Genetics Ex
 Lecture 2.3

Phylogeny: the study of evolutionary relationships


I.                    Phylogenetic analysis - powerful tool for inferring evolutionary relationships from molecular data

             A.  Many advantages to molecular phylogenies over phenotype-based phylogenies

                       1.  Can get similar phenotypes in distantly related organisms due to process called convergent evolution

                        2.  Many organisms do not have easily studied phenotypic features - bacteria

                        3.  With distantly related organisms what phenotypic traits would you use -such as when comparing bacteria and mammals

                        4.  Analysis based on DNA and protein often free of such problems


B.     Phylogeny typically diagrammed as tree (2-D graph of evolutionary relationships)-many forms

                         1.  Lines with names - species or groups of species more closely related to each other than to other groups on tree

                        2.  Follow lines back to where join other lines leads to ancestral species of named groups

                        3.  Groups which all lead back to same line share a common ancestor (root)

                        4.  Root is defined by including an outgroup - related group which branched off earlier than the groups being studied


            C.  Based on series of assumptions

                         1.  Changes in characteristics occurs in lineages over time

                        2.  Any group of organisms are related by descent from a common ancestor

                        3.  There is a bifurcating pattern

                        4.  Each position is changing independently of all other positions


            D.  Different genes evolve at different rates

                       1.  Use different genes to build trees depending on phylogenetic distances involved

                        2.  SSU Ribosomal RNAs and proteins have functionally conserved positions to reveal their homology and work well for distantly related species

                        3.  DNA and LSU rRNA sequences work better for more closely related species

                        4.  Mitochondrial genes work well for very closely related comparisons of species or races as the mitochondrial genome mutates at a high rate


            E.  Requires an optimal alignment to start

                         1.  Need to establish which character represents the same position in sequence so can see how position has changed through sequence evolution

                        2.  Poor alignment leads to false trees - time spent on getting a good alignment is well spent

                        3.  Dealing with gaps a significant problem

                                    a.  Some programs treat gaps as 5th character

                                    b.  Two sequences sharing a 50 nt gap is not really a perfect match of 50 nt as may have been produced by a single event

c.  Sequences sharing large gaps will group together if no compensation is made


II. Methods

                A.     Evolutionary distances - represent number of steps to go from ancestor to extant species

                         1.  Number of substitutions is typically single most important variable in any molecular evolutionary analysis

                        2.  If closely related, the straight count is probably right

                        3.  Highly divergent sequences have greater likelihood of having multiple substitutions at a given site - straight count would be underestimate

                        4.  Models to account for this when estimating evolutionary distance

                                    a.  Jukes-Cantor Model - assumes rate of change to any one of 3 alternate nucleotide to be equal - not true, for instance, know that transitions occur 3X as frequently as transversions -but still useful

                                    b.  Kimura’s 2-parameter model - took into account different rates of transitions vs transversions but use uniform rate within the types

                                    c.  Models with even more parameters - matrices for 12 different substitution types and other variables - models based on large number of assumptions - generally 1 & 2 parameter models give more reliable results than the complex models


            B.  Ancestor species not available for comparison so must estimate


            C.  Rooted (common ancestor with unique path from it to any other node) vs. unrooted (only specify relationships between the nodes - nothing about direction of evolution)

             D.  Several methods of analysis available for this estimation

                      1.      Distance methods - uses pairs of sequences to get a dissimilatory measure

                                    a.  Common ones - UPGMA (unweighted-pair-group method with arithmetic mean) and NJ (Neighbor joining)

                                    b.  Calculates total number of changes - scored according to type - between every pair of sequences in alignment

                                    c.  Represents minimum number of changes required to convert 1 sequence to another

                                    d.  Results  written to distance matrix used to generate tree several possible ways -  branch lengths visually represent amount of change

e.       Removing ambiguous sections will influence branch length estimates


                        2.  Maximum parsimony method - character based analysis

                                    a.  Parsimony means thrift or stinginess

                                    b.  Based on corresponding sequence positions

                                    c.  Uses only “informational” positions

d.      Finds tree that requires the fewest number of mutational events overall -2 underlying premises

                                                i.  mutations are exceedingly rare events

                                                ii.  The more unlikely events a model invokes, the less likely the model is correct

                                    e.  Calculates branch order - not branch length

                                    f.  Advantage - calculations are rapid, can infer ancestral sequences

                                    g.  Disadvantage - large amount of data is discarded, problem if use short sequence or one without many informative sites


                        3.  Maximum Likelihood method

                                    a.  Purely statistical based method

b.      Likelihood of the replacement of a particular nt from pool of nt is calculated

                                    c.  Uses every site unlike parsimony as unchanged sites have a chance of having changed and then changed back

                                    d.  For each possible tree - likelihood of changes is calculated and probabilities for each aligned position are multiplied to get tree likelihood

                                    e.  Tree with maximum likelihood is most probable tree

                                    f.  Disadvantage - very slow to calculate, only as good as substitution model used


                        4.  Masks - weighting schemes based empirical data sets


                        5.  Tree confidence -validity tests - robustness of trees

                                    a.  Phylogenetic prediction has greater credibility if found using at least 2 fundamentally different methods

                                    b.  Use statistics to judge validity of branch points within the tree

c.   Bootstrapping - Felsenstein -allows rough quantification of tree confidence

                                    d.  Resamples original dataset to produce a series of datasets with different random selections of original data

                                    e.  Changing rather than deleting ensures dataset remains same size

f.       Tenuous groupings will be easily disrupted unlike robust associations

                                    g.  Generally create 100 or 1000 datasets to get bootstrap values

                                    h.  Values above 50% may be reliable, but generally use 70%

                                    i.  Not rigorous statistics - must be treated with caution - simulations have shown tendency to underestimate confidence level at high values and to overestimate at low levels        


Some useful website resources:




  2002 The Board of Regents of the University of Wisconsin System.

Click here to email comments to Scott Cooper regarding this site or its links.