Phylogeny: the
study of evolutionary relationships
I.
Phylogenetic analysis - powerful tool for inferring evolutionary
relationships from molecular data
A. Many advantages to
molecular phylogenies over phenotype-based phylogenies
1.
Can get similar phenotypes in distantly related organisms due to process
called convergent evolution
2. Many organisms do not
have easily studied phenotypic features - bacteria
3. With distantly related
organisms what phenotypic traits would you use -such as when comparing bacteria
and mammals
4. Analysis based on DNA and
protein often free of such problems
B.
Phylogeny typically diagrammed as tree (2-D graph of evolutionary
relationships)-many forms
1. Lines with names -
species or groups of species more closely related to
each other than to other groups on tree
2. Follow lines back to
where join other lines leads to ancestral species of
named groups
3. Groups which all lead
back to same line share a common ancestor (root)
4. Root is defined by
including an outgroup - related group which branched off earlier than the groups
being studied
C. Based on series of
assumptions
1. Changes in
characteristics occurs in lineages over time
2. Any group of organisms
are related by descent from a common ancestor
3. There is a bifurcating
pattern
4. Each position is changing
independently of all other positions
D. Different genes evolve at
different rates
1. Use different genes to build trees depending on phylogenetic
distances involved
2. SSU Ribosomal RNAs and
proteins have functionally conserved positions to reveal their homology and work
well for distantly related species
3. DNA and LSU rRNA
sequences work better for more closely related species
4. Mitochondrial genes work
well for very closely related comparisons of
species or races as the mitochondrial genome mutates at a high rate
E. Requires an optimal
alignment to start
1. Need to establish which
character represents the same position in sequence so can see how position has
changed through sequence evolution
2. Poor alignment leads to
false trees - time spent on getting a good
alignment is well spent
3. Dealing with gaps a
significant problem
a. Some programs treat gaps
as 5th character
b. Two sequences sharing a
50 nt gap is not really a perfect match of 50 nt as may have been produced by a
single event
c. Sequences sharing large gaps will group together if no
compensation is made
II. Methods
A.
Evolutionary distances - represent number of steps to go from ancestor to extant
species
1. Number of substitutions
is typically single most important variable in any
molecular evolutionary analysis
2. If closely related, the
straight count is probably right
3. Highly divergent
sequences have greater likelihood of having multiple
substitutions at a given site - straight count would be underestimate
4. Models to account for
this when estimating evolutionary distance
a. Jukes-Cantor Model -
assumes rate of change to any one of 3
alternate nucleotide to be equal - not true, for instance, know
that transitions occur 3X as frequently as transversions -but still
useful
b. Kimura’s 2-parameter
model - took into account different rates of
transitions vs transversions but use uniform rate within the types
c. Models with even more
parameters - matrices for 12 different
substitution types and other variables - models based on large number of
assumptions - generally 1 & 2 parameter
models give more reliable results than the complex models
B. Ancestor species not
available for comparison so must estimate
C. Rooted (common ancestor
with unique path from it to any other node) vs.
unrooted (only specify relationships between the nodes - nothing about
direction of evolution)
D. Several methods of
analysis available for this estimation
1.
Distance methods - uses pairs of sequences to get a dissimilatory measure
a. Common ones - UPGMA (unweighted-pair-group
method with arithmetic mean) and NJ (Neighbor joining)
b. Calculates total number
of changes - scored according to type -
between every pair of sequences in alignment
c. Represents minimum number
of changes required to convert 1 sequence to another
d. Results
written to distance matrix used to generate tree several
possible ways - branch lengths visually represent amount of change
e.
Removing ambiguous sections will influence branch length estimates
2. Maximum parsimony method
- character based analysis
a. Parsimony means thrift or
stinginess
b. Based on corresponding
sequence positions
c. Uses only “informational”
positions
d.
Finds tree that requires the fewest number of mutational events
overall -2 underlying premises
i. mutations are exceedingly
rare events
ii. The more unlikely events
a model invokes, the less likely the model is correct
e. Calculates branch order -
not branch length
f. Advantage - calculations
are rapid, can infer ancestral sequences
g. Disadvantage - large
amount of data is discarded, problem if use short sequence or one without many
informative sites
3. Maximum Likelihood method
a. Purely statistical based
method
b.
Likelihood of the replacement of a particular nt from pool of nt is
calculated
c. Uses every site unlike
parsimony as unchanged sites have a chance of having changed and then changed
back
d. For each possible tree -
likelihood of changes is calculated and probabilities for each aligned position
are multiplied to get tree likelihood
e. Tree with maximum
likelihood is most probable tree
f. Disadvantage - very slow
to calculate, only as good as substitution model used
4. Masks - weighting schemes
based empirical data sets
5. Tree confidence -validity
tests - robustness of trees
a. Phylogenetic prediction
has greater credibility if found using at
least 2 fundamentally different methods
b. Use statistics to judge
validity of branch points within the tree
c. Bootstrapping - Felsenstein -allows rough
quantification of tree confidence
d. Resamples original
dataset to produce a series of datasets with
different random selections of original data
e. Changing rather than
deleting ensures dataset remains same size
f.
Tenuous groupings will be easily disrupted unlike robust associations
g. Generally create 100 or
1000 datasets to get bootstrap values
h. Values above 50% may be
reliable, but generally use 70%
i. Not rigorous statistics -
must be treated with caution - simulations have shown tendency to underestimate
confidence level at high values and to overestimate at low levels
Some useful website resources:
http://aleph0.clarku.edu/~djoyce/java/Phyltree/cover.html
|