First we will do a couple of pencil exercises and then we will analyze
the eleven SSU rRNA sequences that everyone aligned last
time.
1. Draw all possible unrooted trees for 3 species (A, B, C). Now
draw all possible rooted trees. Note: Branches can pivot at a node
so:
Both still group with .
2. Now draw all possible unrooted trees for 4 species (1,2,3,4).
If the trees you have drawn represent maximum parsimony trees for
a given information position where the nucleotides at that position are T, T, C,
and C (corresponding to 1,2 3 & 4 respectively). Take the trees you have drawn and label each of the 2 internal nodes with the
most likely (may not be 1 right answer) candidate for inferred ancestral
nucleotide. How many of these trees are equally parsimonious, invoking a
minimum of 1 substitution? How many invoke 2? Any invoke
greater than 2 substitutions?
We will now look at some distance matrices and use three
different methods to generate phylogenetic trees using the Mobyle portal from
the Pasteur Institute that we used for Lab 2.2.
Calculating a
distance matrix.
These programs calculate pairwise distances between sequences creating a matrix.
First let’s look at a
distance matrix and the effect of correcting for multiple substitutions at a
given site.
3.
Calculating a distance
matrix using the program “distmat”
a.
Under ”Programs”
on the left hand side, click on “phylogeny” to expand your choices, then
“distance” and finally “distmat”, which is the program for calculating a
distance matrix from a multiple sequence alignment.
(If you bookmarked your alignment file,
you can simply click on that and then select the program from the
b.
Upload your
alignment file by clicking on upload and then choose your downloaded MUSCLE file
(alternately you could just rerun the alignment).
In future you should be able to select this file from the dropdown menu
to the right of the “Choose file” button.
c.
Initially use the
default values for this program. Normally I would recommend changing the default
for “Use the ambiguous codes in the calculation” from no to yes, but it can make
the program run slower and so we won’t change this parameter for the class
exercise that we are all running at the same time. Since rRNA doesn’t contain
codons, we will analyze all 3 bases, but with coding DNA sometimes you may want
to exclude the third base or even just look at one of the three bases, depending
on your analysis. Once the program
has run you can click on “full screen view” in order to see all the data.
Note: the smaller the number the smaller the evolutionary distance and so
the closer together the sequences will be on the tree.
d.
Look at your top
row or column 11. Going across the row or down the column, group similar
organisms together based on evolutionary distances calculated. On your paper
rough out a phylogenetic tree (don't worry about branch lengths or style, simply
draw something that shows the groupings and how the various groups are related
to each other). You don't need to hand in the matrix.
e.
In a new tab or
window, open the portal and run “distmat” on your aligned sequences with the
correction for multiple substitutions set to either Kimura or Jukes-Cantor (if
time try them all to see if there is much of a difference). What difference, if
any, do you see in your distance matrix?
Now try the program again, but use the settings “uncorrected” but set the
“Weight given to gaps” at
10. How does this affect the matrix?
f.
Answer the following questions:
Do the groupings
change when you change the parameters? If so, explain how or adjust the
hand-drawn tree accordingly. Do any of the matrices appear to be, or logically
would seem to be, more accurate than the others?
Why or why not?
Creating a
distance-based tree.
4.
Unfortunately, the format for
the output from “distmat” cannot be
used for the phylogenetic trees available on this portal, so while it is a
useful program for looking at the distance matrix, we will need to use a
different program to get a matrix that can serve as the input file
for creating a distance-based tree.
a.
Under ”Programs”
on the left hand side, click on “phylogeny” to expand your choices, then
“distance” and finally “dnadist”, which is another program for calculating a
distance matrix from a multiple sequence alignment, but unfortunately does not
display the matrix in an easily read format, which is why we ran “dnamat” above.
b.
Select your
alignment file again. (Click on “upload” and the selection box with your file
should appear for you to select it.) Note that as one of the options there is a
box for “Weight options” where you would be able to enter a weighting mask file.
Why would you want to do this?
I’m not going to have you create a weighting mask, but think about the
information that you would need to do this.
c.
Use the default
values and run the program. Once the
program has run, below the outfile window with the matrix results, there should
be a box next to a button for “further analysis”.
The box lists a number of distance-based tree programs that use the “dnadist”
outpile file for their input file.
Select “neighbor” for neighbor-joining, a common distance-based tree program.
d.
From the new window you can run
either the default neighbor-joining or switch to run
an UPGMA program. Leave the defaults
and run the program. Do not change
the other defaults, but note that this sets the first sequence as an outgroup.
Since we are using all three domains, there is no true outgroup.
If your first sequence is eukaryotic, DO change this default to one of
the prokaryotic sequences or the tree will looks pretty odd.
Or you can delete the number to not assign an outgroup.
Outgroups are typically used to root a tree – the root of the tree will
be on the line leading to the outgroup species.
Technically the output tree is an unrooted tree, but as the outgroup will
be connected to the bottommost node it is easy to convert into the rooted form.
e.
In order to get a better
looking tree for a printout, go to the “Neighbor output tree file” box.
Below it you can select either “drawgram” (rooted treee) or “drawtree” (unrooted
tree) and then click “further analysis”.
If the file doesn’t automatically input, you will need to select it (njtree.data)
from the box to the right of the upload tabs. (If it isn’t there try the result
tab.) The default is a PostScript printer file, but there are many other
options. There are other parameters
you can change if necessary to get an easily readable tree (like try to avoid
label overlap). For most of the outputs you will need to click on the save icon
to download the file. Note if you
select the rooted version that the root here is arbitrarily chosen and is likely
not true – that said it can sometimes be easier to see the labels of closely
related species in this format.
f.
Return to your “dnadist” file –
either through the data bookmark page if you bookmarked it or by clicking on it
from the list of jobs. This time use
“kitsch” for further analysis and for program method select “minimum evolution”
and run the program. Note:
although this appears to be a rooted tree it is not.
Does this tree differ from the previous distance-based tree? If so, which
one seems more logical to you?
Creating a
parsimony-based tree.
5.
Let’s try a completely
different methodology to generate a phylogenetic tree.
DNA
Parsimony is a
character based method, instead of
distance-based.
a.
Under ”Programs” on the left hand side, click on “phylogeny” to expand your
choices, then “parsimony” and finally “dnapars”.
b.
Click on “upload” and again
select your alignment file.
(Since this is NOT distance-based we will not use the matrix outfile.). Hit
“Run”.
c.
Technically parsimony gives only the relative order of sequences by evolutionary
distance, and it shows only the groupings, with branch lengths that do not
represent quantitative evolutionary distances. Some programs do additional
calculations to derive branch lengths, as is seen here.
For programs that don’t do this all the species names will align.
Note also that parsimony can give you more than one output tree.
Did it this time?
You can also access “drawgram” (rooted treee) or “drawtree” (unrooted
tree) from the Tree file box for the output of this tree method.
Creating a
statistics-based tree.
6.
Finally, let’s try another
completely different methodology to generate a phylogenetic tree.
There are a number of statistical methods
that fall into the category – likelihood
based trees.
a.
Under ”Programs” on the left hand side, click on “phylogeny” to expand your
choices, then “likelihood” and finally “fastdnaml”.
This will construct a tree using maximum likelihood.
b.
We will use the empirical base
frequencies derived from the sequence data, but in some cases users may want to
specific these frequencies if they have biological knowledge of the system that
provides them with this information.
Leave the other defaults and run the program.
c.
Note in the output, not only do
you see the tree, but also the process.
For each branch (species) it added, it shows the number of alternative
trees tested, the likelihood score and as the tree becomes more complicated the
local rearrangements and their alternative trees tested.
Like parsimony, this initially is concerned only with the groupings of
the organisms, but some programs, such as this, do additional calculations to
provide branch lengths. You can also access “drawgram” (rooted treee) or “drawtree”
(unrooted tree) from the Tree file box for the output of this tree method.
d.
How do all of these trees (all three methodologies) compare to each other and
the tree you drew out by hand?
If you do see any differences, how might
you explain them?
Testing branch
validity/confidence using bootstrap analysis
7.
Finally we will look at
analyzing the branches on our trees by bootstrap analysis.
Unfortunately, how to do that is slightly different for each tree so I will walk
you through how to run a bootstrap analysis for each and we talk about what to
do with the results in lab.
a. Distance-based
tree bootstrap:
You need to run the actual resampling on the data prior to running your tree
analysis, and so here this means going back to the “dnadist” program you ran in
step 4. What is different is that in
the “Bootstrap options” box you will change the default from “No” to “Yes”.
There are other ways you may select to check your phylogenetic analysis,
but we will use the default resampling (bootstrap).
Use a random number seed of 111 and 100 replicates (1000 is better but
will take longer so we won’t do this in class when you are all going at once).
Run this program – it will create 100 (or 1000) distance matrices from
resampling the dataset. Once it is
run, you must then select and run the neighbor-joining program again using the
output (further analysis under the outfile).
The difference is that this time, under the bootstrap options you need to
change the default to “Yes” for “Analyze multiple data sets”, type in 100 (or
1000 – must match above) datasets and 111 for the random number seed.
Also change the “Compute a consensus tree” default to “Yes” as well.
Run the program.
b.
Character-based bootstrap:
Since these programs use the alignment
file for input and not a distance matrix, the bootstrap option is part of the
tree program. Open “dnapars” as you
did in Step 5 above. Under
“Bootstrap options” change the default to “Yes” and the replicates and random
number seed to 100 and 111 and change the “Compute a consensus tree” default to
“Yes” as you did in 7a above. Run the program.
Depending on usage, this program can take awhile to run.
The tree will appear under the consense outfile.
c.
Likelihood based bootstrap:
Like parsimony, bootstrapping is done as part of the tree program.
Open “fastdnaml” as you did in Step 6 above.
Under “Bootstrap options” change the default to “Yes” and the number of
samples and random number seed to 100 and 111 as you did in 7a and 7b above.
Leave “maximum attempts at replicating inferred tree” at 10 (if under “advanced
options”. Run the program. This will
generate a dataset with 100 trees, but unlike the distance-based bootstrap, it
does not generate the consensus tree, with the values at the nodes.
You can either count up the number of times a node occurs by hand, or you
can go down to further analysis under the “Bootstrap tree file”, select
“consensus” and run that program using the default values.
d.
These programs give you
bootstrap values for the different nodes of the tree. If the consensus
tree for these is different from the one you drew by hand try to place these
values on the nodes of your tree. In
some cases, like neighbor-joining the output may give values for groupings that
aren’t on the consensus tree but did show up in some of the bootstrap trees.
(This may not be possible if the consensus tree is considerably different from
yours.)
Do the different
bootstrapped tree methodologies agree on which branches are least valid?
What factor would have the greatest impact on the number of computations
needed to complete a bootstrap analysis - doubling the number of sequences (for
example going from 10 sequences to 20) or doubling the length of the sequences
in the alignment (for example all 10 sequences go from 1000 nt to 2000 nt)?
Why? (MUST include this answer and your reasoning for your answer on the
document you hand in for this lab!)
.