Lab 2.3

Current Level

Previous Level

Phylogenetic Trees

First we will do a couple of pencil exercises and then we will analyze the eleven SSU rRNA sequences that everyone aligned last time.

1. Draw all possible unrooted trees for 3 species (A, B, C). Now draw all possible rooted trees. Note: Branches can pivot at a node so:

equals

Both still group with .

2. Now draw all possible unrooted trees for 4 species (1,2,3,4). If the trees you have drawn represent maximum parsimony trees for a given information position where the nucleotides at that position are T, T, C, and C (corresponding to 1,2 3 & 4 respectively). Take the trees you have drawn and label each of the 2 internal nodes with the most likely (may not be 1 right answer) candidate for inferred ancestral nucleotide. How many of these trees are equally parsimonious, invoking a minimum of 1 substitution? How many invoke 2? Any invoke greater than 2 substitutions?

We will now look at some distance matrices and use three different methods to generate phylogenetic trees using the Mobyle portal from the Pasteur Institute that we used for Lab 2.2.

Calculating a distance matrix.

These programs calculate pairwise distances between sequences creating a matrix. First let’s look at a distance matrix and the effect of correcting for multiple substitutions at a given site.

3. Calculating a distance matrix using the program “distmat”

a. Under ”Programs” on the left hand side, click on “phylogeny” to expand your choices, then “distance” and finally “distmat”, which is the program for calculating a distance matrix from a multiple sequence alignment. (If you bookmarked your alignment file, you can simply click on that and then select the program from the

b. Upload your alignment file by clicking on upload and then choose your downloaded MUSCLE file (alternately you could just rerun the alignment). In future you should be able to select this file from the dropdown menu to the right of the “Choose file” button.

c. Initially use the default values for this program. Normally I would recommend changing the default for “Use the ambiguous codes in the calculation” from no to yes, but it can make the program run slower and so we won’t change this parameter for the class exercise that we are all running at the same time. Since rRNA doesn’t contain codons, we will analyze all 3 bases, but with coding DNA sometimes you may want to exclude the third base or even just look at one of the three bases, depending on your analysis. Once the program has run you can click on “full screen view” in order to see all the data. Note: the smaller the number the smaller the evolutionary distance and so the closer together the sequences will be on the tree.

d. Look at your top row or column 11. Going across the row or down the column, group similar organisms together based on evolutionary distances calculated. On your paper rough out a phylogenetic tree (don't worry about branch lengths or style, simply draw something that shows the groupings and how the various groups are related to each other). You don't need to hand in the matrix.

e. In a new tab or window, open the portal and run “distmat” on your aligned sequences with the correction for multiple substitutions set to either Kimura or Jukes-Cantor (if time try them all to see if there is much of a difference). What difference, if any, do you see in your distance matrix? Now try the program again, but use the settings “uncorrected” but set the “Weight given to gaps” at 10. How does this affect the matrix?

f. Answer the following questions:

Do the groupings change when you change the parameters? If so, explain how or adjust the hand-drawn tree accordingly. Do any of the matrices appear to be, or logically would seem to be, more accurate than the others?  Why or why not?

Creating a distance-based tree.

4. Unfortunately, the format for the output from “distmat” cannot be used for the phylogenetic trees available on this portal, so while it is a useful program for looking at the distance matrix, we will need to use a different program to get a matrix that can serve as the input file for creating a distance-based tree.

a. Under ”Programs” on the left hand side, click on “phylogeny” to expand your choices, then “distance” and finally “dnadist”, which is another program for calculating a distance matrix from a multiple sequence alignment, but unfortunately does not display the matrix in an easily read format, which is why we ran “dnamat” above.

b. Select your alignment file again. (Click on “upload” and the selection box with your file should appear for you to select it.) Note that as one of the options there is a box for “Weight options” where you would be able to enter a weighting mask file. Why would you want to do this? I’m not going to have you create a weighting mask, but think about the information that you would need to do this.

c. Use the default values and run the program. Once the program has run, below the outfile window with the matrix results, there should be a box next to a button for “further analysis”. The box lists a number of distance-based tree programs that use the “dnadist” outpile file for their input file. Select “neighbor” for neighbor-joining, a common distance-based tree program.

d. From the new window you can run either the default neighbor-joining or switch to run an UPGMA program. Leave the defaults and run the program. Do not change the other defaults, but note that this sets the first sequence as an outgroup. Since we are using all three domains, there is no true outgroup. If your first sequence is eukaryotic, DO change this default to one of the prokaryotic sequences or the tree will looks pretty odd. Or you can delete the number to not assign an outgroup. Outgroups are typically used to root a tree – the root of the tree will be on the line leading to the outgroup species. Technically the output tree is an unrooted tree, but as the outgroup will be connected to the bottommost node it is easy to convert into the rooted form.

e. In order to get a better looking tree for a printout, go to the “Neighbor output tree file” box. Below it you can select either “drawgram” (rooted treee) or “drawtree” (unrooted tree) and then click “further analysis”. If the file doesn’t automatically input, you will need to select it (njtree.data) from the box to the right of the upload tabs. (If it isn’t there try the result tab.) The default is a PostScript printer file, but there are many other options. There are other parameters you can change if necessary to get an easily readable tree (like try to avoid label overlap). For most of the outputs you will need to click on the save icon to download the file. Note if you select the rooted version that the root here is arbitrarily chosen and is likely not true – that said it can sometimes be easier to see the labels of closely related species in this format.

f. Return to your “dnadist” file – either through the data bookmark page if you bookmarked it or by clicking on it from the list of jobs. This time use “kitsch” for further analysis and for program method select “minimum evolution” and run the program. Note: although this appears to be a rooted tree it is not. Does this tree differ from the previous distance-based tree? If so, which one seems more logical to you?

Creating a parsimony-based tree.

5. Let’s try a completely different methodology to generate a phylogenetic tree. DNA Parsimony is a character based method, instead of distance-based.

a. Under ”Programs” on the left hand side, click on “phylogeny” to expand your choices, then “parsimony” and finally “dnapars”.

b. Click on “upload” and again select your alignment file. (Since this is NOT distance-based we will not use the matrix outfile.). Hit “Run”.

c. Technically parsimony gives only the relative order of sequences by evolutionary distance, and it shows only the groupings, with branch lengths that do not represent quantitative evolutionary distances. Some programs do additional calculations to derive branch lengths, as is seen here. For programs that don’t do this all the species names will align. Note also that parsimony can give you more than one output tree. Did it this time? You can also access “drawgram” (rooted treee) or “drawtree” (unrooted tree) from the Tree file box for the output of this tree method.

Creating a statistics-based tree.

6. Finally, let’s try another completely different methodology to generate a phylogenetic tree. There are a number of statistical methods that fall into the category – likelihood based trees.

a. Under ”Programs” on the left hand side, click on “phylogeny” to expand your choices, then “likelihood” and finally “fastdnaml”. This will construct a tree using maximum likelihood.

b. We will use the empirical base frequencies derived from the sequence data, but in some cases users may want to specific these frequencies if they have biological knowledge of the system that provides them with this information. Leave the other defaults and run the program.

c. Note in the output, not only do you see the tree, but also the process. For each branch (species) it added, it shows the number of alternative trees tested, the likelihood score and as the tree becomes more complicated the local rearrangements and their alternative trees tested. Like parsimony, this initially is concerned only with the groupings of the organisms, but some programs, such as this, do additional calculations to provide branch lengths. You can also access “drawgram” (rooted treee) or “drawtree” (unrooted tree) from the Tree file box for the output of this tree method.

d. How do all of these trees (all three methodologies) compare to each other and the tree you drew out by hand?  If you do see any differences, how might you explain them?

Testing branch validity/confidence using bootstrap analysis

7. Finally we will look at analyzing the branches on our trees by bootstrap analysis. Unfortunately, how to do that is slightly different for each tree so I will walk you through how to run a bootstrap analysis for each and we talk about what to do with the results in lab.

a. Distance-based tree bootstrap: You need to run the actual resampling on the data prior to running your tree analysis, and so here this means going back to the “dnadist” program you ran in step 4. What is different is that in the “Bootstrap options” box you will change the default from “No” to “Yes”. There are other ways you may select to check your phylogenetic analysis, but we will use the default resampling (bootstrap). Use a random number seed of 111 and 100 replicates (1000 is better but will take longer so we won’t do this in class when you are all going at once). Run this program – it will create 100 (or 1000) distance matrices from resampling the dataset. Once it is run, you must then select and run the neighbor-joining program again using the output (further analysis under the outfile). The difference is that this time, under the bootstrap options you need to change the default to “Yes” for “Analyze multiple data sets”, type in 100 (or 1000 – must match above) datasets and 111 for the random number seed. Also change the “Compute a consensus tree” default to “Yes” as well. Run the program.

b. Character-based bootstrap: Since these programs use the alignment file for input and not a distance matrix, the bootstrap option is part of the tree program. Open “dnapars” as you did in Step 5 above. Under “Bootstrap options” change the default to “Yes” and the replicates and random number seed to 100 and 111 and change the “Compute a consensus tree” default to “Yes” as you did in 7a above. Run the program. Depending on usage, this program can take awhile to run. The tree will appear under the consense outfile.

c. Likelihood based bootstrap: Like parsimony, bootstrapping is done as part of the tree program. Open “fastdnaml” as you did in Step 6 above. Under “Bootstrap options” change the default to “Yes” and the number of samples and random number seed to 100 and 111 as you did in 7a and 7b above. Leave “maximum attempts at replicating inferred tree” at 10 (if under “advanced options”. Run the program. This will generate a dataset with 100 trees, but unlike the distance-based bootstrap, it does not generate the consensus tree, with the values at the nodes. You can either count up the number of times a node occurs by hand, or you can go down to further analysis under the “Bootstrap tree file”, select “consensus” and run that program using the default values.

d. These programs give you bootstrap values for the different nodes of the tree. If the consensus tree for these is different from the one you drew by hand try to place these values on the nodes of your tree. In some cases, like neighbor-joining the output may give values for groupings that aren’t on the consensus tree but did show up in some of the bootstrap trees. (This may not be possible if the consensus tree is considerably different from yours.)  Do the different bootstrapped tree methodologies agree on which branches are least valid? What factor would have the greatest impact on the number of computations needed to complete a bootstrap analysis - doubling the number of sequences (for example going from 10 sequences to 20) or doubling the length of the sequences in the alignment (for example all 10 sequences go from 1000 nt to 2000 nt)? Why? (MUST include this answer and your reasoning for your answer on the document you hand in for this lab!)

Click here to email comments to Scott Cooper regarding this site or its links.