Take-home 2

Current Level

Previous Level

Take Home Assignment # 2

Use the hierarchy browser search tool to find your assigned genus in the RDP. Set the size to >1200 nt as you want to use only close to full length sequences. Please hand in a list (not a print out) of the hierarchal “phylogenetic steps” to your assigned genus starting with the Domain - identify what type each step is (for example the first step is a type of Domain). Retrieve all the sequences within the genus either from the RDP or from any of the other databases you've learned to use in this course. (This is regardless of whether the individual species within the genus actually include your genus name as part of their name – also don’t select a sequence outside of the genus as shown on the RDP just because it shares the genus name. You should download between 10 and 15 sequences – if you think you have more or less than this please let me know.) Note: It is important to check "remove all gaps" in the sequence when downloading it if it comes from RDP, which is an aligned database. Find your genus again, only this time set the size to both. How many total sequences are in this genus, including the short partials? Note: Do not retrieve any of these shorter sequences - only use the >1200 nt for the rest of the take home. (6 pt)

Align the retrieved sequences that fit the above criteria using MUSCLE or CLUSTALW. Hand in a copy of the alignment (not colored). Be sure that each sequence is clearly identified (by name, NOT by number) - by editing the original sequence or alignment. The computer will sometimes chop off the labels after 10 characters or at the first space resulting in identical labels for longer names for some clones where the unique identifiers fall only at the end of the name. Feel free to abbreviate the genus and even species if have multiple strains. You will likely want to run the alignment through MView or BOXSHADE to see the consensus sequence, but I don’t want you to turn in that version, just the basic, uncolored text alignment is what I want. (3 pt)

Run a second alignment with E. coli 16S rRNA (or another bacterial 16S rRNA as long as it is outside of your group) as one of the sequences – don’t hand this alignment in (-2 pt if you do). Comparing the consensus sequence from the first alignment to this second alignment should help you find regions that might be unique signature sequences for your genus. Explain why this second alignment should help you to determine these regions. If you were going to choose a different bacterial 16S rRNA sequence instead of E. coli, would you choose one more closely or less closely related to your genus than E. coli? Explain your reasoning. If you had a picture of the secondary structure of the Bacterial 16S rRNA showing which areas are highly conserved and which are highly variable (like the one I showed in lecture without any actual sequence data just dots), would areas from the picture that are highly conserved or highly variable be a better place for you to search for signature sequence for your genus? Select 3-4 different potential sequences (potential probe targets - see tips below). Highlight each of your selected signature sequences on the alignment pages you are handing in and number them 1-3 or 4. Run each candidate through Probe Match to determine its usefulness. Hand in a copy of the Probe Match (be sure it is set to both) information on your signature sequence candidates - be sure the print-out includes all appropriate information up to the number of hits in the genus (but I don't need to see which members of the genus match) and the corresponding number from the highlighted region on the alignment. (Note: if you hand in a cut and paste version rather than the actual print-out you will be docked if information I need to assess your results is missing, but if all the information is there this is fine.) Rerun each signature sequence through Probe Match restricting the search to sequences with data in the region of this signature sequence. (Pick a region 10-20 nt before the start of your signature sequence to 10-20 nt after the end. Think - what information have you generated that will make it easy to identify where this would be on the E. coli sequence?) Hand in the output from this restricted run just like you did with the initial Probe Match run, however, somewhere on each printout write down the numbers for the region you restricted the run to. Also, write up an analysis for each of your Probe Match results. Then summarize your results, by concluding which one of your signature sequences you believe is the best. Explain why you consider this signature sequence to give you a better probe than each of the other options. Be sure to include the information you gained by restricting the search area in your argument. The signature sequence you selected may be the best of your 3-4 candidates, however, in the real world would you consider this probe search a success? Include an explanation as to why you do or don't believe the selected sequence will give you a useful probe for your group. (15 pt)

It is generally necessary to manually edit your aligned sequences prior to phylogenetic analysis or use a mask during the analysis. While I don't expect that your genus would require major editing, there is perhaps some minor edits that could prove useful. You don't have to actually make the edits to your sequence before running your trees, but please mark what edits you would make if you could on the alignment you generated above (with the edited areas circled) and give a brief explanation as to why you performed the edits, or turn in a paragraph explaining exactly why you did not need to make any edits to your aligned sequences if you believe that to be the case. (3 pt)

Construct phylogenetic trees for your assigned group of organisms using three completely different tree methodologies (not just appearance like with rooted and unrooted). Hand in the trees along with the name and general description of the tree methodology (not program name) used for each. Be sure it is clear how these three methodologies differ. Please evaluate your phylogenetic results, including the following in the evaluation discussion: Did any of your methods give you multiple trees? Why would this occur? Do your trees truly differ? Would you expect them to differ? Explain. Are the branch lengths valid for any of your trees? If so, which? Do you prefer one tree over the others - if so, explain? For your genus, speculate as to whether selecting a model for correcting for multiple substitutions rather than uncorrected as one of your analysis parameters would make a difference (don't run it, just think about it). Explain your answer. (17 pt)

You also need to run a bootstrap analysis for the phylogenetic groupings for your 3 trees. Please hand in a hard copy of these computer print outs showing the calculated bootstrap values at the appropriate nodes on the tree. Be sure to state how many random trees/datasets were tested so we know if the 89 is out of 100 trees or 1000 or whatever number you chose. On each tree note whether or not any of the branches (groupings) are not valid according to the bootstrap analysis. What groupings (if any) are valid on all 3 trees? How about questionable? How about not valid? If they differ, which (if any) of your trees do you have the most confidence in? (6 pt)

Probe Design tips

1. Probes, optimally, should be about 18-25 nucleotides in length, but some are as short as 15 nucleotides and others are longer than 25.

2. If most of your sequences have a particular nucleotide (say a T for example) at a site, but one or two of the sequences have an N at that site (meaning it could be any nucleotide), go ahead and design the probe with that nucleotide (the T in my example).

3. If you are having considerable trouble finding consensus sequence regions long enough for probes, expand your options by first checking to see if some base uncertainties are masking consensus. The following IUPAC abbreviations may be used within your sequence: R for A or G, W for A or T, S for G or C, M for A or C, Y for C or T, and K for G or T. Consider then that an R may be in consensus if the other sequences all have A or all have G at that position.

4. If necessary, design the probes using an ambiguous base like R or W (only 1 ambiguous base for probes shorter than 20 nt or 2 for probes over 20 nt).

5. Your phylogenetic trees may show that 1 sequence or a small group of sequences is more distantly related to the rest of the sequences. If this sequence(s) is causing problems in finding a consensus region for a probe, go ahead and design your probe for just the main group of sequences. You will need to explain this in your paragraph on the probe.

Genus names of organisms for Take Home Assignment # 2

1. Leisingera

2. Thermocladium

3. Phaeospirillum

4. Marisediminicola

5. Methylosarcina

6. Thalassobaculum

7. Yangia

8. Croceicoccus

9. Desulfonatronovibrio

10. Tomitella

11. Angustibacter

12. Asanoa

13. Trabulsiella

14. Catellatospora

15. Pimelobacter

16. Arsenicicoccus

17. Lapillicoccus

18. Actinoalloteichus

19. Azomonas

20. Yonghaparkia

21. Caenispirillum

22. Leclercia

23. Desulfofrigus

24. Uruburuella

25. Vitreoscilla

26. Singularimonas

27. Anaerobiospirillum

28. Cedecea

29. Herbiconiux

30. Amorphus

Click here to email comments to Scott Cooper regarding this site or its links.