Use the
hierarchy browser search tool to find your
assigned genus in the RDP. Set
the size to >1200 nt as you want to use
only close to full length sequences. Please hand
in a list (not a print out) of the
hierarchal “phylogenetic steps” to your assigned
genus starting with the Domain - identify what
type each step is (for example the first step is
a type of Domain). Retrieve all the
sequences within the genus either from
the RDP or from any of the other databases
you've learned to use in this course. (This is
regardless of whether the individual species
within the genus actually include your genus
name as part of their name – also don’t select a
sequence outside of the genus as shown on the
RDP just because it shares the genus name.
You should download between 10 and 15
sequences – if you think you have more or less
than this please let me know.)
Note: It is important to check "remove
all gaps" in the sequence when downloading it if
it comes from RDP, which is an aligned database.
Find your genus again, only this time set the
size to both. How many total sequences
are in this genus, including the short partials?
Note: Do not retrieve any of
these shorter sequences - only use the >1200
nt for the rest of the take home. (6 pt)
Align
the retrieved sequences that fit the above
criteria using MUSCLE or CLUSTALW. Hand in a copy of the
alignment (not colored). Be sure that each
sequence is clearly identified (by name, NOT
by number) - by editing the original
sequence or alignment. The computer
will sometimes chop off the labels after 10
characters or at the first space resulting in
identical labels for longer names for some
clones where the unique identifiers fall only at
the end of the name. Feel free to abbreviate the
genus and even species if have multiple strains. You
will likely want to run the alignment through MView
or BOXSHADE to see the consensus
sequence, but I don’t want you to turn in that
version, just the basic, uncolored text
alignment is what I want. (3 pt)
Run a second
alignment with E. coli 16S rRNA (or
another bacterial 16S rRNA as long as it is
outside of your group) as one of the sequences –
don’t hand this alignment in (-2 pt if you do).
Comparing the consensus sequence from the first
alignment to this second alignment should help
you find regions that might be unique signature
sequences for your genus. Explain why
this second alignment should help you to
determine these regions. If you
were going to choose a different bacterial 16S rRNA
sequence instead of E. coli, would you choose one more closely or less closely
related to your genus than
E. coli? Explain your reasoning. If you
had a picture of the secondary structure of the
Bacterial 16S rRNA showing which areas are
highly conserved and which are highly variable
(like the one I showed in lecture without any
actual sequence data just dots), would areas
from the picture that are highly conserved or
highly variable be a better place for you to
search for signature sequence for your genus?
Select 3-4 different potential sequences
(potential probe targets - see tips below).
Highlight each of your selected signature
sequences on the alignment pages you are
handing in and number them 1-3 or 4. Run
each candidate through Probe Match to determine
its usefulness. Hand in a copy of the
Probe Match (be sure it is set to both)
information on your signature sequence
candidates - be sure the print-out includes all
appropriate information up to the number of hits
in the genus (but I don't need to see which
members of the genus match) and
the corresponding number from the highlighted
region on the alignment. (Note:
if you hand in a cut and paste version rather
than the actual print-out you will be docked if
information I need to assess your results is
missing, but if all the information is there
this is fine.) Rerun each signature sequence
through Probe Match restricting the search to
sequences with data in the region of this
signature sequence. (Pick a region 10-20 nt before the start of your signature sequence
to 10-20 nt after the end. Think - what
information have you generated that will make it
easy to identify where this would be on the
E. coli sequence?) Hand in the output
from this restricted run just like you did with
the initial Probe Match run, however, somewhere
on each printout write down the numbers for the
region you restricted the run to. Also,
write up an analysis for each of your Probe
Match results. Then summarize your
results, by concluding which one of your signature
sequences you believe is the best. Explain why
you consider this signature sequence to give you
a better probe than each of the other options.
Be sure to include the information you gained by restricting the search area in
your argument. The signature sequence you
selected may be the best of your 3-4 candidates,
however, in the real world would you consider
this probe search a success? Include an
explanation as to why you do or don't believe
the selected sequence will give you a useful
probe for your group. (15 pt)
It is generally necessary to manually edit your
aligned sequences prior to phylogenetic analysis
or use a mask during the analysis. While I don't expect that your
genus would require major editing, there is
perhaps some minor edits that could prove
useful. You don't have to actually make the
edits to your sequence before running your
trees, but please mark what edits you would make if
you could on the alignment you generated
above (with the edited areas circled) and
give a brief explanation as to why you performed
the edits, or turn in a paragraph
explaining exactly why you did not need to make
any edits to your aligned sequences if you
believe that to be the case. (3 pt)
Construct phylogenetic trees for your assigned
group of organisms using three completely
different tree methodologies (not just
appearance like with rooted and unrooted).
Hand in the trees along with the name and
general description of the tree methodology
(not program name) used for each. Be sure
it is clear how these three methodologies
differ. Please evaluate your phylogenetic
results, including the following in the
evaluation discussion: Did any of your
methods give you multiple trees? Why would
this occur? Do your trees truly differ?
Would you expect them to differ? Explain.
Are the branch lengths valid for any of your
trees? If so, which? Do you prefer
one tree over the others - if so, explain?
For your genus, speculate as to whether
selecting a model for correcting for multiple
substitutions rather than uncorrected as one of
your analysis parameters would make a difference
(don't run it, just think about it).
Explain your answer. (17 pt)
You also
need to run a bootstrap analysis for the phylogenetic groupings
for your 3 trees. Please hand in a
hard copy of these computer
print outs showing the calculated bootstrap values
at the appropriate nodes on the tree. Be sure to
state how many random trees/datasets were tested
so we know if the 89 is out of 100 trees or 1000
or whatever number you chose. On
each tree note whether or not any of the
branches (groupings) are not valid according to
the bootstrap analysis. What groupings (if
any) are valid on all 3 trees? How about
questionable? How about not valid?
If they differ, which (if any) of your trees do you have the most confidence in? (6 pt)
Probe Design tips
1. Probes, optimally, should be about 18-25 nucleotides
in length, but some are as short as 15
nucleotides and others are longer than 25.
2. If most of your sequences have a particular nucleotide
(say a T for example) at a site, but one or two
of the sequences have an N at that site (meaning
it could be any nucleotide), go ahead and design
the probe with that nucleotide (the T in my
example).
3. If you are having considerable trouble finding
consensus sequence regions long enough for
probes, expand your options by first checking to
see if some base uncertainties are masking
consensus. The following IUPAC
abbreviations may be used within your sequence:
R for A or G, W for A or T, S for G or C, M for
A or C, Y for C or T, and K for G or T.
Consider then that an R may be in consensus if
the other sequences all have A or all have G at
that position.
4. If necessary, design the probes using an ambiguous
base like R or W (only 1 ambiguous base for
probes shorter than 20 nt or 2 for probes over
20 nt).
5. Your phylogenetic trees may show that 1 sequence or a
small group of sequences is more distantly
related to the rest of the sequences. If
this sequence(s) is causing problems in finding
a consensus region for a probe, go ahead and
design your probe for just the main group of
sequences. You will need to explain
this in your paragraph on the probe.
Genus names of organisms for Take Home Assignment # 2
1.
Leisingera
2. Thermocladium
3. Phaeospirillum
4.
Marisediminicola
5.
Methylosarcina
6.
Thalassobaculum
7.
Yangia
8.
Croceicoccus
9.
Desulfonatronovibrio
10. Tomitella
11. Angustibacter
12. Asanoa
13.
Trabulsiella
14.
Catellatospora
15.
Pimelobacter
16.
Arsenicicoccus
17.
Lapillicoccus
18.
Actinoalloteichus
19.
Azomonas
20. Yonghaparkia
21. Caenispirillum
22.
Leclercia
23. Desulfofrigus
24. Uruburuella
25. Vitreoscilla
26. Singularimonas
27.
Anaerobiospirillum
28. Cedecea
29.
Herbiconiux
30.
Amorphus
|