Multiple Sequence Alignments
1.
Go to
http://www.ebi.ac.uk/Tools/msa/
This is from the EMBL site of the European Bioinformatics
Institute and it contains a
number of multiple sequence alignment tools.
Some are only for aligning protein
sequences, but most do nucleic acid sequences as well.
We are going to compare the
results of several MSA for our dataset.
2.
Select “Launch
MUSCLE’” from the list of programs.
a.
Upload your file of sequences, or alternately paste them into
the box with a line between each sequence and each with a header
of the form: >SeqName/ID
b.
Under Step 2, in order to see the alignment you must change the
default parameter from Pearon/FASTA as this output shows each
sequence individually with the introduced gaps from the
alignment (a file used as input by some phylogenetic programs).
Change the output format default to “Phylip interleaved”
and hit submit. For a
dataset this size the task will generally complete in only a
couple of minutes and so the e-mail alert isn’t necessary.
Once you get the results, download them as we will need
them for lab 2.3.
c.
Rerun this program using instead the output format default
“HTML”, although “Phylip interleaved” and the remaining formats
will also provide you with an alignment view, HTML provides some
shading for conserved residues (strongly conserved is blue and
weakly conserved is gray) for easier viewing.
There is, however, no consensus line. There are programs
you can download such as Boxshade and Textshade that will shade
the residues (nt or amino acids) different colors based on
consensus levels you can set. They provide many different
viewing options that some people prefer.
d.
While waiting for this to run, open a new window (not tab) to
this site for step 3 below.
3.
Let’s compare the
MUSCLE alignment to one we get using a different alignment tool.
Select “Launch ClustalW2”, a slightly newer version of
the popular CLUSTALW.
a.
Under Step 1, change the default from Protein to DNA and then
enter your sequence file as above in 2a.
b.
Under Step 2, leave the type at the default “Slow” unless the
program bogs down in class, in which case we may switch to
“Fast”. Typically
“Slow” will give better results and so “Fast” is only used for
much larger datasets where the wait time increases dramatically.
Also leave the defaults
for Step 3 for now.
[Time permitting try rerunning the alignment with a different
weight matrix and/or different gap penalties to see the effect
on the alignment.]
c.
When the results pop up scroll through them – the numbers at the
end of each line refer to the nucleotide number (gaps are NOT
counted) for that sequence – when I refer to
E. coli numbering this
would be a good place to determine that.
The asterisks at the bottom of each group of aligned
sequence show where there is 100% consensus.
Note the “Show Colors” button – this is actually based on
protein sequences, and so will be more effective for these
alignments, but does perhaps help see some areas of consensus
within the DNA alignment.
e.
For a better color representation, on this site if you select
the “Download Alignment File” on your results page (note,
doesn’t work with all browsers) you will get a file that you can
upload into MView, described in 4 below, that provides some
color contrast.
4.
Select “Launch MView”
from the list of programs– note this is simply a viewing tool
and not an editor. (You may want to do this in a third window.)
a.
Under Step 1, change the default from Protein to DNA
b.
Upload your downloaded file.
c.
Either leave the Step 2 input format at “Automatic” or change it
to appropriate format. Leave the step 3 output with the default
parameters for now, but feel free to experiment with the
different options.
Unfortunately this viewer sets your first sequence as a
reference and so the shading is based on whether each residue is
the same or different from the reference.
The alignment is numbered, however, and it has 4
different lines of consensus sequence (100% all the sequences
has the same residue at the site, whereas at 70% only 7 out of
10 sequences have that given residue. R stands for purine and Y
for pyrimidine.)
Why do you suppose they
provide 4 different levels of consensus?
5.
Analyze your
alignments.
a.
How do the two alignments (MUSCLE and
CLUSTALW2) compare?
Give the numbers
(with respect to the E. coli numbering) for 3-4
areas where they differ and briefly describe the difference.
b.
Within a given alignment, do the sequences start and end in the
same place? Why do you suppose this is? Do you think
this affects your alignments?
c.
Scanning your alignments, you should see both variable and
conserved regions. Why are both of these features
important?
d.
The region between 1300 and 1400 (E. coli numbering)
contains an area of signature sequence that is considered
universal. Find it and write down at least 10 nt from this
conserved region (assume N's are likely conserved nt).
e.
Give the numbers (from the consensus sequence) for a couple of
regions (size doesn't matter) where Eukarya and Archaea (Methanococcus
and Pyrodictium) have sequence in common but the
Bacterial sequences (E. coli and M. scandinavica)
are different? Give the numbers for a couple of regions
the Archaea and Bacteria share in common? Likewise for
Eukarya and Bacteria? Was the last one harder to find?
Why do you suppose that's true?
6.
Try at least one more
alignment from MAFFT or T-Coffee.
Is this alignment the same
or different
from the other two?
Looking them over do you have a preference for any of the
formats? If yes,
why? What other
information would you need to determine which is the best
alignment?