Multiple Alignments
I. Multiple alignment - process of aligning 3 or
more sequences with each other
A. Set of sequences
are arranged in a scheme where positions believed to be homologous are
written in a common column
B. Missing characters
(alignment gaps) are denoted with a dash
II. Reasons for doing
multiple alignments: “One sequence plays coy; a
pair of homologous sequences whisper; many aligned
sequences shout out loud.”
A. Pairwise comparisons do not readily show positions that
are conserved among a whole set of sequences - tend to miss subtle
similarities that become visible when observed simultaneously among many
species
B. To identify
regions of similar sequence in a group with a conserved biological function
or a phylogenetic group that define a conserved
consensus pattern or domain
C. For sequences of
unknown function, the presence of similar domains in several similar
sequences implies a similar biochemical function or structural fold - basis
for further investigation
D. To try to derive
possible evolutionary relationships among sequences
E. For sequences from
an organism of unknown identity and function, the presence of signature
sequences for a group of organisms with known functions may provide insight
into its physiology or structure - basis for further investigation
III. Computer Alignments
A. Computational
complexity of simultaneous alignments make solution impractical -
approximation algorithms are used
B. Methods often used
with > 10 sequences begin with pairwise
comparison of all sequences to be aligned to get similarities
C. These similarities
then used to cluster sequences into most related groups or to make a phylogenetic tree
D. For grouping method
(examples PIMA and MAXHOM):
1. a consensus
sequence is determined for each group
2. these consensus
sequences are used to align the groups
E. For the phylogenetic tree method (example CLUSTAL):
1. similarities are
converted to evolutionary distances which are used to construct a tree
2. Closest sequences
according to tree are aligned to each other and a consensus sequence is
determined for the pair
3. Consensus
sequence is then aligned to next closest sequence or cluster of sequences
and so on
4. Guided by a
hierarchical tree which can influence final result as it may not be a valid
tree - could direct further evolutionary analysis into wrong direction
IV. CLUSTALW example
A. Common program -
used to align multiple protein and nucleic acid sequences rapidly and
reliably
B. CLUSTALW has a
simple text-mode interface, CLUSTALX has a graphical user interface
C. Basically 4 steps
1. Pairwise alignment of all sequences to be used in the
multiple alignment
2. The similarity
scores (% identity) are used to generate an unrooted
tree (NJ)
3. Unrooted tree is converted to a rooted tree by
mid-point method
4. Work in from tips
of tree toward root to align increasingly larger groups of sequences
D. Parameters can be
adjusted
1. Gap penalties
2. Delay divergent
sequence option (makes optimal use of consensus alignments and gap
insertions that are created earlier with less divergent sequences)
3. Weighted version unweighted - transitions (AÛ G, TÛ C) weighted more strongly than transversions (AÛ T, AÛ C, GÛ T, GÛ
C)
V. Computer output is a prediction that must
always be placed in a biological context
A. rRNA secondary structure
B. Compensatory
changes
Web site used as lecture source that may prove useful to you.
http://www.dkfz-heidelberg.de/tbi/bioinfo/MSA/Intro/index.html
|