Lecture 2.2

Current Level

Previous Level

Multiple Alignments

I. Multiple alignment - process of aligning 3 or more sequences with each other

A. Set of sequences are arranged in a scheme where positions believed to be homologous are written in a common column

B. Missing characters (alignment gaps) are denoted with a dash

II. Reasons for doing multiple alignments: “One sequence plays coy; a pair of homologous sequences whisper; many aligned sequences shout out loud.”

A. Pairwise comparisons do not readily show positions that are conserved among a whole set of sequences - tend to miss subtle similarities that become visible when observed simultaneously among many species

B. To identify regions of similar sequence in a group with a conserved biological function or a phylogenetic group that define a conserved consensus pattern or domain

C. For sequences of unknown function, the presence of similar domains in several similar sequences implies a similar biochemical function or structural fold - basis for further investigation

D. To try to derive possible evolutionary relationships among sequences

E. For sequences from an organism of unknown identity and function, the presence of signature sequences for a group of organisms with known functions may provide insight into its physiology or structure - basis for further investigation

III. Computer Alignments

A. Computational complexity of simultaneous alignments make solution impractical - approximation algorithms are used

B. Methods often used with > 10 sequences begin with pairwise comparison of all sequences to be aligned to get similarities

C. These similarities then used to cluster sequences into most related groups or to make a phylogenetic tree

D. For grouping method (examples PIMA and MAXHOM):

1. a consensus sequence is determined for each group

2. these consensus sequences are used to align the groups

E. For the phylogenetic tree method (example CLUSTAL):

1. similarities are converted to evolutionary distances which are used to construct a tree

2. Closest sequences according to tree are aligned to each other and a consensus sequence is determined for the pair

3. Consensus sequence is then aligned to next closest sequence or cluster of sequences and so on

4. Guided by a hierarchical tree which can influence final result as it may not be a valid tree - could direct further evolutionary analysis into wrong direction

IV. CLUSTALW example

A. Common program - used to align multiple protein and nucleic acid sequences rapidly and reliably

B. CLUSTALW has a simple text-mode interface, CLUSTALX has a graphical user interface

C. Basically 4 steps

1. Pairwise alignment of all sequences to be used in the multiple alignment

2. The similarity scores (% identity) are used to generate an unrooted tree (NJ)

3. Unrooted tree is converted to a rooted tree by mid-point method

4. Work in from tips of tree toward root to align increasingly larger groups of sequences

D. Parameters can be adjusted

1. Gap penalties

2. Delay divergent sequence option (makes optimal use of consensus alignments and gap insertions that are created earlier with less divergent sequences)

3. Weighted version unweighted - transitions (AÛ G, TÛ C) weighted more strongly than transversions (AÛ T, AÛ C, GÛ T, GÛ C)

V. Computer output is a prediction that must always be placed in a biological context

A. rRNA secondary structure

B. Compensatory changes

Web site used as lecture source that may prove useful to you.

http://www.dkfz-heidelberg.de/tbi/bioinfo/MSA/Intro/index.html

Click here to email comments to Scott Cooper regarding this site or its links.