5 Sequence Alignment

Pairwise and Multiple Sequence Alignment Algorithms

5.1 Learning goals

By the end of this chapter you should be able to:

Explain why biologists align sequences and distinguish similarity from homology, orthologs, and paralogs.
Describe how pairwise alignment scores are computed from substitution matrices and gap penalties.
Outline and interpret the Needleman–Wunsch global alignment algorithm and contrast it with Smith–Waterman local alignment.
Explain the difference between a dynamic versus heuristic programming.
Explain what a multiple sequence alignment (MSA) is, why it is more informative than pairwise alignment, and summarize major MSA algorithm families.
Read alignment output critically, recognizing poor alignments and the influence of parameter choices.

5.2 Why align sequences?

Sequence alignment is a core operation in genomics, used whenever you want to compare DNA or protein sequences to ask biological questions¹. Typical motivations include inferring gene or protein function for a new sequence by comparing it to known sequences, identifying orthologous genes across species for phylogeny reconstruction, and detecting conserved motifs and domains that may underlie shared structure or regulation¹. This chapter covers pairwise sequence alignment (both local and global) before extending these concepts to multiple sequence alignment, where information from many sequences is integrated to reveal patterns invisible in pairwise comparisons^1,2.

An algorithm is a step-by-step set of instructions used to solve a problem—a computational protocol. Understanding the algorithms commonly used in sequence analysis helps you recognize their key assumptions and limitations, unveiling the black box mentioned in earlier chapters¹.

5.2.1 Similarity, homology, orthologs, paralogs

Alignment provides a similarity score, but what biologists usually care about is homology, meaning shared ancestry rather than just high percent identity¹. Two sequences are either homologous or not; there is no partial homology, only stronger or weaker evidence for it¹. Orthologs are homologous genes in different species that diverged via a speciation event and usually retain similar function, whereas paralogs are homologous genes within a genome that arose by duplication and often diverge in function over time¹. When you use alignment to infer orthology or paralogy, remember that homology is about evolutionary history, not just alignment score, and typically requires integrating tree information as well.

5.2.2 Amino acids versus nucleotides

With nucleotides offering only four different bases, the information content for homology is low, and codon sequences can be degenerate in how they code for amino acids¹. In contrast, protein sequences offer up to 20 different amino acid types, vastly increasing the information content¹. For identifying homologs, protein sequence comparison is therefore considered superior, providing a deeper look back in time and enabling identification of homologs across the tree of life¹. Exceptions include noncoding DNA regions such as introns and intergenic sequences, as well as studies of DNA polymorphisms that distinguish different populations¹.

5.2.3 Types of sequence changes

Single nucleotide substitutions are often referred to as either SNPs, as you saw earlier in this course. Another term you might encounter is single nucleotide variants (SNVs), a term used more recently to indicate a low frequency variant, often used more in clinical settings. Additionally, you might encounter the term multi-nucleotide polymorphisms (MNPs). In addition to substitutions, there are short insertion/deletion variants, or indels. Large scale rearrangements are much harder to study given that they violate a lot of the assumptions of most sequencing alignment algorithms. Therefore, we will focus on small changes. It is also worth noting that more algorithms only allow 2 possible variants at any one position. It is important to understand the limitations of the algorithms you use when working with datasets that are complex.

Lab Connections

Later in Lab 9, we will visualize sequence variants relative to a reference genome in IGV and describe these sequence changes using Human Genome Variation Society (HGVS) standardized nomenclature¹. Specifically, a sequence change in the DNA may or may not impact the protein sequence. Substitutions are designated by a “>” character with the original base before and the substitution after, prefixed by “g.” for genomic sequence (e.g., g.69A>T for A to T substitution at position 69) and “p.” for protein sequence (e.g., p.K23Y for lysine to tyrosine substitution at position 23).

5.3 Pairwise Sequence Alignment

Pairwise sequence alignment arranges two DNA or protein sequences to assess similarity and infer common evolutionary origin^1,3,4. Given a pair of sequences, an alignment algorithm inserts gaps as needed to maximize a numerical score defined by a scoring scheme, allowing different alignments to be compared quantitatively¹. Global alignment attempts to align sequences from end to end, most appropriate when sequences are expected to be collinear and of similar length^1,3. Local alignment searches for the highest-scoring subsequences without requiring full alignment, ideal when proteins share only conserved domains or when short motifs are embedded in otherwise unrelated context^1,4.

Figure 5.1 – Interpreting Sequence Alignments

Pairwise alignment of two PRDM9 protein sequences between human and Lacerata agilis. — Figure 5.1: Sequence Alignment of PRDM9

In the example pairwise sequence above, we see the results of a global sequence alignment between the PRDM9 protein from humans and that of Lacerata agilis which was manually annotated by Chipper Johnson, course LA. Here, vertical lines | indicate identical resides. Substitutions are indicated by one or two dots, indicating the degree of dissimilarity. Gaps can relfect either an insertion in one of the sequences or a deletion in the other one. These are indicated by the - symbol. This output is from the EMBOSS Needle tool that you will be using for your manual annotation projects this semester.

In genome alignment (more on this in the next chapter), it is common to describe a “sample sequence” being compared to a “reference genome sequence,” but in more general alignment problems the two inputs are called the query sequence and the target (or subject) sequence¹.

5.3.1 Scoring Alignments

Alignment scores combine a substitution matrix assigning scores to aligned residue pairs and a gap penalty scheme for insertions and deletions^1,5,6. For proteins, substitution matrices encode evolutionary and biochemical information: conservative substitutions between similar residues receive positive scores, while rare or structurally disruptive substitutions receive negative scores^1,5. Not all identities are scored equally; matching two tryptophans is more informative than matching two alanines because tryptophan is rare and strongly conserved^1,6.

PAM (Point Accepted Mutation) matrices, introduced by Margaret Dayhoff, are derived from curated alignments of closely related protein families and model substitution probabilities as proteins diverge over evolutionary time^1,5. PAM1 reflects frequencies after one accepted mutation per 100 residues, while higher PAM matrices (e.g., PAM250) approximate more divergent sequences^1,5. BLOSUM (Blocks Substitution Matrix) series from Steven and Jorja Henikoff focuses on conserved blocks of distantly related families, with BLOSUM62 widely used as a default in database searches^1,6. Low PAM numbers and high BLOSUM numbers emphasize close relationships, while high PAM numbers and low BLOSUM numbers better detect distant homology^1,5,6.

Margaret Dayhoff and the Foundations of Scoring Matrices

Margaret Oakley Dayhoff (1925–1983) was a physical chemist and computational biologist whose work laid the groundwork for modern sequence alignment scoring^1,5. From carefully curated multiple sequence alignments of closely related proteins, she defined accepted point mutations and quantified amino-acid substitution frequencies, leading to the PAM model and the first empirically grounded mutation probability matrices^1,5. By converting these probabilities into log-odds scores, Dayhoff introduced a rigorous framework that rewards substitutions observed more often than expected by chance and penalizes unlikely changes^1,5. This logic underlies essentially all modern substitution matrices and directly informs algorithms such as Needleman–Wunsch, Smith–Waterman, and BLAST^1,3,4.

Black and white photograph of a woman. — Figure 5.2: Margaret Dayhoff)

Nucleotide substitution matrices are usually simpler, because there are only four bases and fewer distinct biochemical classes than for amino acids¹. In many applications, DNA alignments are scored with uniform match/mismatch schemes (for example, +1 for a match and -1 for a mismatch) or with slightly more refined scoring where transitions (A↔︎G, C↔︎T) receive different penalties from transversions (purine↔︎pyrimidine), reflecting the higher frequency of transitions in real genomic data¹. More elaborate nucleotide scoring systems, such as matrices derived from observed substitution patterns in genomic regions, exist but are less commonly used than protein substitution matrices¹. Because protein sequences carry higher information content and can leverage sophisticated amino-acid substitution matrices, it is often preferable to translate coding DNA sequences and perform alignments at the protein level when the goal is to detect homology across moderate or large evolutionary distances¹.

5.3.2 Gap penalties

Insertions and deletions (indels) in evolutionary history appear as gaps in sequence alignments and must be penalized so that algorithms do not introduce them without strong justification¹. Without gap penalties, an alignment algorithm could trivially achieve perfect matches by inserting arbitrarily many gaps, so a realistic scoring system must balance the benefits of aligning additional residues against the cost of adding gaps¹. Biologically, single long indels are often more plausible than many short indels, and gap penalty schemes are designed to reflect that intuition¹.

In the simplest scheme, a constant (or linear) gap penalty subtracts the same score for each gap position, so a gap of length \((k)\) incurs a cost proportional to \((k)\)¹. In this case, the cost of opening a gap and the cost of extending it are identical, so a gap of length 5 is exactly five times more costly than a gap of length 1¹. More realistic alignments, especially for proteins, generally use an affine gap penalty, which separates the penalty for initiating a gap from the penalty for extending it¹. Under an affine model, the cost of a gap of length \((k)\) is given by \[gap cost = a + b k,\]

where \((a)\) is a gap opening penalty and \((b)\) is a gap extension penalty per residue, with \((|a|)\) typically larger than \((|b|)\)¹. This favors fewer, longer gaps over many short ones, mirroring the expectation that a single deletion of several residues is more likely than multiple independent single-residue events¹.

When you run alignment tools on the web or command line, default gap opening and extension penalties are often optimized for particular scoring matrices and use cases¹. Understanding how these parameters influence alignments helps explain phenomena such as long conserved blocks separated by large single gaps or the tendency of some parameter combinations to fragment alignments into many short local matches¹. In many practical workflows, end gaps (leading and trailing gaps in global alignment) are either not penalized or penalized differently from internal gaps, especially when one sequence is known to be truncated relative to the other¹.

5.3.3 Needleman–Wunsch: Global Alignment

The Needleman–Wunsch algorithm, introduced in 1970, is a dynamic programming method that guarantees an optimal global alignment for two sequences given a specified substitution matrix and gap penalty scheme^1,3. The core idea is to systematically evaluate all possible ways of aligning prefixes of the two sequences by filling a matrix of scores and then to recover the best full-length alignment by tracing back through that matrix^1,3. This approach avoids exhaustive enumeration of all possible alignments, which would grow combinatorially with sequence length, while still ensuring that the final alignment is optimal under the chosen scoring system^1,3. Below is a description of this algorithm, but if you are more of a visual learner, there is a great 15 minute YouTube video on the topic⁷.

Given sequences of lengths (m) and (n), Needleman–Wunsch constructs a matrix of dimensions ((m+1) (n+1)), where the cell (F_{i,j}) represents the best possible global alignment score for the first (i) residues of sequence 1 and the first (j) residues of sequence 2^1,3. The first row and first column are initialized with cumulative gap penalties, allowing either sequence to begin with a terminal gap if that yields a higher overall score¹. The matrix is then filled row by row or column by column using a recurrence relation that considers the three possible ways to reach each cell: aligning the current pair of residues (diagonal move), inserting a gap in one sequence (vertical move), or inserting a gap in the other (horizontal move)^1,3.

In its simplest form with linear gap penalties, the recurrence for \((F_{i,j})\) can be written as

\[F_{i,j} = \max \begin{cases} F_{i-1,j-1} + s(x_i, y_j) \\ F_{i-1,j} + g \\ F_{i,j-1} + g \end{cases} \]

where \((s(x_i, y_j))\) is the substitution score for aligning residues \((x_i)\) and \((y_j)\), and \((g)\) is the (negative) gap penalty^1,3. More elaborate formulations using affine gap penalties introduce additional matrices or states to distinguish gap opening from gap extension, but the underlying dynamic programming principle is the same¹. Once the matrix is filled, the optimal global alignment score is found in the bottom-right cell \((F_{m,n})\), and a traceback procedure follows the path of choices that led to this score, reconstructing the explicit alignment with matches, mismatches, and gaps^1,3.

Figure 5.2 – Pairwise Sequence Alignment

Figure 5.3: Needleman-Wunsch pairwise alignment matrix for two DNA sequences, showing the scoring grid, traceback path, and the placement of gaps to produce a global alignment.

Pairwise alignment of two nucleotide sequences using the Needleman–Wunsch dynamic programming algorithm for global alignment. The figure shows the scoring matrix used to compare two sequences, with diagonal moves representing matches or mismatches and horizontal or vertical moves representing gaps. The traceback path identifies the optimal global alignment under the chosen scoring scheme. The example illustrates the alignment matrix, scoring scheme, and the role of gaps in producing the final global alignment. Source: Needleman-Wunsch_pairwise_sequence_alignment.png

5.3.4 Smith–Waterman: Local Alignment

The Smith–Waterman algorithm, introduced in 1981, adapts the dynamic programming framework to local rather than global alignment, focusing on the highest-scoring subsequences of the two inputs^1,4. Instead of forcing an alignment from the beginning to the end of both sequences, Smith–Waterman identifies the best local region of similarity, which is especially useful for detecting conserved domains, motifs, or exons embedded in longer, less related sequence contexts^1,4. This local strategy is also central to database search tools such as BLAST and FASTA, which conceptually perform many local alignments between a query and entries in a database¹. As with global alignment, below is a description of this algorithm, but if you are more of a visual learner, there is a short 2 minute YouTube video on the topic⁸.

Figure 5.3 – Local versus global alignment

Two-panel figure illustrating the Smith–Waterman algorithm for local sequence alignment. One panel explains the scoring rule that sets negative values to zero and thereby limits alignment to a high-scoring local region; the other panel shows an example matrix and traceback path for recovering the best local alignment. Smith–Waterman example matrix showing the highest-scoring local alignment region within two sequences.

Unlike global alignment algorithms such as Needleman–Wunsch, Smith–Waterman identifies the best-matching subsequences within longer sequences. It uses a dynamic programming algorithm, meaning it finds the highest-scoring matching region between two sequences rather than aligning them end-to-end. Negative scores are reset to zero, and traceback begins at the highest-scoring cell and continues until a zero is reached, which marks the boundary of the local alignment. This makes it especially useful when only a shared region, domain, or motif is expected between the sequences. Together, these panels show both the scoring logic and an example of the matrix used to recover the best local match. Sources: SmithWaterman.svg and Smith-Waterman.jpg.

Smith–Waterman uses a similar ((m+1) (n+1)) dynamic programming matrix as Needleman–Wunsch, but modifies the recurrence relation in two key ways^1,4. First, the first row and first column are initialized to zero rather than cumulative gap penalties, reflecting the fact that local alignments may begin anywhere in the sequences without penalty^1,4. Second, at each cell the algorithm introduces a fourth option in addition to the three used for global alignment: the score zero. The recurrence becomes

\[F_{i,j} = \max \begin{cases} 0 \\ F_{i-1,j-1} + s(x_i, y_j) \\ F_{i-1,j} + g \\ F_{i,j-1} + g \end{cases}\]

ensuring that scores never become negative^1,4. Any path that would yield a negative score is instead truncated at zero, effectively “resetting” the alignment and allowing new local alignments to begin internally^1,4.

Once the matrix is filled, the starting point for traceback in Smith–Waterman is not the bottom-right cell but the cell with the highest score anywhere in the matrix, which represents the end of the best local alignment^1,4. Traceback then proceeds diagonally, vertically, or horizontally following the choices that produced that score, terminating when a cell with score zero is reached^1,4. The resulting alignment spans only the high-scoring segment pair (or pairs) and ignores lower-scoring flanking regions, often producing a shorter alignment with higher percent identity and similarity than the corresponding global alignment of the same sequences^1,4.

In protein database searching, local alignments are essential because many protein families share only one or a few conserved domains while differing elsewhere, and because databases contain partial sequences, fragments, and alternative isoforms that would not align well under a strict global scheme¹. Heuristic methods such as FASTA and BLAST approximate Smith–Waterman’s local alignment behavior while greatly reducing runtime by focusing computation on promising regions of the search space, enabling rapid comparison of a query sequence to millions of database entries^1,4.

5.3.5 Basic Local Alignment Search Tool (BLAST)

Introduced on the first day of class, BLAST is the most used bioinformatics tool. No doubt ALL of you have used this tool at some point prior to taking this class at least once if not multiple times. The basic local alignment search tool was first published in 1990 with nearly 80,000 citations to date (last updated 1/19/2026).

Getting citation metrics from Web of Science

One of the best ways to assess the importance of a bioinformatics tool is to check the citation metrics. The best to to use here is not google but instead Web of Science.

Here, you can search a paper (e.g. “Altschul 1990 basic local alignment search tool”). Once you have the paper, you can check the box next to the result that you want to analyze. Then, in the top right, click “Citation Report”. In addition to the total number of citations (e.g. 79,208), you can also analyze the citations in depth by clicking the word “Analyze” next to the number of citing articles. This new page offers a drop down menu with several ways to analyze the citations, either by time (e.g. Publication Years), journal (e.g. Publication Titles), by Web of Science Categories, Funding Agencies, etc.

I find the analysis over time to be particularly useful. For example, you might find a particular tool has a LOT of citations, but that the number have dwindled in recent years. A deeper dive might find that a newer tool has largely replaced this tool. This information together would suggest that you should try to use the newer tool.

In the bioinformatics approach for your group project, I expect you to use these types of metrics to justify your selected bioinformatics tools, including ones you might have considered and decided against.

BLAST, first published in 1990, takes a query sequence and searches a database for entries sharing significant local similarity, outputting a ranked list of hits annotated with scores and statistics such as bit score and expect value (E value)^1,9.

Connection to Lab 5

In Lab 5, we will use BLAST on the command line using the Alabama Super Computer¹⁰. The command-line version (BLAST+) allows custom database searches, bulk queries, and complex search strategies essential for large-scale genomic data¹⁰.

BLAST builds on Smith–Waterman’s local alignment framework but uses heuristic programming rather than exhaustive dynamic programming, trading small losses in theoretical optimality for massive gains in speed^1,9. The algorithm proceeds in three main phases^1,9,10:

Phase 1 (List): BLAST analyzes the query and compiles short words of fixed length \((w)\) (typically 3 for proteins)^1,9. Each possible word pair is scored using the substitution matrix, and only words scoring at or above threshold \((T)\) are retained^1,9. This filters for exact or conservative substitutions characteristic of conserved motifs¹.

Phase 2 (Scan and Extend): BLAST scans the database for retained words, using a “two-hit” strategy requiring two non-overlapping hits within window \((A)\) to initiate extension¹. Extensions create high-scoring segment pairs (HSPs), first ungapped then optionally gapped¹. HSPs exceeding reporting cutoff \((S)\) are retained¹.

Phase 3 (Traceback): BLAST reconstructs explicit local alignments for retained HSPs, assigning matches, mismatches, and gaps according to scoring parameters¹.

Word size \((w)\) and threshold \((T)\) control sensitivity–speed trade-offs^1,9. For proteins, default \((w=3)\) balances sensitivity and computational cost; lowering \((T)\) increases hits and extensions but requires more time¹. For nucleotides, BLASTN requires exact word matches with larger default word size (e.g., 11), with smaller words increasing sensitivity and larger words speeding searches¹.

BLAST employs rigorous statistical evaluation based on extreme value distribution (EVD) theory^1,9. The E value represents the expected number of distinct alignments with scores at least as high as \((S)\) occurring by chance in a database search¹:

\[E = K m n e^{-\lambda S}\]

where \((m)\) is query length, \((n)\) is effective database length, and \((K)\) and \(\lambda\) are Karlin–Altschul statistics¹. Smaller E values indicate more significant matches, and E values depend on search space size¹. The related p value is \((p = 1 - e^{-E})\); for very small E values, \((E)\) and \((p)\) are nearly identical¹.

In practice, BLAST users adopt thresholds such as \((E \le 0.05)\) as minimal significance standards, with more stringent criteria (e.g., \((10^{-3})\), \((10^{-5})\)) used in large-scale genome analyses¹. Biological interpretation always requires combining E values with domain knowledge: similar sequence lengths, conserved motifs, compatible domain structures, and congruent genomic context strengthen homology inferences¹.

Table 5.1: Relationship of E to p values in BLAST. Small E values (0.05 or less) correspond closely to p values.

E value	p value
10	0.99995460
5	0.99326205
2	0.86466472
1	0.63212056
0.1	0.09516258
0.05	0.04877058
0.001	0.00099950
0.0001	0.00010000

5.3.5.1 Interpreting BLAST in Biological Context

Although BLAST’s statistical framework is powerful, biological interpretation of BLAST results always requires combining E values and scores with domain knowledge about the sequences involved¹. A “true positive” hit—one that is homologous to the query and descended from a common ancestor—should have a convincingly low E value, but additional criteria often matter: similar sequence lengths, the presence of conserved motifs or signatures, compatible domain structures, and congruent genomic context all strengthen the inference of homology¹. Conversely, BLAST results can include false positives with modest sequence identity and acceptable E values that actually belong to unrelated protein families, particularly when alignments are short or concentrated in low-complexity segments¹.

For these reasons, it is common to follow up marginal BLAST matches with reciprocal searches (using the putative hit as a new query), multiple sequence alignment and phylogenetic analysis, structural comparisons, or domain-based methods such as profile HMMs and position-specific scoring matrices¹. In practice, BLAST serves as the first, essential step: it rapidly maps a query into the landscape of known sequences, providing a statistically grounded set of candidate relationships that subsequent analyses can refine^1,9.

5.4 From pairwise to multiple sequence alignment

Multiple sequence alignment (MSA) extends pairwise alignment to three or more sequences, allowing homologous residues to be aligned in columns across a protein family or related DNA sequences¹. A carefully constructed MSA encodes shared evolutionary history, conserved structure, and functional motifs, enabling prediction of uncharacterized protein functions, identification of catalytic residues, and phylogenetic tree reconstruction^1,2,11. Compared to pairwise alignment, MSA is more sensitive for detecting distant homologs because information from many sequences is integrated rather than relying on a single query–target comparison^1,2.

Conceptually, an MSA arranges homologous sequences so residues in the same column are presumed homologous in both evolutionary and structural senses^1,11. Residues aligned in a given column descended from a shared ancestral residue and, in proteins, occupy comparable positions in the three-dimensional fold^1,11. In practice, achieving a uniquely “correct” alignment for divergent families is rarely possible because sequence divergence often outpaces structural or functional data availability; algorithms aim to approximate optimal alignments under explicit criteria such as maximizing total alignment score or sum-of-pairs objective functions¹.

5.4.1 Criteria and Uses of MSAs

MSAs are built using different implicit criteria reflecting how residues ought to align¹. Structural similarity: residues occupying comparable three-dimensional positions (e.g., within the same helix or beta strand) should align^1,11. Evolutionary similarity: aligned residues should be phylogenetically homologous, tracing back to a single ancestral residue¹. Functional similarity: residues with the same biochemical role (e.g., catalytic residues, ligand-coordinating residues) should align even when surrounding regions have diverged¹. Sequence similarity: maximizing identical or conservatively substituted residues in each column—the criterion directly optimized by most algorithms because it is easily formalized and computed¹.

MSA algorithms largely implement the sequence-similarity criterion, but success is judged by how well alignments respect structural and functional constraints when that information is available¹. For example, in the globin family, MSAs reveal highly conserved histidine residues coordinating the heme group, even though overall pairwise identities between some globins fall near 20–30% identity^1,11.

MSAs inform protein structure and function prediction¹. Profiles and hidden Markov models (e.g., DELTA-BLAST, HMMER) are derived from MSAs and exploit conservation patterns to detect remote homologs^1,2. Conservation patterns inform variant interpretation: amino-acid changes at highly conserved positions across species are more likely deleterious, and many variant effect predictors incorporate cross-species MSAs to quantify evolutionary constraint¹. MSAs form the raw data for phylogenetic inference methods; tree-building algorithms assume each column corresponds to characters evolved from a common ancestral state, so alignment inaccuracies propagate directly into tree errors¹.

Figure 5.4 – Multiple Sequence Alignment of ZNF226 Orthologs

Figure 5.4: Multiple sequence alignment of ZNF226 orthologs showing conserved amino acid positions across several mammalian species, with variable regions highlighted by differences in residue color and shading. The alignment displays sequence conservation and divergence across homologous protein regions.

Multiple sequence alignment of human ZNF226 and orthologous sequences from rhesus macaque, Pacific walrus, southern elephant seal, and wild bactrian camel. The alignment was generated with EMBOSS-EMBL, Clustal Omega, and SAPS, and it illustrates how multiple sequence alignment can be used to compare conserved and variable regions across orthologs. Exon boundaries are highlighted in blue, non-conserved regions in orange, KRAB domains in green, and repeat regions in red, with cysteine and histidine residues in the repeat motifs marked as important structural features. This type of pattern is especially informative for zinc finger genes, where conserved domains support protein function while variation in repeat arrays may alter DNA-binding specificity. It also provides a useful connection to PRDM9, whose rapidly evolving zinc finger domain plays a key role in DNA recognition and recombination hotspot targeting in vertebrates. Source: Wikimedia Commons.

5.4.2 Conserved Residues, Motifs, and Databases of MSAs

Highly conserved residues often mark structurally or functionally critical positions such as catalytic residues, metal-binding residues, or disulfide-forming cysteines¹. In the insulin family, an MSA across mammals shows strongly conserved A and B chains, particularly cysteines forming disulfide bonds, whereas the intervening C peptide is far more variable with numerous indels¹. This pattern reflects different functional constraints: the mature hormone requires precise disulfide-bond geometry, whereas the C peptide primarily functions as a processing intermediate and tolerates more substitutions¹.

Databases such as Pfam and SMART curate large collections of MSAs representing protein families and domains, each associated with profile hidden Markov models for searching domain instances in new sequences^1,12,13. In Pfam, each family has a manually curated seed alignment capturing representative sequences and a larger full alignment extending coverage to additional homologs^1,12. Integrated resources such as InterPro and NCBI Conserved Domain Database (CDD) combine multiple alignment-based resources, allowing single query proteins to be annotated against Pfam, PROSITE, SMART, TIGRFAMs, and other domain collections^1,14,15.

Connection to Chapter 4 and Lab 3

In Chapter 4, you explored genome browsers including UCSC and Ensembl, which display whole-genome MSAs through conservation tracks such as PhastCons, phyloP, and GERP^16–18. These tracks derive from multi-species alignments (e.g., Multiz) spanning entire chromosomes and highlight conserved exons and regulatory elements^19,20. In Lab 3, you will explore these conservation tracks directly to identify functional elements within specific genomic loci, connecting protein-level MSAs covered in this chapter to genome-scale comparative genomics.

5.4.3 Algorithms for Multiple Sequence Alignment

Constructing accurate MSAs is algorithmically challenging because the search space grows combinatorially with sequence number, and exact optimization by dynamic programming becomes infeasible beyond a small number of sequences¹. Practical MSA methods use heuristic strategies trading exact optimality for computational tractability while maintaining high accuracy¹. Five main approaches exist: exact methods, progressive alignment, iterative refinement, consistency-based approaches, and structure-based methods (Table 5.2)¹.

Exact methods generalize Needleman–Wunsch to multiple dimensions, maximizing sum-of-pairs or related objective functions¹. Time and space requirements scale roughly as (O(2^N L^N)) for (N) sequences of average length (L), making them impractical for more than four or five sequences¹. Exact approaches play mainly conceptual or benchmarking roles¹.

Progressive alignment builds MSAs stepwise by computing all pairwise alignments, constructing a guide tree based on pairwise distances, and aligning sequences in tree-specified order^1,21,22. CLUSTALW exemplifies this approach: it uses global pairwise alignments (via Needleman–Wunsch) to compute percent identities, converts similarities to distances, uses clustering (UPGMA or neighbor-joining) to build a guide tree, and aligns the most closely related pair first, progressively adding more sequences until all are included^1,23,24. A key heuristic is “once a gap, always a gap,” giving highest weight to initial closest-sequence alignments when placing gaps^1,23. CLUSTALW down-weights very similar sequence clusters so they do not dominate scoring and uses position-specific gap penalties to encourage block-like gap patterns^1,23. While historically the most popular web-based progressive alignment program, experts now recommend newer programs such as MAFFT, ProbCons, and MUSCLE for improved performance¹. CLUSTALW remains valuable pedagogically for illustrating progressive alignment and is implemented directly in AliView, introduced in Chapter 4^1,21–24.

Iterative methods refine an initial MSA by repeatedly modifying and re-aligning sequence subsets or profiles to improve an objective score^1,25,26. MAFFT and MUSCLE illustrate this strategy^1,25,26. MAFFT begins with fast progressive alignment using k-mer counting (e.g., 6-tuples) to estimate pairwise distances without full alignments, then refines by recalculating distances from the current MSA and performing additional progressive passes and iterative profile–profile realignments to maximize sum-of-pairs scores^1,26,27. MAFFT’s command-line interface allows integration into reproducible pipelines and high-performance computing environments, complementing web-based tools for exploratory work^1,26,27. MUSCLE follows a three-stage protocol: building a draft progressive alignment, constructing an improved guide tree using corrected distance estimates (e.g., Kimura distances), and performing iterative refinement by partitioning the tree into subprofiles, realigning them, and accepting changes increasing sum-of-pairs scores^1,25,28. In both cases, iterative refinement can correct early errors from the first progressive pass, overcoming the main limitation of strictly progressive methods^1,25,26.

Consistency-based approaches introduce additional information by using the emerging MSA to guide pairwise alignments respecting transitive relationships among residues^1,29,30. If residue \((x_i)\) in sequence \((x)\) aligns with residue \((z_k)\) in sequence \((z)\), and \((z_k)\) aligns with \((y_j)\) in sequence \((y)\), this supports aligning \((x_i)\) with \((y_j)\)^1,29,30. ProbCons implements this via a pair-HMM framework: computing posterior probability matrices for each sequence pair, transforming these using consistency across all sequences, building a guide tree based on expected alignment accuracy, and performing progressive alignment with optional iterative refinement^1,30. T-COFFEE builds an extended library of alignments combining global and local pairwise alignments, reweights residue pairs using consistency, and applies progressive alignment with custom substitution matrices derived from the library^1,29. Benchmarks show consistency-based methods often yield more accurate alignments than purely progressive methods, especially for moderately or highly divergent protein families^1,30.

Structure-based methods incorporate three-dimensional structural information, recognizing that tertiary structures are better conserved than primary sequences^1,31. The Expresso mode in T-COFFEE begins by identifying structural templates for input sequences via BLAST searches against the Protein Data Bank, then uses pairwise structural alignments to guide multiple alignment, improving alignment of secondary structure elements and functionally important residues^1,31. PRALINE can integrate profiles from PSI-BLAST searches and predicted secondary structure, using this additional information to refine MSAs beyond what sequence alone dictates¹. Structural information is also crucial in benchmarking: curated reference alignments based on superposed protein structures serve as gold standards for evaluating algorithmic accuracy using metrics such as sum-of-pairs and column scores^1,32,33.

Table 5.2: Algorithm families with representative examples you will encounter in lab exercises.

Method type	Core idea	Examples
Exact	Multi-dimensional dynamic programming; guaranteed optimal but feasible only for very few short sequences.	Used mainly in benchmarking studies.
Progressive	Compute all pairwise global alignments (often with Needleman–Wunsch), build a guide tree from pairwise distances, then align the most similar sequences first and add others stepwise; “once a gap, always a gap.”	CLUSTALW (in AliView).
Iterative	Start with an initial MSA, then repeatedly modify and re-score it, keeping changes that improve an objective until convergence.	MAFFT, MUSCLE.
Consistency-based	Combine information from many pairwise local alignments or profiles so that if A aligns to B and B to C, the algorithm encourages A to align consistently with C.	T-COFFEE, ProbCons.
Structure-based	Incorporate 3D structural information (when available) alongside sequence when aligning, giving special weight to structurally conserved cores.	Expresso (T-COFFEE module).

NCBI COBALT Tool

NCBI’s COBALT (Constraint-Based Alignment Tool) provides a web-based MSA tool integrating pairwise sequence similarity with additional constraints such as conserved motifs and domain information¹. COBALT uses RPS-BLAST searches against the Conserved Domain Database to identify shared domains and motifs, then incorporates this domain-level information into the alignment process so residues within the same domain are preferentially aligned^1,15. For classroom use, COBALT offers an accessible way to perform MSA in a browser, visualize conserved regions, and connect sequence-level conservation to domain annotations without requiring local software installation¹.

5.4.4 Benchmarking and Databases of MSAs

Because different MSA algorithms produce noticeably different alignments, especially for divergent protein families, benchmarking is essential for assessing accuracy and guiding tool choice^1,32. Benchmarking datasets such as BAliBASE, PREFAB, and SABmark consist of curated reference alignments, many derived from three-dimensional structural superpositions rather than sequence alone^1,32. Alignment accuracy is typically quantified using sum-of-pairs scores (measuring how many residue pairs aligned in the reference are also aligned in the test) and column scores (measuring the fraction of correctly recovered alignment columns)^1,32,33. These metrics have limitations—column scores are very stringent, and sum-of-pairs scores do not constitute true distance metrics—but provide standardized ways to compare methods across test cases^1,33.

Benchmarking studies consistently show no single algorithm is best in all situations: progressive methods such as CLUSTALW perform well on closely related sequences and are easy to interpret, whereas iterative and consistency-based methods such as MAFFT, MUSCLE, ProbCons, and T-COFFEE generally achieve higher accuracy on more divergent families at greater computational cost^{1,25,26,29,30}. Structure-based methods like Expresso provide further gains when reliable templates are available, particularly for alignments used in homology modeling and structure prediction^1,31. For practical work, a good strategy is often to generate alignments with at least two different high-performing tools and compare them in regions of biological interest, possibly making manual adjustments where structural or functional information suggests better gap placement¹.

Specialized MSA databases such as Pfam, SMART, and InterPro, together with tools like CDD and DELTA-BLAST, provide a curated layer on top of raw alignment algorithms by maintaining expert-reviewed MSAs and associated models for thousands of protein families^1,12–15. These resources are evaluated and refined over time using benchmarking and structural data, so the MSAs they expose can be trusted as high-quality starting points for functional annotation and comparative analysis^1,12.

5.4.5 Whole-Genome MSAs and Genome Browsers

MSA concepts extend beyond individual proteins to entire genomic regions, where the goal is to align orthologous loci across multiple species to identify conserved exons, regulatory elements, and regions under selection^1,16,19,20. Tools such as Multiz, PhastCons, phyloP, and GERP are central to this domain, and their results are visualized in genome browsers such as UCSC and Ensembl, which you explored in Chapter 4^1,16–19.

In the UCSC Genome Browser, the Vertebrate Multiz Alignment and Conservation track for the human beta-globin (HBB) locus shows alignments of dozens of vertebrate genomes across exons and introns, with PhastCons and phyloP scores highlighting regions of high conservation or accelerated evolution^1,16,17. Exons appear as highly conserved blocks with strong conservation scores, whereas surrounding noncoding regions show a mixture of poorly conserved segments and discrete peaks corresponding to putative regulatory elements¹. GERP scores provide complementary information by estimating rejected substitutions at each alignment column, identifying constrained elements evolving more slowly than expected under neutrality^1,18. Through the UCSC Table Browser and tools such as Galaxy, these whole-genome MSAs can be exported in MAF format, converted to FASTA, and further analyzed or re-aligned with conventional MSA tools^1,34.

The Ensembl browser implements a related but distinct pipeline based on Enredo, Pecan, and Ortheus (EPO) to generate phylogeny-aware MSAs and reconstruct ancestral sequences at internal nodes of the species tree^1,20,35. Enredo partitions genomes into colinear segments accounting for rearrangements and duplications, Pecan uses a consistency-based algorithm to align these segments across species, and Ortheus infers ancestral sequences and insertion–deletion histories using a phylogenetic model^1,20. For the HBB locus, EPO alignments in Ensembl provide not only multi-species alignment of extant sequences but also reconstructed ancestral sequences clarifying whether particular gaps represent insertions or deletions and what ancestral nucleotides likely occupied a given position^1,20. These whole-genome MSAs complement protein-level MSAs by revealing conservation patterns in exons, splice sites, untranslated regions, and distant regulatory elements, connecting naturally back to earlier chapters on genome browsers^1,20.

5.5 Problem Sets

Global Protein Alignment: Implement the Needleman-Wunsch algorithm on this toy example. Below is the PAM matrix already set up based on a gap penalty of -6. First, select the optimal alignment between the pair of sequences. On a blank sheet of paper, work through the matrix to align the two sequences, remembering that vertical and horizontal moves insert gaps and reduce your overall score as in Figure 5.3. Then use the scoring matrix to compute the overall score of your resulting optimal alignment. When done, write out the resulting pairwise sequence alignment of the two sequences as in Figure 5.1.

Scoring Matrix

- S P E A R E

S 2 1 0 1 0 0

H -1 0 1 -1 2 1

A 1 1 0 2 -2 0

K 0 -1 0 -1 3 0

E 0 -1 4 0 -1 4
Multiple Sequence Alignment with COBALT: Download the FASTA file N.fa. Go to the COBALT tool at NCBI, upload the FASTA file, and click “Align”.
- How similar are these sequences? Would you say they are highly similar (likely from the same species) or dissimilar (likely from different species as in Figure 5.4)? These sequences are actually from the nucleocapsid phosphoprotein of the SARS-CoV-2 genome.
- Configure the view by changing the color scheme to “BLOSUM62”. How does this help? Configure the view again by changing to “Frequency-based difference”. How does this change your interpretation? What do some of the other settings do?
- How long is this protein?
- Click the button at the top to generate a phylogenetic tree. Examine the tree topology and branch lengths.
Comparative MSA analysis: Repeat the steps above with this FASTA file: S.fa. This second set of sequences is from the SARS-CoV-2 surface glycoprotein, also called the “spike” protein. Notice how the similarity of these sequences differs from the first set.
- Again configure the view settings and explore the length.
- How does the length of this protein compare to the first?
- How do the two trees compare to each other? Look specifically at the scale bar at the bottom. Which sequence represents MORE genetic change?
- Look at the times these sequences were collected in the sequence names. Do the sequences cluster by year collected?
- Find the node associated with the cluster of sequences from 2019/2020 and reroot the tree. Do these clusters in time make sense?
Algorithm comparison: Download the FASTA file hbb.fa and go to the multiple alignment page on the EBI platform. Select an MSA tool (e.g., CLUSTALW, MAFFT, MUSCLE) and upload the sequences to submit the job.
- Like with COBALT, explore different color schemes for viewing the MSA.
- Look at the resulting phylogenetic tree. Work with others in the class to compare your results across different alignment tools.
- Compare the results to conservation patterns you would expect for hemoglobin based on its function. Do highly conserved regions correspond to functionally important domains?
- Try the same alignment with a second tool. Where do the two alignments differ? Which differences might matter for downstream phylogenetic or structural analyses?
Margaret Dayhoff tribute: Margaret Dayhoff is credited with developing the single-letter abbreviation system for amino acids, similar to ASCII codes used for quality scores of genome sequences in FASTQ files covered in Chapter 3.
- For fun, convert your name to a “protein.” If any letters in your name do not match amino acids, either remove them or choose your middle or last name (or a nickname—this is really just for fun!). For example, I translated my name to “LAVRIE,” replacing the U for the letter V.
- Take your “name protein” and submit it to BLAST. Select the tool Protein BLAST or “blastp”. For the Search set, leave it as the “ClusteredNR” which is the non-redundant set of protein sequences in NCBI.
- Examine your top hits. Do they make biological sense, or are they spurious matches driven by low complexity? What does this exercise reveal about the importance of query length and composition in database searches?

Scoring Matrix
-	S	P	E	A	R	E
S	2	1	0	1	0	0
H	-1	0	1	-1	2	1
A	1	1	0	2	-2	0
K	0	-1	0	-1	3	0
E	0	-1	4	0	-1	4

5.6 Reflection Questions

Substitution matrices and biological interpretation: How does understanding where substitution matrices and gap penalties come from change the way you interpret BLAST or MSA output? Think about a gene or protein you care about—would you learn more from global or local alignments when comparing it across species, and why?
Conservation patterns and functional inference: In the MSA problem sets above, how does your knowledge of the N and S proteins from SARS-CoV-2 match the conservation patterns you observed in their alignments? Most PCR-based tests for SARS-CoV-2 detection target the N protein, with few targeting the S protein. Why do you think the N protein makes a better universal target for detection? In cases where the spike protein is used for detection, what information do you think it provides beyond positive or negative for infection?
Algorithm persistence and modern genomics: Many genome-alignment tools were developed decades after Needleman–Wunsch and Smith–Waterman but still rely on their core ideas. What do you think enabled these early algorithms to remain so influential in modern genomics, even as sequencing technologies and computational power have advanced dramatically?
MSA tool choice and uncertainty: When you see conflicting alignments from different tools (e.g., CLUSTALW vs. MAFFT vs. COBALT), how will you decide which to trust, and how much does that uncertainty matter for the biological inferences you intend to draw? What additional information (structural data, domain annotations, phylogenetic context) might help resolve ambiguities?
Cross-chapter connections: How do the conservation tracks you explored in UCSC and Ensembl (Chapter 4) relate to the MSA algorithms and benchmarking datasets discussed in this chapter? What are the advantages and limitations of using whole-genome MSAs displayed in browsers compared to protein-level MSAs you generate yourself with tools like MAFFT or COBALT?

5.7 References

Pevsner, J. Bioinformatics and Functional Genomics. (Wiley, 2015).

Park, J., Teichmann, S. A., Hubbard, T. & Chothia, C. Intermediate sequences increase the detection of homology between sequences. Journal of Molecular Biology 278, 847–867 (1998).

Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 443–453 (1970).

Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981).

Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5, 345–352 (1978).

Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the USA 89, 10915–10919 (1992).

Deborah Thurtle-Schmidt. Global sequence alignment. (2018).

Tel-Aviv Tutorials. HOW TO Sequence Alignment (Local) - SMITH WATERMAN. (2019).

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990).

10.

Camacho, C. et al. BLAST+: Architecture and applications. BMC Bioinformatics 10, 421 (2009).

11.

Chothia, C. & Lesk, A. M. The relation between the divergence of sequence and structure in proteins. EMBO Journal 5, 823–826 (1986).

12.

Finn, R. D., Bateman, A., Clements, J., et al. Pfam: The protein families database. Nucleic Acids Research 42, D222–D230 (2014).

13.

Letunic, I., Doerks, T. & Bork, P. SMART 7: Recent updates to the protein domain annotation resource. Nucleic Acids Research 40, D302–D305 (2012).

14.

Hunter, S., Jones, P., Mitchell, A., et al. InterPro in 2011: New developments in the family and domain prediction database. Nucleic Acids Research 40, D306–D312 (2012).

15.

Marchler-Bauer, A., Derbyshire, M. K., Gonzales, N. R., et al. CDD: NCBI’s conserved domain database. Nucleic Acids Research 43, D222–D226 (2015).

16.

Siepel, A., Bejerano, G., Pedersen, J. S., et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research 15, 1034–1050 (2005).

17.

Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Research 20, 110–121 (2010).

18.

Davydov, E. V., Goode, D. L., Sirota, M., et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Computational Biology 6, e1001025 (2010).

19.

Blanchette, M., Kent, W. J., Riemer, C., et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Research 14, 708–715 (2004).

20.

Paten, B., Herrero, J., Beal, M. K., Fitzgerald, S. & Birney, E. Enredo and pecan: Genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Research 18, 1814–1828 (2008).

21.

Feng, D.-F. & Doolittle, R. F. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution 25, 351–360 (1987).

22.

Feng, D.-F. & Doolittle, R. F. Progressive sequence alignment and phylogenetic tree construction with the CLUSTAL program. Methods in Enzymology 183, 375–387 (1990).

23.

Thompson, J. D., Higgins, D. G. & Gibson, T. J. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22, 4673–4680 (1994).

24.

Larkin, M. A., Blackshields, G., et al. Clustal W and Clustal X version 2.0. Bioinformatics 23, 2947–2948 (2007).

25.

Edgar, R. C. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32, 1792–1797 (2004).

26.

Katoh, K., Kuma, K.-I., Toh, H. & Miyata, T. MAFFT version 5: Improvement in accuracy of multiple sequence alignment. Nucleic Acids Research 33, 511–518 (2005).

27.

Katoh, K. & Toh, H. Recent developments in the MAFFT multiple sequence alignment program. Briefings in Bioinformatics 9, 286–298 (2008).

28.

Edgar, R. C. MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5, 113 (2004).

29.

Notredame, C., Higgins, D. G. & Heringa, J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology 302, 205–217 (2000).

30.

Do, C. B., Mahabhashyam, M. S. P., Brudno, M. & Batzoglou, S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Research 15, 330–340 (2005).

31.

Armougom, F. et al. Expresso: Automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee. Nucleic Acids Research 34, W604–W608 (2006).

32.

Thompson, J. D., Plewniak, F. & Poch, O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Research 27, 2682–2690 (1999).

33.

Lassmann, T. & Sonnhammer, E. L. L. Quality assessment of multiple alignment programs. FEBS Letters 579, 681–685 (2005).

34.

Blankenberg, D., Taylor, J., Nekrutenko, A. & Galaxy Team, the. Making whole genome multiple alignments usable for biologists. Bioinformatics 27, 2426–2428 (2011).

35.

Flicek, P., Amode, M. R., Barrell, D., et al. Ensembl 2014. Nucleic Acids Research 42, D749–D755 (2014).