19 Lab05: BLAST on the Command Line

For this lab, you will learn to use BLAST (Basic Local Alignment Search Tool) on the command line to identify sequence homology and annotate unknown sequences. While web-based BLAST is convenient for single queries, command-line BLAST is essential for analyzing large datasets, automating repetitive tasks, and integrating BLAST into bioinformatics pipelines. This lab will walk you through creating BLAST databases, running searches, and parsing results—skills that are fundamental to genome annotation, comparative genomics, and evolutionary studies.

19.1 Background: Why Command Line BLAST?

BLAST is the most widely used bioinformatics tool, with the original publication cited nearly 80,000 times¹. While the web interface at NCBI is user-friendly, command-line BLAST² offers several advantages:

Batch processing: Query thousands of sequences automatically
Custom databases: Search against your own sequences, not just NCBI’s databases
Reproducibility: Scripts can be shared and rerun exactly
Integration: Combine BLAST with other tools in analysis pipelines
Control: Fine-tune parameters and output formats for downstream analysis

In this lab, you’ll work with a realistic scenario: you’ve assembled transcripts from a newly sequenced plant genome and need to annotate them using the well-characterized Arabidopsis thaliana genome. Your goal is to determine which transcripts correspond to mitochondrial, chloroplast, or nuclear genes.

19.2 Setup

Login to your ASC account and make a folder for Lab 5 called Lab5.
Copy all files from the shared folder ~/bioinf_shared/Lab5_BLAST_CL into your Lab5 directory. Important: Copy these files, do not move them!
List the files in your directory. You should see:
- test.fasta (your query sequences - the unknown transcripts)
- ATcp.fasta (Arabidopsis chloroplast genome)
- ATmt.fasta (Arabidopsis mitochondrial genome)
- ATchrV.fasta (Arabidopsis chromosome V - a sample nuclear chromosome)
- Example_script.sh (template for your commands)
Open Example_script.sh in your preferred text editor. This script will be your working file throughout the lab. You’ll add commands to it as you progress, building a complete, reusable BLAST workflow by the end.

Replace your username at the top:

DIR="/home/aubclsfXXXX/Lab5" # Replace with your username

19.3 Step 1: Examine Your Query Sequences

Before running BLAST, you should understand your input data.

Use less to view the test.fasta file. Recall that FASTA format consists of a header line starting with > followed by the sequence on subsequent lines.
Count how many sequences are in test.fasta.

Hint: The > symbol marks each new sequence header. Use grep with the -c flag to count matching lines. Check man grep if needed.
Record this number—you’ll need it later to determine how many sequences had no BLAST hits.

19.4 Step 2: Create BLAST Databases

When you use BLAST on the NCBI website, you’re searching against pre-built databases. To use BLAST on the command line with custom sequences, you must first build your own databases.

Load the BLAST module to access the BLAST+ tools:
```
module load blast+/2.15.0
```
Examine the database creation tool:
```
makeblastdb -help
```
Look through the documentation to identify the required parameters. Key options include:
- -in: Input FASTA file
- -dbtype: Type of sequences (nucl for nucleotides, prot for proteins)
- -out: Output database name (optional; defaults to input filename)
Documentation on this module can be found on the ASC Documentation Site. Additional documentation on this tool can be found at NCBI.
Determine the sequence type in your FASTA files. Open one of the AT*.fasta files with less. Are these nucleotide or protein sequences?
Create a BLAST database for each of the three Arabidopsis sequence files (ATcp.fasta, ATmt.fasta, and ATchrV.fasta). Do not create a database from test.fasta—this will be your query.

For example, for the chloroplast:
```
makeblastdb -in ATcp.fasta -dbtype nucl
```
Tip: To make your script more flexible, use variables as shown in the example script. For instance:
```
cp=ATcp.fasta
makeblastdb -in $cp -dbtype nucl
```
Add these three makeblastdb commands to your Example_script.sh file.
Submit your script to the ASC queue with these parameters:
- Number of CPUs = 1
- RAM/Memory = 50MB
- Queue = class
- Time = 01:00:00 (1 hour)
After the job completes, check the standard output file (ending in .o followed by the job number) for errors. Then list your directory contents.
You should now see multiple new files for each database (with extensions like .nhr, .nin, .nsq, etc.). These are the indexed database files that BLAST will use for rapid searching. Count how many new files were created per input FASTA file.
Check your job efficiency using jobinfo -j JOBNUMBER. Note the memory and CPU usage—this information helps you optimize future jobs. What was your memory efficiency? CPU efficiency?

19.5 Step 3: Run Your First BLAST Search

Now you’ll compare your unknown transcripts test.fasta against each of the three databases you created. First, examine Figure 19.1 to determine which program to use within BLAST.

Programs within BLAST

Overview of the five main BLAST Algorithms — Figure 19.1: Overview of BLAST five main Algorithms

There are five major algorithms within BLAST. The suffixes in the names are: P = protein, N = nucleotide, X = dynamically translated DNA, and T = translating database sequences. In making your database, you already know your input sequences are nucleotides. Example the input file test.fasta to determine which of the five programs you will use when running your BLAST search.

Since both your query and database sequences are nucleotides, you’ll use the blastn program. Review the available options:
```
blastn -help
```
The essential parameters are:
- -query: Your input FASTA file
- -db: The database to search against
- -out: Output file name
Run BLAST to compare test.fasta against the chloroplast database. For example:
```
blastn -query test.fasta -db ATcp.fasta -out test_vs_cp.blastn_out
```
Use a descriptive output filename that indicates the query and database (e.g., test_vs_cp.blastn_out for test query versus chloroplast database).
Repeat for the other two databases (mitochondrial and chromosome V). Add all three BLAST commands to your script.
Submit your updated script to the queue:
- Number of CPUs = 1
- RAM/Memory = 100MB
- Queue = class
- Time = 01:00:00
Tip: After successfully creating the blast databases, comment out the makeblastdb commands (add # at the start of each line) so they don’t rerun unnecessarily as we continue to add to this script.
Once the job completes, examine one of the output files using less. This is BLAST’s default “pairwise” format, similar to what you see on the NCBI website. While human-readable, it’s difficult to parse computationally.

19.6 Step 4: Optimize Output Format for Parsing

The default BLAST output is verbose and hard to parse with command-line tools. BLAST offers alternative formats that are more suitable for automated analysis.

Review the -outfmt option in the blastn -help output. You’ll see numbered format options (0-18). Formats 6 and 7 are tabular and particularly useful for parsing.

Rerun one of your BLAST searches using -outfmt 6:

blastn -query test.fasta -db ATcp.fasta -outfmt 6 -out test_vs_cp.tab

Examine the output with less. Format 6 produces tab-delimited columns:
- Column 1: Query sequence ID
- Column 2: Subject (database) sequence ID
- Column 3: Percent identity
- Column 4: Alignment length
- Column 5: Number of mismatches
- Columns 6-12: Additional alignment statistics including E-value
Compare format 6 to format 7 (which includes column headers). Which do you prefer? As a class, decide on one format to use consistently.
Update your script to use your chosen output format for all three BLAST searches. Update the queue parameters based on your previous job’s runtime and memory usage (check with jobinfo):
- Number of CPUs = 1
- RAM/Memory = ? (what did you actually use last time?)
- Queue = class
- Time = ? (how long did it actually take?)
Comment out the previous BLAST commands and add the new versions. Submit to the queue.

19.7 Step 5: Understanding Database Size Effects on E-values

The E-value (expect value) represents the number of hits you’d expect to see by chance in a database of this size. Larger databases generally produce larger E-values for the same match.

To avoid duplicating large files, I have pre-downloaded Chr1 of the Arabidopsis genome. You’ll create a symbolic link to the database folder (make sure to replace your username as you did during setup):
```
ln -s /home/aubclsfXXXX/bioinf_shared/Arabidopsis_thaliana/CHR_I
```
A symbolic link acts like a shortcut—the files appear in your directory but remain stored in the original location.
BLAST allows you to search multiple databases simultaneously by listing them in quotes:
```
blastn -db "ATcp.fasta ATchrV.fasta CHR_I/NC_003070.fasta ATmt.fasta" \
  -query test.fasta -outfmt 6 -out combined_search.txt
```
This searches all four databases (chloroplast, chrV, chrI, and mitochondrial) in one run.

Note: If you named your BLAST databases something different than the default names, then you will need to edit this to match your database naming conventions here.
Add this command to your script, comment out previous lines, and submit to the queue with updated parameters.
After completion, compare the E-values for a specific hit:
- Find a transcript that matched the chloroplast in your earlier single-database search
- Look up the same transcript in your combined search output
- How did the E-value change? Why? Consider: The combined database is much larger than the chloroplast alone.

19.8 Step 6: Parse BLAST Output to Count Hits Per Database

Now you’ll use command-line text processing tools to extract meaningful summaries from your BLAST results. The goal is to count how many of your query transcripts match each database (mitochondrial, chloroplast, or nuclear chromosomes). You will be making a long one-liner with multiple piped steps that you will build incrementally below.

Your BLAST output columns 1 and 2 contain the query name and database hit. Extract just these columns using awk:
```
blastn -db "ATcp.fasta ATchrV.fasta CHR_I/NC_003070.fasta ATmt.fasta" \
  -query test.fasta -outfmt 6 | awk '{print $1,$2}'
```
The | (pipe) sends BLAST output directly to awk without creating an intermediate file. In the next several steps, add each command via pipe to this command line argument. In the end, you will redirect the output to a file.

Keep in mind that if you run the command above by itself, the output will be LARGE! So if you want to test each step of your piped one-liner, simplify the output by piping it to head periodically. But beware! Do not accidentally leave a | head | in the middle of your one-liner which will effectively remove a vast majority of your output hits!
Each query might have multiple hits to the same database sequence. Remove duplicates with another pipe to sort and then uniq to yield unique query-database pairs (one line per transcript-database match).

Hint: Do not forget to use the man page when needed to remind yourself how to use these tools.

Now focus only on which databases were hit by adding a pipe to awk to isolate only column 2. This will remove the query names.
Next, count how many times each database appears by piping the output to uniq -c (which requires sorted input).
Sort the counts numerically and save to a file called RawCounts.txt.
Examine RawCounts.txt. You should see counts for each database. Did you get more hits to chromosome I (the full chromosome) or to chromosome V (partial)? Why might that be?
Both chrV and chrI should be “NT”, but since we used a symbolic link for chrI, it shows up differently in our output. Use sed to convert chrI hits to NT so we can combine hits for these two chromosomes.
Add this complete one-liner to your script. This is a powerful example of piping multiple tools together—a common pattern in bioinformatics.

19.9 Step 7: Identify Sequences with No BLAST Hits

Some of your query transcripts may not match any database, suggesting they could be novel genes, misassembled sequences, or contamination.

Sequences with no BLAST hits won’t appear in your output file at all. Calculate the number:
- Total sequences in test.fasta (you counted this in Step 1)
- Minus: Number of sequences with at least one hit
To count sequences with hits, extract unique query IDs from your BLAST output:
```
awk '{print $1}' combined_search.txt | sort | uniq | wc -l
```
Calculate: (total sequences) - (sequences with hits) = sequences with no hits
Alternatively, if you want to know which specific sequences had no hits, you could:
- Extract all sequence IDs from test.fasta: grep ">" test.fasta > all_queries.txt
- Extract hit IDs from BLAST output: awk '{print $1}' combined_search.txt | sort | uniq > hit_queries.txt
- Find the difference: comm -23 all_queries.txt hit_queries.txt
Add the count of sequences with no hits to the bottom of your RawCounts.txt file using echo:
```
echo "No hits: [your count]" >> RawCounts.txt
```

19.10 Step 8: Finalize Your Script

Review your Example_script.sh. It should now contain:
- Commands to create three BLAST databases
- Commands to run BLAST searches (with appropriate output formatting)
- A one-liner to parse results and count hits per database
- Calculation of sequences with no hits
Add comments throughout your script to explain what each section does.
Test that your script runs successfully from start to finish. You can do this by:
- Removing all output files
- Resubmitting the script (with all commands uncommented)
- Verifying all expected outputs are generated

This script is now a reusable tool. With minor modifications, you could use it to annotate sequences from any organism against any reference genome.

19.11 Reflection Questions

Now that you’ve completed the lab, reflect on the following:

Command-line advantages: What are some specific scenarios where command-line BLAST would be preferable to web-based BLAST? Consider factors like:
- Number of sequences
- Database selection
- Reproducibility
- Integration with other tools
Nucleotide vs. protein BLAST: For identifying orthologs across distantly related species, is it better to use nucleotide sequences (blastn) or protein sequences (blastp/blastx)? Why? Hint: Consider how DNA vs. protein sequences evolve over time.
Novel sequences: Based on your RawCounts.txt results:
- How many transcripts matched the chloroplast genome?
- How many matched the mitochondrial genome?
- How many matched nuclear chromosomes?
- How many had no hits in any database?
For the sequences with no hits, what are possible biological explanations? What would you do next to characterize these sequences?
E-value interpretation: You observed how database size affects E-values. If you’re comparing BLAST results from different studies that used different reference databases, why is it important to consider database size? How might this affect your interpretation of “significant” hits?
Applications: Beyond annotation, describe two other biological questions or research applications where command-line BLAST would be useful. Be specific about what you’d search and why.

19.12 Deliverables

Upload the following to Canvas:

Your completed Example_script.sh file (with comments explaining each section)
Your RawCounts.txt file
Answers to the five reflection questions above (in a Word document or PDF)

19.13 Acknowledgement

This lab assignment has been adapted from an activity initially composed by Dr. Nathan Hall from Auburn University. It has been further modified through multiple cohorts of students and more recently with the assistance of AI.

19.14 Additional Resources

Extending Your BLAST Skills:

Primer design: BLAST can verify primer specificity. For large-scale primer design (>5 loci), automating BLAST checks saves time. Python’s primer3 wrapper can be integrated with command-line BLAST.
Ortholog discovery: SeqAnswers Thread discusses automating BLAST for ortholog identification across multiple genomes—useful for comparative genomics projects.
Alternative alignment tools:
- LASTZ: Better for whole-genome alignments
- BWA: Optimized for mapping short reads to reference genomes
- Mauve: Multiple genome alignment and visualization
Visualization in R: The genoPlotR package creates publication-quality figures from BLAST and Mauve outputs, useful for comparing genome organization.
BLAST+ Documentation: Comprehensive guide to all BLAST+ tools and parameters: NCBI BLAST+ Command Line Applications User Manual

19.15 References

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990).

Camacho, C. et al. BLAST+: Architecture and applications. BMC Bioinformatics 10, 421 (2009).