Skip to main content

24  Lab 10: Population Genomics using the 1000 Genomes Dataset

24.1 Overview

Today you will be working with human population genomics data from the 1000 Genomes Project, which sequenced over 2,500 individuals from 26 populations worldwide to catalogue human genetic variation at minor allele frequencies as low as 1%1. The goal of this exercise is to give you hands-on experience examining genetic differences within and between human populations using the same statistics — \(F_{ST}\), \(\pi\), and Tajima’s D — introduced in Chapter 12.

You will be assigned one of six genomic regions (~2–10 Mb each) that contain known signatures of importance in human evolution, though you will not be told which gene or phenotype is involved at the outset. Your job is to use the data as evidence to form and refine a hypothesis. In Part 1 you will examine between-population divergence (\(F_{ST}\)) to identify which populations differ most across your region. In Part 2 you will investigate within-population diversity (\(\pi\)) and the site frequency spectrum (Tajima’s D) to ask whether selection may be acting. In Part 3 you will use the UCSC Genome Browser to identify candidate genes that could explain the patterns you observed.

TipBefore you begin

Review the Population Structure and \(F_{ST}\) and Tajima’s D sections of Chapter 12, as well as the Human Genomics callout box. The patterns you will encounter in this lab are real data, and the signal of selection in humans is subtler than in many other species — as discussed in the chapter, classic hard selective sweeps are rare in our genome, so peaks may not be visually dramatic.

24.2 Learning Goals and Objectives

This lab has been assigned with the following learning goals in mind. As part of completing this assignment, students will:

  • Cultivate an appreciation for the role that genes play in contributing to phenotype variation.
  • Examine the use of genomics data in revealing signatures of natural selection.
  • Consider the role that genes play in shaping our identity.

By the end of this lab, you should be able to:

  • Interpret sliding-window \(F_{ST}\), \(\pi\), and Tajima’s D metrics from real population genomic data.
  • Connect the direction and magnitude of these statistics to the evolutionary processes (selection, drift, demographic history) discussed in Chapter 12.
  • Use genomic coordinates and a genome browser to connect statistical signals to candidate genes and known biology.
  • Critically evaluate how window size, sample size, and data quality affect population genomic inferences.

24.3 Part 1: Between-population divergence (\(F_{ST}\))

Recall from Chapter 12 that \(F_{ST}\) measures the proportion of total genetic variation explained by differences between populations, and ranges from 0 (no differentiation) to 1 (complete fixation of different alleles in each population). Genome-wide \(F_{ST}\) between major continental human groups is approximately 0.043 — a useful benchmark as you interpret your plots.

  1. Today we will work with four of the 26 populations from the 1000 Genomes Project, each identified by a three-letter code:

    • CEU: Utah Residents (CEPH) with Northern & Western European Ancestry
    • YRI: Yoruba in Ibadan, Nigeria
    • CHB: Han Chinese in Beijing, China
    • PEL: Peruvians from Lima, Peru
  2. Open the R Shiny app I built for today’s lab. In the “Between Populations” tab, select your assigned region from the drop-down menu and click “Plot Fst”. These \(F_{ST}\) values were computed with VCFtools (--weir-fst-pop) using 100 kb sliding windows, as described in Chapter 12 (Table 12.1).

  3. Examine the initial plot. What are the X and Y axes? This plot uses 100 kb sliding windows. Based on what you read in Chapter 12, how do you expect different window sizes to affect what you see? Would a larger window make the signal stronger or weaker, and why?

  4. Use the “Select Population Comparison” drop-down to examine \(F_{ST}\) for each pairwise comparison across your region. Record the average and maximum weighted Weir \(F_{ST}\) for each pair in the table below:

    Metric CEU vs. YRI YRI vs. CHB CEU vs. CHB YRI vs. PEL CEU vs. PEL CHB vs. PEL
    Average \(F_{ST}\)
    Max \(F_{ST}\)
  5. Comparing your values to the genome-wide human average of ~0.043: are any of your pairwise comparisons substantially above this background? Which two populations show the greatest differentiation in your region, considering both the height and breadth of the peak?

If multiple comparisons have similar peak values, choose the two you will investigate further based on the highest single-window \(F_{ST}\) value.

  • Populations selected for further analysis: ______ vs. ______

In the line above, write the 3-letter population abbreviation from #1.

  1. Where is the peak of maximum \(F_{ST}\) in your region?

    • Approximate genomic position (Mb): __________
  2. Save the \(F_{ST}\) plot for your selected population pair to include in your write-up.

24.4 Part 2: Within-population diversity (\(\pi\) and Tajima’s D)

Recall from Chapter 12 that \(\pi\) measures the average pairwise nucleotide difference per site within a population (\(E[\pi] = 4N_e\mu\) under the standard neutral model), while Tajima’s D compares two estimators of \(\theta\) that weight rare and common variants differently. The values of Tajima’s D in a window can reflect evolutionary processes that have shaped genetic variation. Similarly, \(\pi\) is a direct read out of genetic variation that can be used together to make inferences. Remember also that in humans, selective sweep signals are subtle; don’t be surprised if the patterns are not dramatic.

  1. Select the “Within Populations” tab. Choose your region and the two populations you identified in Part 1, Step 5.

If you had two pairs of populations, examine both one at a time.

  1. Click “Plot Pi and Tajima’s D”. These statistics were also computed with VCFtools (--window-pi and --TajimaD) using 100 kb windows.

  2. Scroll to the summary table at the bottom of the page. Record the number of individuals in each population:

    • Pop 1 size: __________
    • Pop 2 size: __________
  3. Define \(\pi\) in your own words. What does a higher or lower value of \(\pi\) in a window tell you about the evolutionary history of that region?

    Watch this video I recorded if you would like a refresher.

  4. Define Tajima’s D in your own words. What does a positive value indicate? A negative value?

    Watch this video by Dr. Mohamed Noor if you would like a refresher.

  5. The window size here is again 100 kb. How do you expect your estimates of \(\pi\) and Tajima’s D to change if the window were 10 kb instead? 500 kb? Think about the trade-off between resolution and noise discussed in Chapter 12.

  6. How does sample size affect the reliability of \(\pi\) and Tajima’s D estimates? Based on the sample sizes you recorded above, how confident are you in the patterns you observe?

  7. Now examine the four plots together with the \(F_{ST}\) plot from Part 1:

    1. Does the peak \(F_{ST}\) sub-region you identified in Part 1 correspond to any notable pattern in \(\pi\) or Tajima’s D? Describe what you see.

    2. Which of your two populations shows the lower Tajima’s D value in the peak region? (Check the summary table at the bottom of the page.) Based on the interpretation of Tajima’s D from Chapter 12, what does this suggest about the recent evolutionary history of that population at this locus?

    3. Is the \(\pi\) in the peak region higher or lower than the surrounding genomic background in each population? How does this compare to what you would expect under a recent selective sweep versus a demographic bottleneck? (Hint: both can produce negative Tajima’s D, but they have different signatures in \(\pi\).)

    4. Make a hypothesis: which population do you think is driving the differentiation signal, and what type of evolutionary event might explain both the \(F_{ST}\) peak and the within-population statistics?

  8. Save the \(\pi\) and Tajima’s D plots for the population you hypothesize is driving the signal to include in your write-up.

24.5 Part 3: Gene identification in the UCSC Genome Browser

Identifying statistical outliers in a sliding-window scan is only the first step — the goal is ultimately to connect genomic signals to biological function. Here you will use the UCSC Genome Browser to ask which genes overlap your peak region and whether any known biology can explain the patterns you observed.

  1. Select the “Explore” tab. Choose your region and click “Load Region”.

  2. Record the chromosome and genomic coordinates displayed:

    • Chromosome: ______ Start: __________ End: __________

    Note that these data were mapped to the human genome reference hg19.

  3. Click “View in UCSC” to open the Genome Browser. Zoom into a ~500 kb window centered on your peak \(F_{ST}\) position from Part 1.

  4. List the genes present in this sub-region:

  5. Turn on the OMIM track (dark green) to display disease- and phenotype-associated variants. Do any OMIM entries overlap your peak region?

  6. Take a screenshot of the zoomed-in browser view to include in your write-up.

  7. Investigate the genes in your window using the browser’s gene pages and external sources (Google Scholar, OMIM, UniProt):

    1. Is there anything about any of the genes that could explain why these populations differ? Note any functional information, expression patterns, or known associations.

    2. Review the Human Genomics callout box in Chapter 12. Do any of the large-effect loci discussed there (pigmentation, diet adaptation, pathogen resistance, high-altitude adaptation) appear in your window? If so, does this support or challenge your hypothesis from Part 2?

    3. Based on the populations showing the highest \(F_{ST}\) and the biology of your candidate gene(s), propose an evolutionary explanation for the pattern. What selection pressure might have acted on this locus, and in which population(s)?

    4. Does the pleiotropic nature of any gene in your region (i.e., one gene affecting multiple traits, as discussed for EDAR in Chapter 12) make it harder or easier to identify the primary target of selection? Explain.

  8. What additional human genomic regions or populations would you want to examine to further test your hypothesis?

24.6 Part 4: Class Comparison and Write-Up

  1. When you have completed Parts 1–3, compare your results with at least two classmates who were assigned different regions. For each region discussed:

    • Which populations were most differentiated?
    • What is the likely candidate gene and phenotype?
    • Was the sweep signal obvious or subtle?

    Record any patterns you notice across regions in the table below:

    Region Populations most differentiated Likely phenotype Sweep obvious? (Y/N)
    Yours
    Peer 1
    Peer 2
  2. Write a short summary (<250 words) that includes:

    • What you did in each part of the lab (1–2 sentences each)
    • Your region’s likely candidate gene and phenotype, with evidence from all three parts
    • How the patterns you observed connect to concepts from Chapter 12 (specifically: \(F_{ST}\), \(\pi\), Tajima’s D, selective sweeps in humans, window size effects)
    • One limitation of the approach used today and how it could be addressed
  3. Submit your write-up along with:

    • The \(F_{ST}\) plot for your selected population pair (Part 1, Step 7)
    • The \(\pi\) and Tajima’s D plots for your hypothesized driving population (Part 2, Step 9)
    • The UCSC Genome Browser screenshot (Part 3, Step 6)

NoteConnecting to the Chapter

This lab demonstrates several principles from Chapter 12 with real data:

  • Sliding-window statistics (\(F_{ST}\), \(\pi\), Tajima’s D) are computed with VCFtools in 100 kb windows — the same approach described in the chapter and listed in Table 12.1.
  • Selective sweeps are rare in humans2,3 — which is why your peaks may be subtle and why combining multiple statistics is more convincing than any single measure alone.
  • Recombination rate variation can create false-positive outlier windows even under neutrality4, which is one reason we cross-check \(F_{ST}\) patterns against \(\pi\) and Tajima’s D before concluding that selection has acted.
  • \(F_{ST} \approx 0.043\) between major continental human groups5 — the values you record in your table should be interpreted against this background.

24.7 References

1.
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
2.
Hernandez, R. D. et al. Classic selective sweeps were rare in recent human evolution. Science 331, 920–924 (2011).
3.
Gao, Z. Rethinking selective sweeps in human evolution. PLoS Biology 22, e3002469 (2024).
4.
Stevison, L. S. & McGaugh, S. E. It’s time to stop sweeping recombination rate under the genome scan rug. Molecular Ecology 29, 4249–4253 (2020).
5.