Publication:

Nelson, Michael F., and Neil O. Anderson. “How Many Marker Loci Are Necessary? Analysis of Dominant Marker Data Sets Using Two Popular Population Genetic Algorithms.” Ecology and Evolution 3, no. 10 (September 1, 2013): 3455–70. https://doi.org/10.1002/ece3.725.

Background

This was my first ever publication! And it’s now my most cited.

My adviser, Neil Anderson, was a great source of encouragement for this work and I’m grateful to have been able to work with him on it.

I was working on my dissertation research when I undertook this project working on the population genetics of the invasive wetland plant Reed Canarygrass. I was busy extracting DNA and working with Inter Simple Sequence Repeats to try to elucidate the patterns among individual genotypes and populations.

Feeling discouraged and doubtful about my methods, I decided to use simulation to examine the validity of my sampling and the ability of my methods to detect patterns.

This was my first foray into simulation modeling, and I used concepts that I first learned about in Peter Tiffin’s molecular ecology course at the University of Minnesota.

Modeling

Inter Simple Sequence Repeats (ISSRs) are a kind of ‘neutral’ genetic marker. They are neutral in the sense that they don’t code for any genes and are therefore not subject to natural selection. This allows their sequences to vary without deleterious effects on the organism. Molecular ecologists can take advantage of molecular markers to examine patterns of similarity between individuals and make inferences about migration and the relationships between populations.

The analyses I wanted to undertake with my sampled plants involved using several popular algorithms to characterize how much variation there was within and between populations, and to look for patterns that might suggest clusters of related individuals.

I was frustrated because I didn’t know if I had sampled enough individuals, nor whether I had enough molecular data (number of markers) to make meaningful inferences.

Simulations

For this simulation study, I decided to create simulated individual plants, each with its own genome consisting of a variable number of neutral genetic markers. In the simulations I was able to vary the the level of migration between populations, the number of markers in the genomes, and the sampling scheme. I was especially interested in knowing if the numbers of individuals I sampled for the real experiment were sufficient to make valid inferences.

I set up the simulated populations in a hierarchy with two continents, each with three regions, and 36 patches within the regions. The amount of migration among regions, continents, and patches was allowed to vary. Simulations were run for 150 years.

Figure 4: Spatial hierarchy of simulated continents and populations in the model. This was the first map I ever made! I crated it in R knowing virtually nothing about GIS. Note the wonky projection!

Main Findings

As expected, the ability to resolve the populations of origin was highest with equal sampling, high numbers of markers, and low relatedness and migration among populations.

Figure 5: STRUCTURE clustering results of 4 different model scenarios ranging from lots of markers and highly differentiated regions (top row) to low marker quantity and highly related regions (bottom row). Individual genotypes are columns in the figures, the proportion of their genome attributed to each population by the model is the shade. Model performance is approximately the degree to which you can visually resolve the 6 different populations of individuals.

Figure 6: This figure shows that the STRUCTURE mode algorithm has a more difficult time correctly classifying individuals when there is unequal sampling.

Conclusions

The clustering model performance depended (as expected) on the number of individuals sampled, the number of markers, the evenness of sampling, and the relatedness of the populations.

This study helped me feel reassured that my sampling scheme and number of markers was sufficient to address my research questions. Of course, the more individuals and the greater number of markers the better, but I think this study provides a helpful guide for people wanting to use neutral markers. Eventually it will be cheaper and easier to use genomic sequencing rather than neutral markers, but for now I think this method is still useful.