Skip to Main Content

Resources for Undergraduate Courses in Bioinformatics: Concepts in Probability and Statistics

By Diane P. Genereux

Concepts in Probability and Statistics

In addition to its impact on public awareness of big data analysis, the Covid-19 pandemic has underscored the strategic value of a basic understanding of probability and statistics, as essential for assessing risk and personal decision making.(5) The resource recommendations in this section are inspired by the perspective that students of all majors routinely use probability and statistics in making decisions, both in academic courses and in their personal lives, and that the primary goal of a bioinformatics course should be to help them to formalize these existing skills and extend them for application to data sets of increasing complexity.

As in all areas of science, significance testing—as illustrated, for example, by the question of whether a given DNA site is more consistent across species than expected given its mutation rate, suggesting that it has been under purifying selection—is an essential skill for anyone working in bioinformatics. Nonetheless, addressing significance testing in an introductory bioinformatics course presents several challenges. As noted above, many biology majors may be accustomed to managing small data sets of the scale that can be collected during a single laboratory session and analyzed using specific directions for implementing a chi-square test in Excel. These students are likely to be familiar with the concept of a null distribution against which to test data for deviations, but may not have experience in constructing such distributions themselves. In turn, computer science students are likely to be comfortable with much larger data sets, but may lack the basic biological understanding necessary to discern a relevant null hypothesis. For an introductory bioinformatics class, then, the most useful resources are likely to be ones that describe the biological features of a problem, and then engage students in examining data to compare an observed data distribution to an appropriately modeled null.

For biology students who arrive with no coding experience, there is potential value in introducing concepts in statistics and probability in the context of Excel, which they will almost certainly have encountered in their introductory biology courses. Excel 2019 for Biological and Life Sciences Statistics by Thomas Quirk, Meghan Quirk, and Howard Horton offers an approachable bridge for such students, providing a practical guide for applying descriptive statistics to data sets of moderate size. Though it is aimed primarily at students of environmental rather than biomedical sciences, Mark Gardener’s Statistics for Ecologists Using R and Excel is also useful for its clear examples and sample problems.

While Excel-based applications may offer a useful bridge for some students, a course that uses Excel alone will severely limit the students’ capacity to analyze larger data sets efficiently and to develop and implement statistical approaches of their own. Developing students’ facility with implementing statistical approaches will be essential, and this goal will be more easily reached if basic operations in R are introduced early in the course, perhaps using some of the resources suggested above, and if statistical approaches are addressed in the context of biological questions that students have previously encountered. In contrast to most texts focused on statistics for bioinformatics, Sunil Mathur’s Statistical Bioinformatics with R begins by presenting statistics approaches—and R implementations—in the context of basic problems in classical genetics that are mainstays in introductory biology courses, such as testing for genotype distributions that deviate from the Hardy-Weinberg equilibrium. This approach may help to ease the path for biology majors into bioinformatics, and it is also scalable to much larger data sets. Michael Baron’s Probability and Statistics for Computer Scientists broaches basic questions in statistics within the context of the management of computer systems rather than biological questions; this work may be helpful for introducing computer science students to null distributions, bootstrapping, and other concepts essential for analysis of biomedical data.

At institutions where both biology and computer science majors are required to take a course in basic statistics prior to enrolling in a bioinformatics course, instructors will be able to jump directly to statistical methods for addressing complex problems. Introduction to Bioinformatics with R: A Practical Guide for Biologists by Edward Curry offers an especially useful section focused on using R to explore distributions. It could be useful both in introducing fundamental concepts in statistics, and in helping students already well versed in statistics to implement familiar methods using R. For students with a strong mathematical background, the Springer volume Statistical Methods in Bioinformatics, jointly authored by Warren Ewens and Gregory Grant, presents in more formal mathematical terms the approaches used to analyze DNA sequence data, also implemented in R. For students with advanced mathematical skills, Xuhua Xia’s A Mathematical Primer of Molecular Phylogenetics presents clearly worked figures and examples for phylogenetic inference from sequence data, and Topological Data Analysis for Genomics and Evolution by R. Rabadan and A.J. Blumberg elegantly introduces an algebraic approach and offers extensive examples for finding fundamental patterns from multidimensional genomic data. 

5. Michael Eisenstein discussed apps for calculating Covid-19 risk in a Nature Technology Feature published on December 21, 2020: “What’s Your Risk of Catching COVID? These Tools Help You to Find Out,” freely accessible at https://www.nature.com/articles/d41586-020-03637-y.