Skip to Main Content

Resources for Undergraduate Courses in Bioinformatics: Data Visualization

By Diane P. Genereux

Data Visualization

While coding is essential for data analysis, data visualization is essential for sharing results. Historically, finalizing data figures was often one of the last steps before publication. In the era of massive data sets, though, it is no longer possible to find patterns simply by looking at raw data— data sets are typically so big as to make that impossible. Therefore, the ability to display graphically at least descriptive features of a large data set is now essential, even as a tool to screen for errors in coding, and especially for sharing initial findings with collaborators, including those who do not have the computational skills that would be required to examine the data themselves. For all these reasons, an introductory bioinformatics course should help students approach data visualization as intrinsic to data analysis.

Assembly of the set of resources suggested here was guided by the notion that making graphics is often a highly enjoyable endeavor, especially for students with experience or interest in graphic design, and the process of designing and refining data-display graphics can, itself, provide opportunities to reinforce essential skills in statistics and hypothesis testing. Colin Ware’s Information Visualization: Perception for Design uses biological features of human visual perception to help explain why some design approaches are more successful than others in capturing challenging concepts. While the book is focused entirely on design rather than implementation of figures, many of the concepts and examples presented will likely inspire lively classroom discussion of how to present data in ways that are at once simple and information rich.

For an initial foray into helping students make figures of their own, instructors who have structured their class around R will likely find it easiest for students to begin with R’s built-in graphics system. This system permits direct use of data in familiar structures, for example, with one line per sample, and each column representing a particular data feature (e.g., sampling location). John Hilfiger’s Graphing Data with R is available only in a first edition that is already a few years old, but which nonetheless remains a practical, reliable resource, offering clearly worked examples and sample code. Getting Started with R, a volume introduced above, is principally a general introduction to all aspects of R, but it also includes several chapters focused specifically on graphing. This work’s seamless integration of basic functions and graphics guide may be of particular use for students less comfortable with coding, as graphing functions are presented as a simple next step from introductory functions. Even some freely available online resources may be valuable for students just getting started with graphing. Although they provide just a few examples, many of these offer an approachable introduction to R’s graphics capacity. Highlights include A Comprehensive Guide to Data Visualisation in R for Beginners, published online by Towards Data Science Inc., and Understanding Data: Graphs—Data Visualization Using R offered by Indeed, encouraging students to begin with an online guide allows for text copying of sample code directly into R installed on individual computers, encouraging immediate tinkering and experimentation in applying sample code to test data sets provided to them with a basic R installation—such experimentation is an essential part of the coding process even for experienced data scientists.

Even in an introductory bioinformatics class, students enthusiastic about data display and inspired by examples in the data-design volumes noted above may become frustrated with the comparatively rudimentary set of options built into R’s basic package. Learning ggplot2—an expansive software library that uses R’s somewhat less intuitive “long format” data option—is nicely documented in the online resource Data Wrangling with R (authored by Claudia Engel and published as a GitHub web page). This is an invaluable step for working data scientists, but including ggplot2 as a requirement can occupy considerable course time, with a possible sacrifice of attention to other course goals.

One option for instructors is to assign data-graphing projects and give students the choice to either use R’s built-in graphic options, or the much more expansive options available through ggplot, according to their individual skills and preferences. For students who do pursue graphing in ggplot2, several basic introductions are available online in addition to Claudia Engel’s, including tutorials by Eric C. Anderson, available as part of the syllabus for his Reproducible Research Course, also archived on a GitHub site. Anderson offers a valuable overview of making multipanel plots, while Data Analysis and Visualization in R for Ecologists (available on the Data Carpentry site maintained by François Michonneau and Auriel Fournier) offers options for tuning figure aesthetics. Additionally, The Complete ggplot2 Tutorial, which presents the capacities of ggplot2 in the context of specific statistical approaches, is offered by Selva Prabhakaran at his site mainly devoted to R ( Instructors with more advanced students and/or sufficient time to focus for several weeks on learning ggplot2 may prefer to use comprehensive volumes that address both data structures and display options. Hadley Wickham’s Ggplot2: Elegant Graphics for Data Analysis is a classic text that offers a comprehensive reference for accessing the full power of ggplot2, and is available in both paper and electronic formats.