A review of ‘R for Data Science’ book @hadleywickham #rstats #openscience
This article is originally published at http://www.christopherlortie.info
Data science is a critical component of many domains of research including the domain I primarily function – ecology. However, in teaching biostatistics within the university context, we have typically focussed on the statistics and less on the science of data (i.e. handling, understanding, and manipulating data). This is unfortunate, but the teaching landscape is now rapidly evolving to include offerings of numerous institutional Master’s of Data Science degrees.
It has taken me an embarrassingly long time to appreciate the differences between data science and statistics. My teaching has embraced open science and shared many of the skills that students need to be scientifically-literate citizens. However, data-literate citizens are important too if we want the next generation to make informed, evidence-based decisions about health, the economy, and the health of our ecosystems. Critical thinking tools for data are non-trivial concepts and statistics are absolutely needed. However, the science of data, big or little, is critical in appreciating the decisions, steps, and workflows needed to prepare, share, analyze, collaborate, and evaluate quantitative and qualitative data. I have been on a reading binge to this effect to both appreciate the value of data science thinking and improve the skill set that I can share with students and some collaborators. Last week, I completed my latest adventure – ‘R for Data Science’ by Garrett Grolemund & Hadley Wickham.
The book was written in R markdown, compiled using bookdown, and it is free online. Appropriately, it thus embodies both open science and data science in how it is written. Bookdown is a package for R that knits a set of R markdown files together into a book. This is important because it is open, you can clone the book from GitHub, it is written using one of the most powerful open science/data science tools, i.e. R (language and environment), and in reading online and seeing the code, you also appreciate the trickle effects of ‘open data science’ thinking to writing, collaboration, and even publishing. This is all incredible, and it is a peek into a very different future of scholarly communication. The book is nearly complete. I read what was available because I teach soon. It confirmed and advanced my understanding and skill set for data science immensely. Here is a brief summary, without spoilers, of some of the dimensions I used to conclude that this book is fantastic.
Language & clarity
In reading R statistics, statistics, or data science books, one expects/hopes that like literate coding, the prose will be accessible, pleasant, and appropriately pitched. This book was ideal in this respect. It was more formal than conversational but not too technical. The structure facilitated comprehension and reading because it was clear and logical. The visuals added a dimension of attractive clarity to the writing that were not just code, prose, R, or data viz. Many of the visuals were excellent heuristics. Some were a reminder to the reader of the big picture in data science whilst others highlighted a particular workflow/approach.
Example of big picture visual.
Example of mechanistic heuristic.
These were extremely useful. I could have even used more here and there, but in digging into the examples, I recognize that they were likely not always needed (and too much can be a bad thing too if poorly executed). The clarity was very high in almost every chapter of the book. I struggled with some of the more complex chapters (for me) such as relational data or some elements of the model building, but the flow keep me rolling through these even if some of the details eluded me.
The expectation that data science or statistics books should be only read once is a challenging notion. Many of the chapters in this book certainly satisfy that criterion, but it depends on the purpose. Some of the more challenging chapters that you identify can be re-read for better comprehension and one could also follow along/experiment with in R studio. Sometimes, it is nonetheless good to get the message from alternate sources described or explained a little differently. In my reading R bonanza, some of the R-statistics books will not be revisited. My feeling for R for Data Science is that the clean style and direct writing do not conflate the message and re-reads would likely be beneficial when needed. The message in many chapters is also unique, and even a brief revisit would highlight some of the handling elements and assumptions associated with best practices for data science.
Welcome to the tidyverse. Enough said to all that follow and read up within the R community. This universe is logical and feels natural. The forthcoming ggvis will help further align the grammar and semantics that parallel the code and flow with pipes versus ‘+’ of ggplot2. Tibbles are a pleasant surprise. The wrangle readings satisfy. Tidiness is next to high-orderedness. Subscribing to the philosophy of readable code, consistent data structures, and logical workflows will promote better open science and reproducibility. This is never really explicitly stated, or if it was, I missed it. I suspect that this is a good thing. We can approach open science, open data, and more transparency in science from top-down or bottom-up efforts. By not repeatedly banging that drum per se but directly providing and describing the tools to handle data cleanly and consistently, this book provides a solid bottom-up pillar for the open science movement. Tidy data and readable code are shareable AND useable. Finally and aligned with this tools-first approach, the value of models and epistemology of hypotheses are stated later in the book (Chapter 19). This worked for me in reading this book but likely not in teaching to students. I like the hypothesis/model philosophy of ‘knowing data’ developed here. It was big data in origins, balanced, and emphasized bias and non-independence in exploring and testing models. What you can learn from a model also depends on how it is applied. This was well described. Split. Build. Think. Test. Know.
Your own personal variation would likely fit within a similar framework even with little data. I did wonder a bit how I can adapt some of the model fitting ideas to more of the little data common in some the ecological inquiries (solutions: (i) pilot field experiments can provide the training data, and (ii) resampling/bootstrapping using modelr to populate larger datasets for more independent EDA) . The reminder to avoid repetition is repeated. Not ironically.
Many books do not need to adapt. Most R statistics books likely do. Packages are often a gamechanger. Grammar changes. Base R is a must know of course, but streamlining and specifics often live in the libraries the community develops. This book is available for sale on amazon, and I assume it will adapt but more slowly than the bookdown version. The frame-rate of change in no way precludes reading the book now or revisiting at some later point in time. Model building chapters, the basics of wrangling, functions, and iterations are solid reading that provide a skill set needed right now. The data viz and perhaps data transformation chapters are most likely to change soon. Read now and capture those skills but expect change. There are also some nice examples of intermediate to advanced tricks in plotting that reading now will provide. Certainly, this the case in the iteration and model chapters too – good intermediate skill building blocks for advanced coding data science. This skill set is pretty darn awesome (PDA), and the strings chapter was also very rich in news skills and a launchpad to text mining with other packages (inspired me to try it right after completion of reading book). Skills abound.
Bottom line (of code) review for readers
high.returns <- c(“basic.R.users”, “intermediate.R.users”)
tidy.data.science <- philosophy of consistent structures %>% visualize with models %>% share
There are many tools for open science (data management plans, slideshare, data repositories, GitHub, preprints, sharing meta-data, social media, blogs, and data publications) . However, effective date science in R can also be a powerful ally if you include the final steps of communicate (Chapters 23-25).
Please visit source website for post related comments.