Parallel Processing Baseball Data with R and mlbgameday
This article is originally published at https://www.datascienceriot.com/
Just In Time For Baseball
mlbgameday package has just reached the milestone of version 0.1.0.
Designed to facilitate extract, transform and load for MLBAM “Gameday” data. The package is optimized for parallel processing of data that may be larger than memory. There are other packages in the R universe that were built to perform statistics and visualizations on these data, but mlbgameday is concerned primarily with data collection. More uses of these data can be found in the pitchRx, openWAR, and baseballr packages.
Install from CRAN
The package’s internal functions are optimized to work with the
doParallel package. By default, the R language will use one core of our CPU. The
doParallel package enables us to use several cores, which will execute tasks simultaneously. In a standard regular season for all teams, the function has to process more than 2,400 individual files, which depending on your system, can take quite some time. Parallel processing speeds this process up by several times, depending on how many processor cores we choose to use.
Although the package is optimized for parallel processing, it will also work without registering a parallel backend. When only querying a single day’s data, a parallel backend may not provide much additional performance. However, parallel backends are suggested for larger data sets, as the process will be faster by several orders of magnitude.
We can download and subset a small amount of data. In the example below, we’ll look for Jake Arrienta’s no-hitter in 2016.
Please visit source website for post related comments.