Track changes in data with the lumberjack %>>%
This article is originally published at http://www.markvanderloo.eu
So you are using this pipeline to have data treated by different functions in R. For example, you may be imputing some missing values using the simputation package. Let us first load the only realistic dataset in R
> data(retailers, package="validate") > head(retailers, 3) size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat 1 sc0 0.02 75 NA NA 1130 NA 18915 20045 NA 2 sc3 0.14 9 1607 NA 1607 131 1544 63 NA 3 sc3 0.14 NA 6886 -33 6919 324 6493 426 NA
This data is dirty with missings and full of errors. Let us do some imputations with simputation.
> out <- retailers %>% + impute_lm(other.rev ~ turnover) %>% + impute_median(other.rev ~ size) > > head(out,3) size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat 1 sc0 0.02 75 NA 6114.775 1130 NA 18915 20045 NA 2 sc3 0.14 9 1607 5427.113 1607 131 1544 63 NA 3 sc3 0.14 NA 6886 -33.000 6919 324 6493 426 NA >
Ok, cool, we know all that. But what if you'd like to know what value was imputed with which method? That's where the lumberjack comes in.
The lumberjack operator is a `pipe' operator that allows you to track changes in data.
> library(lumberjack) > retailers$id <- seq_len(nrow(retailers)) > out <- retailers %>>% + start_log(log=cellwise$new(key="id")) %>>% + impute_lm(other.rev ~ turnover) %>>% + impute_median(other.rev ~ size) %>>% + dump_log(stop=TRUE) Dumped a log at cellwise.csv > > read.csv("cellwise.csv") %>>% dplyr::arrange(key) %>>% head(3) step time expression key variable old new 1 2 2017-06-23 21:11:05 CEST impute_median(other.rev ~ size) 1 other.rev NA 6114.775 2 1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover) 2 other.rev NA 5427.113 3 1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover) 6 other.rev NA 6341.683 >
So, to track changes we only need to switch from
%>>% and add the
dump_log() function calls in the data pipeline. (to be sure: it works with any function, not only with simputation). The package is on CRAN now, and please see the introductory vignette for more examples and ways to customize it.
There are many ways to track changes in data. That is why the lumberjack is completely extensible. The package comes with a few loggers, but users or package authors are invited to write their own. Please see the extending lumberjack vignette for instructions.
If this post got you interested, please install the package using
You can get started with the introductory vignette or even just use the lumberjack operator
%>>% as a (close) replacement of the
As always, I am open to suggestions and comments. Either through the packages github page.
And finally, here's a picture of a lumberjack smoking a pipe.
 It really should be called a function composition operator, but potetoes/potatoes.
Please visit source website for post related comments.