Do basic R operations much faster in bash [Slightly off-topic]
This article is originally published at https://rcrastinate.blogspot.com/R is great, and you can do a LOT OF stuff with it.
However, sometimes you want to do really basic stuff with huge or a lot of files. At work, I have to do that a lot because I am mostly dealing with language data that often needs some pre-processing.
Most of these operations are done much, much faster on the level of the operating system (preferably in Bash on Linux or Unix, i.e. Mac OS). And since R tries to load everything into working memory, these functions might also help you to do stuff with files that are too big for your RAM.
This blog post is some kind of cheat sheet for me to remember some of the bash functions that prove very useful to me. (Most of the functions are quite basic for an advanced user of Linux or Unix, I guess).
Disclaimer: Most of these calls were adapted from different StackExchange questions. There are really lots of very helpful posts. Thanks to the community!
Superfast subset of a tabulated text file (it might also be gzipped!):
[z]grep -E <regex pattern> <from file> > <to file>
<regex pattern> could include your separators. If <from file> is tab-separated, use -P for Perl-like regular expressions (only works with grep, not with zgrep?).
Superfast extraction of the first column from a tab-separated file:
cut -f1 <from file> > <to file>
Just replace <from file> with * if you want to extract the first column from each file and write them all into the same <to file>.
Write unique rows of a file into a new file:
sort <from file> | uniq > <to-file>
Yes, there is no "e" after uniq! You have to sort <from file> first.
Get list of files from a directory really fast - this has to be inserted into an R script to get a list of files:
files <- system(paste0("ls -f ", source.path), intern = T)
I used this to get a list of 1.6 million file names. It was A LOT faster than the built-in R function dir().
To be continued.
Thanks for visiting r-craft.org
This article is originally published at https://rcrastinate.blogspot.com/
Please visit source website for post related comments.