R / R News / Rstats / Statistics

Can random forest provide insights into how yeast grows?

by nsaunders · June 27, 2019

This article is originally published at https://nsaunders.wordpress.com

I’m not saying this is a good idea, but bear with me.

A recent question on Stack Overflow [r] asked why a random forest model was not working as expected. The questioner was working with data from an experiment in which yeast was grown under conditions where (a) the growth rate could be controlled and (b) one of 6 nutrients was limited. Their dataset consisted of 6 rows – one per nutrient – and several thousand columns, with values representing the activity (expression) of yeast genes. Could the expression values be used to predict the limiting nutrient?

The random forest was not working as expected: not one of the nutrients was correctly classified. I pointed out that with only one case for each outcome, this was to be expected – as the random forest algorithm samples a proportion of the rows, no correct predictions are likely in this case. As sometimes happens the question was promptly deleted, which was unfortunate as we could have further explored the problem.

A little web searching revealed that the dataset in question is quite well-known. It’s published in an article titled Coordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast and has been transformed into a “tidy” format for use in tutorials, here and here.

As it turns out, there are 6 cases (rows) for each outcome (limiting nutrient), as experiments were performed at 6 different growth rates. Whilst random forests are good for “large p small n” problems, it may be that ~ 5 500 x 36 is pushing the limit somewhat. But you know – let’s just try it anyway.

As ever, code and a report for this blog post can be found at Github.

First, we obtain the tidied Brauer dataset. But in fact we want to “untidy” it again (make wide from long) because for random forest we want observations (n) as rows and variables (p – both predicted and predictors) in columns.

library(tidyverse)
library(randomForest)
library(randomForestExplainer)

brauer2007_tidy <- read_csv("https://4va.github.io/biodatasci/data/brauer2007_tidy.csv")

I’ll show the code for running a random forest without much explanation. I’ll assume you have a basic understanding of how the process works. That is: rows are sampled from the data and used to build many decision trees where variables predict either a continuous outcome variable (regression) or a categorical outcome (classification). The trees are then averaged (for regression values) or the majority vote is taken (for classification) to generate predictions. Individual predictors have an “importance” which is essentially some measure of how much worse the model would be were they not included.

Here we go then. A couple of notes. First, setting a seed for random sampling would not usually be used; it’s for reproducibility here. Second, unless you are specifically using the model to predict outcomes on unseen data, there’s no real need for splitting the data into test and training sets – the procedure is already performing a bootstrap by virtue of the out-of-bag error estimation.

brauer2007_tidy_rf1 <- brauer2007_tidy %>% 
  mutate(systematic_name = gsub("-", "minus", systematic_name), 
         nutrient = factor(nutrient)) %>% 
  select(systematic_name, nutrient, rate, expression) %>% 
  spread(systematic_name, expression, fill = 0) %>% 
  randomForest(nutrient ~ ., data = ., localImp = TRUE, importance = TRUE)

The model seems to have performed quite well:

Call:
 randomForest(formula = nutrient ~ ., data = ., localImp = TRUE,      importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 74

        OOB estimate of  error rate: 5.56%
Confusion matrix:
          Ammonia Glucose Leucine Phosphate Sulfate Uracil class.error
Ammonia         6       0       0         0       0      0   0.0000000
Glucose         0       6       0         0       0      0   0.0000000
Leucine         0       1       5         0       0      0   0.1666667
Phosphate       0       0       0         6       0      0   0.0000000
Sulfate         0       0       0         0       6      0   0.0000000
Uracil          0       1       0         0       0      5   0.1666667

Let’s look at the expression of the top 20 genes (by variable importance), with respect to growth rate and limiting nutrient.

brauer2007_tidy %>% 
  filter(systematic_name %in% important_variables(brauer2007_tidy_rf1, k = 20)) %>% 
  ggplot(aes(rate, expression)) + 
  geom_line(aes(color = nutrient)) + 
  facet_wrap(~systematic_name, ncol = 5) + 
  scale_color_brewer(palette = "Set2")

Several of those look promising. Let’s select a gene where expression seems to be affected by each of the 6 limiting nutrients and search online resources such as the Saccharomyces Genome Database to see what’s known.

systematic_name	bp	nutrient	search_results
YHR208W	branched chain family amino acid biosynthesis*	leucine	Pathways – leucine biosynthesis
YKL216W	‘de novo’ pyrimidine base biosynthesis	uracil	URA1 – null mutant requires uracil
YLL055W	biological process unknown	sulfate	Cysteine transporter; null mutant absent utilization of sulfur source
YLR108C	biological process unknown	phosphate
YOR348C	proline catabolism*	ammonia	Proline permease; repressed in ammonia-grown cells
YOR374W	ethanol metabolism	glucose	Aldehyde dehydrogenase; expression is glucose-repressed

I’d say the Brauer data “makes sense” for five of those genes. Little is known about the sixth (YLR108C, affected by phosphate limitation).

In summary

Normally, a study like this would start with the genes – identify those that are differentially expressed and then think about the conditions under which differential expression was observed. Here, the process is reversed in a sense: we view the experimental condition as an outcome, rather than a parameter and ask whether it can be “predicted” by other observations.

So whilst not my first choice of method for this kind of study, and despite limited outcome data, random forest does seem to be generating some insights into which genes are affected by nutrient limitation. And at the end of the day: if a method provides insights, isn’t that what data science is for?

Thanks for visiting r-craft.org
This article is originally published at https://nsaunders.wordpress.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Can random forest provide insights into how yeast grows?

You may also like...

Categories

Can random forest provide insights into how yeast grows?

You may also like...

Using R and H2O to identify product anomalies during the manufacturing process.

RStudio Shortcuts and Tips

Plotting Maps in rbokeh

Categories