R News

qeML Example: Issues of Overfitting, Dimension Reduction Etc.

by matloff · November 22, 2023

This article is originally published at https://matloff.wordpress.com

What about variable selection? Which predictor variables/features should we use? No matter what anyone tells you, this is an unsolved problem. But there are lots of useful methods. See the qeML vignettes on feature selection and overfitting for detailed background on the issues involved.

We note at the outset what our concluding statement will be: Even a very simple, very clean-looking dataset like this one may be much more nuanced than it looks. Real life is not like those simplistic textbooks, eh?

Here I’ll discuss qeML::qeLeaveOut1Var. (I usually omit parentheses in referring to function names; see https://tinyurl.com/4hwr2vf.) The idea is simple: For each variable, find prediction accuracy with and without that variable.

Let’s try it on the famous NYC taxi trip data, included (with modification) in qeML. First, note that qeML prediction calls automatically split the data into training and test sets, and compute test accuracy (mean absolute prediction error or overall misclassification error) on the latter.

The call qeLeaveOut1Var(nyctaxi,’tripTime’,’qeLin’,10) predicts trip time using qeML‘s linear model. (The latter wraps lm, but adds some things and sets the standard qeML call form..) Since the test set is random (as is our data), we’ll do 10 repetitions and average the results. Instead of qeLin, we could have used any other qeML prediction function, e.g. qeKNN for k-Nearest Neighbors.

> qeLeaveOut1Var(nyctaxi,'tripTime','qeLin',10)
         full trip_distance  PULocationID  DOLocationID     DayOfWeek
     238.4611      353.2409      253.2761      246.3186      239.2277
There were 50 or more warnings (use warnings() to see the first 50)

We’ll discuss the warnings shortly, but not surprisingly, trip distance is the most important variable. The pickup and dropoff locations also seem to have predictive value, though day of the week may not.

But let’s take a closer look. There were 224 pickup locations. (run levels(nyctaxi$PULocationID) to see this). That’s 223 dummy (“one-hot”) variables; are some more predictive than others? To explore that in qeLeaveOut1Var, we could make the dummies explicit, so each dummy is removed one at a time:

nyct <- factorsToDummies(nyctaxi,omitLast=TRUE)

This function is actually from the regtools package, included in qeML. Then we could try, say,

nyct <- as.data.frame(nyct)
qeLeaveOut1Var(nyct,'tripTime','qeLin',10)

But with so many dummies, this would take a long time to run. We could directly look at mean trip times for each pickup location to get at least some idea of their individual predictive power,

tapply(nyctaxi$tripTime,nyctaxi$PULocationID,mean)
tapply(nyctaxi$tripTime,nyctaxi$PULocationID,length)

Many locations have very little data, so we’d have to deal with that. Note too the possibility of overfitting.

> dim(nyct)
[1] 10000  479

An old rule of thumb is to use under sqrt(n) variables, 100 here. Just a guide, but much less than 479. (Note: Even our analysis using the original factors still converts to dummies internally; nyctaxi has 4 columns, but lm will expand them as in nyct.)

We may wish to delete pickup location entirely. Or, possibly use PCA for dimension reduction,

z <- qePCA(nyctaxi,'tripTime','qeLin',pcaProp=0.75)

This qeML call says, “Compute PCA on the predictors, retaining enough of them for 0.75 of the total variance, and then run qeLin on the resulting PCs.”

But…remember those warning messages? Running warnings() we see messages like “6 rows removed from test set, due to new factor levels.” The problem is that, in dividing the data into training and test sets, some pickup or dropoff locations appeared only in the latter, thus impossible to predict. So, many of the columns in the training set are all 0s, thus 0 variance, thus problems with PCA. We then might run qeML::constCols to find out which columns have 0 variance, then delete those, and try qePCA again.

And we haven’t even mentioned using, say, qeLASSO or qeXGBoost instead of qeLin, etc. But the point is clear: Even a very simple, very clean-looking application like this one may be much more nuanced than it looks.

Thanks for visiting r-craft.org
This article is originally published at https://matloff.wordpress.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

qeML Example: Issues of Overfitting, Dimension Reduction Etc.

You may also like...

Categories

qeML Example: Issues of Overfitting, Dimension Reduction Etc.

You may also like...

R Package Integration with Modern Reusable C++ Code Using Rcpp – Part 3

Where to find the worst weather in the US

rbokeh Version 0.5.0 Released

Categories