Data Science / Machine learning / R News

Naive Bayes Classification in R (Part 2)

by S. Walsh · February 18, 2017

This article is originally published at https://sw23993.wordpress.com

Following on from Part 1 of this two-part post, I would now like to explain how the Naive Bayes classifier works before applying it to a classification problem involving breast cancer data. The dataset is sourced from Matjaz Zwitter and Milan Soklic from the Institute of Oncology, University Medical Center in Ljubljana, Slovenia (formerly Yugoslavia) and the attributes are as follows:

age: a series of ranges from 20-29 to 70-79

menopause: whether a patient was pre- or post-menopausal upon diagnosis

tumor.size: the largest diameter (mm) of excised tumor

inv.nodes: the number of axillary lymph nodes which contained metastatic breast cancer

node.caps: whether metastatic cancer was contained by the lymph node capsule

deg.malign: the histological grade of the tumor (1-3 with 3 = highly abnormal cells)

breast: which breast the cancer occurred in

breast.quad: region of the breast cancer occurred in (four quadrants with nipple = central)

irradiat: whether the patient underwent radiation therapy

Some preprocessing of these data was required as there were some NAs (9 in total). I imputed predicted values using separate Naive Bayes classifiers. The objective here is to attempt to predict, using these attributes, with relatively high accuracy whether a recurrence of breast cancer is likely to occur in patients who were previously diagnosed and treated for the disease. We can pursue this objective by using the Naive Bayes classification method.

Naive Bayes’ Classification

Below is the Naive Bayes’ Theorem:

P(A | B) = P(A) * P(B | A) / P(B)

which can be derived from the general multiplication formula for AND events:

P(A and B) = P(A) * P(B | A)

P(B | A) = P(A and B) / P(A)

P(B | A) = P(B) * P(A | B) / P(A)

If I replace the letters with meaningful words as I have been adopting throughout, the Naive Bayes formula becomes:

P(outcome | evidence) = P(outcome) * P(evidence | outcome) / P(evidence)

It is with this formula that the Naive Bayes classifier calculates conditional probabilities for a class outcome given prior information or evidence (our attributes in this case). The reason it is termed “naive” is because we assume independence between attributes when in reality they may be dependent in some way. For the breast cancer dataset we will be working with, some attributes are clearly dependent such as age and menopause status while some may or may not be dependent such as histological grade and tumor size.

This assumption allows us to calculate the probability of the evidence by multiplying the individual probabilities of each piece of evidence occurring together using the simple multiplication rule for independent AND events. Another point to note is that this naivety results in probabilities that are not entirely mathematically correct but they are a good approximation and adequate for the purposes of classification. Indeed, the Naive Bayes classifier has proven to be highly effective and is commonly deployed in email spam filters.

Calculating Conditional Probabilities

Conditional probabilities are fundamental to the working of the Naive Bayes formula. Tables of conditional probabilities must be created in order to obtain values to use in the Naive Bayes algorithm. The R package e1071 contains a very nice function for creating a Naive Bayes model. Read in the dataset sourced via the hyperlink at the start of this post or see the comments below for Github access. Note that some cleaning was carried out in this example but the original will work fine as long as strings are set to factors.

library(e1071)
breast_cancer <- read.csv("breast_cancer_df.csv")
model <- naiveBayes(class ~ ., data = breast_cancer)
class(model)
summary(model)
print(model)

The model has class “naiveBayes” and the summary tells us that the model provides a-priori probabilities of no-recurrence and recurrence events as well as conditional probability tables across all attributes. To examine the conditional probability tables just print the model.

One of our tasks for this assignment was to create code which would give us the same conditional probabilities as those output by the naiveBayes() function. I went about this in the following way:

tbl_list <- sapply(breast_cancer[-10], table, breast_cancer[ , 10])
tbl_list <- lapply(tbl_list, t)

cond_probs <- sapply(tbl_list, function(x) { 
  apply(x, 1, function(x) { 
    x / sum(x) }) })

cond_probs <- lapply(cond_probs, t)

print(cond_probs)

The first line of code uses the sapply function to loop over all attribute variables in the dataset and create tables against the class attribute. I then used the lapply function to transpose all tables in the list so the rows represented the class attribute.

To calculate conditional probabilities for each element in the tables, I used sapply, lapply and anonymous functions. I had to transpose the output in order to get the same structure as the naiveBayes model output. Finally, I printed out my calculated conditional probabilities and compared them with the naiveBayes output to validate the calculations.

Applying the Naive Bayes’ Classifier

So I’ve explained (hopefully reasonably well) how the Naive Bayes classifier works based on the fundamental rules of probability. Now it’s time to apply the model to the data. This is easily done in R by using the predict() function.

preds <- predict(model, newdata = breast_cancer)

You will see that I have trained the model using the entire dataset and then made predictions on the same dataset. In our assignment we were asked to train the model and test it on the dataset, treating the dataset as an unlabeled test set. This is unconventional as the training set and test set are then identical but I believe the assignment was intended to just test our understanding of how the method works. In practice, one would use a training set for the model to learn from and a test set to assess model accuracy.

If one outcome class is more abundant in the dataset, as is the case with the breast cancer data (no-recurrence: 201, recurrence: 85), the data is unbalanced. This is okay for a generative Naive Bayes model as you want your model to learn from real-world events and to capture the truth. Manipulating the data to achieve less skew would be dangerous. There is also the decision on whether to employ Laplace smoothing to the model. Laplace smoothing, in effect, adds imaginary observations to a dataset in order to avoid absolute zero probabilities which we cannot explicitly determine to be true.

Applying the model to the data gives the following confusion matrix from which a model accuracy of 75% can be calculated:

 conf_matrix <- table(preds, breast_cancer$class)

This post has only scraped the surface of classification methods in machine learning but has been a useful revision for myself and perhaps it may help others new to the Naive Bayes classifier. Please feel free to comment and correct any errors that may be present.

Featured image By Dennis Hill from The OC, So. Cal. – misc 24, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=54704175

Thanks for visiting r-craft.org
This article is originally published at https://sw23993.wordpress.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Naive Bayes Classification in R (Part 2)

You may also like...

Categories

Naive Bayes Classification in R (Part 2)

Naive Bayes’ Classification

Calculating Conditional Probabilities

Applying the Naive Bayes’ Classifier

You may also like...

3rd Birthday of Warsaw R Enthusiasts Group

Quantifying the Impact of the Number of Decks and Depth of Penetration While Counting Blackjack

Explicit semantic analysis with R

Categories