Twitter sentiment analysis based on affective lexicons with R
This article is originally published at https://www.analyzecore.com
We will study another dictionary-based approach that is based on affective lexicons for Twitter sentiment analysis Continue to dig tweets. After we reviewed how to count positive, negative and neutral tweets in the previous post, I discovered another great idea. Suppose positive or negative mark is not enough and we want to understand the rate of positivity or negativity.
For example, if word “good” has 4 points rating, but “perfect” has 6. In this way we can try to measure the rate of satisfaction or opinion in tweets and take a chart with the trend as the following:
We need another dictionary for managing this task, specifically the dictionary with a rating of words. We can create it or find results of great research of affective ratings (e.g. here).
And of course, our algorithm should bypass Twitter’s API limitation via accumulating historical data. This approach was described in the previous post.
Note, I will use average rating for evaluating tweets based on words rating it consists of. For example, if we’ve found “good” (4 points) and “perfect” (6 points) in the tweet, it would be evaluated as (4+6)/2=5. In this way, we will avoid the influence of several negative words that could have a higher total rating, e.g. one word “good” (4 points) should have a higher rating than three words “bad” (for 1,5 points each).
Let’s start. We need to create Twitter Application (https://apps.twitter.com/) in order to have an access to Twitter’s API. Then we will get Consumer Key and Consumer Secret. And finally, our code in R:
#connect all libraries library(twitteR) library(ROAuth) library(plyr) library(dplyr) library(stringr) library(ggplot2)
#connect to API download.file(url='http://curl.haxx.se/ca/cacert.pem', destfile='cacert.pem') reqURL <- 'https://api.twitter.com/oauth/request_token' accessURL <- 'https://api.twitter.com/oauth/access_token' authURL <- 'https://api.twitter.com/oauth/authorize' consumerKey <- '____________' #put the Consumer Key from Twitter Application consumerSecret <- '______________' #put the Consumer Secret from Twitter Application Cred <- OAuthFactory$new(consumerKey=consumerKey, consumerSecret=consumerSecret, requestURL=reqURL, accessURL=accessURL, authURL=authURL) Cred$handshake(cainfo = system.file('CurlSSL', 'cacert.pem', package = 'RCurl')) #There is URL in Console. You need to go to, get code and enter it on Console
save(Cred, file='twitter authentication.Rdata') load('twitter authentication.Rdata') #Once you launched the code first time, you can start from this line in the future (libraries should be connected) registerTwitterOAuth(Cred)
#the function for extracting and analyzing tweets search <- function(searchterm) { #extract tweets and create storage file list <- searchTwitter(searchterm, cainfo='cacert.pem', n=1500) df <- twListToDF(list) df <- df[, order(names(df))] df$created <- strftime(df$created, '%Y-%m-%d') if (file.exists(paste(searchterm, '_stack_val.csv'))==FALSE) write.csv(df, file=paste(searchterm, '_stack_val.csv'), row.names=F)
#merge the last extraction with storage file and remove duplicates stack <- read.csv(file=paste(searchterm, '_stack_val.csv')) stack <- rbind(stack, df) stack <- subset(stack, !duplicated(stack$text)) write.csv(stack, file=paste(searchterm, '_stack_val.csv'), row.names=F)
#tweets evaluation function score.sentiment <- function(sentences, valence, .progress='none') { require(plyr) require(stringr) scores <- laply(sentences, function(sentence, valence){ sentence <- gsub('[[:punct:]]', '', sentence) #cleaning tweets sentence <- gsub('[[:cntrl:]]', '', sentence) #cleaning tweets sentence <- gsub('\\d+', '', sentence) #cleaning tweets sentence <- tolower(sentence) #cleaning tweets word.list <- str_split(sentence, '\\s+') #separating words words <- unlist(word.list) val.matches <- match(words, valence$Word) #find words from tweet in "Word" column of dictionary val.match <- valence$Rating[val.matches] #evaluating words which were found (suppose rating is in "Rating" column of dictionary). val.match <- na.omit(val.match) val.match <- as.numeric(val.match) score <- sum(val.match)/length(val.match) #rating of tweet (average value of evaluated words) return(score) }, valence, .progress=.progress) scores.df <- data.frame(score=scores, text=sentences) #save results to the data frame return(scores.df) }
valence <- read.csv('dictionary.csv', sep=',' , header=TRUE) #load dictionary from .csv file
Dataset <- stack Dataset$text <- as.factor(Dataset$text) scores <- score.sentiment(Dataset$text, valence, .progress='text') #start score function write.csv(scores, file=paste(searchterm, '_scores_val.csv'), row.names=TRUE) #save evaluation results into the file
#modify evaluation stat <- scores stat$created <- stack$created stat$created <- as.Date(stat$created) stat <- na.omit(stat) #delete unvalued tweets write.csv(stat, file=paste(searchterm, '_opin_val.csv'), row.names=TRUE)
#chart ggplot(stat, aes(created, score)) + geom_point(size=1) + stat_summary(fun.data = 'mean_cl_normal', mult = 1, geom = 'smooth') + ggtitle(searchterm)
ggsave(file=paste(searchterm, '_plot_val.jpeg')) }
search("______") #enter keyword
Finally, we will get 4 files:
- storage file with initial data,
- file with tweets rating,
- cleaned (without unvalued tweets) file with tweets and dates,
- the chart where we can see the density of tweet ratings and mean as a trend that looks like:
The post Twitter sentiment analysis based on affective lexicons with R appeared first on AnalyzeCore by Sergey Bryl' - data is beautiful, data is a story.
Thanks for visiting r-craft.org
This article is originally published at https://www.analyzecore.com
Please visit source website for post related comments.