R / R Blog / R Code / R Language / R News / R Programming

Topic modelling in R language using JSTOR’s data for research (dfr)

by Mubashir Qasim · August 18, 2018

This is a a step by step guide to perform topic modelling in R language using JSTOR’s data for research (dfr) and visualize it with drf-browser developed by Andrew Goldstone.

Please note this is a tutorial I originally wrote a while ago, and some parts of the code might require an update due to library updates.

Things you need

R (latest version i.e. 3.1.1 or later). It can be downloaded from cran website
R-Studio (an IDE for R)
JSTOR DFR account http://dfr.jstor.org/
Administration right to install several R packages and supporting tools
WAMP/XAMPP (for windows) OR MAMP for MAC in order to view webpages from your personal computer/localhost
Basic knowledge of R language

Getting started

In this tutorial we will run through your own model in three basic steps:

Download metadata from JSTOR (against a search query)
Process JSTOR meta data in R
Pass processed data files to drf-browser to visualize the results

1. Getting data from JSTOR

Create and account at JSTOR DFR http://dfr.jstor.org/
Submit search query and refine it (if necessary) with several filters listed on the left-hand side of the page
Click on Dataset Requests >> Submit new request
Select Data Type:
- Citations Only and Word Counts are the minimum requirements for model to work
- You can add more dimensions to the data if you wish
Output format: CSV
Maximum Articles: You get a limit of 1000 articles at the time of registration. However, this limit can be increased by sending an email request to [email protected].

Note: You may not be able to download meta data straight after submitting a data request. JSTOR system sends an automatic email when the query is processed (which normally it takes around 24 hours).

2. Preparing data in R

This is crucial and perhaps the most technical part of the process. I tried to write in a simplest possible way.

First of all we need in install an R package called dfrtopics developed by Andrew Goldstone. In order to do that you first need to install Rtools and latest version of java. In windows, you may also need to set environmental path to Rtools.

Please see the screenshot for a quick help to set environmental path in windows.

2.1. Installing require packages in R

Run the following code in R studio.

library(devtools)
install_github("dfrtopics","agoldst")
install.packages("rJava")
install.packages("mallet")
install.packages("ggplot2")

2.2. Processing dfr-browser data files

Now let’s assume that you have downloaded and unzipped the results of a DFR query in your working directory of R i.e. ~dfr/queryResult. Make sure you have citations.csv and stoplist.txt (a list of English stop words) in ~dfr/queryResult and metadata in ~dfr/queryResult/wordcounts.

Note: please right click and save link as stoplist.txt to download the file

Run the following code to process the data.

# dfrtopics library depends on the following packages: stringr,
# plyr, reshape2, grid, ggplot2, scales, mallet and rJava
# We need to allocate 2GB of memory to rJava before loading
# dfrtopics package

options(java.parameters="-Xmx2g")
library(dfrtopics)

# Running the model
model_documents(citations_files="./dfr/queryResult/citations.tsv",
     dirs = "./dfr/queryResult/wordcounts/",
     stoplist_file = "./dfr/queryResult/stoplist.txt", n_topics=20)

# Exporting LDA results in a new folder "data" in your working directory
output_model(m, "data")

# Synthesizing document-topic matrix joined with metadata
# for further analysis
doc_topics_wide(m$doc_topics,m$metadata)

# Converting above dataframe of topic in yearly time series
topic_proportions_series_frame(topic_year_matrix(dtw))

# Make a faceted plot to visualize change within topics over time.
topic_yearly_lineplot(series,facet=T)

# Finally, exporting your data for dfr-browser
export_browser_data("data",
 m$metadata,
 m$wkf,
 m$doc_topics,
 topic_scaled_2d(m$trainer))

3. Passing data to dfr-browser for interactive visualization

Now we need to download dfr-browser from http://agoldst.github.io/dfr-browser/. Extract zip/tag.gz file in appropriate directory of WAMP/XAMP/MAMP in order to load dfr-browser in any web-browser from localhost.

Finally, copy “data” directory created by R into the above dfr-browser’s directory.

Your topic model is ready to visualize!

My examples

1. Sustainability and well-being topic model with 1000 papers
2. Sustainability and well-being topic model with over 3500 papers

Please note this is very quick guide of topic modelling which may have taken the explanation of many steps for granted. Please feel free to drop me a message in the feedback if you need me to elaborate any of above steps further.

Related material

Tags: DFR LDA Mallet Text analysis Topic Modelling Topic modelling with R

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Topic modelling in R language using JSTOR’s data for research (dfr)

You may also like...

Categories

Topic modelling in R language using JSTOR’s data for research (dfr)

Things you need

Getting started

1. Getting data from JSTOR

2. Preparing data in R

2.1. Installing require packages in R

2.2. Processing dfr-browser data files

3. Passing data to dfr-browser for interactive visualization

My examples

Related material

Share this:

Related

You may also like...

How to use the NumPy sum function

RStudio Community Monthly Events – November 2021

February 2023: “Top 40” New CRAN Packages

Categories