Topic modelling in R language using JSTOR’s data for research (dfr)
This is a a step by step guide to perform topic modelling in R language using JSTOR’s data for research (dfr) and visualize it with drf-browser developed by Andrew Goldstone.
Please note this is a tutorial I originally wrote a while ago, and some parts of the code might require an update due to library updates.
Things you need
- R (latest version i.e. 3.1.1 or later). It can be downloaded from cran website
- R-Studio (an IDE for R)
- JSTOR DFR account http://dfr.jstor.org/
- Administration right to install several R packages and supporting tools
- WAMP/XAMPP (for windows) OR MAMP for MAC in order to view webpages from your personal computer/localhost
- Basic knowledge of R language
In this tutorial we will run through your own model in three basic steps:
- Download metadata from JSTOR (against a search query)
- Process JSTOR meta data in R
- Pass processed data files to drf-browser to visualize the results
1. Getting data from JSTOR
- Create and account at JSTOR DFR http://dfr.jstor.org/
- Submit search query and refine it (if necessary) with several filters listed on the left-hand side of the page
- Click on Dataset Requests >> Submit new request
- Select Data Type:
- Citations Only and Word Counts are the minimum requirements for model to work
- You can add more dimensions to the data if you wish
- Output format: CSV
- Maximum Articles: You get a limit of 1000 articles at the time of registration. However, this limit can be increased by sending an email request to [email protected].
Note: You may not be able to download meta data straight after submitting a data request. JSTOR system sends an automatic email when the query is processed (which normally it takes around 24 hours).
2. Preparing data in R
This is crucial and perhaps the most technical part of the process. I tried to write in a simplest possible way.
First of all we need in install an R package called dfrtopics developed by Andrew Goldstone. In order to do that you first need to install Rtools and latest version of java. In windows, you may also need to set environmental path to Rtools.
Please see the screenshot for a quick help to set environmental path in windows.
2.1. Installing require packages in R
Run the following code in R studio.
library(devtools) install_github("dfrtopics","agoldst") install.packages("rJava") install.packages("mallet") install.packages("ggplot2")
2.2. Processing dfr-browser data files
Now let’s assume that you have downloaded and unzipped the results of a DFR query in your working directory of R i.e. ~dfr/queryResult. Make sure you have citations.csv and stoplist.txt (a list of English stop words) in ~dfr/queryResult and metadata in ~dfr/queryResult/wordcounts.
Note: please right click and save link as stoplist.txt to download the file
Run the following code to process the data.
# dfrtopics library depends on the following packages: stringr, # plyr, reshape2, grid, ggplot2, scales, mallet and rJava # We need to allocate 2GB of memory to rJava before loading # dfrtopics package options(java.parameters="-Xmx2g") library(dfrtopics) # Running the model model_documents(citations_files="./dfr/queryResult/citations.tsv", dirs = "./dfr/queryResult/wordcounts/", stoplist_file = "./dfr/queryResult/stoplist.txt", n_topics=20) # Exporting LDA results in a new folder "data" in your working directory output_model(m, "data") # Synthesizing document-topic matrix joined with metadata # for further analysis doc_topics_wide(m$doc_topics,m$metadata) # Converting above dataframe of topic in yearly time series topic_proportions_series_frame(topic_year_matrix(dtw)) # Make a faceted plot to visualize change within topics over time. topic_yearly_lineplot(series,facet=T) # Finally, exporting your data for dfr-browser export_browser_data("data", m$metadata, m$wkf, m$doc_topics, topic_scaled_2d(m$trainer))
3. Passing data to dfr-browser for interactive visualization
Now we need to download dfr-browser from http://agoldst.github.io/dfr-browser/. Extract zip/tag.gz file in appropriate directory of WAMP/XAMP/MAMP in order to load dfr-browser in any web-browser from localhost.
Finally, copy “data” directory created by R into the above dfr-browser’s directory.
Your topic model is ready to visualize!
1. Sustainability and well-being topic model with 1000 papers
2. Sustainability and well-being topic model with over 3500 papers
- Daniel A. McFarland, Daniel Ramage, Jason Chuang, Jeffrey Heer, Christopher D. Manning, Daniel Jurafsky, Differentiating language usage through topic models, Poetics, Volume 41, Issue 6, December 2013, Pages 607-625, ISSN 0304-422X, http://dx.doi.org/10.1016/j.poetic.2013.06.004
Ke Zhai, Jordan Boyd-Graber, Nima Asadi, and Mohamad Alkhouja. Mr. LDA: A Flexible Large Scale Topic Modeling Package using Variational Inference in MapReduce. Proceedings of the 21th International World Wide Web Conference (WWW 2012), 2012, pages 879-888, Lyon, France.
- Momtazi, Saeedeh, and Felix Naumann. “Topic modeling for expert finding using latent Dirichlet allocation.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 3.5 (2013): 346-353.
- Mark Steyvers, Probabilistic Topic Models
- Goldstone, Andrew, and Ted Underwood. “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us.” New Literary History, forthcoming.
- Mr. LDA: A Flexible Large Scale Topic Modeling Package using Variational Inference in MapReduce (presentation)
- Introduction to Topic Models (presentation) by Vivi Nastase
- Machine Learning with MALLET (presentation)
- Topic Modeling and Network Analysis