michaelkrot – THATCamp CHNM 2013 http://chnm2013.thatcamp.org The Humanities and Technology Camp Thu, 03 Apr 2014 15:36:28 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.12 JSTOR Data for Research workshop http://chnm2013.thatcamp.org/05/29/jstor-data-for-research-workshop/ Wed, 29 May 2013 15:18:07 +0000 http://chnm2013.thatcamp.org/?p=399

In this workshop we will provide both a general overview of the JSTOR Data for Research (DfR) service and a “how to” for using Hadoop and cloud computing for text mining large datasets. For the big data mining portion of the workshop we will be using a large dataset consisting of the JSTOR Early Journal Content (EJC) collection. A bundle of metadata and full text for the approximately 460,000 articles in the EJC collection can be downloaded from the DfR site. For this tutorial we have pre-loaded the EJC content into Amazon Web Service (AWS) data storage and will provide instructions on how to use the AWS Elastic Map Reduce (EMR) service for efficiently mining this dataset. In this tutorial we’ll show how to create an AWS account, develop and submit Map-Reduce jobs (written in Python) and retrieve results. The examples provided will include the generation of ngrams from full text and the identification of the top words in articles via the calculation of TF*IDF scores.

]]>