jstor_parse

This repository contains code for parsing a large corpus of JSTOR articles. The code can be adapted for use with other large unparsed and nonstandard corpora, particularly those that are distributed via zipped archives. Rather than unzipping large archives, this code streams data from zipped files using the zipfile library in python.

Articles are then parsed using standard NLP methods with BeautifulSoup. As the JSTOR corpus metadata and text files are contributed from the individual publishers, the XML formatting is not standard with certain tags missing in a large number of files. To accommodate this, the code first explores the corpus to estimate where tags are missing, selectively unzipping articles for manual inspection.

Script Descriptions

percent_parser.py parses all the zipped archives and inspects the XML files to see what percentage of the files have which kinds of metadata. It returns a csv with 1s and 0s depending on whether the metadata is in the file. In this way you can see what percentage of the metadata items are in the corpus.

text_parser.py just returns non-standardized text from all the individual files

jstor_text_clean.py cleans all the stems, lems and normalizes all the text files and returns a really CSV with all the text, jstor_standard_text.csv

metadata_parser.py parses all the zipped archives and returns the metadata for each file.

asa_jstor_pub_counts.py returns publication counts from our faculty members in the network data.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

jstor_parse

Script Descriptions

About

Releases

Packages

Languages

TimothyElder/jstor_parse

Folders and files

Latest commit

History

Repository files navigation

jstor_parse

Script Descriptions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages