Skip to content

Latest commit

 

History

History
executable file
·
18 lines (9 loc) · 1.63 KB

README.md

File metadata and controls

executable file
·
18 lines (9 loc) · 1.63 KB

jstor_parse

This repository contains code for parsing a large corpus of JSTOR articles. The code can be adapted for use with other large unparsed and nonstandard corpora, particularly those that are distributed via zipped archives. Rather than unzipping large archives, this code streams data from zipped files using the zipfile library in python.

Articles are then parsed using standard NLP methods with BeautifulSoup. As the JSTOR corpus metadata and text files are contributed from the individual publishers, the XML formatting is not standard with certain tags missing in a large number of files. To accommodate this, the code first explores the corpus to estimate where tags are missing, selectively unzipping articles for manual inspection.

Script Descriptions

percent_parser.py parses all the zipped archives and inspects the XML files to see what percentage of the files have which kinds of metadata. It returns a csv with 1s and 0s depending on whether the metadata is in the file. In this way you can see what percentage of the metadata items are in the corpus.

text_parser.py just returns non-standardized text from all the individual files

jstor_text_clean.py cleans all the stems, lems and normalizes all the text files and returns a really CSV with all the text, jstor_standard_text.csv

metadata_parser.py parses all the zipped archives and returns the metadata for each file.

asa_jstor_pub_counts.py returns publication counts from our faculty members in the network data.