Project Tracking #1

viswajithiii · 2016-12-04T02:46:24Z

Things to do

For the system

Improve gender from name pipeline using something from Nathian Matias's links.
Heuristics to identify which mentions connect to sources.
Linking mentions to adjectives/dependency parsing

TechCrunch

Count fractions of people mentioned with improved system; break this down by year/month of publication, gender and category of article.

More broadly

Try this on other data sources.
Simple visualization UI which allows people to plug in an article and see our output.

viswajithiii · 2016-12-16T21:42:16Z

State of the project, Dec 16

Goal for the Quarter:

Develop a system to identify sources in articles, and tag them by gender. This would be the first step, and a cornerstone, of scoring articles on gender diversity.

Sub-problems:

Identify mentions of people in text.
Identify the mentioned person’s gender.
Identify which mentions are sources, and which mentions are just in passing.

Updates

Data: We’ve focused mostly on tech news, specifically on a dataset with all of TechCrunch. We've also used data from The New York Times, and updated our models in areas where they didn't generalize.
Individual vs Aggregate: We've found that it’s hard to find a meaningful gender diversity score on an individual article basis. Most articles mention only organizations and no people at all. In TechCrunch, a lot of articles have only one source, who is often a CXO of the company, and very frequently, male. We've decided to, for now, work with aggregate statistics.
Overview of workflow: Scrape raw text and metadata, then annotate the text using Stanford CoreNLP. This does a bunch of useful things, like parsing, named entity recognition and coreference resolution. Then, together with the text and the annotation, we extract the information that's relevant to us.
Identifying mentions of people in text: This was reasonably straightforward -- we use Stanford CoreNLP's named entity recognizer. This tags words corresponding to named entities, and classifies them into person/organization/location. We then run a few heuristics over the person names to collect all mentions of an entity into one.
Identifying gender from a name:
- From first name: We use census baby names data (plus some names we manually put in to the system for common international names) to go from first name to gender. This doesn't work for names that are seen in both genders.
- From coreference: We use coreference with gendered pronouns as evidence of the gender of the person. Our coreference system is high-precision, low-recall: when it marks something as coreferent, it's usually right, but it might miss out on some coreferent mentions. Also, many names just don't have a coreferent pronoun in the article. We do NOT use the gender CoreNLP returns directly: it is unreliable, and seems to have a lot of stereotypes encoded into it.
Identifying sources: For each quote, we identify who said it, and match it to one of the people mentioned in the article. From this, we know what all each person in the article has said. We also identify which verbs each person (or a pronoun they are coreferent with) is a subject of -- if they are a subject of a word that indicates a quote, like say/tell/ask, we guess they are a source. We also have the location in the article of each of these quotes, which can help us identify the importance. It is still hard to distinguish between subject and source.
Adjectives associated with a person: We use dependency parsing to extract adjectives associated with a person. This is still in the early stages, so misses out on a lot.
Blog: We would like to write up some of these results, and we've put up a skeleton of a blog at https://viswajithiii.github.io/gendermeme/.
Web UI: We've built a system where a user can put in an article text, and we return a breakdown of the gender-related information we've computed.

Next Steps

Improve the extraction of adjectives and other words associated with different entities in our articles.
Explore more comprehensive databases/APIs to get name from gender. (Right now, we can only do American names.)
We have systems that look fairly decent on the outside, with a few errors here and there. We need to quantify the amount of error, to really understand how well we're doing and what we're missing.
Figure out how to use our metrics to come up with a measure of gender diversity. Maybe look at how the same story is reported by different organizations. TechMeme will help.
What are things we can do with this system?

Interesting Examples

https://techcrunch.com/2016/02/16/reedsy-launches-book-editor-to-seamlessly-turn-your-draft-into-a-book/ (Romain Dillet)
However, CoreNLP annotates based on stereotypes! Even though this is about an app that lets you interface between a copy editor, a publicist, an illustrator and a book editor, it confidently ascribes genders as seen below.
1 [(u'a copy editor', u'NOMINAL', u'SINGULAR', u'MALE', 16)]
1 [(u'a publicist', u'NOMINAL', u'SINGULAR', u'FEMALE', 16)]
1 [(u'an illustrator', u'NOMINAL', u'SINGULAR', u'MALE', 16)]
1 [(u'cover illustrator', u'NOMINAL', u'SINGULAR', u'MALE', 2)]
1 [(u'a book editor', u'NOMINAL', u'SINGULAR', u'MALE', 3)]
https://techcrunch.com/2016/04/05/the-awl-and-other-indie-publishers-are-moving-to-medium/ (Jacob Carlson)
Mentions no sources, but mentions something about investments made by Shaquille O’Neal, Alex Rodriguez, Jimmy Collins and Rick Fox.
https://techcrunch.com/2016/07/19/pokemon-go-popular-places/ (Darell Etherington)
Mentions no sources, but we mistakenly identify Fred Meyer and Victoria (from Victoria’s secret) as named entities.
https://techcrunch.com/2016/01/20/glassbreakers-raises-2-million-for-its-diversity-and-inclusion-software/
Article is about diversity and inclusion, but only quotes the investor, who is male.

viswajithiii · 2016-12-18T01:54:47Z

Notes from meeting:

Run what we have on New York Times corpus.
Do lit review.

viswajithiii · 2016-12-18T02:31:57Z

Minutes of meeting December 17th:

Can we extract gender from sentences like 'Michelle Obama looks like her daughter' -- here, Michelle and 'her' are not strictly coreferent, but the pronoun her is associated with Michelle.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Tracking #1

Project Tracking #1

viswajithiii commented Dec 4, 2016

viswajithiii commented Dec 16, 2016 •

edited

Loading

viswajithiii commented Dec 18, 2016

viswajithiii commented Dec 18, 2016

Project Tracking #1

Project Tracking #1

Comments

viswajithiii commented Dec 4, 2016

Things to do

For the system

TechCrunch

More broadly

viswajithiii commented Dec 16, 2016 • edited Loading

State of the project, Dec 16

Goal for the Quarter:

Sub-problems:

Updates

Next Steps

Interesting Examples

viswajithiii commented Dec 18, 2016

viswajithiii commented Dec 18, 2016

viswajithiii commented Dec 16, 2016 •

edited

Loading