Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Tracking #1

Open
viswajithiii opened this issue Dec 4, 2016 · 3 comments
Open

Project Tracking #1

viswajithiii opened this issue Dec 4, 2016 · 3 comments

Comments

@viswajithiii
Copy link
Owner

Things to do

For the system

  • Improve gender from name pipeline using something from Nathian Matias's links.
  • Heuristics to identify which mentions connect to sources.
  • Linking mentions to adjectives/dependency parsing

TechCrunch

  • Count fractions of people mentioned with improved system; break this down by year/month of publication, gender and category of article.

More broadly

  • Try this on other data sources.
  • Simple visualization UI which allows people to plug in an article and see our output.
@viswajithiii
Copy link
Owner Author

viswajithiii commented Dec 16, 2016

State of the project, Dec 16

Goal for the Quarter:

Develop a system to identify sources in articles, and tag them by gender. This would be the first step, and a cornerstone, of scoring articles on gender diversity.

Sub-problems:

  • Identify mentions of people in text.
  • Identify the mentioned person’s gender.
  • Identify which mentions are sources, and which mentions are just in passing.

Updates

  • Data: We’ve focused mostly on tech news, specifically on a dataset with all of TechCrunch. We've also used data from The New York Times, and updated our models in areas where they didn't generalize.
  • Individual vs Aggregate: We've found that it’s hard to find a meaningful gender diversity score on an individual article basis. Most articles mention only organizations and no people at all. In TechCrunch, a lot of articles have only one source, who is often a CXO of the company, and very frequently, male. We've decided to, for now, work with aggregate statistics.
  • Overview of workflow: Scrape raw text and metadata, then annotate the text using Stanford CoreNLP. This does a bunch of useful things, like parsing, named entity recognition and coreference resolution. Then, together with the text and the annotation, we extract the information that's relevant to us.
  • Identifying mentions of people in text: This was reasonably straightforward -- we use Stanford CoreNLP's named entity recognizer. This tags words corresponding to named entities, and classifies them into person/organization/location. We then run a few heuristics over the person names to collect all mentions of an entity into one.
  • Identifying gender from a name:
    • From first name: We use census baby names data (plus some names we manually put in to the system for common international names) to go from first name to gender. This doesn't work for names that are seen in both genders.
    • From coreference: We use coreference with gendered pronouns as evidence of the gender of the person. Our coreference system is high-precision, low-recall: when it marks something as coreferent, it's usually right, but it might miss out on some coreferent mentions. Also, many names just don't have a coreferent pronoun in the article. We do NOT use the gender CoreNLP returns directly: it is unreliable, and seems to have a lot of stereotypes encoded into it.
  • Identifying sources: For each quote, we identify who said it, and match it to one of the people mentioned in the article. From this, we know what all each person in the article has said. We also identify which verbs each person (or a pronoun they are coreferent with) is a subject of -- if they are a subject of a word that indicates a quote, like say/tell/ask, we guess they are a source. We also have the location in the article of each of these quotes, which can help us identify the importance. It is still hard to distinguish between subject and source.
  • Adjectives associated with a person: We use dependency parsing to extract adjectives associated with a person. This is still in the early stages, so misses out on a lot.
  • Blog: We would like to write up some of these results, and we've put up a skeleton of a blog at https://viswajithiii.github.io/gendermeme/.
  • Web UI: We've built a system where a user can put in an article text, and we return a breakdown of the gender-related information we've computed.

Next Steps

  • Improve the extraction of adjectives and other words associated with different entities in our articles.
  • Explore more comprehensive databases/APIs to get name from gender. (Right now, we can only do American names.)
  • We have systems that look fairly decent on the outside, with a few errors here and there. We need to quantify the amount of error, to really understand how well we're doing and what we're missing.
  • Figure out how to use our metrics to come up with a measure of gender diversity. Maybe look at how the same story is reported by different organizations. TechMeme will help.
  • What are things we can do with this system?

Interesting Examples

https://techcrunch.com/2016/02/16/reedsy-launches-book-editor-to-seamlessly-turn-your-draft-into-a-book/ (Romain Dillet)
However, CoreNLP annotates based on stereotypes! Even though this is about an app that lets you interface between a copy editor, a publicist, an illustrator and a book editor, it confidently ascribes genders as seen below.
1 [(u'a copy editor', u'NOMINAL', u'SINGULAR', u'MALE', 16)]
1 [(u'a publicist', u'NOMINAL', u'SINGULAR', u'FEMALE', 16)]
1 [(u'an illustrator', u'NOMINAL', u'SINGULAR', u'MALE', 16)]
1 [(u'cover illustrator', u'NOMINAL', u'SINGULAR', u'MALE', 2)]
1 [(u'a book editor', u'NOMINAL', u'SINGULAR', u'MALE', 3)]
https://techcrunch.com/2016/04/05/the-awl-and-other-indie-publishers-are-moving-to-medium/ (Jacob Carlson)
Mentions no sources, but mentions something about investments made by Shaquille O’Neal, Alex Rodriguez, Jimmy Collins and Rick Fox.
https://techcrunch.com/2016/07/19/pokemon-go-popular-places/ (Darell Etherington)
Mentions no sources, but we mistakenly identify Fred Meyer and Victoria (from Victoria’s secret) as named entities.
https://techcrunch.com/2016/01/20/glassbreakers-raises-2-million-for-its-diversity-and-inclusion-software/
Article is about diversity and inclusion, but only quotes the investor, who is male.

@viswajithiii
Copy link
Owner Author

Notes from meeting:

  • Run what we have on New York Times corpus.
  • Do lit review.

@viswajithiii
Copy link
Owner Author

Minutes of meeting December 17th:

  • Can we extract gender from sentences like 'Michelle Obama looks like her daughter' -- here, Michelle and 'her' are not strictly coreferent, but the pronoun her is associated with Michelle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant