Smoking-Gun-Classification/project-plan.md at master · JohnRStarr/Smoking-Gun-Classification · GitHub

Project Plan

Sean Steinle | [email protected]

cleaning/prepping data, loading into a DF, all that jazz.
- I understand this will be harder than I'm making it sound, but I'm not sure what the exact steps are.

undergo basic sentiment analysis on all Enron emails
- general plan:
  1. use pre-loaded, simple tagger (likely just Naive-Bayes)
- questions:
  1. how to partition?
  2. I need tags(i.e. "blah blah blah happy email here", "POS"), right?
    1. get tags from pre-loaded tagger
    2. manual?

customize base ML model
- general plan:
  - tweak the original sentiment analysis model with my own features
    - I'm almost certainly at the mercy of the best-fitting model out there... right?
- questions:
  1. which feats?
    1. should I be reading books about Enron for more domain specific knowledge?
    2. unsupervised ML (Jevon mentioned this)
  2. which technique of ML would be most effective (and practical??)

data analysis
- general plan:
  1. basic metrics:
    1. classifier accuracy
    2. classifier confidence accuracy (I would really like a probabilistic rating schema for each email, 1-10 or something with 10 being incriminating)
  2. effectiveness on other email corpora?
    1. this is probably very hard if at all possible, but it would be really cool.