minutes.txt

- add .conll and .meta file information to JSON files on a per review basis
	- add a field 'tokenized_text' to each review in reviews
	- add a field 'pos' to each review in reviews
	- add a field 'langid' => currently derived from CoNLL file, not separate langid.py analysis
	DONE!

- ensure doc_ids can be generated from .meta files
- make 2 DBs: by review and by person
- as ID use userID + review#
- is the date format accepted by SOLR?
- fields that should not be sub-analyzed/tokenized: gender, company_id

ACTION:
- DH: add fields to JSON files
- AJ: modify import fields
==================================================================================================================

- index words manually, to allow for search over lower-cased and actual-cased strings. For now, use lower case.
- have a mode that discovers existing spelling variations of the target word, and allow the user to include them
- Note potential problem: tokenization is based on CoNLL, might include titles, but exclude single-word files
- what age groups do we want? Binning by 10 years, plus at least one other one
- check SOLR user guide for field options

ACTION:
- AJ: fiddle with imports
- DH: find age ranges

==================================================================================================================
- central file: humboldt.py, copy things over incrementally
- file with definitions (SOLR location, helper scripts)

- TODO: generate empty homepage from humboldt.py (use humboldt.html): DONE
- TODO: integrate existing layout: DONE
- TODO: include queries: DONE

Step to a query:
- select language:country pair target {from EN:GB, DA:DK}
- enter search term (later: add syntax for special searches. For now: add tick boxes)
- find out how many reviews the term matches
- collect gender and age stats
- collect regional distribution (for now: tabulate NUTS regions)

Separate tab for comparison query:
- enter two terms
- present differences rather than individual plots
- identify key differences and summarize w/ significance tests

ACTION:
- AJ: fill DBs

=================================================================================================================

How do we distinguish the countries? Rather than using location info, define on import with separate flag? DONE!
Add field for language: DONE!
Add another script to reset the DB: DONE

ACTION:
- get query from the website to the DB and display the results: DONE
- change "NUTS-X" to "NUTS_X" in jsonl files via sed, and change in enrich...py: DONE

=================================================================================================================
- make config file with country centers/bounding boxes and zoom-level: DONE
- add the possible country:language pairs into config: DONE
- include maps into results page: DONE
- make bobble heads smaller and switch order: DONE
- switch to NUTS-3: DONE
- add bibtex ref and other citation stuff: DONE
- intro page (link paper, explanation of data, and overview plus worked examples [with links]): DONE

PLOTS/INFO:
- number of men and women, as totals and percentages
- age distro
- two plots (by gender) over age
- NUTS map

================================================================================================================

- get totals for each facet to compute percentages: DONE
- get NUTS info into javascript: DONE
- add combined fields: agegroup+gender, region+gender: DONE
- add two-term query results: DONE
- add basic stats and significance tests into output: contingency table, scipy.stats.chi2_contingency: DONE
- How do we best create age-distro subplots for each gender? DONE
- include error handling if search term is not found: DONE
- add region names instead of NUTS code in list of outliers: DONE
- go live...: DONE!
- age plots have the wrong x-label range!: DONE
- change query names to query: DONE
- add France (missing NUTS-info: DONE
- add POS to Dutch TP data: DONE
- add Dutch/NL from TP: DONE
- add an import field 'source_version' (TP-2014, Twitter-DA, etc.): DONE
- add totals for each plot! => check totals are right and include language!: DONE

==============================================================================================================

TODO:
- add Danish Twitter
- add Dutch and Belgian CSP data: produce TP-like JSON file and run scripts
- add worked examples on front page: how do we link to a results page?
- at the end of side-by-side, include links to the individual queries
- add distribution over sources
- add notice: work in progress
- add banner: donate your data!

NICE TO HAVE AT SOME POINT:
- add link to results JSON
- add tooltips to map with NUTS code, count, rel. freq. in region 
- Soundex: find the "phonological signature" of a word