forked from klaesra/WordlyExplorer
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathminutes.txt
109 lines (90 loc) · 4.66 KB
/
minutes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
- add .conll and .meta file information to JSON files on a per review basis
- add a field 'tokenized_text' to each review in reviews
- add a field 'pos' to each review in reviews
- add a field 'langid' => currently derived from CoNLL file, not separate langid.py analysis
DONE!
- ensure doc_ids can be generated from .meta files
- make 2 DBs: by review and by person
- as ID use userID + review#
- is the date format accepted by SOLR?
- fields that should not be sub-analyzed/tokenized: gender, company_id
ACTION:
- DH: add fields to JSON files
- AJ: modify import fields
==================================================================================================================
- index words manually, to allow for search over lower-cased and actual-cased strings. For now, use lower case.
- have a mode that discovers existing spelling variations of the target word, and allow the user to include them
- Note potential problem: tokenization is based on CoNLL, might include titles, but exclude single-word files
- what age groups do we want? Binning by 10 years, plus at least one other one
- check SOLR user guide for field options
ACTION:
- AJ: fiddle with imports
- DH: find age ranges
==================================================================================================================
- central file: humboldt.py, copy things over incrementally
- file with definitions (SOLR location, helper scripts)
- TODO: generate empty homepage from humboldt.py (use humboldt.html): DONE
- TODO: integrate existing layout: DONE
- TODO: include queries: DONE
Step to a query:
- select language:country pair target {from EN:GB, DA:DK}
- enter search term (later: add syntax for special searches. For now: add tick boxes)
- find out how many reviews the term matches
- collect gender and age stats
- collect regional distribution (for now: tabulate NUTS regions)
Separate tab for comparison query:
- enter two terms
- present differences rather than individual plots
- identify key differences and summarize w/ significance tests
ACTION:
- AJ: fill DBs
=================================================================================================================
How do we distinguish the countries? Rather than using location info, define on import with separate flag? DONE!
Add field for language: DONE!
Add another script to reset the DB: DONE
ACTION:
- get query from the website to the DB and display the results: DONE
- change "NUTS-X" to "NUTS_X" in jsonl files via sed, and change in enrich...py: DONE
=================================================================================================================
- make config file with country centers/bounding boxes and zoom-level: DONE
- add the possible country:language pairs into config: DONE
- include maps into results page: DONE
- make bobble heads smaller and switch order: DONE
- switch to NUTS-3: DONE
- add bibtex ref and other citation stuff: DONE
- intro page (link paper, explanation of data, and overview plus worked examples [with links]): DONE
PLOTS/INFO:
- number of men and women, as totals and percentages
- age distro
- two plots (by gender) over age
- NUTS map
================================================================================================================
- get totals for each facet to compute percentages: DONE
- get NUTS info into javascript: DONE
- add combined fields: agegroup+gender, region+gender: DONE
- add two-term query results: DONE
- add basic stats and significance tests into output: contingency table, scipy.stats.chi2_contingency: DONE
- How do we best create age-distro subplots for each gender? DONE
- include error handling if search term is not found: DONE
- add region names instead of NUTS code in list of outliers: DONE
- go live...: DONE!
- age plots have the wrong x-label range!: DONE
- change query names to query: DONE
- add France (missing NUTS-info: DONE
- add POS to Dutch TP data: DONE
- add Dutch/NL from TP: DONE
- add an import field 'source_version' (TP-2014, Twitter-DA, etc.): DONE
- add totals for each plot! => check totals are right and include language!: DONE
==============================================================================================================
TODO:
- add Danish Twitter
- add Dutch and Belgian CSP data: produce TP-like JSON file and run scripts
- add worked examples on front page: how do we link to a results page?
- at the end of side-by-side, include links to the individual queries
- add distribution over sources
- add notice: work in progress
- add banner: donate your data!
NICE TO HAVE AT SOME POINT:
- add link to results JSON
- add tooltips to map with NUTS code, count, rel. freq. in region
- Soundex: find the "phonological signature" of a word