-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When sanitizing data, use periods instead of semicolons #7
base: main
Are you sure you want to change the base?
Conversation
question answer options are separated using `;`, so using `;` to sanitize `,` causes a bug further down the line in analysis: [https://github.com/onnela-lab/beiwe-backend/issues/53](https://github.com/onnela-lab/beiwe-backend/issues/53)
@biblicabeebli is there any reason why this shouldn't be merged in? It definitely seems like it would fix some issues in reading survey_answers files I've been coming across. |
I was not aware of data-analysis-level issues, can you create issues where appropriate and/or direct me to them? Note: the core problem here is that naive csv parsers don't respect "commas inside quotes don't count", so that needs to be doublechecked in the analysis code. I know data processing is naive this way (but could probably be changed be changed). Work is already underway on the data processing code, which is being refactored to allow for modification and extension. HOWEVER, This work only impacts survey timings because survey answers are not actually processed. The final statement on the referenced issue about this work needing to be rolled into a v2 [of file format specifications] remains my preference, if nothing else because this is not the only file format issue. Changing data/file formatting without an announcement or convenient transition point is potentially an administrative disaster. That said we could make a varient of this change for answers files. I will try and determine if this particular fix affects timings or answers or both. @clementzach Could you make an internal tracking issue of source data issues, starting with this? A second item would be that there is no formal specification of the ios/android log file statements. There exist data issues that are resolved in data processing that should be resolved in the apps (where possible), I can populate those. |
I'll make internal tracking issues, but clarifying a couple things in this conversation:
This PR replaces semicolons with periods, so the "commas inside quotes don't count" issue won't be changed by this PR, correct?
survey_answers are processed in the most recent version of Sycamore in order to recover some of the data that has already been lost due to the data corruption issue |
And, one more thing: We should add sanitizing to replace all semicolons with periods as well (not just replacing commas with periods) to avoid problems that may happen if a semicolon is in a response option. |
How can I best help? |
As far as I can tell, this PR and the one on the android app will fix the problem. We just need to think about any downstream effects this change will have. Thanks so much! |
Question answer options are separated using semicolons, so using semicolons to sanitize commas causes a bug further down the line in analysis: onnela-lab/beiwe-backend#53