Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When sanitizing data, use periods instead of semicolons #7

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sguthrie
Copy link

Question answer options are separated using semicolons, so using semicolons to sanitize commas causes a bug further down the line in analysis: onnela-lab/beiwe-backend#53

question answer options are separated using `;`, so using `;` to sanitize `,` causes a bug further down the line in analysis: [https://github.com/onnela-lab/beiwe-backend/issues/53](https://github.com/onnela-lab/beiwe-backend/issues/53)
@clementzach
Copy link
Collaborator

@biblicabeebli is there any reason why this shouldn't be merged in? It definitely seems like it would fix some issues in reading survey_answers files I've been coming across.

@biblicabeebli
Copy link
Member

I was not aware of data-analysis-level issues, can you create issues where appropriate and/or direct me to them?

Note: the core problem here is that naive csv parsers don't respect "commas inside quotes don't count", so that needs to be doublechecked in the analysis code. I know data processing is naive this way (but could probably be changed be changed).

Work is already underway on the data processing code, which is being refactored to allow for modification and extension. HOWEVER, This work only impacts survey timings because survey answers are not actually processed.

The final statement on the referenced issue about this work needing to be rolled into a v2 [of file format specifications] remains my preference, if nothing else because this is not the only file format issue. Changing data/file formatting without an announcement or convenient transition point is potentially an administrative disaster. That said we could make a varient of this change for answers files. I will try and determine if this particular fix affects timings or answers or both.

@clementzach Could you make an internal tracking issue of source data issues, starting with this? A second item would be that there is no formal specification of the ios/android log file statements. There exist data issues that are resolved in data processing that should be resolved in the apps (where possible), I can populate those.

@clementzach
Copy link
Collaborator

I'll make internal tracking issues, but clarifying a couple things in this conversation:

Note: the core problem here is that naive csv parsers don't respect "commas inside quotes don't count", so that needs to be doublechecked in the analysis code. I know data processing is naive this way (but could probably be changed be changed).

This PR replaces semicolons with periods, so the "commas inside quotes don't count" issue won't be changed by this PR, correct?

HOWEVER, This work only impacts survey timings because survey answers are not actually processed.

survey_answers are processed in the most recent version of Sycamore in order to recover some of the data that has already been lost due to the data corruption issue

@clementzach
Copy link
Collaborator

And, one more thing: We should add sanitizing to replace all semicolons with periods as well (not just replacing commas with periods) to avoid problems that may happen if a semicolon is in a response option.

@sguthrie
Copy link
Author

How can I best help?

@clementzach
Copy link
Collaborator

As far as I can tell, this PR and the one on the android app will fix the problem. We just need to think about any downstream effects this change will have. Thanks so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants