You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 24, 2024. It is now read-only.
Team members: Aadit, Michael, Jaqueline
Sprint 4: 6/19-6/26
Overall goal:
Create a function that can scrape out all of the crosswalks that we have in the source metadata files. Then for each of the columns names in the target schema, get the input column names that were matched to it and see if there are any useful patterns. We want to introspect current crosswalks and create a lookup table for column names of input and target schema so that we can look at past 100-200 crosswalks and identify the crosswalk that might be a good start.
Problem Statement:
Data from our datasets have all different types of names for their data (e.g. location can be described with an address, or latitude and longitude, or miles from a city center, etc). Every time we add a new dataset, someone has to manually look through the data and find out how to transform the data into a target schema. We have a lot of datasets where this has been done already, and we would like to find patterns in how datasets are structured so we can eventually generate a draft of the correct mappings based on previous mappings. This would save a lot of time as we try to add thousands more datasets.
What does success look like?
First, get the data for 100 - 200 cross walks and try to identify any patterns (list out maybe 3 - 5).
Then, test your ideas about these patterns on a larger set of crosswalks and see if it still holds true.
Summarize your findings for the team!
Comments:
The functions were written in similar ways so using regular expressions may be helpful and efficient.
Cross walks are a dictionary (also called an Object in JS) where the key is the target data label and the value is either 1. a header name in the data, 2. a function that uses one or more header names to produce a processed value from one or more headers.
You'll notice that the set of keys is not always consistent. This is because there are special keys that help magically convert different units to a target schema. You can find the true target schema in the readme. Crosswalks are often missing data so many of these do not appear in crosswalks.
The text was updated successfully, but these errors were encountered:
mjia8
changed the title
Comparing hashes of data to prevent saving the same data
Mapping each dataset to a common data schema in some semi-automated way
Jun 19, 2023
mjia8
changed the title
Mapping each dataset to a common data schema in some semi-automated way
Semi-automated Matching to Common Schema(1): Scrape Data
Jun 20, 2023
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Team members: Aadit, Michael, Jaqueline
Sprint 4: 6/19-6/26
Overall goal:
Create a function that can scrape out all of the crosswalks that we have in the source metadata files. Then for each of the columns names in the target schema, get the input column names that were matched to it and see if there are any useful patterns. We want to introspect current crosswalks and create a lookup table for column names of input and target schema so that we can look at past 100-200 crosswalks and identify the crosswalk that might be a good start.
Problem Statement:
Data from our datasets have all different types of names for their data (e.g. location can be described with an address, or latitude and longitude, or miles from a city center, etc). Every time we add a new dataset, someone has to manually look through the data and find out how to transform the data into a target schema. We have a lot of datasets where this has been done already, and we would like to find patterns in how datasets are structured so we can eventually generate a draft of the correct mappings based on previous mappings. This would save a lot of time as we try to add thousands more datasets.
What does success look like?
Comments:
The text was updated successfully, but these errors were encountered: