Semi-automated Matching to Common Schema(1): Scrape Data #8

mjia8 · 2023-06-19T15:55:26Z

Team members: Aadit, Michael, Jaqueline
Sprint 4: 6/19-6/26

Overall goal:
Create a function that can scrape out all of the crosswalks that we have in the source metadata files. Then for each of the columns names in the target schema, get the input column names that were matched to it and see if there are any useful patterns. We want to introspect current crosswalks and create a lookup table for column names of input and target schema so that we can look at past 100-200 crosswalks and identify the crosswalk that might be a good start.

Problem Statement:
Data from our datasets have all different types of names for their data (e.g. location can be described with an address, or latitude and longitude, or miles from a city center, etc). Every time we add a new dataset, someone has to manually look through the data and find out how to transform the data into a target schema. We have a lot of datasets where this has been done already, and we would like to find patterns in how datasets are structured so we can eventually generate a draft of the correct mappings based on previous mappings. This would save a lot of time as we try to add thousands more datasets.

What does success look like?

First, get the data for 100 - 200 cross walks and try to identify any patterns (list out maybe 3 - 5).
Then, test your ideas about these patterns on a larger set of crosswalks and see if it still holds true.
Summarize your findings for the team!

Comments:

The functions were written in similar ways so using regular expressions may be helpful and efficient.
Cross walks are a dictionary (also called an Object in JS) where the key is the target data label and the value is either 1. a header name in the data, 2. a function that uses one or more header names to produce a processed value from one or more headers.
You'll notice that the set of keys is not always consistent. This is because there are special keys that help magically convert different units to a target schema. You can find the true target schema in the readme. Crosswalks are often missing data so many of these do not appear in crosswalks.

mjia8 changed the title ~~Comparing hashes of data to prevent saving the same data~~ Mapping each dataset to a common data schema in some semi-automated way Jun 19, 2023

mjia8 assigned AaditT, michaelpregan and jacquelinechernen Jun 19, 2023

mjia8 changed the title ~~Mapping each dataset to a common data schema in some semi-automated way~~ Semi-automated Matching to Common Schema(1): Scrape Data Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semi-automated Matching to Common Schema(1): Scrape Data #8

Semi-automated Matching to Common Schema(1): Scrape Data #8

mjia8 commented Jun 19, 2023 •

edited by xgz2

Loading

Semi-automated Matching to Common Schema(1): Scrape Data #8

Semi-automated Matching to Common Schema(1): Scrape Data #8

Comments

mjia8 commented Jun 19, 2023 • edited by xgz2 Loading

mjia8 commented Jun 19, 2023 •

edited by xgz2

Loading