-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge underfilled_job_title
with employee_position_title
in employee_salaries
#581
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
I just had a look at the PR. I do not think that I am in favor of adding the new, cleaner, column. In terms of philosophy, I would like us to try as hard as we can to improve our tools to work on the data as it is, rather than to change the data. This means that we must stare at our examples and wonder what makes them ugly, and then see if we can provide functionality to make them less ugly. With regards to using more the upstream scikit-learn code, yes, I'm a thousands time in favor of doing that. |
underfilled_job_title
with employee_position_title
in employee_salaries
…ve_fetching # Conflicts: # skrub/datasets/_fetching.py
I agree that as much as we can, we should use appropriate tools, but in this specific instance, I think merging them in advance is the best option. Of course if we have a tool designed for this type of issue down the road, we can re-introduce it, but currently, this merge is something we do a lot in the new examples (#546), and it would simplify them quite a bit. |
I might have missed something, but why do you need to overwrite the |
To me, underfilled job title is a column that gives more specific information about the job title. Let me demonstrate: >>> from skrub.datasets import fetch_employee_salaries
>>> dataset = fetch_employee_salaries()
>>> X = dataset.X # alias
>>> # Filter, keep only the jobs that contain "Fire"
>>> X = X[X["employee_position_title"].str.contains("Fire")]
>>> X[["employee_position_title", "underfilled_job_title", "date_first_hired"]].head(10)
employee_position_title underfilled_job_title date_first_hired
8 Firefighter/Rescuer III Firefighter/Rescuer I (Recruit) 12/12/2016
42 Firefighter/Rescuer III NaN 10/09/2006
107 Firefighter/Rescuer III NaN 05/08/2011
128 Fire/Rescue Captain NaN 02/26/1990
132 Firefighter/Rescuer III Firefighter/Rescuer II 03/10/2014
142 Firefighter/Rescuer III NaN 03/17/2008
152 Master Firefighter/Rescuer NaN 01/30/2006
157 Fire/Rescue Captain NaN 09/11/2000
158 Firefighter/Rescuer III NaN 03/17/2008
167 Firefighter/Rescuer III Firefighter/Rescuer II 03/10/2014 When there is a value, Also, for reference: https://chat.openai.com/share/d4a00de6-d10b-4c5a-af19-43757fb795cf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see the difference though I don't feel it's crucial. But it's good to have it as an option.
I agree with merging this with False
by default (you might easily use it for examples). WDYT?
You're right, it's not crucial, but it unloads some boilerplate from the examples, which I think is a big benefit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! The point of our examples is to showcase our features, not to explain how to use pandas or this dataset specifically, IMHO.
I agree with @LilianBoulard that we should do this quick preprocessing by default to simplify the examples, even though having it in the example is not dramatic or ugly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I became convinced :).
Merging, thank you!
…oyee_salaries` (skrub-data#581) * Add `overload_job_titles` parameter to `fetch_employee_salaries` * Add changelog entry * Fix path
This PR adds a parameter to
fetch_employee_salaries
, so the main dirty column is overloaded with another column that adds some new information (from my understanding).