-
Notifications
You must be signed in to change notification settings - Fork 4
05_exercise
🏠
24.05.22
In this exercise session, we will appraise popular datasets used in research. Will we be able to understand the contained data and comprehend the steps of its creation?
- 10:15 - 10:20: Welcome and arrival
- 10:20 - 10:30: In-exercise task introduction
- 10:30 - 11:10: Task work
- 11:10 - 11:40: Group presentations
- 11:40 - 11:45: Goodbye and outlook
Slides for the fifth exercise.
Last week, we noticed some problems with the German Credit Data1 hosted on the UCI Machine Learning Repository, and included as an example dataset in multiple ML packages. Is this a singular occurrence, or would we be able to find similar problems with other popular used datasets?
- Decide on a dataset you want to analyze. Maybe think about a paper you recently read and a dataset they used for their research.
- Analyze the dataset with all the methods you know of. Below are some points you may consider during your analysis.
- Prepare a short presentation about your findings (~ 5 minutes).
- Upload the presentation to:
../hcds-summer-2022/tree/main/exercise/tasks/ex05_datasets
. - [Optional] If you are done with your dataset (or you think it's perfect) start looking into another one.
- Does it contain a metadata (e.g. a README-file)?
- Is everything clearly labeled and documented?
- Where is the data coming from? Who collected it? How was it accomplished?
- Do you understand the context of the data collection?
- How is the data sampled? Does the sample make sense?
- How is its distribution?
- How was it curated?
- Was it tested and how?
- Have there been any assumptions made?
- Are you able to understand every single attribute?
- Have there been any changes made to the data?
- Are there any known errors or inconsistencies? Is it complete?
Here is an incomplete list of places that are hosting or referring to popular ML datasets and academic papers (where datasets may be introduced or used).
Dataset Repositories:
- Kaggle
- UCI Machine Learning Repository
- OpenML
- Open Data on AWS
- Google’s Datasets Search Engine
- Awesome Public Datasets
List datasets used for research: https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
Finding papers:
1 Professor Dr. Hans Hofmann (1994). Statlog (German Credit Data) Data Set, UCI Machine Learning Repository.
Content is available under Attribution-Share Alike 3.0 Unported unless otherwise noted.