05_exercise

05 | Exercise - Appraising popular datasets

24.05.22

In this exercise session, we will appraise popular datasets used in research. Will we be able to understand the contained data and comprehend the steps of its creation?

Agenda

10:15 - 10:20: Welcome and arrival
10:20 - 10:30: In-exercise task introduction
10:30 - 11:10: Task work
11:10 - 11:40: Group presentations
11:40 - 11:45: Goodbye and outlook

Session notes

Slides for the fifth exercise.

In-exercise Task: Datasets

Last week, we noticed some problems with the German Credit Data¹ hosted on the UCI Machine Learning Repository, and included as an example dataset in multiple ML packages. Is this a singular occurrence, or would we be able to find similar problems with other popular used datasets?

Steps

Decide on a dataset you want to analyze. Maybe think about a paper you recently read and a dataset they used for their research.
Analyze the dataset with all the methods you know of. Below are some points you may consider during your analysis.
Prepare a short presentation about your findings (~ 5 minutes).
Upload the presentation to: ../hcds-summer-2022/tree/main/exercise/tasks/ex05_datasets.
[Optional] If you are done with your dataset (or you think it's perfect) start looking into another one.

What you may consider:

Resources

Here is an incomplete list of places that are hosting or referring to popular ML datasets and academic papers (where datasets may be introduced or used).

Dataset Repositories:

List datasets used for research: https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research

Finding papers:

References

¹ Professor Dr. Hans Hofmann (1994). Statlog (German Credit Data) Data Set, UCI Machine Learning Repository.

Credits & Licenses

Content is available under Attribution-Share Alike 3.0 Unported unless otherwise noted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly