Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RB data quality - outlier data point detection #19

Open
xmcai2016 opened this issue Aug 17, 2023 · 0 comments
Open

RB data quality - outlier data point detection #19

xmcai2016 opened this issue Aug 17, 2023 · 0 comments

Comments

@xmcai2016
Copy link

xmcai2016 commented Aug 17, 2023

Now that we have a few more Reputation Bots on the horizon - I suggest we implement an outlier detection mechanism. The goal is to weed out untrustworthy data (either intentional abuse or unintentional mistakes).
Outlier data points should be filtered out when we output collective Retrieval Bot data to dashboards, GitHub bot, and any other downstream consumer.
We can initially define an outlier as a data point that is 10% deviated from the median data point for the same sp_id measured by different Retrieval Bot instances. Open to suggestions on alternative definitions / definition can be tweaked on the fly with more empirical data.
We should also keep track of the source of the outliers which helps us root cause any skews in data collected.

@xmcai2016 xmcai2016 converted this from a draft issue Aug 17, 2023
@xmcai2016 xmcai2016 moved this to 🍇 Backlog in ActionArena Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🍇 Backlog
Development

No branches or pull requests

1 participant