-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IP cache #53
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have many questions I think :P
Good job in general, many good ideas in here 👍
try: | ||
if num_fails >= self.config.engine.banjax_num_fails_to_ban: | ||
self.ip_cache.ip_banned(ip) | ||
sql = f'update request_sets set banned = 1 where ' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested the performance of update? I think we could consider having a separate table for the banjax bans , since they will be a lot less rows than request sets.
Also, do you use sql strings because of better performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- No, I did not test the performance. Just monitor the performance of the postprocessing pipeline.
- Not for performance. I was a bit concerned about that mysterious 1h shift issue and thought that SQL update with explicit time is solid and maybe more readable.
@@ -1251,13 +1255,68 @@ def __init__(self, config, steps=()): | |||
super().__init__(config, steps) | |||
self.df_chunks = [] | |||
self.df_white_list = None | |||
self.ip_cache = IPCache(config, self.logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the IPCache be used in other steps? It could be a TaskWithIPCache task. Also I think that all the Banjax stuff should be in different tasks, as many of the methods in attack detection, like steps of AttackDetection. It would be more modular. I remember you've said that it was a bit tricky because of the difference in df needs but - whenever we have some time of course - we could figure it out.
Again, whenever we have some time :) We could rethink about this when doing performance tuning after the cluster set up, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TaskWithIPCache looks like an overkill. If you need an IPCahe in other steps you just create an instance and use it. It's a singleton. But yes, we could rethink later. Now, at least, using this singleton does not block us to move in any direction.
num_records = len(records) | ||
if num_records > 0: | ||
challenged_ips = self.spark.createDataFrame(records).withColumn('challenged', F.lit(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried with perist
? If yes, does it help? This also goes for the whitelist df.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember. But this part is not a bottleneck. I tried persist
for whitelisting. It did not help.
White list hosts added.
Local cache for the challenged IPs. Every challenge IP is cached in order to
ip_failed_challenge
orip_passed_challenge
banjax reports