Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IP cache #53

Open
wants to merge 66 commits into
base: develop
Choose a base branch
from
Open

IP cache #53

wants to merge 66 commits into from

Conversation

mazhurin
Copy link
Collaborator

Local cache for the challenged IPs. Every challenge IP is cached in order to

  • prevent the extra challenge commands if this IP is still in the following batches
  • reference this IP from banjax report thread during the processing ip_failed_challenge or ip_passed_challenge banjax reports

@mazhurin mazhurin requested a review from mkaranasou September 28, 2020 14:47
@mazhurin mazhurin changed the base branch from master to develop September 28, 2020 14:48
Copy link
Collaborator

@mkaranasou mkaranasou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have many questions I think :P
Good job in general, many good ideas in here 👍

try:
if num_fails >= self.config.engine.banjax_num_fails_to_ban:
self.ip_cache.ip_banned(ip)
sql = f'update request_sets set banned = 1 where ' \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested the performance of update? I think we could consider having a separate table for the banjax bans , since they will be a lot less rows than request sets.
Also, do you use sql strings because of better performance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • No, I did not test the performance. Just monitor the performance of the postprocessing pipeline.
  • Not for performance. I was a bit concerned about that mysterious 1h shift issue and thought that SQL update with explicit time is solid and maybe more readable.

src/baskerville/models/engine.py Show resolved Hide resolved
src/baskerville/models/ip_cache.py Show resolved Hide resolved
src/baskerville/models/ip_cache.py Outdated Show resolved Hide resolved
src/baskerville/models/ip_cache.py Show resolved Hide resolved
@@ -1251,13 +1255,68 @@ def __init__(self, config, steps=()):
super().__init__(config, steps)
self.df_chunks = []
self.df_white_list = None
self.ip_cache = IPCache(config, self.logger)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the IPCache be used in other steps? It could be a TaskWithIPCache task. Also I think that all the Banjax stuff should be in different tasks, as many of the methods in attack detection, like steps of AttackDetection. It would be more modular. I remember you've said that it was a bit tricky because of the difference in df needs but - whenever we have some time of course - we could figure it out.
Again, whenever we have some time :) We could rethink about this when doing performance tuning after the cluster set up, for example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TaskWithIPCache looks like an overkill. If you need an IPCahe in other steps you just create an instance and use it. It's a singleton. But yes, we could rethink later. Now, at least, using this singleton does not block us to move in any direction.

src/baskerville/models/pipeline_tasks/tasks.py Outdated Show resolved Hide resolved
src/baskerville/util/helpers.py Show resolved Hide resolved
src/baskerville/util/singleton_thread_safe.py Show resolved Hide resolved
num_records = len(records)
if num_records > 0:
challenged_ips = self.spark.createDataFrame(records).withColumn('challenged', F.lit(1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried with perist? If yes, does it help? This also goes for the whitelist df.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember. But this part is not a bottleneck. I tried persist for whitelisting. It did not help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants