Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glue-alpha: feedback on spark jobs constructs introduced in "Refactored glue-alpha L2 CDK construct RFC 0497" #33356

Open
2 tasks
humanzz opened this issue Feb 9, 2025 · 4 comments
Labels
@aws-cdk/aws-glue Related to AWS Glue effort/medium Medium work item – several days of effort feature-request A feature should be added or improved. p2

Comments

@humanzz
Copy link
Contributor

humanzz commented Feb 9, 2025

Describe the feature

Hello CDK team,

As a user of glue-alpha, and having contributed the initial job construct in #12506, I recently came across the refactor from #32521 as part of updating my CDK applications to 2.178.0.

Having refactored my code to leverage the new constructs - mostly the ScalaSparkEtlJob - I noticed a couple of things and wanted to provide feedback on them

Job Role requirements and its auto-creation

I think the previous behaviour of making role prop optional is a sensible default, and that behaviour should be restored, and code in subclasses corrected.

Enums / Constants

I think the previous approach, was better, and inline with other constructs e.g. Lambda, and that it should be restored to provide an easy way for adopting new values for GlueVersion and WorkerType without the need to use escape hatches or be blocked on CDK updates.

extraJars, extraFiles, extraPythonFiles, extraJarsFirst

  • the following points are for spark jobs of different languages
  • extraJars, extraJarsFirst and extraFiles are applicable to all spark jobs regardless of language (Scala/Python)
    • extraJars allow spark to load jvm-based libraries that can be used across both Scala and Python spark jobs
    • extraJarsFirst is about the order of jar loading for all spark jobs across both Scala and Python
  • extraFiles is a way to load other files e.g. binary files or text files in spark - again regardless of the spark job's language
  • extraPythonFiles is only relevant to Python spark jobs
  • the new constructs are not implementing the above behaviour
    • extraJars is not implemented for python spark jobs
    • extraJarsFirst is implemented in only ScalaSparkFlexEtlJob even though it should be available in all spark jobs wherever extraJars is available
    • extraFiles seem to be completely missing from Scala spark jobs

Therefore

Use Case

N/A

Proposed Solution

N/A

Other Information

N/A

Acknowledgements

  • I may be able to implement this feature request
  • This feature might incur a breaking change

CDK version used

2.178.0

Environment details (OS name and version, etc.)

macOS

@humanzz humanzz added feature-request A feature should be added or improved. needs-triage This issue or PR still needs to be triaged. labels Feb 9, 2025
@github-actions github-actions bot added the @aws-cdk/aws-glue Related to AWS Glue label Feb 9, 2025
@humanzz
Copy link
Contributor Author

humanzz commented Feb 9, 2025

@natalie-white-aws just wanted to make sure you're seeing this (some related to the comments on #33238 (comment))

@natalie-white-aws
Copy link
Contributor

Thanks for the tag, I hadn’t seen it. extrajars will be addressed in the other issue. Our thinking on making role mandatory was that a default role would be either too broad (I.e. not least-privilege) or too restrictive (since the L2 has no context for what the glue job will need access to), and the developer would not find out until run time. The better developer experience would be to have them create a role that has the right least privilege roles they know they need. Happy to discuss here.

@humanzz
Copy link
Contributor Author

humanzz commented Feb 10, 2025

Overall, I created this issue as an uber-issue, rather than creating a set of smaller issues. I hope we can get to some sort of alignment on each of the issues, so that work/PRs can happen.

as for the Job Role requirements and its auto-creation, I think the developer experience is harmed by the lack of role auto-creation.

It's always a role that needs to be assumable by glue service e.g.

new iam.Role(this, 'Role' {
    assumedBy: new iam.ServicePrincipal('glue.amazonaws.com'),
    ...
}); 

which other L2 constructs - and I reference the main one I based my initial implementation on, which is Lambda - do provide e.g. https://github.com/aws/aws-cdk/blob/main/packages/aws-cdk-lib/aws-lambda/lib/function.ts#L946.

As for permissions, I 100% agree on granting the least privileges to that role. You can see in the lambda example above, managed policies are used to grant sensible defaults, and it's then up to the user to add more.

The question then becomes, does AWSGlueServiceRole managed policy strike a good balance with its defined permissions between the least privileges and sensible defaults. I think it goes slightly beyond what the Lambda managed policies grant, e.g. gives s3 bucket write permissions to buckets with certain name patterns but on balance, it does not look overly permissive.

I think a judgement needs to be made, whether this policy is good enough, or if maybe a smaller set of sensible permissions should be granted to the role by default.

@pahud
Copy link
Contributor

pahud commented Feb 10, 2025

Thank you. I do see this in the code doc string

* IAM Role (required)
* IAM Role to use for Glue job execution
* Must be specified by the developer because the L2 doesn't have visibility
* into the actions the script(s) takes during the job execution
* The role must trust the Glue service principal (glue.amazonaws.com)
* and be granted sufficient permissions.

And thank you @natalie-white-aws for the chime in.

@pahud pahud added p2 effort/medium Medium work item – several days of effort and removed needs-triage This issue or PR still needs to be triaged. labels Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-glue Related to AWS Glue effort/medium Medium work item – several days of effort feature-request A feature should be added or improved. p2
Projects
None yet
Development

No branches or pull requests

3 participants