-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
glue-alpha: feedback on spark jobs constructs introduced in "Refactored glue-alpha L2 CDK construct RFC 0497" #33356
Comments
@natalie-white-aws just wanted to make sure you're seeing this (some related to the comments on #33238 (comment)) |
Thanks for the tag, I hadn’t seen it. extrajars will be addressed in the other issue. Our thinking on making role mandatory was that a default role would be either too broad (I.e. not least-privilege) or too restrictive (since the L2 has no context for what the glue job will need access to), and the developer would not find out until run time. The better developer experience would be to have them create a role that has the right least privilege roles they know they need. Happy to discuss here. |
Overall, I created this issue as an uber-issue, rather than creating a set of smaller issues. I hope we can get to some sort of alignment on each of the issues, so that work/PRs can happen. as for the Job Role requirements and its auto-creation, I think the developer experience is harmed by the lack of role auto-creation. It's always a role that needs to be assumable by glue service e.g.
which other L2 constructs - and I reference the main one I based my initial implementation on, which is Lambda - do provide e.g. https://github.com/aws/aws-cdk/blob/main/packages/aws-cdk-lib/aws-lambda/lib/function.ts#L946. As for permissions, I 100% agree on granting the least privileges to that role. You can see in the lambda example above, managed policies are used to grant sensible defaults, and it's then up to the user to add more. The question then becomes, does AWSGlueServiceRole managed policy strike a good balance with its defined permissions between the least privileges and sensible defaults. I think it goes slightly beyond what the Lambda managed policies grant, e.g. gives s3 bucket write permissions to buckets with certain name patterns but on balance, it does not look overly permissive. I think a judgement needs to be made, whether this policy is good enough, or if maybe a smaller set of sensible permissions should be granted to the role by default. |
Thank you. I do see this in the code doc string aws-cdk/packages/@aws-cdk/aws-glue-alpha/lib/jobs/job.ts Lines 316 to 321 in 5160796
And thank you @natalie-white-aws for the chime in. |
Describe the feature
Hello CDK team,
As a user of
glue-alpha
, and having contributed the initial job construct in #12506, I recently came across the refactor from #32521 as part of updating my CDK applications to2.178.0
.Having refactored my code to leverage the new constructs - mostly the
ScalaSparkEtlJob
- I noticed a couple of things and wanted to provide feedback on themJob Role requirements and its auto-creation
role
as part ofJobProps
was optional, and its absence led to the auto-creation of a default roleJobProperties
(sidenote: should it be calledJobProps
instead ofJobProperties
?), role is now requiredI think the previous behaviour of making
role
prop optional is a sensible default, and that behaviour should be restored, and code in subclasses corrected.Enums / Constants
GlueVersion
andWorkerType
which removed the ability to use newer values easily as was possible before e.g.GlueVersion.of(...)
.GlueVersion.of
andWorkerType.of
e.g. https://github.com/aws/aws-cdk/blob/main/packages/%40aws-cdk/aws-glue-alpha/lib/constants.ts#L5I think the previous approach, was better, and inline with other constructs e.g. Lambda, and that it should be restored to provide an easy way for adopting new values for
GlueVersion
andWorkerType
without the need to use escape hatches or be blocked on CDK updates.extraJars
,extraFiles
,extraPythonFiles
,extraJarsFirst
extraJars
,extraJarsFirst
andextraFiles
are applicable to all spark jobs regardless of language (Scala
/Python
)extraJars
allow spark to load jvm-based libraries that can be used across bothScala
andPython
spark jobsextraJarsFirst
is about the order of jar loading for all spark jobs across bothScala
andPython
extraFiles
is a way to load other files e.g. binary files or text files in spark - again regardless of the spark job's languageextraPythonFiles
is only relevant toPython
spark jobsextraJars
is not implemented for python spark jobsextraJarsFirst
is implemented in onlyScalaSparkFlexEtlJob
even though it should be available in all spark jobs whereverextraJars
is availableextraFiles
seem to be completely missing fromScala
spark jobsTherefore
extraJars
,extraJarsFirst
,extraFiles
props (I found a related feat(glue-alpha): include extra jars parameter in pyspark jobs #33238)Use Case
N/A
Proposed Solution
N/A
Other Information
N/A
Acknowledgements
CDK version used
2.178.0
Environment details (OS name and version, etc.)
macOS
The text was updated successfully, but these errors were encountered: