Skip to content

Commit

Permalink
[#525] update libraries on Spark images
Browse files Browse the repository at this point in the history
Primarily, this changeset hardens the Spark images by upgrading
libraries used by Spark to their highest compatible version. This
included upgrading Spark from 3.5.3 to 3.5.4 (the latest version), and
modifying the history and Thrift server charts to not hardcode the
version of the Ivy jar used to load libraries at deploy time.

This changeset also includes a fix to the Helm version enforcment logic
to properly detect/allow for non-standard release version (e.g. release
candidate versions).
  • Loading branch information
ewilkins-csi committed Jan 17, 2025
1 parent eb13999 commit 1d740b7
Show file tree
Hide file tree
Showing 23 changed files with 297 additions and 111 deletions.
3 changes: 3 additions & 0 deletions DRAFT_RELEASE_NOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ To better align development processes with processes in CI/CD and higher environ
## Data Access Upgrade
Data access through [GraphQL](https://graphql.org/) has been deprecated and replaced with [Trino](https://trino.io/). Trino is optimized for performing queries against large datasets by leveraging a distributed architecture that processes queries in parallel, enabling fast and scalable data retrieval.

## Spark Upgrade
Spark and PySpark have been upgraded from version 3.5.2 to 3.5.4.

# Breaking Changes
_Note: instructions for adapting to these changes are outlined in the upgrade instructions below._

Expand Down
29 changes: 1 addition & 28 deletions build-parent/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@
<version.javax.servlet>3.1.0</version.javax.servlet>

<!-- Spark Default Dependencies. See `spark-*` profiles below for alternative sets -->
<version.spark>3.5.2</version.spark>
<version.spark>3.5.4</version.spark>
<version.scala>2.12.20</version.scala>
<version.scala.minor>2.12</version.scala.minor>
<version.delta>3.2.1</version.delta>
Expand Down Expand Up @@ -1039,33 +1039,6 @@ To suppress enforce-helm-version rule, you must add following plugin to the root
<useDevRepository>true</useDevRepository>
</configuration>
</plugin>
<plugin>
<groupId>${group.fabric8.plugin}</groupId>
<artifactId>docker-maven-plugin</artifactId>
<version>${version.fabric8.docker.maven.plugin}</version>
<executions>
<execution>
<id>default-build</id>
<phase>package</phase>
<goals>
<goal>build</goal>
</goals>
<configuration>
<!-- Deploy will build all platforms, so skip build phase in this case: -->
<buildArchiveOnly>true</buildArchiveOnly>
<images>
<image>
<build>
<buildx>
<builderName>maven</builderName>
</buildx>
</build>
</image>
</images>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</pluginManagement>
</build>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ public class HelmVersionHelper {
private static final Logger logger = LoggerFactory.getLogger(HelmVersionHelper.class);

private static final String HELM_COMMAND = "helm";
private static final String EXTRACT_VERSION_REGEX = "\"v((\\d\\.?)+)\"";
private static final String EXTRACT_VERSION_REGEX = "Version:\"v(.*?)\"";

private final File workingDirectory;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ packages = [
[tool.poetry.dependencies]
python = ">=3.8"
krausening = ">=20"
pyspark = "3.5.2"
pyspark = "3.5.4"
pyyaml = "^6.0"

[tool.poetry.group.dev.dependencies]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
<groupId>com.boozallen.aissemble</groupId>
<artifactId>aissemble-spark</artifactId>
<version>${project.version}</version>
<type>pom</type>
<type>docker-build</type>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
Expand Down Expand Up @@ -70,6 +70,59 @@
</image>
</images>
</configuration>
<executions>
<execution>
<id>test-image</id>
<phase>integration-test</phase>
<goals>
<goal>start</goal>
</goals>
<configuration>
<showLogs>true</showLogs>
<images>
<image>
<run>
<containerNamePattern>${project.artifactId}-test</containerNamePattern>
<!-- If autoRemove is true and wait is specified, the build will fail. See https://github.com/fabric8io/docker-maven-plugin/issues/1622 -->
<autoRemove>false</autoRemove>
<entrypoint>/opt/spark/test/image-test.sh</entrypoint>
<wait>
<time>60000</time>
<http>
<!-- History server check -->
<url>http://localhost:18080</url>
</http>
<!-- Thrift server check -->
<log>.*HiveThriftServer2 started.*</log>
</wait>
</run>
</image>
</images>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<executions>
<execution>
<id>test-image-cleanup</id>
<phase>post-integration-test</phase>
<goals>
<goal>exec</goal>
</goals>
<configuration>
<executable>docker</executable>
<arguments>
<argument>rm</argument>
<argument>-f</argument>
<argument>${project.artifactId}-test</argument>
<argument>hive-metastore-db</argument>
</arguments>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ RUN curl -L https://github.com/delta-io/connectors/releases/download/v${DELTA_HI
ARG JARS_DIR
ADD ${JARS_DIR}/* $SPARK_HOME/jars/

COPY --chmod=755 src/test/resources/image-test.sh $SPARK_HOME/test/image-test.sh

ENV SPARK_NO_DAEMONIZE=true
USER spark
CMD [ "/bin/bash", "-c", "/opt/spark/sbin/start-history-server.sh & /opt/spark/sbin/start-thriftserver.sh" ]
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/sh

###
# #%L
# aiSSEMBLE::Extensions::Docker::Spark Infrastructure
# %%
# Copyright (C) 2021 Booz Allen
# %%
# This software package is licensed under the Booz Allen Public License. All Rights Reserved.
# #L%
###

#Create events dir that's usually mounted at runtime
mkdir /tmp/spark-events
$SPARK_HOME/sbin/start-history-server.sh &

#Switch to Embedded Derby DB for Thrift Server test
sed -i 's/jdbc:mysql:\/\/hive-metastore-db:3306\/metastore?createDatabaseIfNotExist=true&amp;allowPublicKeyRetrieval=true&amp;useSSL=false/jdbc:derby:\/tmp\/metastore;create=true/' $SPARK_HOME/conf/hive-site.xml
sed -i 's/com.mysql.cj.jdbc.Driver/org.apache.derby.jdbc.EmbeddedDriver/' $SPARK_HOME/conf/hive-site.xml
$SPARK_HOME/sbin/start-thriftserver.sh
13 changes: 0 additions & 13 deletions extensions/extensions-docker/aissemble-spark-operator/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +0,0 @@
# Overview
This module serves as a mechanism for building the `spark-operator` docker image for
aiSSEMBLE&trade; with tight coupling with the aiSSEMBLE supported Spark version(s). The
`spark-operator` source code, written primarily in Go, is largely created and supported
through the official repository at https://github.com/kubeflow/spark-operator/tree/master.
This repository is cloned with each build to guarantee that aiSSEMBLE is always using the
latest stable materials.

In the event that divergence is needed from the officially provided and/or maintained
source, the `spark-operator` repository should be forked, and the checkout URLs
within this module updated to the new repo. The purpose of maintaining the separation
is to better support tracking of any divergences from the base materials, as well as to
ease the process of staying up-to-date with upstream changes.
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ ARG VERSION_AISSEMBLE

FROM docker.io/kubeflow/spark-operator:v1beta2-1.6.2-3.5.0 AS builder

# We would be able to use the kubeflow image directly, except that it is on Spark 3.5 instead of 3.4
# Use our image as a base to ensure Spark version alignment and take advantage of image hardening
FROM ${DOCKER_BASELINE_REPO_ID}boozallen/aissemble-spark:${VERSION_AISSEMBLE}

LABEL org.opencontainers.image.source="https://github.com/boozallen/aissemble"
Expand Down
43 changes: 43 additions & 0 deletions extensions/extensions-docker/aissemble-spark/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,49 @@
<plugin>
<groupId>${group.fabric8.plugin}</groupId>
<artifactId>docker-maven-plugin</artifactId>
<executions>
<execution>
<id>test-image</id>
<phase>integration-test</phase>
<goals>
<goal>start</goal>
</goals>
<configuration>
<showLogs>true</showLogs>
<images>
<image>
<run>
<containerNamePattern>${project.artifactId}-test</containerNamePattern>
<!-- If autoRemove is true, the build will fail. See https://github.com/fabric8io/docker-maven-plugin/issues/1622 -->
<autoRemove>false</autoRemove>
<cmd>/opt/spark/bin/spark-submit /opt/spark/examples/src/main/python/pi.py</cmd>
<wait>
<time>60000</time>
<exit>0</exit>
</wait>
</run>
</image>
</images>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<executions>
<execution>
<id>test-image-cleanup</id>
<phase>post-integration-test</phase>
<goals>
<goal>exec</goal>
</goals>
<configuration>
<executable>docker</executable>
<commandlineArgs>rm ${project.artifactId}-test</commandlineArgs>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,16 @@ ARG SPARK_VERSION
ARG SCALA_VERSION
FROM docker.io/apache/spark:${SPARK_VERSION}-scala${SCALA_VERSION}-java17-python3-ubuntu

#pull var from global scope to build stage
ARG SPARK_VERSION
ARG PYTHON_VERSION=3.11

LABEL org.opencontainers.image.source="https://github.com/boozallen/aissemble"

USER root

# Configures the desired version of Python to install
ARG PYTHON_VERSION=3.11
# Setup Spark home directory
RUN usermod -d /opt/spark spark
RUN mkdir $SPARK_HOME/checkpoint && \
mkdir $SPARK_HOME/krausening && \
mkdir $SPARK_HOME/warehouse && \
Expand All @@ -28,23 +31,31 @@ RUN apt-get update -y && apt-get install --assume-yes \
curl \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-dev \
#TODO is distutils needed?
python${PYTHON_VERSION}-distutils \
#Patch for CVE-2023-4863: upgrade libwebp7 to latest
#Patch for CVE-2023-4863: upgrade libwebp7 to 1.3.2+
&& apt-get upgrade -y libwebp7 \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean \
&& ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python
&& ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python \
&& python -m pip install --upgrade cryptography

# Pyspark uses `python3` to execute pyspark pipelines. This links out latest python install to that command.
RUN ln -sf /usr/bin/python3.11 /usr/bin/python3

# Update gosu from Ubuntu 24 distribution (compatible with but not available for Ubuntu 22)
RUN curl -SsLO https://launchpad.net/ubuntu/+archive/primary/+files/gosu_1.17-1ubuntu0.24.04.2_amd64.deb \
&& apt-get install ./gosu_1.17-1ubuntu0.24.04.2_amd64.deb \
&& rm gosu_1.17-1ubuntu0.24.04.2_amd64.deb

# Update library jars used by Spark and register PySpark as an available python package
COPY --chmod=755 ./src/main/resources/scripts/setup.sh $SPARK_HOME/setup.sh
RUN $SPARK_HOME/setup.sh $SPARK_HOME $SPARK_VERSION

## Add spark configurations
COPY ./src/main/resources/conf/ $SPARK_HOME/conf/

RUN chown -R spark:spark $SPARK_HOME/conf/
RUN chown -R spark:spark $SPARK_HOME

# Fixed the Reflection API breaks java module boundary issue for Java 16+
ENV JDK_JAVA_OPTIONS='--add-opens java.base/java.lang=ALL-UNNAMED'

USER spark
USER spark
Loading

0 comments on commit 1d740b7

Please sign in to comment.