Skip to content

Commit

Permalink
Added some documentation for the integration test docker cluster
Browse files Browse the repository at this point in the history
Signed-off-by: Norman Jordan <[email protected]>
  • Loading branch information
normanj-bitquill committed Jan 14, 2025
1 parent 5974486 commit 9aae9bb
Show file tree
Hide file tree
Showing 8 changed files with 174 additions and 4 deletions.
1 change: 1 addition & 0 deletions docker/integ-test/.env
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ SQL_APP_JAR=./spark-sql-application/target/scala-2.12/sql-job-assembly-0.7.0-SNA
OPENSEARCH_NODE_MEMORY=512m
OPENSEARCH_ADMIN_PASSWORD=C0rrecthorsebatterystaple.
OPENSEARCH_PORT=9200
OPENSEARCH_PA_PORT=9600
OPENSEARCH_DASHBOARDS_PORT=5601
S3_ACCESS_KEY=Vt7jnvi5BICr1rkfsheT
S3_SECRET_KEY=5NK3StGvoGCLUWvbaGN0LBUf9N6sjE94PEzLdqwO
2 changes: 1 addition & 1 deletion docker/integ-test/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ services:
target: /var/run/docker.sock
ports:
- ${OPENSEARCH_PORT:-9200}:9200
- 9600:9600
- ${OPENSEARCH_PA_PORT:-9600}:9600
expose:
- "${OPENSEARCH_PORT:-9200}"
- "9300"
Expand Down
3 changes: 3 additions & 0 deletions docker/integ-test/spark/spark-master-entrypoint.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
#!/bin/bash

# Copyright OpenSearch Contributors
# SPDX-License-Identifier: Apache-2.0

function start_spark_connect() {
sc_version=$(ls -1 /opt/bitnami/spark/jars/spark-core_*.jar | sed -e 's/^.*\/spark-core_//' -e 's/\.jar$//' -e 's/-/:/')

Expand Down
166 changes: 166 additions & 0 deletions docs/docker/integ-test/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# Docker Cluster for Integration Testing

## Introduction

The docker cluster in `docker/integ-test` is designed to be used for integration testing. It supports the following
use cases:
1. Submitting queries directly to Spark in order to test the PPL extension for Spark.
2. Submitting queries directly to Spark that use the OpenSearch datasource. Useful for testing the Flint extension
for Spark.
3. Using the Async API to submit queries to the OpenSearch server. Useful for testing the EMR workflow and querying
S3/Glue datasources. A local container is run rather than using the AWS EMR service.

The cluster consists of several containers and handles configuring them. No tables are created.

## Overview

![Docker Containers](images/integ-test-containers.png "Docker Containers")

All containers run in a dedicated docker network.

### OpenSearch Dashboards

An OpenSearch dashboards server that is connected to the OpenSearch server. It is exposed to the host OS,
so it can be accessed with a browser.

### OpenSearch

An OpenSearch server. It is running in standalone mode. It is exposed to the host OS. It is configured to have
an S3/Glue datasource with the name `mys3`. System indices and system indices permissions are disabled.

This container also has a docker volume used to persist data such as local indices.

### Spark

The Spark master node. It is configured to use an external Hive metastore in the container `metastore`. The
Spark master also has the Flint and PPL extensions installed. It can use locally built Jar files when building
the docker image.

Spark Connect is also running in this container and can be used to easily issue queries to run. The port for
Spark Connect is exposed to the host OS.

### Spark Worker

The Spark worker node. It is configured to use an external Hive metastore in the container `metastore`. The
Spark worker also has the Flint and PPL extensions installed. It can use locally built Jar files when building
the docker image.

### Spark Submit

A temporary container that runs queries for an Async API session. It is started the OpenSearch container. It
does not connect to the Spark cluster and instead runs the queries locally. It will keep looking for more
queries to run until it reaches its timeout (3 minutes by default).

The Spark submit container is configured to use an external Hive metastore in the container `metastore`. The
Flint and PPL extensions are installed. When building the docker image, locally built Jar files can be used.

### Metastore (Hive)

A Hive server that is used as a metastore for the Spark containers. It is configured to use the Minio
container in the bucket `integ-test`.

This container also has a docker volume used to persist the metastore.

### Minio (S3)

A Minio server that acts as an S3 server. Is used as a part of the workflow of executing an S3/Glue query.
It will contain the S3 tables data.

This container also has a docker volume used to persist the S3 data.

### Configuration-Updated

A temporary container that is used to configure the OpenSearch and Minio containers. It is run after both
of those have started up. For Minio, it will add the `integ-test` bucket and create an access key. For
OpenSearch, it will create the S3/Glue datasource and apply a cluster configuration.

## Running the Cluster

To start the cluster go to the directory `docker/integ-test` and use docker compose to start the cluster. When
starting the cluster, wait for the `spark-worker` container to finish starting up. It is the last container
to start.

Start cluster in foreground:
```shell
docker compose up
```

Start cluster in the background:
```shell
docker compose up -d
```

Stopping the cluster:
```shell
docker compose down -d
```

## Creating Tables in S3

Tables need to be created in Spark as external tables. Their location must be set to a path under `s3a://integ-test/`.
Can use `spark-shell` on the Spark master container to do this:
```shell
docker exec it spark spark-shell
```

Example for creating a table and adding data:
```scala
spark.sql("CREATE EXTERNAL TABLE foo (id int, name varchar(100)) location 's3a://integ-test/foo'")
spark.sql("INSERT INTO foo (id, name) VALUES(1, 'Foo')")
```

## Configuration of the Cluster

There are several settings that can be adjusted for the cluster.

* SPARK_VERSION - the tag of the `bitnami/spark` docker image to use
* OPENSEARCH_VERSION - the tag of the `opensearchproject/opensearch` docker image to use
* DASHBOARDS_VERSION - the tag of the `opensearchproject/opensearch-dashboards` docker image to use
* MASTER_UI_PORT - port on the host OS to map to the master UI port (8080) of the Spark master
* MASTER_PORT - port on the host OS to map to the master port (7077) on the Spark master
* UI_PORT - port on the host OS to map to the UI port (4040) on the Spark master
* SPARK_CONNECT_PORT - port on the host OS to map to the Spark Connect port (15002) on the Spark master
* PPL_JAR - The relative path to the PPL extension Jar file. Must be within the base directory of this repository
* FLINT_JAR - The relative path to the Flint extension Jar file. Must be within the base directory of this
repository
* SQL_APP_JAR - The relative path to the SQL application Jar file. Must be within the base directory of this
repository
* OPENSEARCH_NODE_MEMORY - Amount of memory to allocate for the OpenSearch server
* OPENSEARCH_ADMIN_PASSWORD - Password for the admin user of the OpenSearch server
* OPENSEARCH_PORT - port on the host OS to map to port 9200 on the OpenSearch server
* OPENSEARCH_PA_PORT - port on the host OS to map to the performance analyzer port (9600) on the OpenSearch
server
* OPENSEARCH_DASHBOARDS_PORT - port on the host OS to map to the OpenSearch dashboards server
* S3_ACCESS_KEY - access key to create on the Minio container
* S3_SECRET_KEY - secret key to create on the Minio container

## Async API Overview

[Async API Interfaces](https://github.com/opensearch-project/sql/blob/main/docs/user/interfaces/asyncqueryinterface.rst)

[Async API Documentation](https://opensearch.org/docs/latest/search-plugins/async/index/)

The Async API is able to query S3/Glue datasources. This is done by calling the AWS EMR service to use a
docker container to run the query. The docker container uses Spark and is able to access the Glue catalog and
retrieve data from S3.

For the docker cluster, Minio is used in place of S3. Docker itself is used in place of AWS EMR.

![OpenSearch Async API Sequence Diagram](images/OpenSearch_Async_API.png "Sequence Diagram")

1. Client submit a request to the async_search API endpoint
2. OpenSearch server creates a special index (if it doesn't exist). This index is used to store async API requests
along with some state information.
3. OpenSearch server checks if the query is for an S3/Glue datasource. If it is not, then OpenSearch can handle
the request on its own.
4. OpenSearch uses docker to start a new container to process queries for the current async API session.
5. OpenSearch returns the queryId and sessionId to the Client.
6. Spark submit docker container starts up.
7. Spark submit docker container searches for index from step 2 for a query in the current session to run.
8. Spark submit docker container creates a special OpenSearch index (if it doesn't exist). This index is used to
store the results of the async API queries.
9. Spark submit docker container looks up the table metadata from the `metastore` container.
10. Spark submit docker container retrieves the data from the Minio container.
11. Spark submit docker container writes the results to the OpenSearch index from step 7.
12. Client submits a request to the async_search results API endpoint using the queryId form step 5.
13. OpenSearch returns the results to the Client.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/spark-docker.md → docs/docker/spark-docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ sbt clean
sbt assembly
```

Refer to the [Developer Guide](../DEVELOPER_GUIDE.md) for more information.
Refer to the [Developer Guide](../../DEVELOPER_GUIDE.md) for more information.

## Using Docker Compose

Expand Down Expand Up @@ -65,7 +65,7 @@ spark.sql("INSERT INTO test_table (id, name) VALUES(2, 'Bar')")
spark.sql("source=test_table | eval x = id + 5 | fields x, name").show()
```

For further information, see the [Spark PPL Test Instructions](ppl-lang/local-spark-ppl-test-instruction.md)
For further information, see the [Spark PPL Test Instructions](../ppl-lang/local-spark-ppl-test-instruction.md)

## Manual Setup

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ sbt clean
sbt assembly
```

Refer to the [Developer Guide](../DEVELOPER_GUIDE.md) for more information.
Refer to the [Developer Guide](../../DEVELOPER_GUIDE.md) for more information.

## Using Docker Compose

Expand Down

0 comments on commit 9aae9bb

Please sign in to comment.