Skip to content

Commit

Permalink
Merge pull request #37 from boozallen/5-migrate-documentation
Browse files Browse the repository at this point in the history
#5 📝 Tranche 7 of documentation migration
  • Loading branch information
d-ryan-ashcraft authored May 3, 2024
2 parents 39252ce + f6a7ced commit cfded37
Show file tree
Hide file tree
Showing 12 changed files with 990 additions and 23 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ on:

# Runs on pushes targeting the following branch(es)
push:
branches: [dev,5-migrate-documentation]
branches: [dev]

# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
permissions:
Expand Down
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,17 +33,16 @@ In addition, the following build tools and container frameworks are an important

### Detailed Documentation

aiSSEMBLE documentation will be posted on GitHub pages along with this repository. We are [actively porting our existing
documentation](https://github.com/boozallen/aissemble/issues/5) and will update this section with a link as soon as it is up and available.
[aiSSEMBLE documentation is available GitHub pages](https://boozallen.github.io/aissemble).

### aiSSEMBLE Releases

aiSSEMBLE is currently released about once a month, but we intend to increase to around twice a month as we get our
processes adjusted and honed into the public GitHub

## Configurations
## Environment Configuration

For details on using the configuration tool, please consult our [Configuring Your Environment guidance](https://boozallen.github.io/aissemble/aissemble/current/configurations.html).
Please consult our [Configuring Your Environment guidance](https://boozallen.github.io/aissemble/aissemble/current/configurations.html).

## Build

Expand Down
2 changes: 1 addition & 1 deletion docs/modules/ROOT/pages/databricks.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ you can set the following parameters:
spark.driver.extraJavaOptions
-DKRAUSENING_BASE=/dbfs/FileStore/shared_uploads/project-name/krausening/base
-DKRAUSENING_EXTENSIONS=/dbfs/FileStore/shared_uploads/project-name/krausening/databricks
-DKRAUSENING_PASSWORD=3uQ2j_=wmP5A2q8b
-DKRAUSENING_PASSWORD=<YOUR PASSWORD_HERE>
7. When you are done configuring, select `Create Cluster`
Expand Down
61 changes: 61 additions & 0 deletions docs/modules/ROOT/pages/guides/guides-configuration-store.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
= Leveraging the Configuration Store

== Overview
The Configuration Store is a tool that enables the various configurations for a project to be centrally defined and
managed, while also standardizing access to the configurations. The Configuration Store dynamically provides
environment-specific configurations, based on the project's development lifecycle phase. Usage limitations and
regeneration strategies can also be set, allowing configurations to automatically refreshed, thereby bolstering
security of sensitive properties.

=== Setup
The Configuration Store tool is currently under development. Stay tuned for its release!

=== Usage
Via the configuration store's helm chart, project teams can specify environment variables the URIs that house the project's
various configurations. It is required to specify a base URI that houses the base/default configurations. Optionally,
one can specify a secondary URI that houses environment-specific configurations that will override and/or augment the
base configurations. Multiple configuration files can be stored at the URI. Configuration files are expected to be
in `YAML` format. Further guidance is covered below.

Because it is common practice to define a separate Helm chart for each of a project's development lifecycle phase's
environment, it is encouraged to define one shared base URI and respective environment-specific URIs, each housing
the relevant overrides and augmentations.

The following example Configuration Store Helm chart demonstrates a URI specifications for a CI deployment:
[source,yaml]
----
env:
baseURI: <URI housing base configurations>
envURI: <URI housing CI-specific overrides/augmentations>
----

Suppose we defined the following config at the `baseURI`:
[source,yaml]
----
groupName: exampleGroup
properties:
- name: connector
value: smallrye-kafka
- name: topic
value: baseTopic
----

Next, suppose we defined the following config at the `envURI`:
[source,yaml]
----
groupName: messaging
properties:
- name: topic
value: ciTopic
- name: newProperty
value: newValue
----

Then the following calls to the tool would provide the following configurations:
[source,java]
----
ConfigServiceClient client = new ConfigServiceClient();
client.getProperty("messaging", "connector") //smallrye-kafka
client.getProperty("messaging", "topic") //ciTopic
client.getProperty("messaging", "newProperty") //newValue
----
250 changes: 250 additions & 0 deletions docs/modules/ROOT/pages/guides/guides-efficient-debugging.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
= Leveraging Unit Testing and Live Updates: Recommendations for Efficient Development

== Introduction
During the development process, it's essential to employ effective strategies for testing and debugging to ensure the
quality and reliability of software. This page will outline how and when to leverage unit testing and local live
code deployment for efficient development and debugging while ensuring code longevity.

=== Unit Test

*What is a Unit Test*

Unit testing involves writing and running tests to verify the correctness of individual components or units of code.
It focuses on isolating and testing specific functions, methods, or classes independently of the larger system. Unit
testing makes development more efficient by identifying issues early, promoting modular and reusable code, enabling
faster refactoring, and facilitating collaboration among developers and teams. Therefore, it is advisable to
incorporate unit tests whenever possible in the development process.

More information on unit testing within aiSSEMBLE(TM) can be found in the
xref:testing.adoc#_unit_testing_the_pipeline[Testing the
Project page of the Getting Started guide]


*Benefits of Unit Testing*

Unit testing offers several benefits that enhance the development process and code quality. Here, we introduce some
commonly seen examples of using unit testing to improve debugging efficiency.

1. Critical functionality and edge cases: unit tests help verify that the code behaves as expected, especially
important in edge cases or corner scenarios.
2. Refactoring: During refactoring, unit tests act as a safety net, ensuring that existing functionality is not
unintentionally altered or broken.
3. Modular components: When developing modular components or libraries that will be reused across different projects,
unit tests provide confidence in their functionality and compatibility.

Please note that while unit testing should be regarded as a fundamental practice in software development and employed
whenever possible, it can be enhanced by complementary techniques like live update. Live update provides rapid
feedback during development, facilitating faster iterations and enabling immediate visual verification of code changes.
Developers should use unit tests and live update when possible to enhance code quality and efficiency

*Unit Testing Examples*

1. Data access and processing: Unit testing is particularly useful in scenarios such as data access and processing
within Python pipelines. For example, consider a data processing module that handles scaling and normalization of
input features. Unit testing for this module would involve writing tests specifically targeting the scaling and
normalization functions or classes.
* Example: To ensure the functional correctness of PySpark file ingestion for structured data, we can utilize unit
tests which read a CSV file into a file store and verify that the ingestion process was successful.
** Generate a `data_ingestion.feature` file within the `pipeline_name/pyspark_file_pipeline_name/tests/features`
directory, with a path that's tailored to the specific pipeline name you've created, for creating a scenario
** Create a file named `data_ingest_steps.py` in the `pipeline_name/pyspark_file_pipeline_name/tests/features/steps
directory`( the path should be tailored to pipeline you have created). Within this file, utilize the Ingest class to
import the CSV file titled "data to ingest" into the file store. Then the unit test runs to make sure that the data
file is stored in the file store.
****
`data_ingest.feature`
[source]
----
@data_ingest
Feature: Read a CSV file into a filestore
Scenario: Read a CSV file into files tore
Given Ingest csv file exist
Then the csv file data is read into the file store
----
`data_ingest.py`
[source,python]
----
import os
from behave import *
import nose.tools as nt
from src.pyspark_file_data_ingest.step.ingest import Ingest
@given("Ingest csv file exist")
def step_impl(context):
return
@then("the csv file data is read into the file store")
def step_impl(context):
container_name = "movie-data"
data_folder = os.getcwd()
file_name = "data-to-ingest.csv"
file_path = data_folder + "/" + file_name
context.ingest = Ingest()
context.file_store = context.ingest.file_stores["LocalTest"]
if not os.path.exists(container_name):
os.makedirs(container_name)
context.container = context.file_store.get_container(container_name=container_name)
context.file_store.upload_object(file_path, context.container, file_name)
context.ingest.execute_step()
dataframe = context.test_spark_session.sql(
"""
SELECT _c0 as title, _c1 as year, _c2 as certificate, _c3 as duration, _c4 as genre, _c5 as rating,
_c6 as description, _c7 as stars, _c8 as votes from netflix_movies"""
)
dataframe.collect()
nt.ok_(dataframe)
----
****
2. API alls and connection

If your Python pipeline interacts with external APIs to fetch data or make predictions, unit tests can be employed to
validate the API integration. This includes testing the request and response handling, verifying the correctness of
data mapping, or parsing, and ensuring the pipeline behaves as expected when interacting with different API endpoints
or handling different response scenarios. It's important to note that to perform unit tests rather than integration
tests, mocks should be used. Mocking allows you to simulate the behavior of external API endpoints without making real
network requests.

* Example: In an aiSSEMBLE project, if you are interested in testing the connection of a Python pipeline to a
data store using the `create_engine` function, you can employ unit tests that use mocks to verify the successful
establishment of this connection.
** Create a scenario file named `create_engine_test.feature` in the
`pipeline_name/pipeline_name-pipelines/example-data-delivery-py-spark-pipeline/tests/features` directory. The path
should be tailored to the pipeline you've created.
** Create a python file named `create_engine_test_steps.py` in the
`pipeline_name/pipeline_name-pipelines/example-data-delivery-py-spark-pipeline/tests/features/steps` directory. The
path should be tailored to the pipeline you've created. The unit test uses mock to simulate the return value when
the user calls `create_engine` function.

****
`create_engine_test.feature`
[source,python]
----
@create_engine_test
Feature: Placeholder test
Scenario: python pipelines can be connected to data store via create_engine function
Given Pyspark pipeline exists
Then User can connect to data store via create_engine function
----
`create_engine_test_steps.py`
[source,python]
----
from unittest.mock import patch
import sqlalchemy
from sqlalchemy.pool import QueuePool
import nose.tools as nt
@given("Pyspark pipeline exists")
def step_impl(context):
return
@then("User can connect to data store via create_engine function")
@patch("sqlalchemy.create_engine")
def step_impl(context, mock_create_engine):
mock_create_engine.return_value = {
"url": "postgresql://username:***@host:1001/database"
}
sqlalchemy.create_engine(
"postgresql://username:password@host:1001/database",
poolclass=QueuePool,
pool_size=5,
)
expected_url = "postgresql://username:***@host:1001/database"
nt.eq_(mock_create_engine.return_value["url"], expected_url)
----
****

=== Live Updates

*What are Live Updates*

Live updates, facilitated by tools like Tilt, allow developers to make changes to the code and see the results
immediately without the need for a full rebuild or redeployment.

*Benefits of Live Updates*

1. Rapid prototyping: When rapidly iterating on a feature or exploring different approaches, live updates enable quick
feedback by instantly reflecting code changes in a running application.
2. Debugging and small code changes: Live updates are effective for debugging scenarios where developers need to
quickly iterate on small code changes and observe the impact in real-time.

*Example of How to Implement Live Updates and How They are Used*

An example of live update is the automatic updating of the inference code in the local deployment, making testing
easier during the development process. The code in this example is generated as a manual action blob during the
project build to enable live updates. This code automates several tasks involved in the development and deployment
process of a machine learning component for an AI system. It enables developers to make changes to the code, sync
those changes with the running Docker container, and observe the results immediately using the live update feature.

[source]
----
# Add deployment resources here
load('ext://restart_process', docker_build_with_restart')
# quick-inference-compiler
local_resource(
name='compile-quick-inference',
cmd='cd project-name-pipelines/aissemble-machine-learning-inference/quick-inference && poetry run behave tests/features && poetry build && cd ../../.. && \
cp -r project-name-pipelines/aissemble-machine-learning-inference/quick-inference/dist project-name-docker/project-name-quick-inference-docker/target/quick-inference',
deps=['project-name-pipelines/aissemble-machine-learning-inference/quick-inference'],
auto_init=False,
ignore=['**/dist/']
)
sync_properties = sync(
local_path='project-name-docker/project-name-quick-inference-docker/target/quick-inference/dist',
remote_path='/modules/quick-inference'
)
# project-name-quick-inference-docker
docker_build_with_restart(
ref='boozallen/project-name-quick-inference-docker',
context='project-name-docker/project-name-quick-inference-docker',
live_update=[sync_properties,
run('cd /modules/quick-inference; for x in *.whl; do pip install $x --no-cache-dir --no-deps --force-reinstall; done')
],
entrypoint='python -m quick_inference.inference_api_driver "fastAPI" & python -m quick_inference.inference_api_driver "grpc"',
build_args=build_args,
dockerfile='project-name-docker/project-name-quick-inference-docker/src/main/resources/docker/Dockerfile'
)
----

*Code Explanation*

The code loads a module called `restart_process` and a function called `docker_build_with_restart`. It then defines
a local resource named `compile-quick-inference` with specific commands and dependencies. A synchronization property
is created to sync a local path with a remote path. Finally, the code builds a docker image with live update
capabilities using the provided parameters, including the reference, context, synchronization properties, entrypoint,
build arguments, and Dockerfile location.

* `load('ext://restart_process', 'docker_build_with_restart')`: Loads the external extension called
`restart_process`, specifically the `docker_build_with_restart` function, which is referenced later in the code and
enables the live update functionality for the Docker container.
* `local_resource( name='compile-quick-inference', cmd='cd project-name-pipelines/aissemble-machine-learning-inference/...)`:
Defines a local resource named `compile-quick-inference` with a set of commands to be executed locally. It builds and
tests a module called `quick inference` and copies the resulting `dist` directory to a specific location
* `sync(
local_path='project-name-docker/project-name-quick-inference-docker/target/quick-inference/dist',
remote_path='/modules/quick-inference')`: This specifies the locations that need to be synchronized. It ensures that
the `dist` directory from the previous step is kept in sync with a specific directory on the remote target.
* `docker_build_with_restart(
ref='boozallen/project-name-quick-inference-docker',
context='project-name-docker/project-name-quick-inference-docker',...)`: This section is referenced earlier in the
code in `load('ext://restart_process', 'docker_build_with_restart')`. The configuration includes the image reference,
file location, and additional options and defines the setup of a Docker container with live update functionality.

*How Live Updates Enable Debugging*

Live update functionality can be used to facilitate debugging inference steps within aiSSEMBLE projects. Here's a
step-by-step guide on how the live update can help you quickly visualize changes when modifying an endpoint response
in this case:

1. Open the file you'd like to modify within your pipeline step, such as `inference/rest/inference_api_rest.py`
which defines the REST API logic, and locate the endpoint you wish to modify.
2. Modify the return statement of the endpoint to a different response.
** In the case of the `/healthcheck` endpoint, you can change the return statement to a custom message.
3. Save the changes.
** Now, when you trigger the curl command using:
`curl --location 'http://0.0.0.0:7080/healthcheck' --header 'Content-Type: application/json'`.
The response you will receive depends on the modifications made to the `/healthcheck` endpoint. By default, the
endpoint returns the string `Inference service for `"InferencePipeline is running"`. If you modify the return
statement in the script, for example, to change the response message to `"Health check passed!"`, the curl command
will return the updated response to `"Health check passed!"`

By following this step-by-step guide and utilizing the live update feature to modify the endpoint response, you can
quickly visualize the changes and significantly improve your debugging efficiency.
Loading

0 comments on commit cfded37

Please sign in to comment.