Merge pull request #37 from boozallen/5-migrate-documentation

#5 📝 Tranche 7 of documentation migration
boozallen · May 3, 2024 · cfded37 · cfded37
2 parents 39252ce + f6a7ced
commit cfded37
Show file tree

Hide file tree

Showing 12 changed files with 990 additions and 23 deletions.
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -6,7 +6,7 @@ on:
 
   # Runs on pushes targeting the following branch(es)
   push:
-    branches: [dev,5-migrate-documentation]
+    branches: [dev]
 
 # Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
 permissions:

diff --git a/README.md b/README.md
@@ -33,17 +33,16 @@ In addition, the following build tools and container frameworks are an important
 
 ### Detailed Documentation
 
-aiSSEMBLE documentation will be posted on GitHub pages along with this repository.  We are [actively porting our existing
-documentation](https://github.com/boozallen/aissemble/issues/5) and will update this section with a link as soon as it is up and available.
+[aiSSEMBLE documentation is available GitHub pages](https://boozallen.github.io/aissemble).
 
 ### aiSSEMBLE Releases
 
 aiSSEMBLE is currently released about once a month, but we intend to increase to around twice a month as we get our
 processes adjusted and honed into the public GitHub
 
-## Configurations
+## Environment Configuration
 
-For details on using the configuration tool, please consult our [Configuring Your Environment guidance](https://boozallen.github.io/aissemble/aissemble/current/configurations.html).
+Please consult our [Configuring Your Environment guidance](https://boozallen.github.io/aissemble/aissemble/current/configurations.html).
 
 ## Build
 

diff --git a/docs/modules/ROOT/pages/databricks.adoc b/docs/modules/ROOT/pages/databricks.adoc
@@ -23,7 +23,7 @@ you can set the following parameters:
     spark.driver.extraJavaOptions
     -DKRAUSENING_BASE=/dbfs/FileStore/shared_uploads/project-name/krausening/base
     -DKRAUSENING_EXTENSIONS=/dbfs/FileStore/shared_uploads/project-name/krausening/databricks
-    -DKRAUSENING_PASSWORD=3uQ2j_=wmP5A2q8b
+    -DKRAUSENING_PASSWORD=<YOUR PASSWORD_HERE>
 
 7. When you are done configuring, select `Create Cluster`
 

diff --git a/docs/modules/ROOT/pages/guides/guides-configuration-store.adoc b/docs/modules/ROOT/pages/guides/guides-configuration-store.adoc
@@ -0,0 +1,61 @@
+= Leveraging the Configuration Store
+
+== Overview
+The Configuration Store is a tool that enables the various configurations for a project to be centrally defined and
+managed, while also standardizing access to the configurations. The Configuration Store dynamically provides
+environment-specific configurations, based on the project's development lifecycle phase. Usage limitations and
+regeneration strategies can also be set, allowing configurations to automatically refreshed, thereby bolstering
+security of sensitive properties.
+
+=== Setup
+The Configuration Store tool is currently under development. Stay tuned for its release!
+
+=== Usage
+Via the configuration store's helm chart, project teams can specify environment variables the URIs that house the project's
+various configurations. It is required to specify a base URI that houses the base/default configurations. Optionally,
+one can specify a secondary URI that houses environment-specific configurations that will override and/or augment the
+base configurations. Multiple configuration files can be stored at the URI. Configuration files are expected to be
+in `YAML` format. Further guidance is covered below.
+
+Because it is common practice to define a separate Helm chart for each of a project's development lifecycle phase's
+environment, it is encouraged to define one shared base URI and respective environment-specific URIs, each housing
+the relevant overrides and augmentations.
+
+The following example Configuration Store Helm chart demonstrates a URI specifications for a CI deployment:
+[source,yaml]
+----
+  env:
+    baseURI: <URI housing base configurations>
+    envURI: <URI housing CI-specific overrides/augmentations>
+----
+
+Suppose we defined the following config at the `baseURI`:
+[source,yaml]
+----
+groupName: exampleGroup
+properties:
+  - name: connector
+    value: smallrye-kafka
+  - name: topic
+    value: baseTopic
+----
+
+Next, suppose we defined the following config at the `envURI`:
+[source,yaml]
+----
+groupName: messaging
+properties:
+  - name: topic
+    value: ciTopic
+  - name: newProperty
+    value: newValue
+----
+
+Then the following calls to the tool would provide the following configurations:
+[source,java]
+----
+ConfigServiceClient client = new ConfigServiceClient();
+client.getProperty("messaging", "connector")     //smallrye-kafka
+client.getProperty("messaging", "topic")         //ciTopic
+client.getProperty("messaging", "newProperty")   //newValue
+----
diff --git a/docs/modules/ROOT/pages/guides/guides-efficient-debugging.adoc b/docs/modules/ROOT/pages/guides/guides-efficient-debugging.adoc
@@ -0,0 +1,250 @@
+= Leveraging Unit Testing and Live Updates: Recommendations for Efficient Development
+
+== Introduction
+During the development process, it's essential to employ effective strategies for testing and debugging to ensure the
+quality and reliability of software. This page will outline how and when to leverage unit testing and local live
+code deployment for efficient development and debugging while ensuring code longevity.
+
+=== Unit Test
+
+*What is a Unit Test*
+
+Unit testing involves writing and running tests to verify the correctness of individual components or units of code.
+It focuses on isolating and testing specific functions, methods, or classes independently of the larger system. Unit
+testing makes development more efficient by identifying issues early, promoting modular and reusable code, enabling
+faster refactoring, and facilitating collaboration among developers and teams. Therefore, it is advisable to
+incorporate unit tests whenever possible in the development process.
+
+More information on unit testing within aiSSEMBLE(TM) can be found in the
+xref:testing.adoc#_unit_testing_the_pipeline[Testing the
+Project page of the Getting Started guide]
+
+
+*Benefits of Unit Testing*
+
+Unit testing offers several benefits that enhance the development process and code quality. Here, we introduce some
+commonly seen examples of using unit testing to improve debugging efficiency.
+
+1. Critical functionality and edge cases: unit tests help verify that the code behaves as expected, especially
+important in edge cases or corner scenarios.
+2. Refactoring: During refactoring, unit tests act as a safety net, ensuring that existing functionality is not
+unintentionally altered or broken.
+3. Modular components: When developing modular components or libraries that will be reused across different projects,
+unit tests provide confidence in their functionality and compatibility.
+
+Please note that while unit testing should be regarded as a fundamental practice in software development and employed
+whenever possible, it can be enhanced by complementary techniques like live update. Live update provides rapid
+feedback during development, facilitating faster iterations and enabling immediate visual verification of code changes.
+Developers should use unit tests and live update when possible to enhance code quality and efficiency
+
+*Unit Testing Examples*
+
+1. Data access and processing: Unit testing is particularly useful in scenarios such as data access and processing
+within Python pipelines. For example, consider a data processing module that handles scaling and normalization of
+input features. Unit testing for this module would involve writing tests specifically targeting the scaling and
+normalization functions or classes.
+* Example: To ensure the functional correctness of PySpark file ingestion for structured data, we can utilize unit
+tests which read a CSV file into a file store and verify that the ingestion process was successful.
+** Generate a `data_ingestion.feature` file within the `pipeline_name/pyspark_file_pipeline_name/tests/features`
+directory, with a path that's tailored to the specific pipeline name you've created, for creating a scenario
+** Create a file named `data_ingest_steps.py` in the `pipeline_name/pyspark_file_pipeline_name/tests/features/steps
+directory`( the path should be tailored to pipeline you have created). Within this file, utilize the Ingest class to
+import the CSV file titled "data to ingest" into the file store. Then the unit test runs to make sure that the data
+file is stored in the file store.
+****
+`data_ingest.feature` 
+[source]
+----
+@data_ingest
+Feature: Read a CSV file into a filestore
+  Scenario: Read a CSV file into files tore
+    Given Ingest csv file exist
+    Then the csv file data is read into the file store
+----
+
+`data_ingest.py`
+[source,python]
+----
+import os
+from behave import *
+import nose.tools as nt
+from src.pyspark_file_data_ingest.step.ingest import Ingest
+@given("Ingest csv file exist")
+def step_impl(context):
+    return
+@then("the csv file data is read into the file store")
+def step_impl(context):
+    container_name = "movie-data"
+    data_folder = os.getcwd()
+    file_name = "data-to-ingest.csv"
+    file_path = data_folder + "/" + file_name
+    context.ingest = Ingest()
+    context.file_store = context.ingest.file_stores["LocalTest"]
+    if not os.path.exists(container_name):
+        os.makedirs(container_name)
+    context.container = context.file_store.get_container(container_name=container_name)
+    context.file_store.upload_object(file_path, context.container, file_name)
+    context.ingest.execute_step()
+    dataframe = context.test_spark_session.sql(
+        """
+    SELECT _c0 as title, _c1 as year, _c2 as certificate, _c3 as duration, _c4 as genre, _c5 as rating,
+    _c6 as description, _c7 as stars, _c8 as votes from netflix_movies"""
+    )
+    dataframe.collect()
+    nt.ok_(dataframe)
+----
+****
+2. API alls and connection
+
+If your Python pipeline interacts with external APIs to fetch data or make predictions, unit tests can be employed to
+validate the API integration. This includes testing the request and response handling, verifying the correctness of
+data mapping, or parsing, and ensuring the pipeline behaves as expected when interacting with different API endpoints
+or handling different response scenarios. It's important to note that to perform unit tests rather than integration
+tests, mocks should be used. Mocking allows you to simulate the behavior of external API endpoints without making real
+network requests.
+
+* Example: In an aiSSEMBLE project, if you are interested in testing the connection of a Python pipeline to a
+data store using the `create_engine` function, you can employ unit tests that use mocks to verify the successful
+establishment of this connection.
+** Create a scenario file named `create_engine_test.feature` in the
+`pipeline_name/pipeline_name-pipelines/example-data-delivery-py-spark-pipeline/tests/features` directory. The path
+should be tailored to the pipeline you've created.
+** Create a python file named `create_engine_test_steps.py` in the
+`pipeline_name/pipeline_name-pipelines/example-data-delivery-py-spark-pipeline/tests/features/steps` directory. The
+path should be tailored to the pipeline you've created. The unit test uses mock to simulate the return value when
+the user calls `create_engine` function.
+
+****
+`create_engine_test.feature` 
+[source,python]
+----
+@create_engine_test
+Feature: Placeholder test
+  Scenario: python pipelines can be connected to data store via create_engine function
+    Given Pyspark pipeline exists
+    Then User can connect to data store via create_engine function
+----
+
+`create_engine_test_steps.py`
+[source,python]
+----
+from unittest.mock import patch
+import sqlalchemy
+from sqlalchemy.pool import QueuePool
+import nose.tools as nt
+@given("Pyspark pipeline exists")
+def step_impl(context):
+    return
+@then("User can connect to data store via create_engine function")
+@patch("sqlalchemy.create_engine")
+def step_impl(context, mock_create_engine):
+    mock_create_engine.return_value = {
+        "url": "postgresql://username:***@host:1001/database"
+    }
+    sqlalchemy.create_engine(
+        "postgresql://username:password@host:1001/database",
+        poolclass=QueuePool,
+        pool_size=5,
+    )
+    expected_url = "postgresql://username:***@host:1001/database"
+    nt.eq_(mock_create_engine.return_value["url"], expected_url)
+----
+****
+
+=== Live Updates
+
+*What are Live Updates*
+
+Live updates, facilitated by tools like Tilt, allow developers to make changes to the code and see the results
+immediately without the need for a full rebuild or redeployment.
+
+*Benefits of Live Updates*
+
+1. Rapid prototyping: When rapidly iterating on a feature or exploring different approaches, live updates enable quick
+feedback by instantly reflecting code changes in a running application.
+2. Debugging and small code changes: Live updates are effective for debugging scenarios where developers need to
+quickly iterate on small code changes and observe the impact in real-time.
+
+*Example of How to Implement Live Updates and How They are Used*
+
+An example of live update is the automatic updating of the inference code in the local deployment, making testing
+easier during the development process. The code in this example is generated as a manual action blob during the
+project build to enable live updates. This code automates several tasks involved in the development and deployment
+process of a machine learning component for an AI system. It enables developers to make changes to the code, sync
+those changes with the running Docker container, and observe the results immediately using the live update feature.
+
+[source]
+----
+# Add deployment resources here
+load('ext://restart_process', docker_build_with_restart') 
+# quick-inference-compiler
+local_resource(
+   name='compile-quick-inference',
+   cmd='cd project-name-pipelines/aissemble-machine-learning-inference/quick-inference && poetry run behave tests/features && poetry build && cd ../../.. && \
+       cp -r project-name-pipelines/aissemble-machine-learning-inference/quick-inference/dist project-name-docker/project-name-quick-inference-docker/target/quick-inference', 
+   deps=['project-name-pipelines/aissemble-machine-learning-inference/quick-inference'],
+   auto_init=False,
+   ignore=['**/dist/']
+)
+sync_properties = sync(
+   local_path='project-name-docker/project-name-quick-inference-docker/target/quick-inference/dist',
+   remote_path='/modules/quick-inference'
+)
+
+# project-name-quick-inference-docker
+docker_build_with_restart(
+   ref='boozallen/project-name-quick-inference-docker',
+   context='project-name-docker/project-name-quick-inference-docker',
+   live_update=[sync_properties,
+      run('cd /modules/quick-inference; for x in *.whl; do pip install $x --no-cache-dir --no-deps --force-reinstall; done')
+   ],
+   entrypoint='python -m quick_inference.inference_api_driver "fastAPI" & python -m quick_inference.inference_api_driver "grpc"',
+   build_args=build_args,
+   dockerfile='project-name-docker/project-name-quick-inference-docker/src/main/resources/docker/Dockerfile'
+)
+----
+
+*Code Explanation*
+
+The code loads a module called `restart_process` and a function called `docker_build_with_restart`. It then defines
+a local resource named `compile-quick-inference` with specific commands and dependencies. A synchronization property
+is created to sync a local path with a remote path. Finally, the code builds a docker image with live update
+capabilities using the provided parameters, including the reference, context, synchronization properties, entrypoint,
+build arguments, and Dockerfile location.
+
+* `load('ext://restart_process', 'docker_build_with_restart')`: Loads the external extension called
+`restart_process`, specifically the `docker_build_with_restart` function, which is referenced later in the code and
+enables the live update functionality for the Docker container.
+* `local_resource( name='compile-quick-inference', cmd='cd project-name-pipelines/aissemble-machine-learning-inference/...)`:
+Defines a local resource named `compile-quick-inference` with a set of commands to be executed locally. It builds and
+tests a module called `quick inference` and copies the resulting `dist` directory to a specific location
+* `sync(
+   local_path='project-name-docker/project-name-quick-inference-docker/target/quick-inference/dist',
+   remote_path='/modules/quick-inference')`: This specifies the locations that need to be synchronized. It ensures that
+the `dist` directory from the previous step is kept in sync with a specific directory on the remote target.
+* `docker_build_with_restart(
+   ref='boozallen/project-name-quick-inference-docker',
+   context='project-name-docker/project-name-quick-inference-docker',...)`: This section is referenced earlier in the
+code in `load('ext://restart_process', 'docker_build_with_restart')`. The configuration includes the image reference,
+file location, and additional options and defines the setup of a Docker container with live update functionality.
+
+*How Live Updates Enable Debugging*
+
+Live update functionality can be used to facilitate debugging inference steps within aiSSEMBLE projects. Here's a
+step-by-step guide on how the live update can help you quickly visualize changes when modifying an endpoint response
+in this case:
+
+  1. Open the file you'd like to modify within your pipeline step, such as `inference/rest/inference_api_rest.py`
+which defines the REST API logic, and locate the endpoint you wish to modify.
+  2. Modify the return statement of the endpoint to a different response.
+** In the case of the `/healthcheck` endpoint, you can change the return statement to a custom message.
+  3. Save the changes.
+** Now, when you trigger the curl command using:
+`curl --location 'http://0.0.0.0:7080/healthcheck' --header 'Content-Type: application/json'`.
+The response you will receive depends on the modifications made to the `/healthcheck` endpoint. By default, the
+endpoint returns the string `Inference service for `"InferencePipeline is running"`. If you modify the return
+statement in the script, for example, to change the response message to `"Health check passed!"`, the curl command
+will return the updated response to `"Health check passed!"`
+
+By following this step-by-step guide and utilizing the live update feature to modify the endpoint response, you can
+quickly visualize the changes and significantly improve your debugging efficiency.