Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[system tests] Validate fields in transforms are documented based on mappings #2341

Open
mrodm opened this issue Jan 14, 2025 · 2 comments
Open
Assignees

Comments

@mrodm
Copy link
Contributor

mrodm commented Jan 14, 2025

Follows #2207

In order to try to be agnostic to the structure of the documents ingested to run the validation in system tests. It would be helpful to run validations comparing the mapping definitions instead.

For those mappings that can not be validated against the preview mappings in #2206, it is also needed to validate if they match with any of the dynamic templates found in the data stream.

As part of the packages, there could be defined transforms with their own fields that should also be validated.

Check and if possible validate those fields based on the available mappings and dynamic templates.

The destination index could be used in

Mappings and dynamic templates can be retrieved from these APIs:

  • Mappings and dynamic templates installed by Fleet before ingesting any doc.
    • These preview mappings can be retrieved using this simulated API:
      POST /_index_template/_simulate/<index_template_name>
      
      # Example
      POST /_index_template/_simulate/logs-microsoft_dhcp.log
      
  • Mappings and dynamic templates that are present after ingesting the docs as part of the system tests.
    • These mappings can be retrieved using this API:
      GET /<data_stream_test>/_mapping/
      

The destination index (transforms.dest.index) could be used in the above APIs:

{
  "count": 1,
  "transforms": [
    {
      "id": "logs-ti_anomali.latest_intelligence-default-0.1.0",
      "dest": {
        "index": "logs-ti_anomali_latest.intelligence-1",
        "aliases": [
          {
            "alias": "logs-ti_anomali_latest.intelligence",
            "move_on_creation": true
          }
        ]
      },

The index template looks like it could also known in advance (suffix -template?)

GET /logs-ti_anomali_latest.intelligence-1/_mapping

# using index
POST /_index_template/_simulate_index/logs-ti_anomali_latest.intelligence-1

# using template 
POST /_index_template/_simulate/logs-ti_anomali.latest_intelligence-template

To be tested:

  • Run these validations in stack 7.x
  • Run these validations in stack 8.x
  • Run these validations in input and integration packages.
  • Run these validations in Stacks with LogsDB enabled (synthetics).
@mrodm
Copy link
Contributor Author

mrodm commented Jan 29, 2025

Looking at how tests are working and how transforms are tested by elastic-package, there are scenarios that could cause that not all documents are processed/validated by the transform.

For context, for every system test configuration defined in the package, the process is:

  • the documents found in the data stream are validated (via fields/mappings),
  • if there is any transform and it matches the given configuration, it is validated the documents processed by the transform (via fields/mappings).

So for a each system test, elastic-package requires to wait until some documents are processed by the transform to run the given validations for it.

I think there is an issue that could be raised in two different scenarios:

  • there are several test configuration files defined for a specific data stream.
  • there are several data streams matching the same transform configuration (source.index).

Focusing on the first scenario to show the problem, there could be a package that contains two system test configuration files in the same data stream. When validating the first test case, the process will wait until the destination index of the transform has processed the expected documents. The problem is that the second test case probably will not wait for all the documents since the destination index (that is the same for all the test configuration) has already processed the documents from the first test case.

Moreover, it could happen than the documents for the second test case have the same unique fields (e.g. event.dataset and package.event.id) as the firs test case. If so, no new document is going to be processed by the transform. But if there were different documents with different package.event.id values, it's likely that the process will not wait for those.

For instance, if first test produces 16 documents and the second test 10 documents (considering different unique fields), the first test for transform would wait for those 16 documents. However for the second test as the transform already has 16 documents, the wait process would finish automatically and not wait for those 10 documents.

In contrast, if those 10 documents of the second test case have the same unique keys as the documents ingested by the first test case, the transform should not wait for any new documents (there is not going to be any new document).

Maybe, validate transforms once all tests for data streams are validated/tested. It implies to re-think how to run system tests.
For instance, first running in parallel all system tests for data streams, and once they are finished, test all transforms (in parallel too?).

Any other ideas on how to handle this? @jsoriano

As the validation based on fields is going to keep it (related #2381), the behavior should remain the same as it is now.

@jsoriano
Copy link
Member

jsoriano commented Jan 29, 2025

Not sure what is the best approach. With what we have now, and with the proposal of testing after all tests, we would not be testing on isolation, there would be mixed tests, though we would be choosing different ways of mixing them.

If isolation of tests is really the objective, I think the only way to do this would be, for each transform and test:

  • Clone the transform.
  • Replace the source index in the cloned transform to refer only to the tested data stream.
  • Run the transform tests checking the cloned transform.
  • Cleanup the clones and their destination indexes.

Though even on this case maybe the developer really expects multiple data streams to be executed at the same time, and we could not test that.

Maybe we should introduce some kind explicit definitions of transform tests, so developers can chose with what sets of data streams and configurations the transform should be tested, and these tests would be executed after data stream tests, as you suggest.

Only some thoughts, I see pros and cons on any of the options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants