Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/query #44

Merged
merged 15 commits into from
Oct 22, 2024
Merged

Feature/query #44

merged 15 commits into from
Oct 22, 2024

Conversation

hohonuuli
Copy link
Member

@hohonuuli hohonuuli commented Oct 16, 2024

This branch adds support for ad-hoc queries against the annotations view in the database. This feature is needed to support the upcoming vars query web ui.

The endpoints for these new features will be under v1/query

  • Add distinct as a flag. Default should be true.
  • Add strict(?) as a flag. Default is false. When false adds the observation_uuid and index_recorded_timestamp columns and sorts by those.
  • Add orderby param. This should be ignored when strict is true (verify that this is the behavior we want)
  • Add equals operator
  • Rename like to contains and retain current behavior
  • Add like that takes a sql like string (e.g. http%)
  • Add integration tests for sql server and postgres

@hohonuuli
Copy link
Member Author

hohonuuli commented Oct 17, 2024

There are new query endpoints, see http://portal.shore.mbari.org:8100/docs/#/Query.

Goals

These endpoints are to allow app-developers and some users to fetch annotation or annotation data in a flexible manner. The primary use case is the new VARS web query. Secondary use would be to replace SQL used in vars-gridview. Non-goals include specialized queries for reporting.

Notes

These endpoints operate against the annotations view in VARS. That view joins tables from the M3_ANNOTATIONS and M3_VIDEO_ASSETS databases into a single unified view. When working with data from this view, it's important to be aware of how the table joins affect the data returned. Typically, you will have to do some munging of the rows in your app, depending on what you're querying for. The observation_uuid column is your friend and you can use that to combine related columns (notably anything to with associations or images which are essentially a one to many join with observations). I used this same method with all the version of the VARS query through the years and it works well.

The query endpoint is relatively dynamic, so we can add or remove columns from the annotations view as we see fit. Note that the observation_uuid, imaged_moment_uuid, and index_recorded_timestamp columns are hard-coded into the codebase at the moment. This is required to do sensible things like giving stable sort keys and allowing for returns of related concepts/associations when doing a query.

/v1/query/columns

This endpoint allows your code to know what it can query for.

This returns information about the each column in the annotation view. For most users, they only care about the columnName, but columnType may be useful too. Example from http://portal.shore.mbari.org:8100/v1/query/columns edited for brevity:

[
  {
    "columnName": "imaged_moment_uuid",
    "columnType": "uniqueidentifier",
    "columnSize": 36,
    "columnLabel": "imaged_moment_uuid",
    "columnClassName": "java.lang.String"
  }
]

/v1/query/count

Takes the same JSON body used in /v1/query/run (except you don't need to include select) and returns a count of matching rows. Note that this is NOT QUITE RIGHT yet. run includes DISTINCT so count will likely overestimate the number of rows returned.

/v1/query/run

Runs a query using a POST / JSON request and returns the DISTINCT result ordered by time as tab-delimited data

Here's an example body below, it's a bit like if SQL and JSON had a baby. Important things ...

  1. select - specifies the columns to return
  2. where - constraints. Can be one of the operators below
    • between - can be two elements of numbers or dates (as ISO8601)
    • contains - translates to LIKE '%word%'
    • equals - Matches a string
    • in - Same as SQL IN. The value is an array of strings: ["foo", "bar", "etc"]
    • isnull - can be true or false.
    • like - User has to supply the %
    • max - Becomes column <= max
    • min - Becomes column >= max
    • minmax - A number between the provided values. Takes an array of numbers [1, 100]
  3. concurrentObservations - When true runs your query and also returns any other annotations occurring on the same frames that were returned by your query. If true it overrides strict and strict will be treated as false regardless of what you set it to.
  4. relatedAssociations - When true, and you're constraining by some association field, will run your query but also return all other associations on observations in your query. Useful for things like searching for bounding box but also getting any other associations, not just the bounding box ones. If true it overrides strict and strict will be treated as false regardless of what you set it to.
  5. limit - the max number of rows to return
  6. offset - The starting row to return. When used with limit can page through the data.
  7. distinct - Applies distinct to the query, the default is false
  8. strict - When false, queries will be modified to include the observation_uuid and index_recorded_timestamp will be included in the returns. The default is true
  9. orderby - takes and array of column name to be used for sorting. The default is by index_recorded_timestamp

This query returns all Nanomia (and Nanomia bijuga) annotations with a localization, with a valid recorded timestamp, and that have images and are on a video file. Since concurrentObservations is true, it will also return all other bounding box annotations on the same frames (e.g. not nanomia)

{
  "select": [
    "concept",
    "index_recorded_timestamp",
    "video_sequence_name",
    "video_uri",
    "image_url",
    "link_value"
  ],
  "where": [
    {
      "column": "concept",
      "in":["Nanomia", "Nanomia bijuga"]
    },
    {
      "column": "index_recorded_timestamp",
      "isnull": false
    },
    {
      "column": "image_url",
      "isnull": false
    },
    {
      "column": "link_name",
      "in": ["bounding box"]
    },
    {
      "column": "video_uri",
      "like": "http%"   
    }
  ],
  "limit": 5000,
  "offset": 0,
  "concurrentObservations": true,
  "relatedAssociations": false
}

You can run this from the command like with:

curl -X 'POST' \
  'http://portal.shore.mbari.org:8100/v1/query/run' \
  -H 'Content-Type: application/json' \
  -d '{
  "select": [
    "concept",
    "index_recorded_timestamp",
    "video_sequence_name",
    "video_uri",
    "image_url",
    "link_value"
  ],
  "where": [
    {
      "column": "concept",
      "in":["Nanomia", "Nanomia bijuga"]
    },
    {
      "column": "index_recorded_timestamp",
      "isnull": false
    },
    {
      "column": "image_url",
      "isnull": false
    },
    {
      "column": "link_name",
      "in": ["bounding box"]
    },
    {
      "column": "video_uri",
      "like": "http%"   
    }
  ],
  "limit": 5000,
  "offset": 0,
  "concurrentObservations": true,
  "relatedAssociations": false
}
'

@lonnylundsten @kevinsbarnard @NancyJS I would really appreciate any feedback so that this endpoint addresses current SQL use cases. (Kevin, especially for apps). Please be mindful it's not meant to address ALL need for SQL (Lonny, that's especially true for reporting). Also, none of the operator names are set in stone, so we can tweak them if there's consensus (e.g. min -> gt). It's relatively easy to add other operators too if needed.

This is currently deployed starting with release 1.2.0 and is now running internally at MBARI. I'm waiting for feedback before I start writing any apps against it.

@hohonuuli hohonuuli marked this pull request as draft October 17, 2024 16:09
@lonnylundsten
Copy link

@hohonuuli @kevinsbarnard

If I want to constrain a query by date, how can I do that using this API?

Is it something like this -- this doesn't work?
{"column": "index_recorded_timestamp", "between": "1996-01-01 and 2002-01-01"}

@lonnylundsten
Copy link

@hohonuuli If I update the query so there is no limit (i.e., fetch all the data) when I do a big query (i.e., Nanomia bijuga) I get a time out error. Is that expected or should I be able to get all the data?

@lonnylundsten
Copy link

@hohonuuli If I update the query so there is no limit (i.e., fetch all the data) when I do a big query (i.e., Nanomia bijuga) I get a time out error. Is that expected or should I be able to get all the data?

I may have crashed VARS....

Screenshot 2024-10-17 at 2 21 13 PM

@hohonuuli
Copy link
Member Author

@lonnylundsten

  1. With great power comes great responsibility
  2. Don't do that.

That unbounded query will return a little over a half-million rows ('cause table joins). The service converts that to 1. an in memory data structure which is 2. converted to a String to be written back to the client. I'm pretty sure I haven't configured the service with enough memory to handle that query.

@hohonuuli
Copy link
Member Author

The proper order to fetch large sets is:

  1. Use the count endpoint to get an estimate of the number of rows
  2. Page through the data using limit and offset to read in smaller chunks. I don't know what the max rooms is and honestly, it depends on how busy the server is. Start with 5000 as the upper bound.

@hohonuuli
Copy link
Member Author

hohonuuli commented Oct 17, 2024

If I want to constrain a query by date, how can I do that using this API?

It's suppose to be the following, but it might be broken ATM as I'm changing the API.

{
    "column": "index_recorded_timestamp",
    "between": [
        "1996-01-01T00:00:00Z",
        "2002-01-01T00:00:00Z"
    ]
}

@mbari-org mbari-org deleted a comment from lonnylundsten Oct 17, 2024
@hohonuuli
Copy link
Member Author

hohonuuli commented Oct 21, 2024

@lonnylundsten @kevinsbarnard I've updated the query params the /v1/query/run accepts. The docs above reflect the new parameters. Here's new examples:

Get localizations of Nanomia that are missing an image, but one could be fetched using beholder.

curl -X 'POST' \
  'http://m3.shore.mbari.org/anno/v1/query/run' \
  -H 'Content-Type: application/json' \
  -d '{
  "select": [
    "concept",
    "index_recorded_timestamp",
    "video_sequence_name",
    "video_uri",
    "image_url",
    "link_value"
  ],
  "where": [
    {
      "column": "concept",
      "in":["Nanomia", "Nanomia bijuga"]
    },
    {
      "column": "index_elapsed_time_millis",
      "isnull": false
    },
    {
      "column": "image_url",
      "isnull": true
    },
    {
      "column": "link_name",
      "in": ["bounding box"]
    },
    {
      "column": "video_uri",
      "like": "http%"   
    }
  ],
  "distinct": true,
  "limit": 5000,
  "offset": 0,
  "concurrentObservations": true,
  "relatedAssociations": false
}
'

Get all concept names used in annotations

curl -X 'POST' \
  'http://m3.shore.mbari.org/anno/v1/query/run' \
  -H 'Content-Type: application/json' \
  -d '{
  "select": [
    "concept"
  ],
  "where": [
    {
      "column": "concept",
      "isnull": false
    }
  ],
  "strict": true,
  "orderby": ["concept"],
  "distinct": true
}
'

These changes have been deployed internally as v1.2.1

@hohonuuli hohonuuli marked this pull request as ready for review October 22, 2024 22:42
@hohonuuli hohonuuli merged commit ae39e2c into master Oct 22, 2024
2 checks passed
@hohonuuli hohonuuli deleted the feature/query branch October 22, 2024 22:52
@hohonuuli
Copy link
Member Author

Released as v1.2.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants