Backend should return results data in a task agnostic format for frontend consumption #678

njbrake · 2025-01-17T20:48:58Z

Motivation

I believe this is also a relevant issue for the task that @HareeshBahuleyan is working on:

As I understand it currently, the metric names and settings are hardcoded into the frontend, e.g. https://github.com/mozilla-ai/lumigator/blob/main/lumigator/frontend/src/components/molecules/LExperimentResults.vue#L69

As we support new metrics and custom user metrics in the future, I'm thinking that the frontend should be able to remain ignorant about what the specific metric is.

In order to support this, I think we need to restructure the file whose path we provide to the frontend, which relates to #670.

Currently, the file that the eval job creates and saves to S3 (which then the backend creates a presigned url so that the frontend can download the file) looks like this:

{
  "rouge": {
    "rouge1": [
      0.27272727272727276,
      0.27586206896551724
    ],
    "rougeLsum": [
      0.27272727272727276,
      0.20689655172413796
    ],
    "rouge1_mean": 0.27429467084639503,
    "rougeLsum_mean": 0.23981191222570536
  },
......
  "bertscore": {
    "precision": [
      0.8871338963508606,
      0.8618351817131042
    ],
    "recall": [
      0.8770406246185303,
      0.8888148069381714
    ],
...
    "hashcode": "hashymchashhash",
    "precision_mean": 0.8744845390319824,
    "recall_mean": 0.8829277157783508,
  },
  "predictions": [
    "a summary",
    "a summary2"
  ],
  "ground_truth": [
    "the real summary",
    "the real summary2"
  ]
}

Can we re-org this to something that looks more like

{
  "metrics": {
    "rouge1": {
      "displayName": "ROUGE-1",
      "values": [
        0.27272727272727276,
        0.27586206896551724
      ],
      "mean": 0.27429467084639503
    },
    "rougeL": {
      "displayName": "ROUGE-L",
      "values": [
        0.27272727272727276,
        0.27586206896551724
      ],
      "mean": 0.27429467084639503
    },
    "bertscorePrecision": {
      "displayName": "BERTScore Precision",
      "values": [
        0.8871338963508606,
        0.8618351817131042
      ],
      "mean": 0.8744845390319824
    },
    "bertscoreRecall": {
      "displayName": "BERTScore Recall",
      "values": [
        0.8770406246185303,
        0.8888148069381714
      ],
      "mean": 0.8829277157783508
    }
  },
  "artifacts": {
    "predictions": [
      "a summary",
      "a summary2"
    ],
    "ground_truth": [
      "the real summary",
      "the real summary2"
    ]
  },
  "parameters": {
    "max_input_length": 512,
    "max_output_length": 150,
    "num_beams": 4,
    "length_penalty": 2.0,
    "early_stopping": true
  }
}

This would be the first step towards a more structured output. As I understand it, the backend is agnostic about what is in the output result.json from a job, and the frontend relies upon the job to create the result.json in the right format. My thought was that it would help if the backend was responsible for ensuring that the job output conforms to the standards that the frontend requires.

I think(?) this should help to make our code more flexible to handle new tasks.

In order to make this backwards compatible at first, we can overhaul each job to save both the existing result.json file in addition to a new file that conforms to our new agreed upon format. This way we don't need to as tightly coordinate the backend with frontend changes.

Alternatives

This is more of an incremental step. We could try to take a more massive bite and try to completely overhaul how files are saved and provided to the frontend, for example, saving and providing 3+ files to the frontend instead of the single results.json.

Contribution

Happy to help work on this if this direction sounds good.

Have you searched for similar issues before submitting this one?

Yes, I have searched for similar issues

The text was updated successfully, but these errors were encountered:

njbrake · 2025-01-17T20:53:57Z

I think @george-mzai is the right person to ask about what he needs the new file to contain so that he can access everything he needs to display the content? I'm thinking that this restructure will make your life on the frontend much easier, but let me know if that's not the case 😆

I think @peteski22 may be the right person to pull in to ask about whether this API makes sense from the backend perspective? Right now as I understand it, the backend doesn't do any validation of the artifact result.json created by a job, I'm not sure where the validation should go.

I think @ividal or @veekaybee or @aittalam may be the ones to comment about whether this makes sense to alter the job output to conform to this output?

ividal · 2025-01-24T18:04:20Z

My 2 cents, this makes perfect sense.

Two upcoming bigger bodies of work will be:

Supporting Translation - this can start with metrics we already support, but it will likely make sense to extend to COMET or others.
Supporting task-agnostic metrics (e.g. related to robustness, persona drift...). It´s indeed not practical at all to have to change hardcoded values in UI and BE for it, when the UI could just read what metrics were returned.

We just need to keep an eye out on how we show experiments that contain different sets of metrics (e.g. once we have summarization, translation, etc.).

aittalam · 2025-01-27T11:26:08Z

I think having task-agnostic visualization of results will help a lot with generalizing our evals! What'd be the best place to store this mapping information?

currently, all the jobs that run under Ray are agnostic of what happens in the front-end (as ppl might call them via the SDK or directly through the API). I think it makes sense as requiring ppl who want to write a new job to also know about the UI raises the barrier to participation
if we want to change something in the presentation of results and have viz information saved together with the results, we won't be able to automatically apply the new mappings to old data
if we opt for having this in the backend, any small change in the UI (e.g. a label is too large to fit a display, so a shorter text has to be provided) will require a change in the backend code. This is probably not a deal-breaker (as it will likely be a configuration file as it would be in frontend), but I wanted to raise this because IMO it makes FE/BE more coupled than needed

WDYT?

njbrake · 2025-02-05T01:05:46Z

Thank you for the feedback! You made good points that I hadn't considered. It sounds like:

The design should focus on keeping the jobs as a flexible entity that doesn't require knowledge about how the frontend works.
The design should allow for new visualizations of existing data, which means that the frontend should be able to make decisions about how to display the data, and it shouldn't be tied to the backend.
Tl;dr it sounds like the ideal design will de-couple the presentation (frontend) layer from the data management/storage layer (backend), as well as the execution layer (job).

With this in mind, it seems like the design I propose in the original post on this thread is mostly good, with the exception that I shouldn't add that displayName key. I'll take a crack at pulling something together for this 👍

njbrake added enhancement New feature or request sdk backend api Changes which impact API/presentation layer schemas Changes to schemas (which may be public facing) labels Jan 17, 2025

This was referenced Jan 17, 2025

Define interface between a job and the backend #670

Closed

Implement job_result_download for experiment service #632

Merged

Use Job Schemas to validate input and output of jobs (inference & eval_lite) #692

Merged

ividal changed the title ~~[FEATURE]: Decouple metric names from the frontend~~ Decouple metric names from the frontend Jan 22, 2025

njbrake assigned njbrake and unassigned njbrake Jan 22, 2025

njbrake changed the title ~~Decouple metric names from the frontend~~ Backend return results data in a task agnostic format for frontend consumption Jan 22, 2025

njbrake changed the title ~~Backend return results data in a task agnostic format for frontend consumption~~ Backend should return results data in a task agnostic format for frontend consumption Jan 22, 2025

njbrake self-assigned this Jan 22, 2025

njbrake linked a pull request Feb 6, 2025 that will close this issue

Jobs return standardized and flexible output #815

Merged

4 tasks

njbrake mentioned this issue Feb 6, 2025

Jobs return standardized and flexible output #815

Merged

4 tasks

njbrake closed this as completed in #815 Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend should return results data in a task agnostic format for frontend consumption #678

Backend should return results data in a task agnostic format for frontend consumption #678

njbrake commented Jan 17, 2025 •

edited

Loading

njbrake commented Jan 17, 2025

ividal commented Jan 24, 2025

aittalam commented Jan 27, 2025

njbrake commented Feb 5, 2025

Backend should return results data in a task agnostic format for frontend consumption #678

Backend should return results data in a task agnostic format for frontend consumption #678

Comments

njbrake commented Jan 17, 2025 • edited Loading

Motivation

Alternatives

Contribution

Have you searched for similar issues before submitting this one?

njbrake commented Jan 17, 2025

ividal commented Jan 24, 2025

aittalam commented Jan 27, 2025

njbrake commented Feb 5, 2025

njbrake commented Jan 17, 2025 •

edited

Loading