Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML module API #31

Closed
maxdml opened this issue Aug 11, 2016 · 31 comments
Closed

ML module API #31

maxdml opened this issue Aug 11, 2016 · 31 comments

Comments

@maxdml
Copy link

maxdml commented Aug 11, 2016

Hello,

In an effort to build the global cognoma architecture, it would be very useful to determine an API which defines exactly what is given to the ML module (and incidentally what it will return).

As an exemple of strong API documentation, I believe OpenStack is a good start. Note how every module's API is listed, and how for each of those modules each route is described.

Some direct example for a cognoma API can be found here. This is a first specification for the frontend module.

@awm33
Copy link
Member

awm33 commented Aug 11, 2016

What is the ML module? I don't think the machine learning code will provide any APIs, it will be consuming them. From what it sounds like, the backend group will need to figure out what the input to some function like run will be. I think we need to write something that will feed off the task queue and run a function, perhaps loading a class/module specific to an algorithm.

I think the input to this function could be the full classifier object. It would expect a JSON serializable object or map in response.

results = svg_classifer.run(task.data)

or even

svg_classifer.run(task)'

Where task could even have functions like task.progress(80, "Fold 4/5 completed")

@maxdml
Copy link
Author

maxdml commented Aug 11, 2016

The ML module is any library used to implement machine learning algorithm. Actually, if I am not mistaken, this library is pandas (please confirm/correct :).

It will receive request from the Asynchronous Task Queue, as you mentioned. However from the (outdated) schema in the meta cognoma repository, and in the schema @cgreene draw last night (if someone could upload it somewhere that would be awesome), we see the need to decompose the ML component to the django backend.

Even if we just forward the full classifier object, each component has specific APIs, which explains why I opened the issue.

@cgreene
Copy link
Member

cgreene commented Aug 12, 2016

@maxdml : I think the machine learning group is primarily using sklearn, though others may use something else. I think that we should define what gets provided to these methods, and each one should get passed the same information. Maybe some of the implementers can let us know what type of information they use. We should also define what we want the algorithm to report at the conclusion of the run.

@awm33
Copy link
Member

awm33 commented Aug 12, 2016

I think this is more of, what is the schema of the input/output?

If the input is a classifier object, then the only thing not defined yet is the algorithm_parameters. The parameters will vary from algorithm to algorithm. On the algorithm object, I put a parameters field to store the parameter schema for algorithm as JSON schema. JSON schema allows us to validate the parameters in both JS and python. There are also angular libs to automatically generate forms and display views based on JSON schema. Swagger, Google, and other API spec formats also use JSON schema.

We should create algorithm parameter schemas for each algorithm. Creating them and storing them as JSON schema inside the algorithms repo would be ideal. We could also have algorithm implementors write out the schema as tables in markdown if JSON schema is too complex.

Here is an example:

{
    "title": "SVG classifier parameters",
    "type": "object",
    "properties": {
        "threshold_a": {
            "type": "number",
            "title": "Threshold A",
            "description": "Threshold A controls yada yada",
            "minimum": 0.0,
            "maximum": 2.0
        },
        "category_example": {
            "type": "string",
            "title": "Category Example",
            "description": "Category Example yada yada",
            "enum": ["blue","green"]
        }
    }
}

Here's a good guide https://spacetelescope.github.io/understanding-json-schema/

Creating JSON schemas for the output could be useful as well.

@maxdml
Copy link
Author

maxdml commented Aug 12, 2016

Right, it is a question of "what is the schema of the input/output". The reason why I am mentioning an API is that, from my understanding, we are trying to setup a modular infrastructure where the machine learning code is decoupled from the django backend, and the link between them is the ATQ.

@awm33
Copy link
Member

awm33 commented Aug 12, 2016

Ah ok. So that is a good questions. I think we'll need some sort of client daemon to run the machine learning code in. The client will need to pull classifier tasks off the queue via RESTful HTTP calls to the task-service.

I don't know what the handoff code will look like. Right now, these are all scripts. We could keep them as scripts, the client could pipe in the JSON classifier and expect JSON output when done.

They could be wrapped into python modules, which I think is more ideal. The public functions that the client hits should be standard across all the algorithm modules. I think this is what @maxdml is interested in. Wrapping them as modules seems more ideal to me. I think the API could be fairly simple. Like a suggested above, just a run function.

We may also need to write a wrapper lib for the Cognoma API or pass a module as an argument. This is if they need to access data from the primary database directly.

I would like to see a task.progress like function to report progress since these will be long running. The progress function could also touch the task in the task-service so that it know the worker is alive.

@maxdml
Copy link
Author

maxdml commented Aug 12, 2016

Right, I think the public functions should be consistent across all ML modules. Once we are fixed on a first API proposal, I would like to implement a simple containers setup which would look like the following:

  • 1 container for the django backend
  • 1 container for the JS frontend
  • 1 container for the ATQ (I can mock the request which are filling that queue)
  • 1 container for machine learning code (I can just mock the modules and expose the API)

The goal is to start thinking "deployment". Having independent containers will greatly simplify automated testing (with Jenkins for example), and help providing mock containers to each team (e.g provide a mock ML container to the backend team, and a mock ATQ container to the ML team).

@awm33
Copy link
Member

awm33 commented Aug 13, 2016

@maxdml Sure. I don't know if the daemon code will live in this repo or another. I think it could be in another repo or be a directory within this one. From a deployment standpoint the daemon will probably be a python script running inside of some sort of process manager like pm2. We could do something like make a job JSON blob or file path an optional argument to the daemon script to run jobs manually in the terminal for dev/testing.

@dhimmel Are the scripts like https://github.com/cognoma/machine-learning/blob/master/algorithms/scripts/SGDClassifier-master.py the final deliverable from the ML team? These are written for ipython notebook and write output and graphs to the shell. Can we ask the ML team to write these or a version of them as python modules meant to be run as application tasks? The backend team could provide some scaffolding. I think it could be boiled down to just a run function.

@cgreene
Copy link
Member

cgreene commented Aug 15, 2016

Quick question while we're on the topic - does it make sense to use something like Celery for this? It does integrate with django and then we could require each method to provide a task definition with standard parameters & behavior expectations.

http://www.celeryproject.org/

@awm33
Copy link
Member

awm33 commented Aug 15, 2016

@cgreene It might, but that would mean an architectural change. I would have to stop working on the task service and we may need to collapse all the backend code to a single repo.

I do I like the idea of storing the parameters in a portable data structure like JSON schema more. Then we would be able to:

  • Validate the algorithm parameters in JS and server-side in the REST API call
  • Generate input fields automatically
  • Generate parameters section of the job summary, and any other part that needs to display them.
  • Might also be able to generate docs.

The hand off from the ML implementer may not need to be JSON. YAML or even something as simple as markdown table that gets manually translated or a script could convert it.

@cgreene
Copy link
Member

cgreene commented Aug 15, 2016

Gotcha - as long as we have considered it and had use cases that it doesn't
solve. :)

On Mon, Aug 15, 2016, 6:29 PM Andrew Madonna [email protected]
wrote:

@cgreene https://github.com/cgreene It might, but that would mean an
architectural change. I would have to stop working on the task service and
we may need to collapse all the backend code to a single repo.

I do I like the idea of storing the parameters in a portable data
structure like JSON schema more. Then we would be able to:

  • Validate the algorithm parameters in JS and server-side in the REST
    API call
  • Generate input fields automatically
    https://github.com/json-schema-form/angular-schema-form
  • Generate parameters section of the job summary, and any other part
    that needs to display them.
  • Might also be able to generate docs.

The hand off from the ML implementer may not need to be JSON. YAML or even
something as simple as markdown table that gets manually translated or a
script could convert it.


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#31 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAhHs0Gi0DUzIxqSIfOrJsY6Q9MGeHpmks5qgOhCgaJpZM4JiXng
.

@awm33
Copy link
Member

awm33 commented Aug 15, 2016

Ok. I'm proposing this.

Each algorithm would need a module that exposes:

  • definition - a dictionary defining the algorithm with name, title, description, and parameters schema.
  • run - a function that takes a task instance. This will contain the classifier input data. We could also pass the classifier object and/or algorithm parameters separately if that makes it easier.

I'd also like the task instance to contain that progress function I mentioned before. It may not be necessary but could really help communicate what's going on to the user other than it's been running for X amount of time. Though, run histories form the task table may eventually be able to be used to predicate how long it will take.

A script would exist in this repo that could generate the algorithms table using the info from each module.

The daemon for ML tasks would live in a directory in this repo. It could be run as a daemon connected to the task-service or as a script in the terminal that does one off jobs manually for development purposes.

We could create an example module and documentation to assist in creating a module.

@dhimmel
Copy link
Member

dhimmel commented Aug 16, 2016

@awm33, Currently, the notebooks we're creating are solely for prototyping (although we may actually want to provide users their custom notebook). The scripts exported from the notebooks are solely for tracking, not for execution.

Thoughts on the machine learning archetecture

The machine learning team will write a Python module with a run function. The run function will have a parameter for a JSON input (either the raw JSON text, the loaded object, or the loaded object split into **kwargs). The input JSON will contain a sample subset array, a gene subset array, a mutation status array, and an algorithm string. Here's an example payload:

{
  "classifier": "elastic-net-logistic-regression",
  "expression_subset": [
    1421,
    5203,
    5818,
    9875,
    10675,
    10919,
    23262
  ],
  "mutation_status": [
    0,
    1,
    1,
    0,
    0
  ],
  "sample_id": [
    "TCGA-22-4593-01",
    "TCGA-2G-AALW-01",
    "TCGA-3G-AB0O-01",
    "TCGA-3N-A9WD-06",
    "TCGA-49-4487-01"
  ]
}

For expression_subset, "all" is also a valid value corresponding to when the user does not want to subset the genes whose expression is used as features.

I made a few design choices above which I'll explicitly state now.

  1. The frontend/django-cognoma will not be able to pass algorithm hyperparameters. The machine learning team will therefore pre-specify hyperparameter grids for each algorihtm. For some hyperparameter settings where we want to give the user direct access, we can use multiple algorithm names. For example, ridge-logistic-regression, lasso-logistic-regression, and elastic-net-logistic-regression may all use the SGDClassifier in sklearn, but will appear as three algorithm options. The machine learning team will thus create an algorithms table/list that the frontend/django-cognoma can consume.
  2. The frontend/django-cognoma will be required to compute the outcome array of mutation status. An alternative would be to have the machine learning group receive a formula rather than the actual binary values.

In the future, we may want to add more options to the payload, such as a transformation or variable_selection arguments.


CCing @RenasonceGent who was interested in these topics at the last meetup.

@RenasonceGent
Copy link

I was waiting on more examples to be completed before moving on. I added a gist here with the code I have so far. I broke up the example script into functions. It should make it easier to change things later. I can see having a default set of hyperparameters to start, but isn't that something that we should make optional for the user later? I wouldn't expect the optimal set of hyperparameters for one set of data to be even be near optimal for another.

@maxdml
Copy link
Author

maxdml commented Aug 16, 2016

@dhimmel I am a bit of a layperson when it comes to the actual meaning of the ML computations. Do you know where I can learn more about the "outcome array of mutation status"?

The reason I want to understand this is that I am concerned about business layer related computation being implemented in the frontend. If some outcome of the ML algorithm has to be further refined before being delivered to an user, shouldn't it be kept in the ML module?

Sorry if the question looks dumb.

@dhimmel
Copy link
Member

dhimmel commented Aug 16, 2016

Do you know where I can learn more about the "outcome array of mutation status"?

@maxdml, All this means is whether a sample is mutated (0 coded) or normal (1 coded) for a user-specified gene (or set of genes). The goal of the machine learning model is to learn how to classify samples as either mutated or not-mutated. Outcome is maybe a confusing term here -- it refers to what the model is trying to predict and must be available before the model can be trained.

Just so we're clear, the goal is to use gene expression (also referred to as features/X/predictors) to predict mutation status (also referred to as outcome/y/status). If you have any general questions about machine learning, perhaps ask them at #7.

@awm33
Copy link
Member

awm33 commented Aug 16, 2016

@dhimmel

The frontend/django-cognoma will not be able to pass algorithm hyperparameters.

Is that because of a limitation of the frontend/backend, or that just doesn't seem necessary for the user?

Looking at the classifer object in https://github.com/cognoma/django-cognoma/blob/master/doc/api.md.

  • If we aren't going to be passing any algorithm parameters, we can drop those from the schema.
  • Looks like genes maps to expression_subset. Is expression_subset a better name? I'd rather use one field name consistently. Couldn't you infer empty [] equals "all"? Seems like bad data modeling practice to mix them, we could add a boolean field to mean "all" if you need something explicit.
  • tissues is only used by the result viewer?
  • mutation_status This is the outcome array generated from the sparse matrix? Not the full one?
  • sample_id Is this just a list of samples that connect to the outcome array? Not all (7.5k) sample ids?

The machine learning team will thus create an algorithms table/list that the frontend/django-cognoma can consume.

How do they want to maintain this? Where the table be a SQL table? We could just write a script to generate/update if we add a couple fields to the modules, like a user friendly name/title and description.

The frontend/django-cognoma will be required to compute the outcome array of mutation status.

Can you or someone from the Greene Lab create an issue describing how to calculate / create the outcome array?

Another thing is logging. Will the ML group be logging using the python logging module? We may want to send the logs to disk and/or something like logstash.

If the module code could periodically hit a progress function, that would be great. It could report progress, and most importantly we could know if it's stuck, otherwise it will have to wait for a long timeout.

@dhimmel
Copy link
Member

dhimmel commented Aug 17, 2016

Is that because of a limitation of the frontend/backend, or that just doesn't seem necessary for the user?

Choosing hyperparameter values is a great hindrance. If the machine learning team does our job, we can hopefully not subject the user to this nuisance. In the future if some users want direct access to setting hyperparameters, perhaps we can expand functionality.

Regarding the schema, which I hadn't actually seen yet (nice work) -- I think we're on the same page, I am just envisioning the simplest possible system.

  • tissues is used to identify the relevant sample_id set, so we will need either the sample_id array or the tissues array.
  • genes is the same thing as expression_subset. I wanted to be more specific than just "genes" since mutations also are in genes. expression_genes is a possibility.
  • mutation_status -- it looks like the current api docs are missing a field for the outcome (y) for the machine learning classifier.

How do they want to maintain this?

For the list of algorithms, we'll export either a TSV or JSON file with the algorithm, name, and description.

We can log and hit the progress function. @awm33 -- let's deal with these two issues later.

@awm33
Copy link
Member

awm33 commented Aug 17, 2016

@dhimmel Cool.

  • Tissues - I think I'd like to keep it as tissues on the classifier model, since will want to store that as the user selection. It sounds like the sample id list is based on something like select sample_id from samples where tissue in (tissues). That could be done during task queueing, inside the job outside the ml.run, or inside the ml.run function. Perhaps using some shared function. I'm leaning towards doing it in the task.
  • expression_genes or gene_expression_set ?
  • mutation_status yep, that's not there, I didn't know what it would look like or how to construct/calculate it.
  • We also need to figure out what fields are on the samples table. It just has id now. Once the cancer data and ML groups have decided what the fields are, we can add them.
  • If the ML code needs to hit the main API / database, the application group (what I've been calling the combined frontend/backend groups) can help write a client module.

@dhimmel
Copy link
Member

dhimmel commented Aug 17, 2016

I think I'd like to keep it as tissues on the classifier model, since will want to store that as the user selection.

Sounds good. In the future however users may want to select samples based on other criteria then tissue. We can always change this then.

expression_genes or gene_expression_set?

Don't care, but if we use set, then should we use tissue_set as well?

mutation_status yep, that's not there, I didn't know what it would look like or how to construct/calculate it.

If we don't store the sample_id array, we will need to store a formula for how to compute mutation status. Will start a separate issue for this.

@awm33
Copy link
Member

awm33 commented Aug 17, 2016

@dhimmel

Sounds good. In the future however users may want to select samples based on other criteria then tissue. We can always change this then.

Ok, It sounded to me like tissues is a filter on the samples. If there are more filter criteria, I think we should still store it for the state of the UI and knowing how the user generated the list.

If we don't store the sample_id array, we will need to store a formula for how to compute mutation status.

I think we need more clarification on where mutation_status and samples_ids are coming from. Maybe that can be in your issue. The model does have mutations which connects genes to samples on a many-to-many. Are mutation_status and samples_ids just entries for each selected gene's mutation status on each sample? So if you choose 10 genes, you would have 10*number-of-samples entries? If that's true, then is number-of-samples the full 7.5k or the number of samples matching the tissue filter?

@awm33 awm33 mentioned this issue Aug 18, 2016
@dhimmel
Copy link
Member

dhimmel commented Sep 12, 2016

Minimum Viable Product Specification

For the minimum viable product (i.e. the first release of the machine-learning package), I'm thinking we can have a simplified input to machine-learning (ML). The main ML function would consume a JSON file with a mutation_status and sample_id array. Do people prefer mutation_status or mutation_statuses / sample_id or sample_ids? Also we could encode the information as a sample_id_to_mutation_status object.

{
  "mutation_status": [
    0,
    1,
    1,
    0,
    0
  ],
  "sample_id": [
    "TCGA-22-4593-01",
    "TCGA-2G-AALW-01",
    "TCGA-3G-AB0O-01",
    "TCGA-3N-A9WD-06",
    "TCGA-49-4487-01"
  ]
}

Based on this design choice, the ML module never gets passed information on which sample filters were applied (such as disease type, gender, or age). While this information should be stored, the ML portion of the project won't actually need to know this information.

@awm33 I know I didn't answer your questions, just let me know which ones are still outstanding.

@awm33
Copy link
Member

awm33 commented Sep 21, 2016

@dhimmel Looking at the your original example from above

{
  "classifier": "elastic-net-logistic-regression",
  "expression_subset": [
    1421,
    5203,
    5818,
    9875,
    10675,
    10919,
    23262
  ],
  "mutation_status": [
    0,
    1,
    1,
    0,
    0
  ],
  "sample_id": [
    "TCGA-22-4593-01",
    "TCGA-2G-AALW-01",
    "TCGA-3G-AB0O-01",
    "TCGA-3N-A9WD-06",
    "TCGA-49-4487-01"
  ]
}

Is this data related/tabular? So the above could be written as:

[
    [1421,0,"TCGA-22-4593-01"],
    [5203,1,"TCGA-2G-AALW-01"],
    [5818,1,"TCGA-3G-AB0O-01"],
    [9875,0,"TCGA-3N-A9WD-06"],
    [10675,0,"TCGA-49-4487-01"]
]

or

[
    {
        "expression_subset": 1421,
        "mutation_status": 0,
        "sample_id": "TCGA-22-4593-01"
    },
    {
        "expression_subset": 5203,
        "mutation_status": 1,
        "sample_id": "TCGA-2G-AALW-01"
    },
    {
        "expression_subset": 5818,
        "mutation_status": 1,
        "sample_id": "TCGA-3G-AB0O-01"
    },
    {
        "expression_subset": 9875,
        "mutation_status": 0,
        "sample_id": "TCGA-3N-A9WD-06"
    },
    {
        "expression_subset": 10675,
        "mutation_status": 0,
        "sample_id": "TCGA-49-4487-01"
    }
]

I'm just assuming since a sample is mutated for a specific gene, that's what we are trying to pass, a row per sample.

@dhimmel
Copy link
Member

dhimmel commented Sep 21, 2016

sample_id and mutation_status are two columns of a single table. expression_subset is of a different nature (and will not be included in the MVP implementation). Therefore, we could use any method for representing the sample/observation table (containing sample_id and mutation_status and possible more columns in the future). Let me know what you think is best.

@awm33
Copy link
Member

awm33 commented Sep 30, 2016

Looking at #51

Would it make sense to have the worker/ task runner code in this repo or in a separate one?

I was thinking of exposing it as a cli like

python ./task-runner.py run-task ./some/path/task.json for local machine testing/development

python ./task-runner.py worker which would start a worker and be run as a daemon in prod

The worker code would from cognoml.analysis import classify and run the classify fn in the task process, passing the specific task data.

@dhimmel
Copy link
Member

dhimmel commented Oct 2, 2016

Would it make sense to have the worker/ task runner code in this repo or in a separate one?

My preference is a separate repo. This repo already contains multiple things. Also I think people may be interested in using the cognoml package without the task runner.

The task runner environment can install the cognoml package using pip install, either from specific commits on GitHub or if we upload cognoml to PyPI.

@awm33
Copy link
Member

awm33 commented Oct 4, 2016

@dhimmel That sounds good, do you want to create the repo? Trying to think of a good name, maybe "task-workers" or "ml-workers" in case we have other background tasks.

@dhimmel
Copy link
Member

dhimmel commented Oct 4, 2016

That sounds good, do you want to create the repo? Trying to think of a good name, maybe "task-workers" or "ml-workers" in case we have other background tasks.

@awm33, you pick the name and I'll create. I like both the suggestions. How will this repo be different than task-service?

@awm33
Copy link
Member

awm33 commented Oct 6, 2016

@dhimmel ml-workers works then. The repo will house code consuming that service and the core API, but is not part of the service itself. It's the "Machine Learning Worker(s)" in the architecture diagram.

@dhimmel
Copy link
Member

dhimmel commented Oct 6, 2016

@awm33 I created ml-workers -- see https://github.com/cognoma/ml-workers. You're the maintainer.

@awm33
Copy link
Member

awm33 commented Oct 7, 2016

@dhimmel Thanks! 🌮

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants