-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ML module API #31
Comments
What is the ML module? I don't think the machine learning code will provide any APIs, it will be consuming them. From what it sounds like, the backend group will need to figure out what the input to some function like I think the input to this function could be the full classifier object. It would expect a JSON serializable object or map in response.
or even
Where task could even have functions like |
The ML module is any library used to implement machine learning algorithm. Actually, if I am not mistaken, this library is pandas (please confirm/correct :). It will receive request from the Asynchronous Task Queue, as you mentioned. However from the (outdated) schema in the meta cognoma repository, and in the schema @cgreene draw last night (if someone could upload it somewhere that would be awesome), we see the need to decompose the ML component to the django backend. Even if we just forward the full classifier object, each component has specific APIs, which explains why I opened the issue. |
@maxdml : I think the machine learning group is primarily using sklearn, though others may use something else. I think that we should define what gets provided to these methods, and each one should get passed the same information. Maybe some of the implementers can let us know what type of information they use. We should also define what we want the algorithm to report at the conclusion of the run. |
I think this is more of, what is the schema of the input/output? If the input is a classifier object, then the only thing not defined yet is the We should create algorithm parameter schemas for each algorithm. Creating them and storing them as JSON schema inside the algorithms repo would be ideal. We could also have algorithm implementors write out the schema as tables in markdown if JSON schema is too complex. Here is an example:
Here's a good guide https://spacetelescope.github.io/understanding-json-schema/ Creating JSON schemas for the output could be useful as well. |
Right, it is a question of "what is the schema of the input/output". The reason why I am mentioning an API is that, from my understanding, we are trying to setup a modular infrastructure where the machine learning code is decoupled from the django backend, and the link between them is the ATQ. |
Ah ok. So that is a good questions. I think we'll need some sort of client daemon to run the machine learning code in. The client will need to pull classifier tasks off the queue via RESTful HTTP calls to the task-service. I don't know what the handoff code will look like. Right now, these are all scripts. We could keep them as scripts, the client could pipe in the JSON classifier and expect JSON output when done. They could be wrapped into python modules, which I think is more ideal. The public functions that the client hits should be standard across all the algorithm modules. I think this is what @maxdml is interested in. Wrapping them as modules seems more ideal to me. I think the API could be fairly simple. Like a suggested above, just a We may also need to write a wrapper lib for the Cognoma API or pass a module as an argument. This is if they need to access data from the primary database directly. I would like to see a |
Right, I think the public functions should be consistent across all ML modules. Once we are fixed on a first API proposal, I would like to implement a simple containers setup which would look like the following:
The goal is to start thinking "deployment". Having independent containers will greatly simplify automated testing (with Jenkins for example), and help providing mock containers to each team (e.g provide a mock ML container to the backend team, and a mock ATQ container to the ML team). |
@maxdml Sure. I don't know if the daemon code will live in this repo or another. I think it could be in another repo or be a directory within this one. From a deployment standpoint the daemon will probably be a python script running inside of some sort of process manager like pm2. We could do something like make a job JSON blob or file path an optional argument to the daemon script to run jobs manually in the terminal for dev/testing. @dhimmel Are the scripts like https://github.com/cognoma/machine-learning/blob/master/algorithms/scripts/SGDClassifier-master.py the final deliverable from the ML team? These are written for ipython notebook and write output and graphs to the shell. Can we ask the ML team to write these or a version of them as python modules meant to be run as application tasks? The backend team could provide some scaffolding. I think it could be boiled down to just a |
Quick question while we're on the topic - does it make sense to use something like Celery for this? It does integrate with django and then we could require each method to provide a task definition with standard parameters & behavior expectations. |
@cgreene It might, but that would mean an architectural change. I would have to stop working on the task service and we may need to collapse all the backend code to a single repo. I do I like the idea of storing the parameters in a portable data structure like JSON schema more. Then we would be able to:
The hand off from the ML implementer may not need to be JSON. YAML or even something as simple as markdown table that gets manually translated or a script could convert it. |
Gotcha - as long as we have considered it and had use cases that it doesn't On Mon, Aug 15, 2016, 6:29 PM Andrew Madonna [email protected]
|
Ok. I'm proposing this. Each algorithm would need a module that exposes:
I'd also like the task instance to contain that A script would exist in this repo that could generate the The daemon for ML tasks would live in a directory in this repo. It could be run as a daemon connected to the task-service or as a script in the terminal that does one off jobs manually for development purposes. We could create an example module and documentation to assist in creating a module. |
@awm33, Currently, the notebooks we're creating are solely for prototyping (although we may actually want to provide users their custom notebook). The scripts exported from the notebooks are solely for tracking, not for execution. Thoughts on the machine learning archetectureThe machine learning team will write a Python module with a {
"classifier": "elastic-net-logistic-regression",
"expression_subset": [
1421,
5203,
5818,
9875,
10675,
10919,
23262
],
"mutation_status": [
0,
1,
1,
0,
0
],
"sample_id": [
"TCGA-22-4593-01",
"TCGA-2G-AALW-01",
"TCGA-3G-AB0O-01",
"TCGA-3N-A9WD-06",
"TCGA-49-4487-01"
]
} For I made a few design choices above which I'll explicitly state now.
In the future, we may want to add more options to the payload, such as a CCing @RenasonceGent who was interested in these topics at the last meetup. |
I was waiting on more examples to be completed before moving on. I added a gist here with the code I have so far. I broke up the example script into functions. It should make it easier to change things later. I can see having a default set of hyperparameters to start, but isn't that something that we should make optional for the user later? I wouldn't expect the optimal set of hyperparameters for one set of data to be even be near optimal for another. |
@dhimmel I am a bit of a layperson when it comes to the actual meaning of the ML computations. Do you know where I can learn more about the "outcome array of mutation status"? The reason I want to understand this is that I am concerned about business layer related computation being implemented in the frontend. If some outcome of the ML algorithm has to be further refined before being delivered to an user, shouldn't it be kept in the ML module? Sorry if the question looks dumb. |
@maxdml, All this means is whether a sample is mutated ( Just so we're clear, the goal is to use gene expression (also referred to as features/X/predictors) to predict mutation status (also referred to as outcome/y/status). If you have any general questions about machine learning, perhaps ask them at #7. |
Is that because of a limitation of the frontend/backend, or that just doesn't seem necessary for the user? Looking at the classifer object in https://github.com/cognoma/django-cognoma/blob/master/doc/api.md.
How do they want to maintain this? Where the table be a SQL table? We could just write a script to generate/update if we add a couple fields to the modules, like a user friendly name/title and description.
Can you or someone from the Greene Lab create an issue describing how to calculate / create the outcome array? Another thing is logging. Will the ML group be logging using the python logging module? We may want to send the logs to disk and/or something like logstash. If the module code could periodically hit a |
Choosing hyperparameter values is a great hindrance. If the machine learning team does our job, we can hopefully not subject the user to this nuisance. In the future if some users want direct access to setting hyperparameters, perhaps we can expand functionality. Regarding the schema, which I hadn't actually seen yet (nice work) -- I think we're on the same page, I am just envisioning the simplest possible system.
For the list of algorithms, we'll export either a TSV or JSON file with the We can log and hit the progress function. @awm33 -- let's deal with these two issues later. |
@dhimmel Cool.
|
Sounds good. In the future however users may want to select samples based on other criteria then tissue. We can always change this then.
Don't care, but if we use set, then should we use
If we don't store the |
Ok, It sounded to me like tissues is a filter on the samples. If there are more filter criteria, I think we should still store it for the state of the UI and knowing how the user generated the list.
I think we need more clarification on where |
Minimum Viable Product SpecificationFor the minimum viable product (i.e. the first release of the machine-learning package), I'm thinking we can have a simplified input to machine-learning (ML). The main ML function would consume a JSON file with a {
"mutation_status": [
0,
1,
1,
0,
0
],
"sample_id": [
"TCGA-22-4593-01",
"TCGA-2G-AALW-01",
"TCGA-3G-AB0O-01",
"TCGA-3N-A9WD-06",
"TCGA-49-4487-01"
]
} Based on this design choice, the ML module never gets passed information on which sample filters were applied (such as disease type, gender, or age). While this information should be stored, the ML portion of the project won't actually need to know this information. @awm33 I know I didn't answer your questions, just let me know which ones are still outstanding. |
@dhimmel Looking at the your original example from above
Is this data related/tabular? So the above could be written as:
or
I'm just assuming since a sample is mutated for a specific gene, that's what we are trying to pass, a row per sample. |
|
Looking at #51 Would it make sense to have the worker/ task runner code in this repo or in a separate one? I was thinking of exposing it as a cli like
The worker code would |
My preference is a separate repo. This repo already contains multiple things. Also I think people may be interested in using the The task runner environment can install the |
@dhimmel That sounds good, do you want to create the repo? Trying to think of a good name, maybe "task-workers" or "ml-workers" in case we have other background tasks. |
@awm33, you pick the name and I'll create. I like both the suggestions. How will this repo be different than task-service? |
@dhimmel |
@awm33 I created |
@dhimmel Thanks! 🌮 |
Hello,
In an effort to build the global cognoma architecture, it would be very useful to determine an API which defines exactly what is given to the ML module (and incidentally what it will return).
As an exemple of strong API documentation, I believe OpenStack is a good start. Note how every module's API is listed, and how for each of those modules each route is described.
Some direct example for a cognoma API can be found here. This is a first specification for the frontend module.
The text was updated successfully, but these errors were encountered: