How to retrain in v0.3.3 ? #206

tomdavidson · 2022-04-15T00:08:27Z

tomdavidson
Apr 15, 2022

updateLabels can remark existing labels, but I do not think mis-marked labels is the problem in some of my clusters. Here is an example of labels and of a cluster that is poorly matched.

I think I need to convert the data sample to fit the preexisting training data format and train a new model, ie https://docs.zingg.ai/zingg/stepbystep/createtrainingdata/addowntrainingdata

This format is not clear. Is the z_cluster field the same as z_cluster in the output or just an arbitrary grouping for the label? The example data does not have headers at all? https://github.com/zinggAI/zingg/blob/main/examples/febrl/training.csv

Rather than the preexisting labels in the config file, can we just add the labels to the marked/ collection? Marked files produced by zingg only have two records for each file. Could we have n records that have the same mark in a singe file?

Answered by sonalgoyal

May 17, 2022

File formats - you can have any kind of file or format here. json/csv/parquet..or even plugin the Zingg Pipe of your JDBC/NoSQL store. All we need are the z_cluster and the fieldDefinition fields.

Reuse labels yes you can export and reuse trainingdata/marked or trainingdata/unmarked as per your liking in a format of your choice. Skipping the labelling is not the right way, as we need an optimal selection of edge cases to have a well tuned model.

Single source of truth sometimes users have a few records that they want to seed the training with. sometimes one or two types of matching scenarios can be omiited from discovery by the usual findTrainingData+label route. We dont want to have a lo…

View full answer

sonalgoyal · 2022-04-15T10:32:15Z

sonalgoyal
Apr 15, 2022
Maintainer

as discussed today on the call, I would advise against directly manipulating the model data. Instead, you can use the trainingsamples with z_cluster and z_ismatch fields. opening an issue to add header to preexisting training data

1 reply

tomdavidson Apr 15, 2022
Author

Oh, not modifying the model data directly. I was curious about the difference with trainingSamples disclosed in the configuration and the model/model-name/trainingdata/marked/*.parquet files.

File formats. The pretrained samples have their own z_cluster grouping. The trainingdata/marked files seem to be two records per group, one group per file. The trainingSamples example is a headerless csv, but what are the options? a json file with one a array of groupings? Does it need to be a local file? Im sure a google sheet is supported in this early version :D
Reuse labels Can I copy the trainingdata/marked tiles to the new model directory? Can I just convert the parquet to json, add my own selection of marks, disclose it ad trainingSamples in the config and skip the label phase? Cloud I also add my own labels to the trainingdata/marked/ directory, one file per group?
One source of truth Why have the two different mechanisms to add labeled data?
Data scope The example CSV and the existing marked training data has the entire record. This seems to be a coupling. If a field named change, say a new pass-through filed, would the labels have to be updated? Would it make sense to have the labels use a record id with the z_ fields and ignore the rest?

sonalgoyal · 2022-05-17T12:25:39Z

sonalgoyal
May 17, 2022
Maintainer

File formats - you can have any kind of file or format here. json/csv/parquet..or even plugin the Zingg Pipe of your JDBC/NoSQL store. All we need are the z_cluster and the fieldDefinition fields.

Reuse labels yes you can export and reuse trainingdata/marked or trainingdata/unmarked as per your liking in a format of your choice. Skipping the labelling is not the right way, as we need an optimal selection of edge cases to have a well tuned model.

Single source of truth sometimes users have a few records that they want to seed the training with. sometimes one or two types of matching scenarios can be omiited from discovery by the usual findTrainingData+label route. We dont want to have a lot of training data, it is the quality of the samples which matters most as per what we have seen.

Data scope the z_ fields are not universal ids, hence we cant bank on them across runs. We also can not always have identifier fields in the data which we can use. If the schema is changing, the model will need to be updated, not by re labelling but potentially by reshaping - renaming columns or adding empty dont_use columns..we can build that separately.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to retrain in v0.3.3 ? #206

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

How to retrain in v0.3.3 ? #206

tomdavidson Apr 15, 2022

Replies: 2 comments · 1 reply

sonalgoyal Apr 15, 2022 Maintainer

tomdavidson Apr 15, 2022 Author

sonalgoyal May 17, 2022 Maintainer

tomdavidson
Apr 15, 2022

Replies: 2 comments 1 reply

sonalgoyal
Apr 15, 2022
Maintainer

tomdavidson Apr 15, 2022
Author

sonalgoyal
May 17, 2022
Maintainer