How to retrain in v0.3.3 ? #206
-
updateLabels can remark existing labels, but I do not think mis-marked labels is the problem in some of my clusters. Here is an example of labels and of a cluster that is poorly matched. I think I need to convert the data sample to fit the preexisting training data format and train a new model, ie https://docs.zingg.ai/zingg/stepbystep/createtrainingdata/addowntrainingdata This format is not clear. Is the z_cluster field the same as z_cluster in the output or just an arbitrary grouping for the label? The example data does not have headers at all? https://github.com/zinggAI/zingg/blob/main/examples/febrl/training.csv Rather than the preexisting labels in the config file, can we just add the labels to the marked/ collection? Marked files produced by zingg only have two records for each file. Could we have n records that have the same mark in a singe file? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
as discussed today on the call, I would advise against directly manipulating the model data. Instead, you can use the trainingsamples with z_cluster and z_ismatch fields. opening an issue to add header to preexisting training data |
Beta Was this translation helpful? Give feedback.
-
File formats - you can have any kind of file or format here. json/csv/parquet..or even plugin the Zingg Pipe of your JDBC/NoSQL store. All we need are the z_cluster and the fieldDefinition fields. Reuse labels yes you can export and reuse trainingdata/marked or trainingdata/unmarked as per your liking in a format of your choice. Skipping the labelling is not the right way, as we need an optimal selection of edge cases to have a well tuned model. Single source of truth sometimes users have a few records that they want to seed the training with. sometimes one or two types of matching scenarios can be omiited from discovery by the usual findTrainingData+label route. We dont want to have a lot of training data, it is the quality of the samples which matters most as per what we have seen. Data scope the z_ fields are not universal ids, hence we cant bank on them across runs. We also can not always have identifier fields in the data which we can use. If the schema is changing, the model will need to be updated, not by re labelling but potentially by reshaping - renaming columns or adding empty dont_use columns..we can build that separately. |
Beta Was this translation helpful? Give feedback.
File formats - you can have any kind of file or format here. json/csv/parquet..or even plugin the Zingg Pipe of your JDBC/NoSQL store. All we need are the z_cluster and the fieldDefinition fields.
Reuse labels yes you can export and reuse trainingdata/marked or trainingdata/unmarked as per your liking in a format of your choice. Skipping the labelling is not the right way, as we need an optimal selection of edge cases to have a well tuned model.
Single source of truth sometimes users have a few records that they want to seed the training with. sometimes one or two types of matching scenarios can be omiited from discovery by the usual findTrainingData+label route. We dont want to have a lo…