-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add contraints and embedding directives #3405
base: develop
Are you sure you want to change the base?
feat: Add contraints and embedding directives #3405
Conversation
@@ -520,6 +525,11 @@ func (c *collection) update( | |||
ctx context.Context, | |||
doc *client.Document, | |||
) error { | |||
err := c.setEmbedding(ctx, doc, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: This is making a remote call to a third party service no (ollama or openai)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If embedding is defined yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds discussion-worthy 😁
todo: Please significantly expand the documentation on the Embedding
related types/properties in client
, so both ourselves and users are very aware that this behaviour exists.
todo: Please describe to the team why we really need to do this (on doc create/update) in the PR description, including what we are looking to gain from the call (where/when-ever it is made from), how long you expect it to remain in it's current state (e.g. synchronous, on doc create/update), and any alternative and long-term thoughts you may have on it.
todo: Please flag this in discord, or perhaps the standup (if everyone is going to be present), as I think the team needs to be aware of this - making mandatory calls (depending on field kind etc) to third party APIs on document creation and update from a database does sound a little unusual and does appear to me to come with significant downsides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR description has been updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up: Discussed during the standup. We'll leave the "in-request" embedding network call as is for now, and start expanding other options for async embedding generation soon.
cc: @fredcarle can you create a follow up ticket for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the calls out to 3rd party APIs needs expanded documentation and discussion with the team. I have reviewed very little so far but will continue whilst that issue is being resolved.
@@ -74,6 +74,12 @@ jobs: | |||
DEFRA_MUTATION_TYPE: ${{ matrix.mutation-type }} | |||
|
|||
steps: | |||
- name: Install ollama | |||
run: curl -fsSL https://ollama.com/install.sh | sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: This is only needed for basic test matrix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Why are you asking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because:
- I don't understand why this is not needed for other test types (acp, view, encryption, etc.).
- Is this a required dependency now? if so then:
- document in
README
- add a make rule in
Makefile
- perhaps add this to the setup composite instead (
./.github/composites/setup-defradb
)
- document in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- When you see "Ollama", do you need more documentation to know what this is about? I might be wrong to assume that would be enough to understand that this is only for embedding related testing.
- It's not a Defra dependency. Only a testing dependency for the embedding specific tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Maybe I missed it but last I checked the embedding tests they didn't have
SupportedTypes
like config. So I assumed they will run with all configurations. For example when encryption or macos tests run, wouldn't they still run and hence be need this dep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yea so similar to how we have rust dep for lens test, this should have a make rule imo and will be part of test/dev deps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've given the production code a first-pass review, and will shortly look through the tests. The code looks good, most of the comments are about moving stuff around, or questions/thoughts RE the feature design.
thought: It looks very tedious and error prone to introduce new types atm, although numerics are probably most painful. Perhaps we should review certain aspects in the future to improve this.
} | ||
if size != 0 && len(array) != size { | ||
return nil, NewErrArraySizeMismatch(array, size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: This error only appears to be returned when getting the value, I thought we had a dedicated space for validating doc contents - would it be more appropriate to handle this there? Or at least on doc-value write (maybe in addition-too read)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is called by validateFieldSchema
. So it's done on write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks!
for _, colField := range c.Definition().GetFields() { | ||
if colField.Embedding != nil { | ||
fieldsVal := make(map[string]client.NormalValue) | ||
needsGeneration := false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: It looks like if a user explicitly provides their own vector value it will be overwritten by the result of the remote call. Would it not be better to allow users to provide their own values?
Besides being less surprising for them, it would give them greater control over the determinism of the resultant CID/DocID, which I believe can vary depending on the timing of the remote calls (under most AI models).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't overwrite. If it does it's a mistake on my part. Users should be free to use their own vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay nice!
todo: I've not read the tests yet, could you please make sure that there is at least one that covers that scenario?
suggestion: Please also note this behaviour in the func documentation
func (c *collection) setEmbedding(ctx context.Context, doc *client.Document, isCreate bool) error { | ||
embeddingGenerated := false | ||
for _, colField := range c.Definition().GetFields() { | ||
if colField.Embedding != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Inverting this if
and continue
-ing would remove quite a long indentation, and IMO make the code a bit easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to my suggestion above regarding moving the EmbeddingDescription
to the CollectionDescription
instead of the CollectionFieldDescription
would simplify this function, where it can go straight to the embeddings rather than scanning all the fields to see if there is a description on them.
text := "" | ||
for _, fieldName := range colField.Embedding.Fields { | ||
if val, ok := fieldsVal[fieldName]; ok { | ||
text += fmt.Sprintf("%v\n", val.Unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: StringBuilder might be more appropriate here, it might be more readable, and should be more efficient, although the performance gains would probably be small compared to the cost of a remote call.
switch arg.Name.Value { | ||
case types.ConstraintsDirectivePropSize: | ||
if !kind.IsArray() { | ||
return constraintDescription{}, errors.New("size constraint can only be applied to array fields") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: This error should be declared in an errors.go file
@@ -13,6 +13,7 @@ package schema | |||
import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: There appears to be a lot of schema/collection description validation taking place in this file. Please remove as much as you can and instead place it in internal/db/definition_validation.go
.
This will allow all our schema validation to live in once place, and lets it operates on our lowest-level/core types, instead of within query-language dependent code-spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you referring to the new code in this PR or in general for this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new code in this PR. We have existing validation in this file, but only for GQL/SDL specific validation. The current PR contains a lot of general validation, such as validating that target is an array-float kind - this needs to move.
func Float32ListOperatorBlock(op *gql.InputObject) *gql.InputObject { | ||
return gql.NewInputObject(gql.InputObjectConfig{ | ||
Name: "Float32ListOperatorBlock", | ||
Description: "These are the set of filter operators available for use when filtering on [Float32] values.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: Please move these descriptions to the descriptions.go file
|
||
var Float32 = graphql.NewScalar(graphql.ScalarConfig{ | ||
Name: "Float32", | ||
Description: "The `Float32` scalar type represents signed single-precision fractional " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: Please move this description to descriptions.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed the tests, thanks for adding them, I think there are a couple of important test gaps though, please add tests for:
- Patching a
size
constraint, including removing it - Patching an
embedding
property, and it's children:
2.1 Adding an embedding description to a field that didnt have one before
2.2 Mutating each of the embedding properties on a field that declared one on create
2.3 Removing an embedding property from a field that had one on create - Including a
size
constraint in a View declaration, if you don't/haven't blocked it off, please include a test that shows what happens if that constraint is violated - Including an
embedding
property in a View declaration (I assume we want this to error for now).
Results: map[string]any{ | ||
"Users": []map[string]any{ | ||
{ | ||
"_docID": testUtils.NewDocIndex(0, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: _docID
is quite an expensive property maintenance-wise to include in a test, and it does not appear to have a use here, I suggest removing it from the query.
Same goes for the p[n]counter tests.
// by the Apache License, Version 2.0, included in the file | ||
// licenses/APL.txt. | ||
|
||
//go:build linux |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: Please document in this file why these tests are Linux only. I know you noted this in the PR description, but if this constraint persists it will take some figuring out as to why it is Linux only.
suggestion: Consider creating a ticket and linking it to and from here to remove this constraint.
"User": []map[string]any{ | ||
{ | ||
"_docID": testUtils.NewDocIndex(0, 0), | ||
"name_v": []float32{-0.005572272, 0.11054294, -0.14823098, 0.007955516, -0.039147116, 0.024685992, -0.07684083, 0.009803459, 0.034218226, -0.028475128, -0.035921726, 0.03413789, 0.04778687, 0.06861559, -0.0005298276, -0.02227662, -0.0024149693, -0.0028286104, 0.044666544, 0.07396666, -0.01592924, -0.062988736, -0.002934097, -0.052807238, 0.06903765, -0.008730141, -0.030993376, 0.025784684, -0.04041025, -0.0003130241, -0.026256371, -0.039696295, -0.031230215, 0.08520995, -0.002201601, -0.0075100223, 0.03557135, 0.049701013, 0.003201713, -0.0622082, -0.052659452, 0.013204338, 0.025808379, -0.06746355, 0.06658112, -0.01420459, -0.002811362, 0.043729357, 0.003082942, 0.024417346, 0.024631174, 0.0060707806, -0.024436437, -0.061332986, 0.0068888417, -0.020600287, 0.01386111, -0.017386531, -0.028814659, -0.01508969, 0.06513925, -0.017392348, 0.019955589, 0.07307108, 0.014680802, 0.027661886, 0.03331586, 0.025785519, -0.06021169, -0.017004002, 0.054138314, -0.017119048, 0.0021329697, 0.035362158, -0.024475744, -0.012421996, 0.034326375, 0.034637466, -0.03635416, 0.0023013682, 0.01953218, 0.010028518, 0.09421262, -0.022285048, 0.038090277, -0.016491888, 0.024973793, -0.0062092217, -0.0006079076, 0.012070617, -0.010586575, 0.014457437, 0.018606706, 0.0132330125, -0.0012271218, 0.029310074, -0.06372368, 0.0027326255, -0.0036948905, -0.036406342, -0.06319537, 0.004964318, -0.016415307, -0.037278935, 0.032909706, 0.0039533414, 0.055153944, 0.0077766413, 0.04752943, -0.020662656, 0.018654374, 0.027579581, -0.009995172, -0.030590994, 0.0032329047, 0.0051636375, -0.019894218, -0.034079403, -0.004074199, 0.027430773, -0.04688943, -0.006935151, -0.037497897, 0.06076331, -0.014011966, 0.025153061, 0.02982776, 0.028703934, -0.06938982, -0.0053709135, -0.015808186, 0.054642905, 0.022894323, 0.0067457263, 0.025538877, 0.040291548, 0.06352461, -0.047891434, 0.015845068, 0.021532292, 0.029393502, 0.06712895, 0.03418742, -0.020285899, 0.04999721, -0.015223504, 0.008664959, -0.08734331, -0.03180259, -0.060385752, 0.045383543, 0.022910262, 0.023367167, 0.06202772, -0.041573163, -0.06919459, 0.008020132, -0.035788614, -0.022974774, 0.017490545, 0.012094521, -0.008700876, 0.023880692, 0.029243862, 0.03433071, -0.022300672, 0.02793599, 0.04557679, -0.0074306466, 0.106741145, 0.011710485, -0.022114152, -0.05479797, -0.026192376, 0.017863747, -0.059329823, -0.016414663, -0.023017433, 0.046126083, -0.037351213, -0.004975373, -0.06755006, 0.003153917, 0.05803753, -0.0030448795, -0.018085295, -0.011378983, -0.032927986, -0.0356213, -0.017273922, 0.0057821423, 0.026892597, -0.056277, -0.049027056, -0.03585269, -0.002762686, 0.022830265, 0.01952595, -0.0023518447, 0.019815316, -0.0308199, -0.056283116, -0.05646737, 0.040788855, -0.06935688, 0.0016162894, -0.07576725, 0.0021788045, -0.027595945, 0.016406938, 0.00914925, -0.030377423, -0.04072336, 0.004285908, -0.053028747, -0.043620903, 0.013743935, -0.0060078623, -0.05145278, -0.01240032, -0.0019821713, 0.0459069, -0.000397437, -0.040874127, 0.02232409, -0.021987809, 0.016709065, -0.019955955, -0.043258507, -0.012822493, 0.0015544997, 0.008750475, 0.08029463, -0.016551215, 0.0106922155, 0.04511181, 0.012863488, 0.08101362, 0.015255131, -0.0378599, -0.026632613, -0.007132961, 0.022561012, -0.035132837, -0.051327687, 0.012184595, -0.053649075, -0.06340343, -0.038676884, 0.05588606, 0.04636262, -0.0017865195, 0.0065502143, -0.0021936935, 0.037256364, -0.0117199775, 0.001200583, 0.03416475, -0.008868033, -0.09372393, 0.02795748, -0.03450235, 0.0191968, -0.04304435, -0.039260436, -0.039282955, -0.026117781, 0.028008305, 0.020413136, -0.069958426, 0.007468339, 0.03535249, -0.013168222, 0.03160398, -0.03536105, -0.012599075, 0.012694869, -0.047842525, -0.023844596, -0.002578087, 0.018921396, -0.038521487, 0.016861906, 0.018677093, 0.009483494, 0.06958433, 0.020350413, 0.002132844, -0.015051295, 0.012355237, 0.036700185, 0.041566264, 0.02166237, -0.017458526, -0.003880696, 0.021615662, 0.071755074, -0.06163507, -0.018081794, -0.04193382, 0.0038752954, 0.029048733, 0.0021254525, 0.028686205, 0.0021548586, 0.0153228, 0.027537907, -0.06318449, 0.013970559, -0.027223982, 0.007815344, -0.044364017, -0.009059432, 0.093419164, -0.0075638997, 0.0349247, -0.04389202, 0.03379941, 0.032499135, -0.05383602, -0.0075482763, -0.05727751, 0.029602082, -0.023207424, 0.009572965, 0.015227847, 0.017614286, 0.017081033, 0.029041866, -0.015170309, -0.016623484, -0.0040708343, 0.07524853, -0.029914094, 2.6077309e-05, 0.022064127, 0.0061305235, 0.031021086, -0.04173227, 0.034938656, 0.010578506, -0.014831277, -0.008377155, -0.055739824, 0.00076083303, 0.04864599, -0.028258063, 0.007825028, 0.06374719, -0.008117671, -0.0529908, -0.0008744555, -0.026621709, 0.057709042, -0.028861143, 0.015160184, 0.06823767, -0.03448378, -0.012032499, 0.007125862, -0.0021329958, 0.021073861, 0.035434462, -0.005443737, -0.037596166, 0.01867455, 0.021003755, -0.027144607, 0.0040687053, 0.0073787137, -0.02148215, -0.01370821, 0.028337331, 0.03438015, 0.025714653, 0.016099878, -0.1280255, -0.028904129, 0.054412697, -0.00027886606, 0.010180635, 0.006911521, -0.072732806, 0.040548593, 0.054354455, 0.022762984, 0.083944574, -0.032127433, 0.031493276, 0.025450544, 0.0014648492, -0.053174727, -0.089231305, -0.05349475, 0.02832863, -0.042825792, 0.040540442, 0.0034059666, 0.012527256, -0.0070269583, 0.004348014, -0.0012775054, 0.020653797, -0.028427351, -0.012061082, 0.08474403, 0.018355045, -0.03782953, 0.01780059, -0.03124967, -0.02972917, 0.06377995, 0.03622817, -0.038774543, -0.008168061, 0.0008756734, 0.063374326, -0.017164867, 0.03089118, -0.030859854, 0.018398924, -0.01290288, -0.013464138, 0.07005317, -0.031074032, 0.0065023694, 0.01833391, 0.086340636, 0.0103744725, -0.017039359, -0.054524146, -0.02184828, 0.03237553, -0.0051302775, -0.03880768, 0.0042719403, -0.018497659, 0.039791476, -0.0029515014, 0.047630522, 0.009005865, 0.012332961, -0.065514065, -0.017099215, 0.020070605, 0.033033818, 0.03819908, -0.002926078, 0.02485725, -0.02492589, 0.02524227, 0.015304151, 0.010357196, -0.0012361126, 0.06055463, -0.052393332, 0.0018451725, -0.033913672, 0.034882527, -0.040411558, 0.029277325, 0.0066468045, -0.028638732, 0.030994104, -0.0038566508, -0.026518226, 0.023721188, 0.053882364, 0.015044844, 0.009445471, 0.008423282, 0.006908645, 0.033433937, 0.0013090725, -0.043953463, 0.009207147, -0.02148394, 0.009764033, 0.026332734, 0.07524719, -0.013170156, 0.0311968, -0.028490331, -0.054174777, 0.008361257, 0.026559152, 0.0019873658, 0.010297786, 0.05164019, -0.017219178, 0.026013039, 0.029987952, 0.018734867, -0.054145347, -0.00529485, -0.027970446, -0.083762854, 0.021667952, 0.03809247, -0.016397795, 0.024940109, -0.011692556, -0.015403972, -0.037349943, 0.05150695, -0.002360465, -0.038538422, -0.033100877, -0.078322336, 0.019312337, -0.039709445, -0.028119525, 0.0075941817, 0.044093464, 0.04312815, -0.01860829, -0.014516086, -0.027146114, -0.05374159, 0.018937763, 0.055734564, -0.055264555, -0.021128979, 0.013552821, -0.023388384, 0.028102161, -0.06574926, 0.0005777815, 0.05832509, -0.021472303, 0.008772928, 0.05439286, -0.015902612, 0.0072142733, -0.030573256, 0.0070295557, -0.0415438, 0.035653543, -0.01455105, 0.020005474, 0.07986046, -0.035983853, 0.035677813, 0.011222366, -0.023504755, -0.03807093, 0.041671403, -0.017686117, -0.027882392, -0.021149794, -0.020939926, 0.005685982, 0.008118187, -0.02225789, 0.028101409, -0.0077426652, -0.058306817, -0.013176626, -0.0018302508, -0.042805985, 0.03499617, 0.006960921, 0.0521078, 0.013127357, 0.012807865, -0.0041235113, 0.0364854, -0.029623492, 0.0036852285, 0.010436406, -0.012979821, -0.06214568, 0.02496504, -0.012057285, 5.3822554e-05, 0.068691194, -0.005192032, -0.00840417, -0.03474282, -0.041815113, 0.01586503, -0.06975514, -0.022814387, -0.01747859, -0.03028506, 0.00899632, -0.030455945, -0.02612898, 0.09748062, -0.060053565, 0.073910706, -0.045878906, 0.017043162, 0.036452837, 0.033786267, 0.07532076, -0.012363524, 0.014629511, -0.051726285, -0.060937867, 0.009702624, 0.0037173766, 0.025674582, 0.0073014116, -0.034176115, 0.014616017, 0.020726405, 0.05494357, -0.023246752, 0.023410168, 0.0058005634, -0.02368794, 0.030081518, -0.020378673, -0.0042353794, -0.02886309, 0.030399263, 0.009913847, -0.019511933, 0.039135117, -0.029321602, -0.057733648, 0.018555647, 0.04234238, -0.024444107, -0.01830207, -0.01449405, -0.055994887, -0.02198859, 0.022180306, -0.017542241, 0.020654257, -0.07833157, 0.029106235, -0.016729757, -0.037108444, -0.027721336, -0.0055683763, -0.011531618, 0.021272335, -0.010220706, -0.009008802, 0.0025321587, 0.007843418, -0.009176056, -0.035371616, 0.035749473, 0.04374431, -0.0133530535, -0.027428798, 0.052815367, 0.08725027, 0.07851333, 0.036350124, 0.022137968, -0.060914032, -0.007600696, -0.03680366, -0.030651785, -0.039126765, 0.0125571685, 0.016707521, -0.03751044, 0.03910936, 0.031825088, -0.0029502262, 0.007766306, -0.03814175, -0.07744442, -0.0054074563, 0.030857768, 0.015186418, -0.021467306, -0.015918817, -0.011368014, 0.0019515429, -0.014827345, -0.0022194595, 0.02624271, 0.0066641932, 0.019060731, 0.05616408, 0.01946424, 0.050201155, 0.016927207, -0.038973548, 0.037738074, -0.056299437, 0.00967483, -0.018431908, 0.01668865, -0.047797382, 0.015456402, 0.018490652, -0.012237633, 0.012219139, 0.026376488, -0.023400653, -0.04556467, 0.064960316, 0.0019083915, 0.05282892, 0.043711416, 0.07134014, -0.029614225, -0.040969286, 0.045358557, -0.031595223, -0.01910029, -0.061545778, 0.024926651, 0.04099986, 0.01620945, 0.042243116, 0.04717351, -0.0048233513, -0.0019400177, -0.036986295, -0.039084256, 0.025065007, 0.03538771, -0.00042725189, 0.011488134, -0.018427324, -0.019081326, -0.03214962, 0.05602044, -0.07190716, 0.006701344, -0.027393665, 0.025218304, 0.021957463, -0.08623837, 0.026349807, -0.020874213, 0.016827395, -0.051633146, -0.016894199, 0.009786709, 0.026273636, 0.013940125, 0.011852815, 0.036727257, 0.055551354, 0.008069757, -0.015989762, 0.01829905, 0.060527466, 0.048552617, 0.01770399, -0.03225409, -0.042072583, -0.026071887, -0.036898546, -0.05571102, -0.0058531873, -0.047945924, -0.0003473616, 0.038228065, 0.018455006, 0.025350647, 0.00965173, 0.00064714643, 0.01915388, -0.015175882, -0.040475022, -0.0338664, 0.034016117}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: I'm not sure including the full array helps people read the test, and the only benefit to asserting anything beyond 'is it an array with some values' is perhaps the change detector.
question: Do you have plans to introduce more tests like this in the near future?
suggestion: If you do want to add more, can we perhaps look at asserting on this value in a slightly different way, perhaps only checking that it is of an expected size, with non-zero values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as an idea, we can create a new Validator
for asserting this. Similar to NewUniqueCid
in tests/integration/results.go
// by the Apache License, Version 2.0, included in the file | ||
// licenses/APL.txt. | ||
|
||
//go:build linux |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: Please change the way you constrain these tests. The current setup will prevent me, or any other linux users from running go test ./...
or even make test
, this could be a bit of a nuisance. Perhaps introduce something similar to the SupportedClients
TestCase
property.
docIDMap := make(map[string]struct{}) | ||
for _, docID := range docIDs { | ||
docIDMap[docID.String()] = struct{}{} | ||
} | ||
|
||
if action.ExpectedError == "" { | ||
waitForUpdateEvents( | ||
s, | ||
action.NodeID, | ||
action.CollectionID, | ||
getEventsForCreateDoc(s, action), | ||
action.Identity, | ||
) | ||
waitForUpdateEvents(s, action.NodeID, action.CollectionID, docIDMap, action.Identity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: curious why this needed to be done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the ids were already available on create and since the embedding vector is not know in advance, we cannot depend on the documents content in the test's definition to determine the docIDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking this as "request for changes" (you can dismiss at any time), even tho there aren't any explcit todos
.
I have mainly focused on the definition/description code along with the collection/doc updating. Speed ran through the encoding section and the gql parsing.
Haven't reviewed the tests.
Main question/discussion points are "when" you are generating the embedding in the call path, and the location of the EmbeddingDescription
.
// This may cause increase latency in the completion of the mutation requests. | ||
// This is necessary to ensure that the generated docID is representative of the | ||
// content of the document. | ||
Embedding *EmbeddingDescription |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: What if the structure/placement of the EmbeddingDescription
was similar to how we have IndexDescription
, where it actually lives on the CollectionDescription
instead here on the CollectionFieldDescription
.
This would make it much easier to iterate through the existing embedding descriptions when mutating documents to update/generate the embeddings. Whereas right now you have to iterate through all the fields on a doc.
It also opens up the door to not having to actually have the embedding in the document graph itself, and instead is stored as an index outside the graph (future work).
@@ -520,6 +525,11 @@ func (c *collection) update( | |||
ctx context.Context, | |||
doc *client.Document, | |||
) error { | |||
err := c.setEmbedding(ctx, doc, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Is there a particular reason why the first thing the create/update
call does is generate the embeddings? in the update
case, it gets the embeddings before the ACP check is even done to determine if a mutation is allowed? It could even be in the save
func deeper in the call graph instead of duplicated in the create
and update
paths.
func (c *collection) setEmbedding(ctx context.Context, doc *client.Document, isCreate bool) error { | ||
embeddingGenerated := false | ||
for _, colField := range c.Definition().GetFields() { | ||
if colField.Embedding != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to my suggestion above regarding moving the EmbeddingDescription
to the CollectionDescription
instead of the CollectionFieldDescription
would simplify this function, where it can go straight to the embeddings rather than scanning all the fields to see if there is a description on them.
} | ||
} | ||
} | ||
if needsGeneration && len(missingFieldsForGeneration) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Should there be an additional check here if we are creating a new doc? Since we can resolve the missingFieldsForGenerating
by getting the old doc version if this is a create call. Or is this covered by the c.get
call returning an error not found?
false, | ||
) | ||
if err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: might want to wrap this error so its clear to the caller why a document is trying to be fetched.
} else { | ||
fieldDef, ok := c.def.GetFieldByName(embedField) | ||
if !ok { | ||
return errors.New("field not found", errors.NewKV("field", embedField)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: I cant remember where we are it regarding dev policies for inline error creation? Does this need to be a properly defined error in the errors.go
file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please :)
} | ||
embedding, err := embeddingFunc(ctx, text) | ||
if err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: wrap this error
@@ -13,6 +13,7 @@ package schema | |||
import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you referring to the new code in this PR or in general for this file?
// generateAndSetDocID generates the DocID and then (re)sets `doc.id`. | ||
// GenerateAndSetDocID generates the DocID and then (re)sets `doc.id`. | ||
func (doc *Document) GenerateAndSetDocID() error { | ||
return doc.generateAndSetDocID() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I don't see a point of having the private function
FieldKind_FLOAT64_ARRAY, | ||
FieldKind_STRING_ARRAY, | ||
FieldKind_NILLABLE_BOOL_ARRAY, | ||
FieldKind_NILLABLE_INT_ARRAY, | ||
FieldKind_NILLABLE_FLOAT_ARRAY, | ||
FieldKind_NILLABLE_FLOAT64_ARRAY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: add please tests for float32
FieldKind_NILLABLE_FLOAT64_ARRAY ScalarArrayKind = 20 | ||
FieldKind_NILLABLE_STRING_ARRAY ScalarArrayKind = 21 | ||
FieldKind_NILLABLE_FLOAT32 ScalarKind = 22 | ||
FieldKind_FLOAT32_ARRAY ScalarArrayKind = 23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why not to make them 8 and 9 so that they are next to their siblings?
@@ -54,8 +54,8 @@ func TestField_ScalarArray_HasSubKind(t *testing.T) { | |||
}, | |||
{ | |||
name: "nillable float array", | |||
arrKind: FieldKind_NILLABLE_FLOAT_ARRAY, | |||
subKind: FieldKind_NILLABLE_FLOAT, | |||
arrKind: FieldKind_NILLABLE_FLOAT64_ARRAY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: please cover float32 as well
"User": []map[string]any{ | ||
{ | ||
"_docID": testUtils.NewDocIndex(0, 0), | ||
"name_v": []float32{-0.005572272, 0.11054294, -0.14823098, 0.007955516, -0.039147116, 0.024685992, -0.07684083, 0.009803459, 0.034218226, -0.028475128, -0.035921726, 0.03413789, 0.04778687, 0.06861559, -0.0005298276, -0.02227662, -0.0024149693, -0.0028286104, 0.044666544, 0.07396666, -0.01592924, -0.062988736, -0.002934097, -0.052807238, 0.06903765, -0.008730141, -0.030993376, 0.025784684, -0.04041025, -0.0003130241, -0.026256371, -0.039696295, -0.031230215, 0.08520995, -0.002201601, -0.0075100223, 0.03557135, 0.049701013, 0.003201713, -0.0622082, -0.052659452, 0.013204338, 0.025808379, -0.06746355, 0.06658112, -0.01420459, -0.002811362, 0.043729357, 0.003082942, 0.024417346, 0.024631174, 0.0060707806, -0.024436437, -0.061332986, 0.0068888417, -0.020600287, 0.01386111, -0.017386531, -0.028814659, -0.01508969, 0.06513925, -0.017392348, 0.019955589, 0.07307108, 0.014680802, 0.027661886, 0.03331586, 0.025785519, -0.06021169, -0.017004002, 0.054138314, -0.017119048, 0.0021329697, 0.035362158, -0.024475744, -0.012421996, 0.034326375, 0.034637466, -0.03635416, 0.0023013682, 0.01953218, 0.010028518, 0.09421262, -0.022285048, 0.038090277, -0.016491888, 0.024973793, -0.0062092217, -0.0006079076, 0.012070617, -0.010586575, 0.014457437, 0.018606706, 0.0132330125, -0.0012271218, 0.029310074, -0.06372368, 0.0027326255, -0.0036948905, -0.036406342, -0.06319537, 0.004964318, -0.016415307, -0.037278935, 0.032909706, 0.0039533414, 0.055153944, 0.0077766413, 0.04752943, -0.020662656, 0.018654374, 0.027579581, -0.009995172, -0.030590994, 0.0032329047, 0.0051636375, -0.019894218, -0.034079403, -0.004074199, 0.027430773, -0.04688943, -0.006935151, -0.037497897, 0.06076331, -0.014011966, 0.025153061, 0.02982776, 0.028703934, -0.06938982, -0.0053709135, -0.015808186, 0.054642905, 0.022894323, 0.0067457263, 0.025538877, 0.040291548, 0.06352461, -0.047891434, 0.015845068, 0.021532292, 0.029393502, 0.06712895, 0.03418742, -0.020285899, 0.04999721, -0.015223504, 0.008664959, -0.08734331, -0.03180259, -0.060385752, 0.045383543, 0.022910262, 0.023367167, 0.06202772, -0.041573163, -0.06919459, 0.008020132, -0.035788614, -0.022974774, 0.017490545, 0.012094521, -0.008700876, 0.023880692, 0.029243862, 0.03433071, -0.022300672, 0.02793599, 0.04557679, -0.0074306466, 0.106741145, 0.011710485, -0.022114152, -0.05479797, -0.026192376, 0.017863747, -0.059329823, -0.016414663, -0.023017433, 0.046126083, -0.037351213, -0.004975373, -0.06755006, 0.003153917, 0.05803753, -0.0030448795, -0.018085295, -0.011378983, -0.032927986, -0.0356213, -0.017273922, 0.0057821423, 0.026892597, -0.056277, -0.049027056, -0.03585269, -0.002762686, 0.022830265, 0.01952595, -0.0023518447, 0.019815316, -0.0308199, -0.056283116, -0.05646737, 0.040788855, -0.06935688, 0.0016162894, -0.07576725, 0.0021788045, -0.027595945, 0.016406938, 0.00914925, -0.030377423, -0.04072336, 0.004285908, -0.053028747, -0.043620903, 0.013743935, -0.0060078623, -0.05145278, -0.01240032, -0.0019821713, 0.0459069, -0.000397437, -0.040874127, 0.02232409, -0.021987809, 0.016709065, -0.019955955, -0.043258507, -0.012822493, 0.0015544997, 0.008750475, 0.08029463, -0.016551215, 0.0106922155, 0.04511181, 0.012863488, 0.08101362, 0.015255131, -0.0378599, -0.026632613, -0.007132961, 0.022561012, -0.035132837, -0.051327687, 0.012184595, -0.053649075, -0.06340343, -0.038676884, 0.05588606, 0.04636262, -0.0017865195, 0.0065502143, -0.0021936935, 0.037256364, -0.0117199775, 0.001200583, 0.03416475, -0.008868033, -0.09372393, 0.02795748, -0.03450235, 0.0191968, -0.04304435, -0.039260436, -0.039282955, -0.026117781, 0.028008305, 0.020413136, -0.069958426, 0.007468339, 0.03535249, -0.013168222, 0.03160398, -0.03536105, -0.012599075, 0.012694869, -0.047842525, -0.023844596, -0.002578087, 0.018921396, -0.038521487, 0.016861906, 0.018677093, 0.009483494, 0.06958433, 0.020350413, 0.002132844, -0.015051295, 0.012355237, 0.036700185, 0.041566264, 0.02166237, -0.017458526, -0.003880696, 0.021615662, 0.071755074, -0.06163507, -0.018081794, -0.04193382, 0.0038752954, 0.029048733, 0.0021254525, 0.028686205, 0.0021548586, 0.0153228, 0.027537907, -0.06318449, 0.013970559, -0.027223982, 0.007815344, -0.044364017, -0.009059432, 0.093419164, -0.0075638997, 0.0349247, -0.04389202, 0.03379941, 0.032499135, -0.05383602, -0.0075482763, -0.05727751, 0.029602082, -0.023207424, 0.009572965, 0.015227847, 0.017614286, 0.017081033, 0.029041866, -0.015170309, -0.016623484, -0.0040708343, 0.07524853, -0.029914094, 2.6077309e-05, 0.022064127, 0.0061305235, 0.031021086, -0.04173227, 0.034938656, 0.010578506, -0.014831277, -0.008377155, -0.055739824, 0.00076083303, 0.04864599, -0.028258063, 0.007825028, 0.06374719, -0.008117671, -0.0529908, -0.0008744555, -0.026621709, 0.057709042, -0.028861143, 0.015160184, 0.06823767, -0.03448378, -0.012032499, 0.007125862, -0.0021329958, 0.021073861, 0.035434462, -0.005443737, -0.037596166, 0.01867455, 0.021003755, -0.027144607, 0.0040687053, 0.0073787137, -0.02148215, -0.01370821, 0.028337331, 0.03438015, 0.025714653, 0.016099878, -0.1280255, -0.028904129, 0.054412697, -0.00027886606, 0.010180635, 0.006911521, -0.072732806, 0.040548593, 0.054354455, 0.022762984, 0.083944574, -0.032127433, 0.031493276, 0.025450544, 0.0014648492, -0.053174727, -0.089231305, -0.05349475, 0.02832863, -0.042825792, 0.040540442, 0.0034059666, 0.012527256, -0.0070269583, 0.004348014, -0.0012775054, 0.020653797, -0.028427351, -0.012061082, 0.08474403, 0.018355045, -0.03782953, 0.01780059, -0.03124967, -0.02972917, 0.06377995, 0.03622817, -0.038774543, -0.008168061, 0.0008756734, 0.063374326, -0.017164867, 0.03089118, -0.030859854, 0.018398924, -0.01290288, -0.013464138, 0.07005317, -0.031074032, 0.0065023694, 0.01833391, 0.086340636, 0.0103744725, -0.017039359, -0.054524146, -0.02184828, 0.03237553, -0.0051302775, -0.03880768, 0.0042719403, -0.018497659, 0.039791476, -0.0029515014, 0.047630522, 0.009005865, 0.012332961, -0.065514065, -0.017099215, 0.020070605, 0.033033818, 0.03819908, -0.002926078, 0.02485725, -0.02492589, 0.02524227, 0.015304151, 0.010357196, -0.0012361126, 0.06055463, -0.052393332, 0.0018451725, -0.033913672, 0.034882527, -0.040411558, 0.029277325, 0.0066468045, -0.028638732, 0.030994104, -0.0038566508, -0.026518226, 0.023721188, 0.053882364, 0.015044844, 0.009445471, 0.008423282, 0.006908645, 0.033433937, 0.0013090725, -0.043953463, 0.009207147, -0.02148394, 0.009764033, 0.026332734, 0.07524719, -0.013170156, 0.0311968, -0.028490331, -0.054174777, 0.008361257, 0.026559152, 0.0019873658, 0.010297786, 0.05164019, -0.017219178, 0.026013039, 0.029987952, 0.018734867, -0.054145347, -0.00529485, -0.027970446, -0.083762854, 0.021667952, 0.03809247, -0.016397795, 0.024940109, -0.011692556, -0.015403972, -0.037349943, 0.05150695, -0.002360465, -0.038538422, -0.033100877, -0.078322336, 0.019312337, -0.039709445, -0.028119525, 0.0075941817, 0.044093464, 0.04312815, -0.01860829, -0.014516086, -0.027146114, -0.05374159, 0.018937763, 0.055734564, -0.055264555, -0.021128979, 0.013552821, -0.023388384, 0.028102161, -0.06574926, 0.0005777815, 0.05832509, -0.021472303, 0.008772928, 0.05439286, -0.015902612, 0.0072142733, -0.030573256, 0.0070295557, -0.0415438, 0.035653543, -0.01455105, 0.020005474, 0.07986046, -0.035983853, 0.035677813, 0.011222366, -0.023504755, -0.03807093, 0.041671403, -0.017686117, -0.027882392, -0.021149794, -0.020939926, 0.005685982, 0.008118187, -0.02225789, 0.028101409, -0.0077426652, -0.058306817, -0.013176626, -0.0018302508, -0.042805985, 0.03499617, 0.006960921, 0.0521078, 0.013127357, 0.012807865, -0.0041235113, 0.0364854, -0.029623492, 0.0036852285, 0.010436406, -0.012979821, -0.06214568, 0.02496504, -0.012057285, 5.3822554e-05, 0.068691194, -0.005192032, -0.00840417, -0.03474282, -0.041815113, 0.01586503, -0.06975514, -0.022814387, -0.01747859, -0.03028506, 0.00899632, -0.030455945, -0.02612898, 0.09748062, -0.060053565, 0.073910706, -0.045878906, 0.017043162, 0.036452837, 0.033786267, 0.07532076, -0.012363524, 0.014629511, -0.051726285, -0.060937867, 0.009702624, 0.0037173766, 0.025674582, 0.0073014116, -0.034176115, 0.014616017, 0.020726405, 0.05494357, -0.023246752, 0.023410168, 0.0058005634, -0.02368794, 0.030081518, -0.020378673, -0.0042353794, -0.02886309, 0.030399263, 0.009913847, -0.019511933, 0.039135117, -0.029321602, -0.057733648, 0.018555647, 0.04234238, -0.024444107, -0.01830207, -0.01449405, -0.055994887, -0.02198859, 0.022180306, -0.017542241, 0.020654257, -0.07833157, 0.029106235, -0.016729757, -0.037108444, -0.027721336, -0.0055683763, -0.011531618, 0.021272335, -0.010220706, -0.009008802, 0.0025321587, 0.007843418, -0.009176056, -0.035371616, 0.035749473, 0.04374431, -0.0133530535, -0.027428798, 0.052815367, 0.08725027, 0.07851333, 0.036350124, 0.022137968, -0.060914032, -0.007600696, -0.03680366, -0.030651785, -0.039126765, 0.0125571685, 0.016707521, -0.03751044, 0.03910936, 0.031825088, -0.0029502262, 0.007766306, -0.03814175, -0.07744442, -0.0054074563, 0.030857768, 0.015186418, -0.021467306, -0.015918817, -0.011368014, 0.0019515429, -0.014827345, -0.0022194595, 0.02624271, 0.0066641932, 0.019060731, 0.05616408, 0.01946424, 0.050201155, 0.016927207, -0.038973548, 0.037738074, -0.056299437, 0.00967483, -0.018431908, 0.01668865, -0.047797382, 0.015456402, 0.018490652, -0.012237633, 0.012219139, 0.026376488, -0.023400653, -0.04556467, 0.064960316, 0.0019083915, 0.05282892, 0.043711416, 0.07134014, -0.029614225, -0.040969286, 0.045358557, -0.031595223, -0.01910029, -0.061545778, 0.024926651, 0.04099986, 0.01620945, 0.042243116, 0.04717351, -0.0048233513, -0.0019400177, -0.036986295, -0.039084256, 0.025065007, 0.03538771, -0.00042725189, 0.011488134, -0.018427324, -0.019081326, -0.03214962, 0.05602044, -0.07190716, 0.006701344, -0.027393665, 0.025218304, 0.021957463, -0.08623837, 0.026349807, -0.020874213, 0.016827395, -0.051633146, -0.016894199, 0.009786709, 0.026273636, 0.013940125, 0.011852815, 0.036727257, 0.055551354, 0.008069757, -0.015989762, 0.01829905, 0.060527466, 0.048552617, 0.01770399, -0.03225409, -0.042072583, -0.026071887, -0.036898546, -0.05571102, -0.0058531873, -0.047945924, -0.0003473616, 0.038228065, 0.018455006, 0.025350647, 0.00965173, 0.00064714643, 0.01915388, -0.015175882, -0.040475022, -0.0338664, 0.034016117}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as an idea, we can create a new Validator
for asserting this. Similar to NewUniqueCid
in tests/integration/results.go
fieldDef, ok := c.def.GetFieldByName(embedField) | ||
if !ok { | ||
return errors.New("field not found", errors.NewKV("field", embedField)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I would move this part on the GraphQL side to catch error earlier and potentially make better user experience.
It should work pretty much the same way as we check fields for indexes. Find in the project indexFieldFromAST
.
I would make it "todo" actually, but our experiment... :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by move to the GQL side. This is the generation part so the embedding on the GQL side isn't really involved prior to this.
Relevant issue(s)
Resolves #3350
Resolves #3351
Description
Sorry for having 3 technically separate features as part of 1 PR. The reason for this is that I started working on embeddings and the size contraint was initially part of the embedding. Since we discussed that it could be applied to all array fields (and later even to String and Blob), I extracted it into a contraints directive that has a size parameter (more contraints can be added in the future). Furthermore, embeddings returned from ML models are arrays of float32. This caused some precision issues because we only supported float64. When saving the float32 array, querying it would return an float64 array with slight precision issues. I decided to add the float32 type.
You can review the first commit for contraint and embedding related code and the second commit for the float related changes. Some float stuff might have leaked in the first commit. Sorry for this. I tried hard to separate the float32 related changes.
Note that the
gql.Float
type is nowFloat64
internally.is the same as
The embedding generation relies on a 3rd party package called
chromem-go
to call the model provider API. As long as one of the supported provider API is configured and accessible, the embeddings will be generated when adding new documents. I've added a step in the test workflow that will run the embedding specific tests on linux only (this is because installation on windows and mac is less straight forward) using Ollama (because it runs locally).The call to the API has to be done synchronously otherwise the docID/CID won't be representative of the contents. The only alternative would be for the system to automatically update the document when returning from the API call but I see that as a inferior option as it hides the update step from the user. It could also make doc anchoring more complicated as the user would have to remember to wait on the doc update before anchoring the doc at the right CID.
We could avoid having embedding generation support and let the users do that call themselves and store the embedding vectors directly. However, having it as a feature allows us to support RAG vector search which would let users get started with AI with very little work. This seems to be something our partners are looking forward to.
I don't see the 3rd party API call inline with a mutation as a problem since this is something that has to be configured by users and users will expect the mutation calls to take more time as a result.
If you're interested in running it locally, install Ollama and define a schema like so
Next steps:
_similarity
operation to calculate the cosine similarity between two arrays.Tasks
How has this been tested?
make test and manual testing
Specify the platform(s) on which this was tested: