-
Notifications
You must be signed in to change notification settings - Fork 9
Conversation
The current norm encoding consistently makes a footprint of size greater 5bytes on disk irrespective of the actual field length. This size bloating could be saved if we directly encode the field length directly to disk and decode it correctly during the read time. More than 30% reduction in index size is observed with this change The current change also restricts the changes to zap itself. As this fix changes the meaning of data present on disk, it mandates a zap version bump. There could be variations in norm's precison stemming from the floating point precision loss between uint64 to float64 conversions on some higher field length values, but trading it off for the simple conversion logic. Any alternate suggestions appreciated.
new.go
Outdated
@@ -210,7 +210,7 @@ type interimStoredField struct { | |||
|
|||
type interimFreqNorm struct { | |||
freq uint64 | |||
norm float32 | |||
norm uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, can leave this float32.
new.go
Outdated
@@ -456,7 +456,7 @@ func (s *interim) processDocument(docNum uint64, | |||
// now that it's been rolled up into fieldTFs, walk that | |||
for fieldID, tfs := range fieldTFs { | |||
dict := s.Dicts[fieldID] | |||
norm := float32(1.0 / math.Sqrt(float64(fieldLens[fieldID]))) | |||
norm := uint64(fieldLens[fieldID]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we can use
norm := math.Float32frombits(fieldLens[fieldID])
@mschoch , Tried to move the norm decoding logic out of zap and added that in bleve here - blevesearch/bleve#1432 |
posting.go
Outdated
@@ -478,8 +478,7 @@ func (i *PostingsIterator) nextAtOrAfter(atOrAfter uint64) (segment.Posting, err | |||
if err != nil { | |||
return nil, err | |||
} | |||
|
|||
rv.norm = math.Float32frombits(uint32(normBits)) | |||
rv.norm = float32(normBits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this still is incorrect. The normBit's contains the integer length, we cannot just stick it directly into a float32 here, as the upstream scoring code is expect the actual computed norm. This line needs to be:
rv.norm = float32(1.0 / math.Sqrt(float64(normBits)
The normBits has our field length as an integer. We put that into a float64, so that w can use math.Sqrt, compute the inverse, then stick that result in a float32.
I think if we do this, the higher-level scoring code will not need any changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mschoch , i think we need to sync..
One problem with the approach of decoding it here is that - while merging the merger takes the decoded norm from the postings and then it again has to encode it back to the fieldLen before persisting to disk. So if we keep doing this encode/decode within Zap, the precision will compromise for higher order fieldLengths (eg field length >2400).
I have done similar changes in the first commit.. But then we decided that the norm decoding logic will be moved out of zap.
Now the norm decoding logic is in bleve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code functionally works fine with the corresponding bleve change. .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What corresponding change in bleve are you referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already mentioned it above -
@mschoch , Tried to move the norm decoding logic out of zap and added that in bleve here - blevesearch/bleve#1432
Ah, in that case I'm a hard -1. We're making the dependencies between bleve and zap worse and digging a hole we'll never get out of. |
hehe... :-) I had the same feeling earlier and hence I made the change contained within zap.. For me that is always the best solution.. But you were proposing that the decoding logic should better be outside of zap and ideally be in scorer/search pkg..in context of blugelag experiments.. Now the main challenge in making a zap contained change is the precision loss between write(new.go)/read(posting.go)/write(merge.go).. and this change is available in commit 1 here... Any thoughts on improving that or any other proposals? |
OK, here are 2 thoughts: First, even with this current approach, you should not use Second, instead of changing the way zap encodes the norm value, we should take this opportunity to move that logic outside of zap. Basically when calling zap.New(), you would also pass in a function of signature Why go through all this extra work? Because we expect that the norm encoding could change again in the future and possibly soon. So that if-then-else logic decoding the norm value inside scorch is just going to get worse. With this change, it's a bit more structured, you always have to pair the correct encode/decode functions, with just one exception for legacy segments. Thoughts? |
@mschoch , trying to make another round of changes before our sync ups.. Thinking is to contain this change without any higher New() api level changes.. Zap for all its internal use, need not depend on the bleve level Segment.Posting implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a different thought about whether we even need the private interface, see inline...
merge.go
Outdated
@@ -487,7 +487,12 @@ func mergeTermFreqNormLocs(fieldsMap map[string]uint16, term []byte, postItr *Po | |||
newRoaring.Add(uint32(hitNewDocNum)) | |||
|
|||
nextFreq := next.Frequency() | |||
nextNorm := uint64(math.Float32bits(float32(next.Norm()))) | |||
var nextNorm uint64 | |||
if pi, ok := next.(postingsAdv); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I didn't notice this during the discussion. why do we even need the private interface? couldn't we simply type assert to the implementation struct?
next.(*Posting)
merge.go
Outdated
if pi, ok := next.(postingsAdv); ok { | ||
nextNorm = pi.NormValue() | ||
} else { | ||
return 0, 0, 0, nil, fmt.Errorf("internal norm read error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would make the error message more clear, something like
fmt.Errorf("unexpected posting type %T", next)
posting.go
Outdated
} | ||
|
||
// NormValue returns the undecorated, raw norm value for the internal use in zap. | ||
func (p *Posting) NormValue() uint64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we call this NormUint64()
?
value access within zap. Processed/decoded value of norm would still be available over the existing Norm() method.
done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done. Really pleased with how small this change ended up being.
The current norm encoding consistently makes a footprint of
size greater 5bytes on disk irrespective of the actual field length.
This size bloating could be saved if we directly encode the field
length directly to disk and decode it correctly during the read time.
More than 30% reduction in index size is observed with this change
The current change also restricts the changes to zap itself.
As this fix changes the meaning of data present on disk,
it mandates a zap version bump.