Skip to content
This repository has been archived by the owner on Apr 10, 2024. It is now read-only.

New norm disk format for zap #27

Merged
merged 3 commits into from
Oct 1, 2020
Merged

New norm disk format for zap #27

merged 3 commits into from
Oct 1, 2020

Conversation

sreekanth-cb
Copy link
Contributor

@sreekanth-cb sreekanth-cb commented Jul 28, 2020

The current norm encoding consistently makes a footprint of
size greater 5bytes on disk irrespective of the actual field length.

This size bloating could be saved if we directly encode the field
length directly to disk and decode it correctly during the read time.

More than 30% reduction in index size is observed with this change

The current change also restricts the changes to zap itself.

As this fix changes the meaning of data present on disk,
it mandates a zap version bump.

The current norm encoding consistently makes a footprint of
size greater 5bytes on disk irrespective of the actual field length.

This size bloating could be saved if we directly encode the field
length directly to disk and decode it correctly during the read time.

More than 30% reduction in index size is observed with this change

The current change also restricts the changes to zap itself.

As this fix changes the meaning of data present on disk,
it mandates a zap version bump.

There could be variations in norm's precison stemming from the
floating point precision loss between uint64 to float64 conversions
on some higher field length values, but trading it off for the simple
conversion logic.

Any alternate suggestions appreciated.
new.go Outdated
@@ -210,7 +210,7 @@ type interimStoredField struct {

type interimFreqNorm struct {
freq uint64
norm float32
norm uint64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, can leave this float32.

new.go Outdated
@@ -456,7 +456,7 @@ func (s *interim) processDocument(docNum uint64,
// now that it's been rolled up into fieldTFs, walk that
for fieldID, tfs := range fieldTFs {
dict := s.Dicts[fieldID]
norm := float32(1.0 / math.Sqrt(float64(fieldLens[fieldID])))
norm := uint64(fieldLens[fieldID])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can use

norm := math.Float32frombits(fieldLens[fieldID])

@sreekanth-cb
Copy link
Contributor Author

@mschoch , Tried to move the norm decoding logic out of zap and added that in bleve here - blevesearch/bleve#1432

posting.go Outdated
@@ -478,8 +478,7 @@ func (i *PostingsIterator) nextAtOrAfter(atOrAfter uint64) (segment.Posting, err
if err != nil {
return nil, err
}

rv.norm = math.Float32frombits(uint32(normBits))
rv.norm = float32(normBits)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this still is incorrect. The normBit's contains the integer length, we cannot just stick it directly into a float32 here, as the upstream scoring code is expect the actual computed norm. This line needs to be:

rv.norm = float32(1.0 / math.Sqrt(float64(normBits)

The normBits has our field length as an integer. We put that into a float64, so that w can use math.Sqrt, compute the inverse, then stick that result in a float32.

I think if we do this, the higher-level scoring code will not need any changes.

Copy link
Contributor Author

@sreekanth-cb sreekanth-cb Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mschoch , i think we need to sync..
One problem with the approach of decoding it here is that - while merging the merger takes the decoded norm from the postings and then it again has to encode it back to the fieldLen before persisting to disk. So if we keep doing this encode/decode within Zap, the precision will compromise for higher order fieldLengths (eg field length >2400).

I have done similar changes in the first commit.. But then we decided that the norm decoding logic will be moved out of zap.

Now the norm decoding logic is in bleve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code functionally works fine with the corresponding bleve change. .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What corresponding change in bleve are you referring to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already mentioned it above -

@mschoch , Tried to move the norm decoding logic out of zap and added that in bleve here - blevesearch/bleve#1432

@mschoch
Copy link
Contributor

mschoch commented Aug 7, 2020

Ah, in that case I'm a hard -1. We're making the dependencies between bleve and zap worse and digging a hole we'll never get out of.

@sreekanth-cb
Copy link
Contributor Author

hehe... :-)

I had the same feeling earlier and hence I made the change contained within zap.. For me that is always the best solution..

But you were proposing that the decoding logic should better be outside of zap and ideally be in scorer/search pkg..in context of blugelag experiments..

Now the main challenge in making a zap contained change is the precision loss between write(new.go)/read(posting.go)/write(merge.go).. and this change is available in commit 1 here...

Any thoughts on improving that or any other proposals?

@mschoch
Copy link
Contributor

mschoch commented Aug 7, 2020

OK, here are 2 thoughts:

First, even with this current approach, you should not use rv.norm = float32(normBits) it will work most of the time, but still risks losing some precision because you're actually converting an int to a float. Instead, it would be safer to use math.Float32frombits(uint32(normBits)), and then since you have higher-level logic to process this anyway, you unpack it later. The benefit would be that during a merge, the bits never actually change, whereas in your current approach, the bits go from int-encoding of N to float-encoding of N and back again.

Second, instead of changing the way zap encodes the norm value, we should take this opportunity to move that logic outside of zap. Basically when calling zap.New(), you would also pass in a function of signature func(int) float32). While building the original segments, zap invokes this function on the field length to produce the norm value. All internal operations of zap operate on this float32 value unchanged, and this exact value is returned by the Norm() method on a posting. Then (and this is the part that is still debatable), and a suitable higher-level location, you have to use the correct decode() method to properly work with this value. We would need a legacy version available for all segments written before this new API change was introduced.

Why go through all this extra work? Because we expect that the norm encoding could change again in the future and possibly soon. So that if-then-else logic decoding the norm value inside scorch is just going to get worse. With this change, it's a bit more structured, you always have to pair the correct encode/decode functions, with just one exception for legacy segments. Thoughts?

@sreekanth-cb
Copy link
Contributor Author

sreekanth-cb commented Sep 29, 2020

@mschoch , trying to make another round of changes before our sync ups.. Thinking is to contain this change without any higher New() api level changes..
Idea is that the original Norm() api would continue to return and behave the same way as it is today. ie returns the decoded norm value as per the given version of zap.

Zap for all its internal use, need not depend on the bleve level Segment.Posting implementation.
It could have its own internal implementations to read whatever norm details it needs , perhaps?

Copy link
Contributor

@mschoch mschoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a different thought about whether we even need the private interface, see inline...

merge.go Outdated
@@ -487,7 +487,12 @@ func mergeTermFreqNormLocs(fieldsMap map[string]uint16, term []byte, postItr *Po
newRoaring.Add(uint32(hitNewDocNum))

nextFreq := next.Frequency()
nextNorm := uint64(math.Float32bits(float32(next.Norm())))
var nextNorm uint64
if pi, ok := next.(postingsAdv); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I didn't notice this during the discussion. why do we even need the private interface? couldn't we simply type assert to the implementation struct?

next.(*Posting)

merge.go Outdated
if pi, ok := next.(postingsAdv); ok {
nextNorm = pi.NormValue()
} else {
return 0, 0, 0, nil, fmt.Errorf("internal norm read error")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would make the error message more clear, something like

fmt.Errorf("unexpected posting type %T", next)

posting.go Outdated
}

// NormValue returns the undecorated, raw norm value for the internal use in zap.
func (p *Posting) NormValue() uint64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we call this NormUint64()?

value access within zap.

Processed/decoded value of norm would still be available
over the existing Norm() method.
@sreekanth-cb
Copy link
Contributor Author

done!

Copy link
Contributor

@mschoch mschoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done. Really pleased with how small this change ended up being.

@sreekanth-cb sreekanth-cb merged commit 468c784 into blevesearch:master Oct 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants