New norm disk format for zap #27

sreekanth-cb · 2020-07-28T09:36:44Z

The current norm encoding consistently makes a footprint of
size greater 5bytes on disk irrespective of the actual field length.

This size bloating could be saved if we directly encode the field
length directly to disk and decode it correctly during the read time.

More than 30% reduction in index size is observed with this change

The current change also restricts the changes to zap itself.

As this fix changes the meaning of data present on disk,
it mandates a zap version bump.

The current norm encoding consistently makes a footprint of size greater 5bytes on disk irrespective of the actual field length. This size bloating could be saved if we directly encode the field length directly to disk and decode it correctly during the read time. More than 30% reduction in index size is observed with this change The current change also restricts the changes to zap itself. As this fix changes the meaning of data present on disk, it mandates a zap version bump. There could be variations in norm's precison stemming from the floating point precision loss between uint64 to float64 conversions on some higher field length values, but trading it off for the simple conversion logic. Any alternate suggestions appreciated.

mschoch · 2020-07-30T14:38:50Z

new.go

@@ -210,7 +210,7 @@ type interimStoredField struct {

 type interimFreqNorm struct {
 	freq    uint64
-	norm    float32
+	norm    uint64


As discussed, can leave this float32.

mschoch · 2020-07-30T14:39:28Z

new.go

@@ -456,7 +456,7 @@ func (s *interim) processDocument(docNum uint64,
 	// now that it's been rolled up into fieldTFs, walk that
 	for fieldID, tfs := range fieldTFs {
 		dict := s.Dicts[fieldID]
-		norm := float32(1.0 / math.Sqrt(float64(fieldLens[fieldID])))
+		norm := uint64(fieldLens[fieldID])


Here we can use

norm := math.Float32frombits(fieldLens[fieldID])

sreekanth-cb · 2020-07-31T09:04:48Z

@mschoch , Tried to move the norm decoding logic out of zap and added that in bleve here - blevesearch/bleve#1432

mschoch · 2020-08-06T19:59:51Z

posting.go

@@ -478,8 +478,7 @@ func (i *PostingsIterator) nextAtOrAfter(atOrAfter uint64) (segment.Posting, err
 	if err != nil {
 		return nil, err
 	}
-
-	rv.norm = math.Float32frombits(uint32(normBits))
+	rv.norm = float32(normBits)


Sorry this still is incorrect. The normBit's contains the integer length, we cannot just stick it directly into a float32 here, as the upstream scoring code is expect the actual computed norm. This line needs to be:

rv.norm = float32(1.0 / math.Sqrt(float64(normBits)

The normBits has our field length as an integer. We put that into a float64, so that w can use math.Sqrt, compute the inverse, then stick that result in a float32.

I think if we do this, the higher-level scoring code will not need any changes.

@mschoch , i think we need to sync..
One problem with the approach of decoding it here is that - while merging the merger takes the decoded norm from the postings and then it again has to encode it back to the fieldLen before persisting to disk. So if we keep doing this encode/decode within Zap, the precision will compromise for higher order fieldLengths (eg field length >2400).

I have done similar changes in the first commit.. But then we decided that the norm decoding logic will be moved out of zap.

Now the norm decoding logic is in bleve.

This code functionally works fine with the corresponding bleve change. .

What corresponding change in bleve are you referring to?

I have already mentioned it above -

@mschoch , Tried to move the norm decoding logic out of zap and added that in bleve here - blevesearch/bleve#1432

mschoch · 2020-08-07T03:32:17Z

Ah, in that case I'm a hard -1. We're making the dependencies between bleve and zap worse and digging a hole we'll never get out of.

sreekanth-cb · 2020-08-07T03:38:11Z

hehe... :-)

I had the same feeling earlier and hence I made the change contained within zap.. For me that is always the best solution..

But you were proposing that the decoding logic should better be outside of zap and ideally be in scorer/search pkg..in context of blugelag experiments..

Now the main challenge in making a zap contained change is the precision loss between write(new.go)/read(posting.go)/write(merge.go).. and this change is available in commit 1 here...

Any thoughts on improving that or any other proposals?

mschoch · 2020-08-07T12:46:02Z

OK, here are 2 thoughts:

First, even with this current approach, you should not use rv.norm = float32(normBits) it will work most of the time, but still risks losing some precision because you're actually converting an int to a float. Instead, it would be safer to use math.Float32frombits(uint32(normBits)), and then since you have higher-level logic to process this anyway, you unpack it later. The benefit would be that during a merge, the bits never actually change, whereas in your current approach, the bits go from int-encoding of N to float-encoding of N and back again.

Second, instead of changing the way zap encodes the norm value, we should take this opportunity to move that logic outside of zap. Basically when calling zap.New(), you would also pass in a function of signature func(int) float32). While building the original segments, zap invokes this function on the field length to produce the norm value. All internal operations of zap operate on this float32 value unchanged, and this exact value is returned by the Norm() method on a posting. Then (and this is the part that is still debatable), and a suitable higher-level location, you have to use the correct decode() method to properly work with this value. We would need a legacy version available for all segments written before this new API change was introduced.

Why go through all this extra work? Because we expect that the norm encoding could change again in the future and possibly soon. So that if-then-else logic decoding the norm value inside scorch is just going to get worse. With this change, it's a bit more structured, you always have to pair the correct encode/decode functions, with just one exception for legacy segments. Thoughts?

sreekanth-cb · 2020-09-29T03:12:21Z

@mschoch , trying to make another round of changes before our sync ups.. Thinking is to contain this change without any higher New() api level changes..
Idea is that the original Norm() api would continue to return and behave the same way as it is today. ie returns the decoded norm value as per the given version of zap.

Zap for all its internal use, need not depend on the bleve level Segment.Posting implementation.
It could have its own internal implementations to read whatever norm details it needs , perhaps?

mschoch

I had a different thought about whether we even need the private interface, see inline...

mschoch · 2020-09-30T13:39:20Z

merge.go

@@ -487,7 +487,12 @@ func mergeTermFreqNormLocs(fieldsMap map[string]uint16, term []byte, postItr *Po
 		newRoaring.Add(uint32(hitNewDocNum))

 		nextFreq := next.Frequency()
-		nextNorm := uint64(math.Float32bits(float32(next.Norm())))
+		var nextNorm uint64
+		if pi, ok := next.(postingsAdv); ok {


sorry I didn't notice this during the discussion. why do we even need the private interface? couldn't we simply type assert to the implementation struct?

next.(*Posting)

mschoch · 2020-09-30T13:39:24Z

merge.go

+		if pi, ok := next.(postingsAdv); ok {
+			nextNorm = pi.NormValue()
+		} else {
+			return 0, 0, 0, nil, fmt.Errorf("internal norm read error")


i would make the error message more clear, something like

fmt.Errorf("unexpected posting type %T", next)

mschoch · 2020-09-30T13:39:56Z

posting.go

+}
+
+// NormValue returns the undecorated, raw norm value for the internal use in zap.
+func (p *Posting) NormValue() uint64 {


How about we call this NormUint64()?

value access within zap. Processed/decoded value of norm would still be available over the existing Norm() method.

sreekanth-cb · 2020-09-30T15:48:22Z

done!

mschoch

Nicely done. Really pleased with how small this change ended up being.

sreekanth-cb requested review from mschoch, abhinavdangeti and steveyen July 28, 2020 09:37

mschoch suggested changes Jul 30, 2020

View reviewed changes

sreekanth-cb added the do not merge label Jul 31, 2020

removing the norm decoding logic from zap

ac3f180

sreekanth-cb removed the do not merge label Aug 6, 2020

mschoch suggested changes Aug 6, 2020

View reviewed changes

sreekanth-cb requested a review from mschoch September 29, 2020 12:30

mschoch suggested changes Sep 30, 2020

View reviewed changes

mschoch mentioned this pull request Sep 30, 2020

Support zap v15 (coming soon) blevesearch/bleve#1474

Closed

Adding NormUint64() utility method for the internal norm

e6acb61

value access within zap. Processed/decoded value of norm would still be available over the existing Norm() method.

sreekanth-cb requested a review from mschoch September 30, 2020 15:49

mschoch approved these changes Oct 1, 2020

View reviewed changes

sreekanth-cb merged commit 468c784 into blevesearch:master Oct 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New norm disk format for zap #27

New norm disk format for zap #27

sreekanth-cb commented Jul 28, 2020 •

edited

Loading

mschoch Jul 30, 2020

mschoch Jul 30, 2020

sreekanth-cb commented Jul 31, 2020

mschoch Aug 6, 2020

sreekanth-cb Aug 7, 2020 •

edited

Loading

sreekanth-cb Aug 7, 2020

mschoch Aug 7, 2020

sreekanth-cb Aug 7, 2020

mschoch commented Aug 7, 2020

sreekanth-cb commented Aug 7, 2020

mschoch commented Aug 7, 2020

sreekanth-cb commented Sep 29, 2020 •

edited

Loading

mschoch left a comment

mschoch Sep 30, 2020

mschoch Sep 30, 2020

mschoch Sep 30, 2020

sreekanth-cb commented Sep 30, 2020

mschoch left a comment

New norm disk format for zap #27

New norm disk format for zap #27

Conversation

sreekanth-cb commented Jul 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sreekanth-cb commented Jul 31, 2020

Choose a reason for hiding this comment

sreekanth-cb Aug 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mschoch commented Aug 7, 2020

sreekanth-cb commented Aug 7, 2020

mschoch commented Aug 7, 2020

sreekanth-cb commented Sep 29, 2020 • edited Loading

mschoch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sreekanth-cb commented Sep 30, 2020

mschoch left a comment

Choose a reason for hiding this comment

sreekanth-cb commented Jul 28, 2020 •

edited

Loading

sreekanth-cb Aug 7, 2020 •

edited

Loading

sreekanth-cb commented Sep 29, 2020 •

edited

Loading