Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store compressed vectors in dense ByteSequence for PQVectors #370

Merged
merged 9 commits into from
Dec 2, 2024

Conversation

michaeljmarshall
Copy link
Collaborator

@michaeljmarshall michaeljmarshall commented Nov 23, 2024

The current design stores the compressed vectors using a list of ByteSequence objects where each object represents a single vector. That costs us an overhead of 16 bytes for the ByteSequence wrapper class and 16 more bytes for the array in the ArrayByteSequence. In the case of a 128 dimension vector, we compress PQ down to 64 bytes, so the cost of the wrapper/array is 32 bytes out of the total 96 bytes, which is 33% of the total size. In cases where we have many of these vectors in memory, an extra 33% is quite large.

This PR proposes removing the per-vector wrapper and opts for a more dense representation by

  1. Storing the vectors in an array of ByteSequence references where the ByteSequence has as many vectors as it can possibly hold.
  2. Introducing a ByteSequence#slice method to allow for getting a slice of the whole ByteSequence. This method is intended to be light weight.
  3. Introduce ArraySliceByteSequence to store the offset and length information to simplify element retrieval.
  4. Introducing a ByteSequence#offset method to be used in conjunction with the get method to ensure that the Panama and SimdOps methods do not need to perform an array copy when passing the array to Panama.
  5. Added assertions in the native code when we call get on a MemorySegmentByteSequence since we're technically not passing the offset. FWIW, we could consider updating the c to take a long offset and a long length since those MemorySegments can exceed int length.

We also make some minor breaking changes to the VectorCompression interface to allow implementations to construct the CompressedVectors object directly. This change simplifies most usages of the encodeAll method.

This PR is a more invasive alternative to #369, which only removed one of the two wrappers.

Rejected alternatives

Before implementing the PR this way, I tried a few alternatives.

  1. Use MemorySegment in the PQVectors to remove the chunking required because we can have more PQVector bytes than int max, which pushes us to store the compressedVectors field as a ByteSequence[] instead of a constant ByteSequence. Further, a MemorySegment would have allowed for a simpler API because of its support in the Panama API. However, this solution proved challenging due to compilation requirements and concerns about using a preview api (MemorySegment) in java 20.
  2. Attempt to use a ByteBuffer instead of ArraySliceByteSequence. This is plausible, but really just comes down to API design preference. It felt a bit clunky to add it in. The main advantage is that if we add this, we can avoid the offset method on the ByteSequence interface, which might be convenient. One way we could add it is by making PQVectors know how to create a ByteBuffer wrapping a slice of the compressedVectors data.

@michaeljmarshall
Copy link
Collaborator Author

@jbellis @jkni @marianotepper - this is ready for review. The other alternative, described above, where we use a ByteBuffer instead of creating an ArraySliceByteSource seems viable, but comes down to preference. Let me know what you think.

return Objects.hash(pq, compressedVectors);
// We don't use the array structure in the hash code calculation because we allow for different chunking
// strategies. Instead, we use the first entry in the first chunk to provide a stable hash code.
return Objects.hash(pq, count(), count() > 0 ? get(0).get(0) : 0);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're worried about the performance of hashing millions of vectors, I'd prefer adding the first codepoint from each vector, to using all the first vector's codepoints

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up just hashing based on every value since that is the norm. If it becomes a performance bottleneck, we can address it then--I doubt we compute the hash code frequently.

// TODO how do we want to determine equality? With the current change, we are willing to write one
// thing and materialize another. It seems like the real concern should be whether the compressedVectors have
// the same data, not whether they are in a MemorySegment or a byte[] and not whether there is one chunk of many
// vectors or many chunks of one vector. This technically goes against the implementation of each of the
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with changing the ByteSequence equals implementations to match. You get the class comparison by default when intellij generates it for you and I think we just ran with that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this, but it's not as elegant as I wanted. I could have made the ByteSequence interface an abstract class and overridden equals/hashCode, but that seemed like it went too far, so I added default methods that implementations of the interface can call. In practice, I doubt these objects are compared for strict equality frequently.

@jbellis
Copy link
Owner

jbellis commented Nov 25, 2024

I like where this is going, I'm just concerned that it doesn't break compatibility enough. :)

That is, the PQVectors API is hard to use "correctly" now since the only method that gives you One Big Array is load(). (Technically the constructor used by load is public but everyone else just calls it with an array of one vector per chunk.)

I think we should push the changes into VectorCompressor to solve this. Instead of introducing a new class to abstract long[] vs ByteSequence, encodeAll should return CompressedVectors directly. (By far the most common way encodeAll is used is to immediately wrap the result in CV anyway.) I think there's a code path or two where you don't need the full CV but it's not a big deal to construct one for those cases.

CompressedVectors is not generic so VC wouldn't have to be either, which makes me happy.

@jbellis
Copy link
Owner

jbellis commented Nov 25, 2024

I'm 80% sure that it's correct to go ahead and change the API and release 3.1 instead of doing unnatural contortions to maintain compatibility.

@jbellis
Copy link
Owner

jbellis commented Nov 25, 2024

Also I do not insist that BQ change to the chunking approach. If we can conform to the "right" API without doing that, it's fine to leave optimizing BQ until somebody needs it.

@michaeljmarshall
Copy link
Collaborator Author

@jbellis - how would this work in the CompactionGraph? I left the hybrid creation method in place because it seemed like we needed it for that class, but perhaps we just need to add an add method to the PQVectors instead of letting them mutate the ArrayList (the current design).

@jbellis
Copy link
Owner

jbellis commented Nov 25, 2024

Yes, I think an add() or a set() method would be straightforward since CG knows up front how many it's going to generate. This is a better design than leaking the underlying List to be mutated, but that was the path of least resistance at one point!

@jbellis
Copy link
Owner

jbellis commented Nov 26, 2024

Is it possible to unify encode/set to void encodeTo(VectorFloat, CompressedVectors, int) to avoid the copy?

@michaeljmarshall
Copy link
Collaborator Author

I had started to look at that, but stopped because of our logic to for encodedOmittedVector. Turns out that is just setting a zero vector, so as long as we have an on-heap data structure, we're good to go, but we might not want to assume that for long since an off heap MemorySegment looks like it might have undefined values for regions that are not explicitly set.

Getting the change written up now.

@michaeljmarshall
Copy link
Collaborator Author

Note: I'll update the PR description tomorrow to align with recent commits.

@jbellis
Copy link
Owner

jbellis commented Nov 26, 2024

LGTM

do you mind updating UPGRADING and README?

public long[][] encodeAll(RandomAccessVectorValues ravv, ForkJoinPool simdExecutor) {
return simdExecutor.submit(() -> IntStream.range(0, ravv.size())
public CompressedVectors encodeAll(RandomAccessVectorValues ravv, ForkJoinPool simdExecutor) {
var cv = simdExecutor.submit(() -> IntStream.range(0, ravv.size())
.parallel()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity:
does it really help to use "parallel" here ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Massively.

@jbellis jbellis merged commit 78274e6 into jbellis:main Dec 2, 2024
6 checks passed
@michaeljmarshall michaeljmarshall deleted the one-big-array branch December 2, 2024 16:32
@jbellis
Copy link
Owner

jbellis commented Dec 2, 2024

LGTM, merged

Thoughts for if/when we revisit this:

  1. I don't love PQVectors being mutable or not based on which constructor was called. Would factory methods improve the clarity? Separate classes?
  2. Could we eliminate ArrayByteSequence and just make everything an ArraySliceBS? I don't think there's a path left where we care about the overhead from the extra fields and I'd rather have a single implementation instead of having to make fixes like setLittleEndianShort in two places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants