-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote Vector Index Build Component -- Repository Integration Low Level Design #2465
Comments
@jed326 thanks for putting up the details. On the download part, we should see how we can parallelize otherwise for larger graphs it will reduce the performance of the system |
Thanks @navneet1v, how about instead something like As for download part I've mentioned a few potential improvements in #2464, we will revisit this in the near future based on perf benchmarking. |
The first step here is to build an initial POC to validate the following:
For this I have created 2 POCs: With the POC 2 I have also done some preliminary benchmarking to validate the parallel upload performance, using 1 shard 0 replica setup on AWS
Note that the Tangential to POC and benchmarking, I also want to highlight some of the nuances specific to the S3 parallel upload implementation and list out the various performance tuning knobs. Chunk SizingToday by default a 16mb chunk size is used for parallel uploads (ref), so in the benchmarking above the 2.9GB file was split into ~180 parts. There is no way to directly control the number of parts, however the part size is adjustable. This is configurable via the Memory UsageThe number of parts that are uploaded in parallel is determined by the size of the threadpool processing these uploads. This also depends on the priority of the write context we use, but for example the If the repository is configured to allow upload retries, configurable via the
|
On an updated POC that does not load vectors on |
@jed326 is this good or bad as compared to your older runs? |
@navneet1v It's much better, the "bad" version took 37789 ms on the same. 7k ms for 2.9 GB is approximately 400 mb/s for upload (which is including all of the disk reads), which is a good speed. We will see if we can push this even higher with tuning in the future. |
Thanks for the update. benchmark numbers look awesome then. |
Updated POC: jed326@29a8230 Updated numbers:
|
Overview
This is the low level design follow-up to #2392. Specifically, the following are covered:
User Experience
First, we define the user experience for how a user can configure and control the remote vector build feature.
Vector Repository Configuration
We will expose a cluster settings for users to indicate the name of the (registered) repository they would like to use as the vector repository. A user must register the repository on their own.
It is difficult to get a reference to the RepositoryService from KNNSettings, so for now we will not validate that the repository specified here is registered. We will leave this for #2464.
From #2392 we outlined that we will use the
vectors
path outside of the indices path to avoid collision with any snapshots, so a user will still be able to use this repository for snapshots if they wish (although we will recommend them not to).There are many knobs exposed by the specific repository implementations that can be used for performance tuning. For example, repository-s3 allows configuring the chunk size, buffer size, etc (https://opensearch.org/docs/latest/api-reference/snapshots/create-repository/#s3-repository). We will also perform performance benchmarks and performance tuning suggestions in #2464.
Feature Controls
The various settings with which a user can enable/disable this feature are listed below.
Visibility
Because we are giving field level controls over this feature, ideally we also should provide field level visibility/metrics. However, since k-nn stats today only supports node level stats, in the first version we will provide only node level stats and users will need to reference the node level stats to understand if the remote vector index build feature is being successfully used. Shard/index level metrics will be explored as a part of #2464.
Previously outlined metrics:
Proposed Metric Names
Additionally, we will also add corresponding "remote" versions of the existing merge/refresh metrics:
k-NN/src/main/java/org/opensearch/knn/plugin/stats/StatNames.java
Lines 46 to 47 in 9a52b2b
Implementation
This section covers the low level abstractions that we will implement. We will create a new
RemoteVectorIndexBuilder
component in the KNN Plugin that will manage both theRepositoriesService
as well as the new remote vector service client. Below is an overview of how the new components will fit into the existing k-NN plugin:A reference to this new
RemoteVectorIndexBuilder
component will be passed down to the vectors writers, and theRemoteVectorIndexBuilder
will be responsible for:Repositories Service
This section contains all of the specific low level design decisions related to the repository service and reading/writing vectors to a repository.
Converting Vector Values to
InputStream
The OpenSearch repository interface uses the java
InputStream
interface to read and write from a given repository, which implements methods that can read an arbitrary number of bytes in a sequential manner (see javadoc). In the KNN plugin we maintain a genericKNNVectorValues
class that acts as an abstraction over the various LuceneDocIdSetIterator
s (for example FloatVectorValues), and thisKNNVectorValues
is what we use in the native engines index writer to iterate over the vectors during merge or flush. This also means that ultimately we need to convert theKNNVectorValues
into anInputStream
in order to write the vector values to the repository, which we will reference for now as theVectorValuesInputStream
. The rest of this section discusses how we can do so while keeping memory consumption in check.A naive solution would be to iterate through all of the vectors, copying them into an Array, and then use that Array as the backing data structure to read from in the
VectorValuesInputStream
, however the size of this array would run into memory limitations in larger segments. For example, a 1k dimension fp32 vector takes 4k bytes to store each vector, so with 10m documents the vectors would take up 40GB, exceeding the heap space of many typical setups.If we look at the repository-s3 implementation as an example, when trying to upload large blobs S3 will split the blob into multiple configurable buffer sizes and perform a multi part upload (ref). By default the repository-s3 buffer size is between 5mb and 5% of jvm heap size (ref), so this also means that the
VectorValuesInputStream
will not need to read more than this buffer size from KNNVectorValues at a time.In the POC, we back the
VectorValuesInputStream
with a single vector sized byte buffer which is refilled one vector at a time so from the analysis above we can instead maintain a buffer between 5mb and 1% of jvm heap size, similar to what is being done for the vector transfer to JNI today (see: #1506).Parallel Blob Upload
Today there are 2 write methods in the
BlobContainer
interface,asyncBlobUpload
andwriteBlob
, differences outlined below:asyncBlobUpload
InputStream
s in parallel via a queueing mechanismAsyncMultiStreamBlobContainer
, which only repository-s3 implements todayInputStream
s of part sizeS
from the same filewriteBlob
Remote Store Reference
Original Parallel Upload Design: opensearch-project/OpenSearch#6632
Based on the performance analysis in opensearch-project/OpenSearch#6632, we are going to need the parallel upload feature for performance. In the POC where vectors are buffered 1 by 1, the transfer of ~1.6m 768 dimension vectors only takes ~1 minute to complete, so we can revisit the performance aspect here as needed. This is compared to the 110s the single threaded POC took to upload 4.5GB from opensearch-project/OpenSearch#6632.
First,
writeBlob
must be used as a fallback regardless otherwise (at least for now), repository-s3 would be the only repository implementation supported by the feature today. Unfortunately forasyncBlobUpload
the implementation will not be as straightforward for the following reasons:asyncBlobUpload
the JVM utilization becomesnum_threads * partSize
, but neither of these are directly controllable by repository settings today so based on benchmarking results we may need to implement additional knobs here.KNNVectorValues
which itself is only a sequential iterator.InputStreamContainer
object which contains the length of the input stream and this length is required in the S3 upload request. However, given aKNNVectorValues
iterator withM
live docs, if we want to take a subset ofK
doc ids 1 <K
<M
it’s impossible to know how many docs are present between doc ids 1 throughK
without iterating throughK
due to the potential presence of deleted docs. This is actually how the native engines vector writer gets the live docs count today (ref). This means if we want to split a givenKNNVectorValues
into NInputStream
s, we have to iterate through each part in order to get the content length to be used in the S3 upload.The specific implementation then of the parallel upload would look like this:
K
doc ids and the correspondingInputStreamContainer
will know the number of live docs within each part of documents via iterating through the vector values.InputStream
s in parallel we will use N correspondingKNNVectorValues
iterators.Writing Doc Ids
In addition to the vector values, we also need to write the doc ids to the remote repository. However, unlike for vector values doc ids have a much smaller bounded size, so for the doc ids case we do not need to buffer the doc ids in smaller parts and instead we can take the naive approach above of writing all the doc ids in one go. Below is copied from the same JNI memory improvement issue as above:
Read InputStream to IndexOutput
Similar to
writeBlob
, the repositoryreadBlob
method also returns anInputStream
, and we must use this InputStream to write to an IndexOutput to write the graph file to disk. Also similarly we need to do so in a way that keeps the jvm utilization under control, so we can take the same approach described above in Converting Vector Values to InputStream to buffer the bytes to disk in chunks. SpecificIndexOutput
implementations already buffer chunks to the disk however in order to keep our memory management solution generic we should implement this chunking logic within the remote index builder as well so whether or not chunking happens is not dependent on Directory or Repository implementation.Similar to
asyncBlobUpload
, there is also a correspondingreadBlobAsync
to be used for parallel multipart download of a blob. However, this API is still marked as experimental is unused due to limitations related to encryption. For now we will only use the readBlob implementation with a similar buffered approach towriteBlob
to keep memory utilization in check. For #2464 there are 2 tracks we can explore:The text was updated successfully, but these errors were encountered: