forked from myscale/vector-db-benchmark
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
14 changed files
with
300 additions
and
80 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,94 +1,220 @@ | ||
# vector-db-benchmark | ||
 | ||
|
||
There are various vector search engines available, and each of them may offer | ||
a different set of features and efficiency. But how do we measure the | ||
performance? There is no clear definition and in a specific case you | ||
may worry about a specific thing, while not paying much attention to other aspects. This | ||
project is a general framework for benchmarking different engines under the | ||
same hardware constraints, so you can choose what works best for you. | ||
|
||
Running any benchmark requires choosing an engine, a dataset and defining the | ||
scenario against which it should be tested. A specific scenario may assume | ||
running the server in a single or distributed mode, a different client | ||
implementation and the number of client instances. | ||
|
||
## How to run a benchmark? | ||
|
||
Benchmarks are implemented in server-client mode, meaning that the server is | ||
running in a single machine, and the client is running on another. | ||
|
||
### Run the server | ||
|
||
All engines are served using docker compose. The configuration is in the [servers](./engine/servers/). | ||
|
||
To launch the server instance, run the following command: | ||
|
||
```bash | ||
cd ./engine/servers/<engine-configuration-name> | ||
docker compose up -d | ||
# Step-by-Step Guide for Testing Cloud Vector Database Services | ||
|
||
[//]: # () | ||
|
||
## Introduction | ||
To compare the performance of vector search database cloud services, including MyScale, Pinecone, Weaviate Cloud, | ||
Qdrant Cloud, and Zilliz Cloud, we have developed this framework based on | ||
[qdrant/vector-db-benchmark](https://github.com/qdrant/vector-db-benchmark/). | ||
We will conduct performance tests on these cloud services using the following two datasets. | ||
Information about the two datasets is as follows: | ||
|
||
| Dataset name | Description | Number of vectors | Number of queries | Dimension | Distance | Filters | Payload columns | Download link | | ||
|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|-------------------|-------------------|-----------|----------|-------------------------------------|-----------------|---------------------------------------------------------------------------------------------------| | ||
| laion-768-5m-ip | Provided by MyScale. Generated from [LAION 2B images](https://huggingface.co/datasets/laion/laion2b-multi-vit-h-14-embeddings/tree/main). | 5,000,000 | 10000 | 768 | IP | N/A | 0 | [link](https://myscale-datasets.s3.ap-southeast-1.amazonaws.com/laion-5m-test-ip.hdf5) | | ||
| arxiv-titles-384-angular | Provided by [Qdrant](https://github.com/qdrant/ann-filtering-benchmark-datasets). Generated from arXiv titles. | 2,138,591 | 10000 | 384 | Cosine | Match keywords and timestamp ranges | 2 | [link](https://storage.googleapis.com/ann-filtered-benchmark/datasets/arxiv_small_payload.tar.gz) | | ||
|
||
## Preparation | ||
You need to install the required libraries on the client used for testing. | ||
```shell | ||
pip install -r requirements.txt | ||
``` | ||
|
||
Containers are expected to expose all necessary ports, so the client can connect to them. | ||
|
||
### Run the client | ||
|
||
Install dependencies: | ||
|
||
```bash | ||
pip install poetry | ||
poetry install | ||
## Testing steps | ||
For any cloud vector database, the testing process follows the flowchart below. | ||
 | ||
Below are the specific testing processes for each cloud vector database. | ||
### MyScale | ||
#### Step1. Create Cluster | ||
Go to the [MyScale official website](https://myscale.com/) and create a cluster. | ||
In the [cluster console](https://console.myscale.com/clusters), | ||
record the cluster connection information: `host`, `port`, `username`, and `password`. | ||
 | ||
|
||
#### Step2. Modify the configuration | ||
We have provided two configuration files for testing MyScale: | ||
- [myscale_cloud_mstg_laion-768-5m-ip.json](experiments/configurations/myscale_cloud_mstg_laion-768-5m-ip.json) | ||
- [myscale_cloud_mstg_arxiv-titles-384-angular.json](experiments/configurations/myscale_cloud_mstg_arxiv-titles-384-angular.json) | ||
|
||
You need to write the cluster connection information obtained in Step 1 into the configuration files. | ||
To modify the configuration files for testing, open each file and locate the `connection_params` section. | ||
Update the values for `host`, `port`, `user`, and `password` with the appropriate cluster connection information obtained in Step 1. | ||
|
||
Here is an example of how the modified section may look: | ||
|
||
```shell | ||
"connection_params": { | ||
"host": "your_host.aws.dev.myscale.cloud", | ||
"port": 8443, | ||
"http_type": "http", | ||
"user": "your_username", | ||
"password": "your_password" | ||
}, | ||
``` | ||
|
||
Run the benchmark: | ||
|
||
```bash | ||
Usage: run.py [OPTIONS] | ||
|
||
Example: python3 -m run --engines *-m-16-* --datasets glove-* | ||
|
||
Options: | ||
--engines TEXT [default: *] | ||
--datasets TEXT [default: *] | ||
--host TEXT [default: localhost] | ||
--skip-upload / --no-skip-upload | ||
[default: no-skip-upload] | ||
--install-completion Install completion for the current shell. | ||
--show-completion Show completion for the current shell, to | ||
copy it or customize the installation. | ||
--help Show this message and exit. | ||
### Step3. Run the tests | ||
```shell | ||
python3 run.py --engines *myscale* | ||
``` | ||
|
||
Command allows you to specify wildcards for engines and datasets. | ||
Results of the benchmarks are stored in the `./results/` directory. | ||
|
||
## How to update benchmark parameters? | ||
|
||
Each engine has a configuration file, which is used to define the parameters for the benchmark. | ||
Configuration files are located in the [configuration](./experiments/configurations/) directory. | ||
|
||
Each step in the benchmark process is using a dedicated configuration's path: | ||
|
||
* `connection_params` - passed to the client during the connection phase. | ||
* `collection_params` - parameters, used to create the collection, indexing parameters are usually defined here. | ||
* `upload_params` - parameters, used to upload the data to the server. | ||
* `search_params` - passed to the client during the search phase. Framework allows multiple search configurations for the same experiment run. | ||
### Step4. View the test results | ||
```shell | ||
cd results | ||
grep -E 'rps|mean_precision' $(ls -t) | ||
``` | ||
 | ||
|
||
### Pinecone | ||
#### Step1. Create Cluster | ||
Register with [Pinecone](https://docs.pinecone.io/docs/overview) and obtain the cluster connection information for | ||
`Environment` and `Value`. | ||
 | ||
|
||
#### Step2. Modify the configuration | ||
We have provided two configuration files for testing Pinecone: | ||
- [pinecone_cloud_s1_laion-768-5m-ip.json](experiments/configurations/pinecone_cloud_s1_laion-768-5m-ip.json) | ||
- [pinecone_cloud_s1_arxiv-titles-384-angular.json](experiments/configurations/pinecone_cloud_s1_arxiv-titles-384-angular.json) | ||
|
||
- You need to write the cluster connection information obtained in Step 1 into the configuration files. | ||
- Modify the `connection_params` section of the files and update the values for `environment` and `api_key`. | ||
|
||
Here is an example of how the modified section may look: | ||
```shell | ||
"connection_params": { | ||
"api-key": "your_api_key", | ||
"environment": "your_environment" | ||
}, | ||
``` | ||
|
||
Exact values of the parameters are individual for each engine. | ||
### Step3. Run the tests | ||
```shell | ||
python3 run.py --engines *pinecone* | ||
``` | ||
|
||
## How to register a dataset? | ||
### Step4. View the test results | ||
```shell | ||
cd results | ||
grep -E 'rps|mean_precision' $(ls -t) | ||
``` | ||
 | ||
|
||
### Zilliz | ||
#### Step1. Create Cluster | ||
You need to find the cluster connection information, including `end_point`, `user`, and `password`, | ||
in the [Zilliz Cloud console](https://cloud.zilliz.com/projects/MA==/databases). | ||
The `user` and `password` are the credentials you specified when creating the cluster. | ||
 | ||
|
||
#### Step2. Modify the configuration | ||
We have provided two configuration files for testing Zilliz: | ||
- [zilliz_cloud_1cu_storage_optimized_laion-768-5m-ip.json](experiments/configurations/zilliz_cloud_1cu_storage_optimized_laion-768-5m-ip.json) | ||
- [zilliz_cloud_1cu_storage_optimized_arxiv-titles-384-angular.json](experiments/configurations/zilliz_cloud_1cu_storage_optimized_arxiv-titles-384-angular.json) | ||
|
||
You need to write the cluster connection information obtained in Step 1 into the configuration files. | ||
To modify the configuration files for testing, open each file and locate the `connection_params` section. | ||
Update the values for `end_point`, `cloud_user`, and `cloud_password` with the appropriate cluster connection information obtained in Step 1. | ||
|
||
Here is an example of how the modified section may look: | ||
|
||
```shell | ||
"connection_params": { | ||
"cloud_mode": true, | ||
"host": "127.0.0.1", | ||
"port": 19530, | ||
"user": "", | ||
"password": "", | ||
"end_point": "https://your_host.zillizcloud.com:19538", | ||
"cloud_user": "your_user", | ||
"cloud_password": "your_password", | ||
"cloud_secure": true | ||
}, | ||
``` | ||
|
||
Datasets are configured in the [datasets/datasets.json](./datasets/datasets.json) file. | ||
Framework will automatically download the dataset and store it in the [datasets](./datasets/) directory. | ||
### Step3. Run the tests | ||
```shell | ||
python3 run.py --engines *zilliz* | ||
``` | ||
|
||
## How to implement a new engine? | ||
### Step4. View the test results | ||
```shell | ||
cd results | ||
grep -E 'rps|mean_precision' $(ls -t) | ||
``` | ||
 | ||
|
||
### Weaviate Cloud | ||
#### Step1. Create Cluster | ||
Register with [Weaviate Clou](https://console.weaviate.cloud/dashboard) and create a cluster. | ||
Record the cluster connection information: `cluster URL` and `Authentication`. | ||
 | ||
|
||
#### Step2. Modify the configuration | ||
We have provided two configuration files for testing Weaviate Cloud: | ||
- [weaviate_cloud_standard_arxiv-titles-384-angular.json](experiments/configurations/weaviate_cloud_standard_arxiv-titles-384-angular.json) | ||
- [weaviate_cloud_standard_laion-768-5m-ip.json](experiments/configurations/weaviate_cloud_standard_laion-768-5m-ip.json) | ||
|
||
You need to write the cluster connection information obtained in Step 1 into the configuration files. | ||
Modify the `connection_params` section of the files and update the values for `host` and `api_key`. | ||
The `host` corresponds to the `cluster URL`, and the `api_key` is the `Authentication`. | ||
|
||
Here is an example of how the modified section may look: | ||
|
||
```shell | ||
"connection_params": { | ||
"host": "https://your_host.weaviate.cloud", | ||
"port": 8090, | ||
"timeout_config": 2000, | ||
"api_key": "your_api_key" | ||
}, | ||
``` | ||
|
||
There are a few base classes that you can use to implement a new engine. | ||
### Step3. Run the tests | ||
```shell | ||
python3 run.py --engines *weaviate* | ||
``` | ||
|
||
* `BaseConfigurator` - defines methods to create collections, setup indexing parameters. | ||
* `BaseUploader` - defines methods to upload the data to the server. | ||
* `BaseSearcher` - defines methods to search the data. | ||
### Step4. View the test results | ||
```shell | ||
cd results | ||
grep -E 'rps|mean_precision' $(ls -t) | ||
``` | ||
 | ||
|
||
### Qdrant | ||
#### Step1. Create Cluster | ||
Register with [Qdrant Cloud](https://cloud.qdrant.io/) and create a cluster. | ||
Record the cluster connection information: `URL` and `API key`. | ||
 | ||
|
||
#### Step2. Modify the configuration | ||
We have provided three configuration files for testing Qdrant: | ||
- [qdrant_cloud_hnsw_2c16g_storage_optimized_laion-768-5m-ip.json](experiments/configurations/qdrant_cloud_hnsw_2c16g_storage_optimized_laion-768-5m-ip.json) | ||
- [qdrant_cloud_hnsw_2c16g_storage_optimized_arxiv-titles-384-angular.json](experiments/configurations/qdrant_cloud_hnsw_2c16g_storage_optimized_arxiv-titles-384-angular.json) | ||
- [qdrant_cloud_quantization_2c16g_storage_optimized_laion-768-5m-ip.json](experiments/configurations/qdrant_cloud_quantization_2c16g_storage_optimized_laion-768-5m-ip.json) | ||
|
||
You need to write the cluster connection information obtained in Step 1 into the configuration files. | ||
Modify the `connection_params` section of the files and update the values for `host` and `api_key`. | ||
Please note that for the `connection_params` section, you need to remove the `port` from the end of the `host` string. | ||
Here is an example of how the modified section may look: | ||
|
||
```shell | ||
"connection_params": { | ||
"host": "https://your_host.aws.cloud.qdrant.io", | ||
"port": 6333, | ||
"grpc_port": 6334, | ||
"prefer_grpc": false, | ||
"api_key": "your_api_key" | ||
}, | ||
``` | ||
|
||
See the examples in the [clients](./engine/clients) directory. | ||
### Step3. Run the tests | ||
```shell | ||
python3 run.py --engines *qdrant* | ||
``` | ||
|
||
Once all the necessary classes are implemented, you can register the engine in the [ClientFactory](./engine/clients/client_factory.py). | ||
### Step4. View the test results | ||
```shell | ||
cd results | ||
grep -E 'rps|mean_precision' $(ls -t) | ||
``` | ||
 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
# vector-db-benchmark | ||
 | ||
|
||
There are various vector search engines available, and each of them may offer | ||
a different set of features and efficiency. But how do we measure the | ||
performance? There is no clear definition and in a specific case you | ||
may worry about a specific thing, while not paying much attention to other aspects. This | ||
project is a general framework for benchmarking different engines under the | ||
same hardware constraints, so you can choose what works best for you. | ||
|
||
Running any benchmark requires choosing an engine, a dataset and defining the | ||
scenario against which it should be tested. A specific scenario may assume | ||
running the server in a single or distributed mode, a different client | ||
implementation and the number of client instances. | ||
|
||
## How to run a benchmark? | ||
|
||
Benchmarks are implemented in server-client mode, meaning that the server is | ||
running in a single machine, and the client is running on another. | ||
|
||
### Run the server | ||
|
||
All engines are served using docker compose. The configuration is in the [servers](./engine/servers/). | ||
|
||
To launch the server instance, run the following command: | ||
|
||
```bash | ||
cd ./engine/servers/<engine-configuration-name> | ||
docker compose up -d | ||
``` | ||
|
||
Containers are expected to expose all necessary ports, so the client can connect to them. | ||
|
||
### Run the client | ||
|
||
Install dependencies: | ||
|
||
```bash | ||
pip install poetry | ||
poetry install | ||
``` | ||
|
||
Run the benchmark: | ||
|
||
```bash | ||
Usage: run.py [OPTIONS] | ||
|
||
Example: python3 -m run --engines *-m-16-* --datasets glove-* | ||
|
||
Options: | ||
--engines TEXT [default: *] | ||
--datasets TEXT [default: *] | ||
--host TEXT [default: localhost] | ||
--skip-upload / --no-skip-upload | ||
[default: no-skip-upload] | ||
--install-completion Install completion for the current shell. | ||
--show-completion Show completion for the current shell, to | ||
copy it or customize the installation. | ||
--help Show this message and exit. | ||
``` | ||
|
||
Command allows you to specify wildcards for engines and datasets. | ||
Results of the benchmarks are stored in the `./results/` directory. | ||
|
||
## How to update benchmark parameters? | ||
|
||
Each engine has a configuration file, which is used to define the parameters for the benchmark. | ||
Configuration files are located in the [configuration](./experiments/configurations/) directory. | ||
|
||
Each step in the benchmark process is using a dedicated configuration's path: | ||
|
||
* `connection_params` - passed to the client during the connection phase. | ||
* `collection_params` - parameters, used to create the collection, indexing parameters are usually defined here. | ||
* `upload_params` - parameters, used to upload the data to the server. | ||
* `search_params` - passed to the client during the search phase. Framework allows multiple search configurations for the same experiment run. | ||
|
||
Exact values of the parameters are individual for each engine. | ||
|
||
## How to register a dataset? | ||
|
||
Datasets are configured in the [datasets/datasets.json](./datasets/datasets.json) file. | ||
Framework will automatically download the dataset and store it in the [datasets](./datasets/) directory. | ||
|
||
## How to implement a new engine? | ||
|
||
There are a few base classes that you can use to implement a new engine. | ||
|
||
* `BaseConfigurator` - defines methods to create collections, setup indexing parameters. | ||
* `BaseUploader` - defines methods to upload the data to the server. | ||
* `BaseSearcher` - defines methods to search the data. | ||
|
||
See the examples in the [clients](./engine/clients) directory. | ||
|
||
Once all the necessary classes are implemented, you can register the engine in the [ClientFactory](./engine/clients/client_factory.py). |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.