add README

silver-ymz · May 12, 2023 · f0751c7 · f0751c7
1 parent a114453
commit f0751c7
Show file tree

Hide file tree

Showing 14 changed files with 300 additions and 80 deletions.
diff --git a/README.md b/README.md
@@ -1,94 +1,220 @@
-# vector-db-benchmark
-![Benchmark Results](./benchmark_results.png)
-
-There are various vector search engines available, and each of them may offer
-a different set of features and efficiency. But how do we measure the
-performance? There is no clear definition and in a specific case you
-may worry about a specific thing, while not paying much attention to other aspects. This
-project is a general framework for benchmarking different engines under the
-same hardware constraints, so you can choose what works best for you.
-
-Running any benchmark requires choosing an engine, a dataset and defining the
-scenario against which it should be tested. A specific scenario may assume
-running the server in a single or distributed mode, a different client
-implementation and the number of client instances.
-
-## How to run a benchmark?
-
-Benchmarks are implemented in server-client mode, meaning that the server is
-running in a single machine, and the client is running on another.
-
-### Run the server
-
-All engines are served using docker compose. The configuration is in the [servers](./engine/servers/).
-
-To launch the server instance, run the following command:
-
-```bash
-cd ./engine/servers/<engine-configuration-name>
-docker compose up -d
+# Step-by-Step Guide for Testing Cloud Vector Database Services
+
+[//]: # (![Benchmark Results]&#40;images/benchmark_results.png&#41;)
+
+## Introduction
+To compare the performance of vector search database cloud services, including MyScale, Pinecone, Weaviate Cloud, 
+Qdrant Cloud, and Zilliz Cloud, we have developed this framework based on 
+[qdrant/vector-db-benchmark](https://github.com/qdrant/vector-db-benchmark/). 
+We will conduct performance tests on these cloud services using the following two datasets. 
+Information about the two datasets is as follows:
+
+| Dataset name             | Description                                                                                                                               | Number of vectors | Number of queries | Dimension | Distance | Filters                             | Payload columns | Download link                                                                                     |
+|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|-------------------|-------------------|-----------|----------|-------------------------------------|-----------------|---------------------------------------------------------------------------------------------------|
+| laion-768-5m-ip          | Provided by MyScale. Generated from [LAION 2B images](https://huggingface.co/datasets/laion/laion2b-multi-vit-h-14-embeddings/tree/main). | 5,000,000         | 10000             | 768       | IP       | N/A                                 | 0               | [link](https://myscale-datasets.s3.ap-southeast-1.amazonaws.com/laion-5m-test-ip.hdf5)            |
+| arxiv-titles-384-angular | Provided by [Qdrant](https://github.com/qdrant/ann-filtering-benchmark-datasets). Generated from arXiv titles.                            | 2,138,591         | 10000             | 384       | Cosine   | Match keywords and timestamp ranges | 2               | [link](https://storage.googleapis.com/ann-filtered-benchmark/datasets/arxiv_small_payload.tar.gz) |
+
+## Preparation
+You need to install the required libraries on the client used for testing.
+```shell
+pip install -r requirements.txt
 ```
 
-Containers are expected to expose all necessary ports, so the client can connect to them.
-
-### Run the client
-
-Install dependencies:
-
-```bash
-pip install poetry
-poetry install
+## Testing steps
+For any cloud vector database, the testing process follows the flowchart below.
+![](images/cloud%20test%20steps.png)
+Below are the specific testing processes for each cloud vector database.
+### MyScale
+#### Step1. Create Cluster
+Go to the [MyScale official website](https://myscale.com/) and create a cluster. 
+In the [cluster console](https://console.myscale.com/clusters), 
+record the cluster connection information: `host`, `port`, `username`, and `password`.
+![MyScaleConsole.jpg](images%2FMyScaleConsole.jpg)
+
+#### Step2. Modify the configuration
+We have provided two configuration files for testing MyScale:
+- [myscale_cloud_mstg_laion-768-5m-ip.json](experiments/configurations/myscale_cloud_mstg_laion-768-5m-ip.json)
+- [myscale_cloud_mstg_arxiv-titles-384-angular.json](experiments/configurations/myscale_cloud_mstg_arxiv-titles-384-angular.json)
+
+You need to write the cluster connection information obtained in Step 1 into the configuration files. 
+To modify the configuration files for testing, open each file and locate the `connection_params` section. 
+Update the values for `host`, `port`, `user`, and `password` with the appropriate cluster connection information obtained in Step 1.
+
+Here is an example of how the modified section may look:
+
+```shell
+"connection_params": {
+  "host": "your_host.aws.dev.myscale.cloud",
+  "port": 8443,
+  "http_type": "http",
+  "user": "your_username",
+  "password": "your_password"
+},
 ```
 
-Run the benchmark:
-
-```bash
-Usage: run.py [OPTIONS]
-
-  Example: python3 -m run --engines *-m-16-* --datasets glove-*
-
-Options:
-  --engines TEXT                  [default: *]
-  --datasets TEXT                 [default: *]
-  --host TEXT                     [default: localhost]
-  --skip-upload / --no-skip-upload
-                                  [default: no-skip-upload]
-  --install-completion            Install completion for the current shell.
-  --show-completion               Show completion for the current shell, to
-                                  copy it or customize the installation.
-  --help                          Show this message and exit.
+### Step3. Run the tests
+```shell
+python3 run.py --engines *myscale*
 ```
 
-Command allows you to specify wildcards for engines and datasets.
-Results of the benchmarks are stored in the `./results/` directory.
-
-## How to update benchmark parameters?
-
-Each engine has a configuration file, which is used to define the parameters for the benchmark.
-Configuration files are located in the [configuration](./experiments/configurations/) directory.
-
-Each step in the benchmark process is using a dedicated configuration's path:
-
-* `connection_params` - passed to the client during the connection phase.
-* `collection_params` - parameters, used to create the collection, indexing parameters are usually defined here.
-* `upload_params` - parameters, used to upload the data to the server.
-* `search_params` - passed to the client during the search phase. Framework allows multiple search configurations for the same experiment run.
+### Step4. View the test results
+```shell
+cd results
+grep -E 'rps|mean_precision' $(ls -t)
+```
+![MyScaleResuts.jpg](images%2FMyScaleResuts.jpg)
+
+### Pinecone
+#### Step1. Create Cluster
+Register with [Pinecone](https://docs.pinecone.io/docs/overview) and obtain the cluster connection information for 
+`Environment` and `Value`.
+![PineconeConsole.jpg](images%2FPineconeConsole.jpg)
+
+#### Step2. Modify the configuration
+We have provided two configuration files for testing Pinecone:
+- [pinecone_cloud_s1_laion-768-5m-ip.json](experiments/configurations/pinecone_cloud_s1_laion-768-5m-ip.json)
+- [pinecone_cloud_s1_arxiv-titles-384-angular.json](experiments/configurations/pinecone_cloud_s1_arxiv-titles-384-angular.json)
+
+- You need to write the cluster connection information obtained in Step 1 into the configuration files. 
+- Modify the `connection_params` section of the files and update the values for `environment` and `api_key`.
+
+Here is an example of how the modified section may look:
+```shell
+"connection_params": {
+  "api-key": "your_api_key",
+  "environment": "your_environment"
+},
+```
 
-Exact values of the parameters are individual for each engine.
+### Step3. Run the tests
+```shell
+python3 run.py --engines *pinecone*
+```
 
-## How to register a dataset?
+### Step4. View the test results
+```shell
+cd results
+grep -E 'rps|mean_precision' $(ls -t)
+```
+![PineconeResults.jpg](images%2FPineconeResults.jpg)
+
+### Zilliz
+#### Step1. Create Cluster
+You need to find the cluster connection information, including `end_point`, `user`, and `password`, 
+in the [Zilliz Cloud console](https://cloud.zilliz.com/projects/MA==/databases). 
+The `user` and `password` are the credentials you specified when creating the cluster.
+![ZillizConsole.jpg](images%2FZillizConsole.jpg)
+
+#### Step2. Modify the configuration
+We have provided two configuration files for testing Zilliz:
+- [zilliz_cloud_1cu_storage_optimized_laion-768-5m-ip.json](experiments/configurations/zilliz_cloud_1cu_storage_optimized_laion-768-5m-ip.json)
+- [zilliz_cloud_1cu_storage_optimized_arxiv-titles-384-angular.json](experiments/configurations/zilliz_cloud_1cu_storage_optimized_arxiv-titles-384-angular.json)
+
+You need to write the cluster connection information obtained in Step 1 into the configuration files. 
+To modify the configuration files for testing, open each file and locate the `connection_params` section. 
+Update the values for `end_point`, `cloud_user`, and `cloud_password` with the appropriate cluster connection information obtained in Step 1.
+
+Here is an example of how the modified section may look:
+
+```shell
+"connection_params": {
+  "cloud_mode": true,
+  "host": "127.0.0.1",
+  "port": 19530,
+  "user": "",
+  "password": "",
+  "end_point": "https://your_host.zillizcloud.com:19538",
+  "cloud_user": "your_user",
+  "cloud_password": "your_password",
+  "cloud_secure": true
+},
+```
 
-Datasets are configured in the [datasets/datasets.json](./datasets/datasets.json) file.
-Framework will automatically download the dataset and store it in the [datasets](./datasets/) directory.
+### Step3. Run the tests
+```shell
+python3 run.py --engines *zilliz*
+```
 
-## How to implement a new engine?
+### Step4. View the test results
+```shell
+cd results
+grep -E 'rps|mean_precision' $(ls -t)
+```
+![ZillizResults.jpg](images%2FZillizResults.jpg)
+
+### Weaviate Cloud
+#### Step1. Create Cluster
+Register with [Weaviate Clou](https://console.weaviate.cloud/dashboard) and create a cluster. 
+Record the cluster connection information: `cluster URL` and `Authentication`.
+![WeaviateConsole.jpg](images%2FWeaviateConsole.jpg)
+
+#### Step2. Modify the configuration
+We have provided two configuration files for testing Weaviate Cloud:
+- [weaviate_cloud_standard_arxiv-titles-384-angular.json](experiments/configurations/weaviate_cloud_standard_arxiv-titles-384-angular.json)
+- [weaviate_cloud_standard_laion-768-5m-ip.json](experiments/configurations/weaviate_cloud_standard_laion-768-5m-ip.json)
+
+You need to write the cluster connection information obtained in Step 1 into the configuration files. 
+Modify the `connection_params` section of the files and update the values for `host` and `api_key`. 
+The `host` corresponds to the `cluster URL`, and the `api_key` is the `Authentication`.
+
+Here is an example of how the modified section may look:
+
+```shell
+"connection_params": {
+  "host": "https://your_host.weaviate.cloud",
+  "port": 8090,
+  "timeout_config": 2000,
+  "api_key": "your_api_key"
+},
+```
 
-There are a few base classes that you can use to implement a new engine.
+### Step3. Run the tests
+```shell
+python3 run.py --engines *weaviate*
+```
 
-* `BaseConfigurator` - defines methods to create collections, setup indexing parameters.
-* `BaseUploader` - defines methods to upload the data to the server.
-* `BaseSearcher` - defines methods to search the data.
+### Step4. View the test results
+```shell
+cd results
+grep -E 'rps|mean_precision' $(ls -t)
+```
+![WeaviateResults.jpg](images%2FWeaviateResults.jpg)
+
+### Qdrant
+#### Step1. Create Cluster
+Register with [Qdrant Cloud](https://cloud.qdrant.io/) and create a cluster. 
+Record the cluster connection information: `URL` and `API key`.
+![QdrantConsole.jpg](images%2FQdrantConsole.jpg)
+
+#### Step2. Modify the configuration
+We have provided three configuration files for testing Qdrant:
+- [qdrant_cloud_hnsw_2c16g_storage_optimized_laion-768-5m-ip.json](experiments/configurations/qdrant_cloud_hnsw_2c16g_storage_optimized_laion-768-5m-ip.json)
+- [qdrant_cloud_hnsw_2c16g_storage_optimized_arxiv-titles-384-angular.json](experiments/configurations/qdrant_cloud_hnsw_2c16g_storage_optimized_arxiv-titles-384-angular.json)
+- [qdrant_cloud_quantization_2c16g_storage_optimized_laion-768-5m-ip.json](experiments/configurations/qdrant_cloud_quantization_2c16g_storage_optimized_laion-768-5m-ip.json)
+
+You need to write the cluster connection information obtained in Step 1 into the configuration files. 
+Modify the `connection_params` section of the files and update the values for `host` and `api_key`. 
+Please note that for the `connection_params` section, you need to remove the `port` from the end of the `host` string.
+Here is an example of how the modified section may look:
+
+```shell
+"connection_params": {
+  "host": "https://your_host.aws.cloud.qdrant.io",
+  "port": 6333,
+  "grpc_port": 6334,
+  "prefer_grpc": false,
+  "api_key": "your_api_key"
+},
+```
 
-See the examples in the [clients](./engine/clients) directory.
+### Step3. Run the tests
+```shell
+python3 run.py --engines *qdrant*
+```
 
-Once all the necessary classes are implemented, you can register the engine in the [ClientFactory](./engine/clients/client_factory.py).
+### Step4. View the test results
+```shell
+cd results
+grep -E 'rps|mean_precision' $(ls -t)
+```
+![QdrantResults.jpg](images%2FQdrantResults.jpg)
diff --git a/README_OLD.md b/README_OLD.md
@@ -0,0 +1,94 @@
+# vector-db-benchmark
+![Benchmark Results](images/benchmark_results.png)
+
+There are various vector search engines available, and each of them may offer
+a different set of features and efficiency. But how do we measure the
+performance? There is no clear definition and in a specific case you
+may worry about a specific thing, while not paying much attention to other aspects. This
+project is a general framework for benchmarking different engines under the
+same hardware constraints, so you can choose what works best for you.
+
+Running any benchmark requires choosing an engine, a dataset and defining the
+scenario against which it should be tested. A specific scenario may assume
+running the server in a single or distributed mode, a different client
+implementation and the number of client instances.
+
+## How to run a benchmark?
+
+Benchmarks are implemented in server-client mode, meaning that the server is
+running in a single machine, and the client is running on another.
+
+### Run the server
+
+All engines are served using docker compose. The configuration is in the [servers](./engine/servers/).
+
+To launch the server instance, run the following command:
+
+```bash
+cd ./engine/servers/<engine-configuration-name>
+docker compose up -d
+```
+
+Containers are expected to expose all necessary ports, so the client can connect to them.
+
+### Run the client
+
+Install dependencies:
+
+```bash
+pip install poetry
+poetry install
+```
+
+Run the benchmark:
+
+```bash
+Usage: run.py [OPTIONS]
+
+  Example: python3 -m run --engines *-m-16-* --datasets glove-*
+
+Options:
+  --engines TEXT                  [default: *]
+  --datasets TEXT                 [default: *]
+  --host TEXT                     [default: localhost]
+  --skip-upload / --no-skip-upload
+                                  [default: no-skip-upload]
+  --install-completion            Install completion for the current shell.
+  --show-completion               Show completion for the current shell, to
+                                  copy it or customize the installation.
+  --help                          Show this message and exit.
+```
+
+Command allows you to specify wildcards for engines and datasets.
+Results of the benchmarks are stored in the `./results/` directory.
+
+## How to update benchmark parameters?
+
+Each engine has a configuration file, which is used to define the parameters for the benchmark.
+Configuration files are located in the [configuration](./experiments/configurations/) directory.
+
+Each step in the benchmark process is using a dedicated configuration's path:
+
+* `connection_params` - passed to the client during the connection phase.
+* `collection_params` - parameters, used to create the collection, indexing parameters are usually defined here.
+* `upload_params` - parameters, used to upload the data to the server.
+* `search_params` - passed to the client during the search phase. Framework allows multiple search configurations for the same experiment run.
+
+Exact values of the parameters are individual for each engine.
+
+## How to register a dataset?
+
+Datasets are configured in the [datasets/datasets.json](./datasets/datasets.json) file.
+Framework will automatically download the dataset and store it in the [datasets](./datasets/) directory.
+
+## How to implement a new engine?
+
+There are a few base classes that you can use to implement a new engine.
+
+* `BaseConfigurator` - defines methods to create collections, setup indexing parameters.
+* `BaseUploader` - defines methods to upload the data to the server.
+* `BaseSearcher` - defines methods to search the data.
+
+See the examples in the [clients](./engine/clients) directory.
+
+Once all the necessary classes are implemented, you can register the engine in the [ClientFactory](./engine/clients/client_factory.py).
diff --git a/images/MyScaleConsole.jpg b/images/MyScaleConsole.jpg
diff --git a/images/MyScaleResuts.jpg b/images/MyScaleResuts.jpg
diff --git a/images/PineconeConsole.jpg b/images/PineconeConsole.jpg
diff --git a/images/PineconeResults.jpg b/images/PineconeResults.jpg
diff --git a/images/QdrantConsole.jpg b/images/QdrantConsole.jpg
diff --git a/images/QdrantResults.jpg b/images/QdrantResults.jpg
diff --git a/images/WeaviateConsole.jpg b/images/WeaviateConsole.jpg
diff --git a/images/WeaviateResults.jpg b/images/WeaviateResults.jpg
diff --git a/images/ZillizConsole.jpg b/images/ZillizConsole.jpg
diff --git a/images/ZillizResults.jpg b/images/ZillizResults.jpg
diff --git a/benchmark_results.png → images/benchmark_results.png b/benchmark_results.png → images/benchmark_results.png
diff --git a/images/cloud test steps.png b/images/cloud test steps.png