What exactly does numPartitions parameter do ? #57

sudhanshugarg · 2023-04-13T18:56:01Z

sudhanshugarg
Apr 13, 2023

Hi,
In KnnAlgorithm.scala on line 524, there is this line:

        .flatMap { case (queryId, vector) =>
          Range(0, getNumPartitions).map { partition =>
            (partition, (queryId, vector))
          }
        }

I realized after looking through the code that the greater the value returned by getNumPartitions, the longer my job takes. Infact, I was setting this parameter to 16000 at one point, and the job took 22 hrs and I had to kill it.

Now I've reduced it to 3, and the job runs in 55 minutes. I'm unclear on what this parameter is, and what exactly does each partition represent. Could you please help elaborate a bit @jelmerk ?

Thanks.

jelmerk · 2023-06-16T08:32:30Z

jelmerk
Jun 16, 2023
Maintainer

if you have 3 partitions then spark will create 3 indices each containing roughly 1/3 of the input embeddings.

This mean that you can index arbitrarily large datasets however querying becomes more expensive with every partition you add, because in order to find the nearest neighbours across all inputs it will need to query 3 indices

If you manually partition your data you can mitigate this somewhat, say you know with high likelihood that the best matches will only be found on partition 1 then you can use queryPartitionsCol to only query that index and avoid doing 2/3 of the work

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What exactly does numPartitions parameter do ? #57

{{title}}

Replies: 1 comment

{{title}}

Select a reply

What exactly does numPartitions parameter do ? #57

sudhanshugarg Apr 13, 2023

Replies: 1 comment

jelmerk Jun 16, 2023 Maintainer

sudhanshugarg
Apr 13, 2023

jelmerk
Jun 16, 2023
Maintainer