What exactly does numPartitions parameter do ? #57
Unanswered
sudhanshugarg
asked this question in
Q&A
Replies: 1 comment
-
if you have 3 partitions then spark will create 3 indices each containing roughly 1/3 of the input embeddings. This mean that you can index arbitrarily large datasets however querying becomes more expensive with every partition you add, because in order to find the nearest neighbours across all inputs it will need to query 3 indices If you manually partition your data you can mitigate this somewhat, say you know with high likelihood that the best matches will only be found on partition 1 then you can use queryPartitionsCol to only query that index and avoid doing 2/3 of the work |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
In KnnAlgorithm.scala on line 524, there is this line:
I realized after looking through the code that the greater the value returned by
getNumPartitions
, the longer my job takes. Infact, I was setting this parameter to 16000 at one point, and the job took 22 hrs and I had to kill it.Now I've reduced it to 3, and the job runs in 55 minutes. I'm unclear on what this parameter is, and what exactly does each partition represent. Could you please help elaborate a bit @jelmerk ?
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions