We did a hackathon #2871
MatthewAry
started this conversation in
Show and tell
Replies: 2 comments 1 reply
-
Oh yeah, I forgot to mention that we also managed to crash SurrealDB. We posted about that here #2677 (comment) |
Beta Was this translation helpful? Give feedback.
0 replies
-
Your hackathon work involving SurrealDB has really caught my interest, especially its performance aspects and the use of Bun and Elysia.js. Is there any chance you could share the source code or provide some insights into your experience? I completely understand if you can't, but I'm just eager to learn from your approach. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This October, I decided to organize a Hackathon within my company. We spent 36+ man-hours working on building a SurrealDB powered application. It was simply a data retrieval application that was populated with data from an FDA database.
💡 The data that we used can be found here:Our objective was to take the Raw FDA device data and put it into SurrealDB. We also wanted to create embeddings for the product classifications because the FDA classification system is pretty bad. We thought that we could use OpenAI’s
text-ada-001
embedder to generate our embedding values for the classifications.Our hackathon was conducted by 3 people, and only I had any experience playing around with SurrealDB but had never used it on a data set this big. Anyways, we created a schema, albeit a flawed one and not fully utilized.
From the FDA Data we renamed classification to category, and from the 510K clearances file, we split it up into two tables, vendors and products. This is because in the 510K data set, vendors are called
APPLICANT
s and a vendor can sell many products. We thought it would be cool to see what products each vendor sells. We used thePRODUCTCODE
field in the two datasets to link the products to classifications.System:
SurrealDB ran using RocksDB on a 2018 MacBook Pro with 32GB of RAM, 2.6GHz 6‑core 8th‑gen Intel i7. Team members connected to it over a TailScale network. We all connected to the same user account,
root
. We ran SurrealDB directly on the OS.Stack:
We used Python and the corresponding SurrealDB SDK to process and load the data into our database and built an application that connected to it using Bun and Elysia. We used the latest version of SurrealDB 1.0.0
Things were working fine until we loaded in the FDA data and did preparations for the embeddings. Disclaimer: This was a hackathon and we readily admit that we could have done things better, but we believe our observations are worth sharing.
We noticed the following:
SurrealDB became slow and sometimes seemingly unresponsive for all participants when we tried to execute a query such as
SELECT name, description, (SELECT name FROM product WHERE category = $parent.id) as products FROM category LIMIT 5;
we believe the performance would have been better if we had not done a JOIN like query and instead prepared an edge table that would bi-directionallyRELATE
the product and category tables together. (Could we have an Anti-Pattern blog post?) Regardless, when we ran this query, it seemed like nothing was happening and we couldn’t talk to the database. In addition, looking at the system activity, the SurrealDB process seemed compute bound, activity manager said that it was using 200% of the CPU (which I interpreted to mean that it was only using 2 CPU cores out of 6). Memory and Iops on the disk seemed negligible.We noticed that when we tried to perform a seemingly simple query such as
Count(SELECT * FROM category;)
orSELECT * FROM category LIMIT 5;
the query also ran extremely slowly. We think it might have something to do with the embedding field but aren’t sure. In order for us to create an index for the embedding field we set a default value for the embedding field to be an array with a length of 1536, with all values set to0
. This length represents the number of dimensions we observed thetext-ada-001
embedder returning when we did experiments to generate embeddings.Or when we ran
Count(SELECT * FROM products;)
it was also slow (there were a lot of product records!).We thought that maybe RocksDB might be to blame with how the database was getting locked up but when we tried doing stuff using TiKV as the storage layer (with only one node), we got similar results. We would see SurrealDB lock up when executing long running queries.
Perhaps we were doing something wrong, but we didn’t have a lot of time budgeted to try to figure out what the problem was, and we weren’t sure what we could to diagnose it either. Regardless, it was an interesting exercise, but unfortunately had us deciding to table using SurrealDB for the time being until we can budget the time to revisit this again and try to address the problems we ran into.
We intend to discuss our experience with the SurrealDB team during their weekly office hours. We will post updates to this thread as our journey with SurrealDB progresses.
Update: We talked with the developers during their office hours on October 20th about our experience. These are our takeaways from that conversation.
TLDR: SurrealDB is largely feature complete, but those features have a lot of room for optimization.
Our experiences with our query performance are largely due to the fact that there is a lot of optimization work that still needs to be done for a variety of SurrealDB query features. For this reason the
Count
,LIMIT
and our JOIN like query were slow.Count
SurrealDB would grab the entire record object when it executed it's operation. It is not efficient like what you might get with MySQL when you would do something likeSELECT Count(*) FROM products;
because SurrealDB would end up processing more information than what is needed to perform the operation requested efficiently.LIMIT
, those queries could be slow because if the data selected is not ordered then it has to figure out what objects to retrieve. The developers weren't clear as to what was happening under the hood, but they did tell us thatLIMIT
is currently not as efficient at execution it should be.products.length * customers.length
comparisons to figure out what records should go together. They told us that they have a very primitive query planner and that we could use theEXPLAIN
argument against our query to possibly see what SurrealDB would end up doing. They also told us that we could have used thePARALLEL
argument to have the database take a divide and conquer approach for query execution, possibly allowing it to consume more compute resources to complete the query faster.The developers weren't exactly sure what could have caused slow downs for other connected clients, or the crash.
Beta Was this translation helpful? Give feedback.
All reactions