Next Stop, Vector Databases: Building a Music Discovery App
Documenting the lessons learnt from building our discovery system outlined in “Bit Of This, Bit Of That: Revisiting Search & Discovery”
Disclaimer 1: This is the third instalment in the How Not to Build a Music Discovery App series, based on our paper titled Bit of This, Bit Of That: Revisiting Search and Discovery. In Part 1, we present the initial monolithic version, and in Part 2, we discuss the transition to a microservice-based architecture.
Disclaimer 2: The design choices mentioned in this article are made with a low-cost setup in mind. As a result, some of the design choices may not be the most straightforward.
Brief Summary
Genre-fluid music is any musical item (song or a playlist) which contains more than a single genre. Think Stressed Out by Twenty-One Pilots. Or Peaches En Regalia by Frank Zappa. Genre-fluid music has been gaining popularity over the last few decades. However, the search interfaces in music apps like Spotify and Apple Music are still designed for single-genre searches. Our paper proposes a platform to discover gene-fluid music through a combination of expressive search and user experience created around the core idea of genre-fluid search.
Part 1 of this series outlines the initial monolithic architecture used to build this platform. In Part 2, we break the monolith into three components: web server, discovery engine, and vector search server, using PostgreSQL for keyword search, Spotify ANNOY library for sparse genre-vector search, Gensim Word2vec for similarity-based search, and Redis as the cache storage.
Limitations
In the design version version2
, we broke the monolithic architecture into smaller components to simplify horizontal scaling. However, there were some accompanying design issues as well. These are as follows:
Too Many Search & Lookup Sources
Our search/lookup source schema looks like the following:
- PostgreSQL for full-text search
- Redis for storing cached and app data
- Spotify ANNOY data structures for sparse vector search
- Gensim library for dense vector search.
- In-memory sparse genre-vectors needed by the scoring module
That seems a bit much. Having these many search/lookup sources adds to the complexity of managing all these, such as scaling complexity and performance tuning.
Sub-Optimal Vector Database Implementation
Our vector search component, which uses the Spotify ANNOY library and the Gensim library for vector search, packaged as a Fastapi service, is not the most optimized vector-search component.
Spotify ANNOY was already a mid-tier vector search library in terms of speed in 2023, and we used the mmap mode on top of that to increase the search time further.
In addition, we also have the same problem of using two different libraries for vector search where using one vector database would have sufficed.
Redis Overhead
Redis is a very convenient caching solution, but as the application data schema becomes more complex, we start to see code snippets like this everywhere.
all_integer_items = [int(item.decode('utf-8')) for item in all_cached_items]
We must cast the byte response to our required data type as Redis stores the data as string data type.
Additionally, Redis works best with flat storage (list, set, or dictionary) and is not designed to store nested data. This makes Redis a less-than-ideal data storage (or caching) candidate as the application data becomes more complicated.
Finally, since we know that making multiple calls vs making a single bulk call can make all the difference from a performance perspective, Redis also provides limited support in that context by only supporting bulk GET queries for dictionaries and not lists.
Design Changes
Based on the abovementioned limitations, we can make the following design changes:
Replacing Redis With MongoDB
In the design versions 1 and 2, we used Redis for caching and storing the application data. This worked fine until we ran into the problems of having to write additional serialization/deserialization code for storing and using the retrieved data from Redis, having no direct support for storing nested data, and limited bulk query capabilities.
We can solve all those problems by moving to a NoSQL solution, such as MongoDB*. By doing so, we get two advantages straight off the bat:
- No more serialization deserialization overhead
- No more data storage format restrictions. We can store our data as JSONs with support for nesting as well.
We also get one more benefit by migrating to MongoDB: With its new storage engine, WiredTiger, we can choose the amount of memory to be allocated to it, meaning we can control the performance-memory tradeoff by controlling the amount of memory allocated to MongoDB.
Using Elasticsearch For All-Things-Search
Instead of using two separate solutions (PostgreSQL and SpotifyANNOY/Gensim) for full-text search and vector search, we can use a service which supports both, such as Elasticsearch. Elasticsearch has existed for a long time as a distributed full-text search engine. However, it has also added the vector search functionality over the past few years. Using this as our out-of-box vector database gives us the benefit of having a well-established, stable, and highly optimized service for our search use cases, making our application design much cleaner and more optimized.
Search Workflow
The search workflow in version3
remains similar to that discussed in version2
.
- The web server forwards the user query to the core discovery service.
- The query parser module parses the query, builds a query payload and forwards it to the caching module.
- The caching module checks whether the query results have already been stored in MongoDB. In case of a cache hit, steps 4–6 are skipped, and the result set is returned to the browser. In case of a cache miss, the query payload is forwarded to the Candidate Aggregation module.
- The candidate aggregation module sends the query to Elasticsearch for vector search, which returns 10k candidates.
- The scoring module scores the candidates with respect to the query.
- The filtering module removes duplicate candidates with regard to the item name and primary artist composition.
- The detail population module finally populates the result set candidates.
Thoughts
The Good
Search Source Aggregation
What this setup succeeds in achieving is search source aggregation. We replace PostgreSQL, Spotify ANNOY, and Gensim entirely with Elasticsearch, making our design much cleaner and easier to manage in terms of infrastructure management, scaling, and performance tuning.
MongoDB Convenience
By using MongoDB in place of Redis for caching purposes and application data storage, we now have data stored in a JSON format that closely resembles how we use the data in our application. And we no longer need to cast the data back to their intended data types, resulting in a cleaner code. All of this comes with an option to specify the memory allocated to MongoDB, thus making this setup suitable for the hardware resources available.
The Bad
In-Complete In-Memory Cleanup
The in-memory cleanup remains incomplete, with the sparse genre vectors remaining in the memory. We can store those in MongoDB, but fetching genre vectors from MongoDB despite sufficient memory allocation would still result in an increased fetch time compared to in-memory.
10k Scoring Time
Since the beginning, one persistent issue with the scoring module has been that it takes longer than expected to calculate scores for 10,000 candidates. This results in an overall increase in search response time.
The Ugly
While Elasticsearch is better than Spotify ANNOY (mmap mode) in terms of search performance and convenience it provides by handling both vector search and full-text search, the memory consumption in this setup went past all our self-imposed restrictions. With a total index size of around 25GB, we must allocate a whole new 64GB server for search, which amounts to roughly $576 monthly for running a self-hosted single instance of Elasticsearch. Ouch!
Conclusion
We cleaned up the application design by using Elasticsearch for both full-text and vector search queries. And it came at the expense of memory consumption. We also replaced Redis with MongoDB for caching and data storage purposes, thus aligning the data storage format (JSON) with the usage format. We must reduce costs significantly as we advance while keeping the design cleaner. Another problem is the 10k scoring time problem, which needs to be substantially reduced. And lastly, the memory cleanup remains to be completed.
Until the next iteration.
* In place of MongoDB, we can also use RedisJSON, a document-based database similar to MongoDB that supports data storage and retrieval as JSON.