<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Microservices | Bits And Music</title>
    <link>https://bitsandmusic.com/tag/microservices/</link>
      <atom:link href="https://bitsandmusic.com/tag/microservices/index.xml" rel="self" type="application/rss+xml" />
    <description>Microservices</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Wed, 01 Jan 2025 00:00:00 +0000</lastBuildDate>
    <image>
      <url>https://bitsandmusic.com/images/icon_hua5672c1e15dce4d511903ad7fb945fd0_28771_512x512_fill_lanczos_center_2.png</url>
      <title>Microservices</title>
      <link>https://bitsandmusic.com/tag/microservices/</link>
    </image>
    
    <item>
      <title>Playlist2vec: DIY Autoscaler For Docker Swarm - 2</title>
      <link>https://bitsandmusic.com/post/playlist2vec-docker-swarm-and-a-diy-autoscaler-2/</link>
      <pubDate>Wed, 01 Jan 2025 00:00:00 +0000</pubDate>
      <guid>https://bitsandmusic.com/post/playlist2vec-docker-swarm-and-a-diy-autoscaler-2/</guid>
      <description>&lt;p&gt;&lt;strong&gt;Disclaimer 1:&lt;/strong&gt; &lt;em&gt;This post continues the last post about building a playlist search and discovery application on a Raspberry Pi powered by the sequence-to-sequence model described in the post &amp;quot;Building Music Playlists Recommendation System.&amp;quot;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disclaimer 2:&lt;/strong&gt; &lt;em&gt;The design choices mentioned in this article are made keeping in mind a low-cost setup. As a result, some of the design choices may not be the most straightforward.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disclaimer 3:&lt;/strong&gt; &lt;em&gt;You can explore the vector search application, playlist2vec, here: &lt;a href=&#34;https://playlist2vec.com/&#34;&gt;https://playlist2vec.com/&lt;/a&gt;. You can find the code for the demo application &lt;a href=&#34;https://github.com/piyp791/playlist2vec&#34;&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&#34;brief-summary&#34;&gt;Brief Summary&lt;/h2&gt;

&lt;p&gt;In the initial setup for our vector search application, we had a &lt;a href=&#34;https://expressjs.com/&#34;&gt;NodeJS (Express JS)&lt;/a&gt; web server and &lt;a href=&#34;https://fastapi.tiangolo.com/&#34;&gt;FastAPI&lt;/a&gt; microservices for autocomplete and vector search features. All three are deployed as &lt;a href=&#34;https://www.docker.com/&#34;&gt;docker&lt;/a&gt; containers for ease of installation and scalability. We use the &lt;a href=&#34;https://github.com/unum-cloud/usearch&#34;&gt;USearch vector search&lt;/a&gt; library for vector search, and for autocomplete, we use a Python-based &lt;a href=&#34;https://en.wikipedia.org/wiki/Directed_acyclic_word_graph&#34;&gt;Directed Word Graph&lt;/a&gt; library called &lt;a href=&#34;https://github.com/seperman/fast-autocomplete&#34;&gt;fast-autocomplete&lt;/a&gt;. The entire setup is behind &lt;a href=&#34;https://nginx.org/en/&#34;&gt;Nginx&lt;/a&gt;, which acts as a reverse proxy for our setup.&lt;/p&gt;

&lt;figure&gt;
  &lt;img src=&#34;https://bitsandmusic.com/assets/images/playlist2vec.drawio.png&#34;  /&gt;
  &lt;figcaption&gt;
      &lt;p&gt;Playlist2vec design architecture. Scalable microservice-based architecture with NodeJS webserver, FastAPI-based vector search and autocomplete APIs, and Nginx as a reverse proxy.&lt;/p&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&#34;limitations&#34;&gt;Limitations&lt;/h2&gt;

&lt;p&gt;Despite support for these within our setup, there was no actual HTTPS or scaling implementation.
In this post, we will focus on adding scaling capability to our application using &lt;a href=&#34;https://docs.docker.com/engine/swarm/&#34;&gt;Docker Swarm&lt;/a&gt;.&lt;/p&gt;

&lt;hr&gt;

&lt;h2 id=&#34;docker-swarm-from-containers-to-services&#34;&gt;Docker Swarm: From Containers To Services&lt;/h2&gt;

&lt;p&gt;Docker Swarm enables the deployment and management of multiple instances of applications, ensuring high availability and resilience.
The key distinction between deploying an application in swarm mode and using conventional Docker Compose lies in the added abstraction of &lt;a href=&#34;https://docs.docker.com/engine/swarm/services/&#34;&gt;services&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Containers&lt;/strong&gt; are units of deployment which have their runtime. They encapsulate an application and its dependencies, including libraries, binaries, and configuration files, into a single, lightweight package. Each container runs in isolation from others, sharing the host operating system&#39;s kernel but maintaining its filesystem, processes, and network stack.&lt;/p&gt;

&lt;p&gt;On the other hand, &lt;strong&gt;services&lt;/strong&gt; represent a higher-level abstraction that defines how a specific application or a set of related applications should run in a container orchestration platform like Docker Swarm or &lt;a href=&#34;https://kubernetes.io/&#34;&gt;Kubernetes&lt;/a&gt;. A service specifies the desired state for a group of containers, including the number of replicas (instances) to run, the networking configuration, and the load-balancing strategy. When you create a service, the orchestration platform automatically manages the deployment and scaling of the underlying containers to meet the defined specifications.&lt;/p&gt;

&lt;h2 id=&#34;scaling-up-the-setup&#34;&gt;Scaling Up The Setup&lt;/h2&gt;

&lt;p&gt;Before transitioning to a Docker Swarm setup, we take the following steps to incorporate scaling into our configuration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add an additional Raspberry Pi to our machine cluster, ensuring that both machines can communicate with each other.&lt;/li&gt;
&lt;li&gt;Modify the vector search implementation to be memory-based, moving away from the previous &lt;a href=&#34;https://en.wikipedia.org/wiki/Mmap&#34;&gt;MMAP-based&lt;/a&gt;. This change allows us to better understand the resource requirements associated with scaling.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&#34;docker-swarm-config&#34;&gt;Docker Swarm Config&lt;/h2&gt;

&lt;p&gt;Here’s a snippet of the &lt;code&gt;docker-compose.yaml&lt;/code&gt; file for one of the services, &lt;code&gt;autocomplete-service&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;autocomplete-service:
    build: ./autocomplete-service
    image: ${REGISTRY_HOST}:${REGISTRY_PORT}/autocomplete-image:latest
    networks:
      - p2v-network
    env_file:
      - .env
    deploy:
      replicas: 2
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;When the &lt;a href=&#34;https://docs.docker.com/engine/swarm/stack-deploy/&#34;&gt;docker stack&lt;/a&gt; is deployed, this configuration scales the &lt;code&gt;autocomplete-service&lt;/code&gt; to run multiple instances (2 in this case). This setup enables the service to handle significantly more traffic compared to a single-instance configuration, enhancing its ability to manage increased load effectively.&lt;/p&gt;

&lt;hr&gt;

&lt;h2 id=&#34;autoscaling&#34;&gt;Autoscaling&lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;https://en.wikipedia.org/wiki/Autoscaling&#34;&gt;Autoscaling&lt;/a&gt; is a cloud computing feature that automatically adjusts a service&#39;s number of active instances based on current demand. This ensures optimal resource utilization, maintains performance, and minimizes costs by dynamically scaling resources up or down in response to varying workloads.&lt;/p&gt;

&lt;p&gt;The core concept of an autoscaler is to define conditions that trigger scaling actions. Common criteria for scaling include CPU usage, memory consumption, or the number of requests indicative of traffic load.&lt;/p&gt;

&lt;p&gt;An autoscaler can operate in two primary ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Event-Driven Scaling:&lt;/strong&gt; Scaling actions are triggered by specific events, such as a sudden spike in traffic.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Polling-Based Scaling:&lt;/strong&gt; A service continuously monitors a metric and initiates scaling actions when that metric crosses a defined threshold.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Docker Swarm does not natively support autoscaling capabilities, unlike Kubernetes, which offers robust &lt;a href=&#34;https://kubernetes.io/docs/concepts/workloads/autoscaling/&#34;&gt;autoscaling features&lt;/a&gt;. However, it is possible to implement a basic autoscaling solution within an existing Docker Swarm setup.&lt;/p&gt;

&lt;h2 id=&#34;diy-autoscaling&#34;&gt;DIY Autoscaling&lt;/h2&gt;

&lt;p&gt;In this implementation, &lt;strong&gt;we focus on the number of requests within a specific timeframe as the primary condition for autoscaling.&lt;/strong&gt; We use Nginx&#39;s &lt;code&gt;access.log&lt;/code&gt; as our primary source of information for the requests logged.&lt;/p&gt;

&lt;p&gt;Our approach employs a polling mechanism with a 15-second interval. A bash script runs every 15 seconds to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieve each endpoint&#39;s total number of requests within the last 15 seconds by using &lt;a href=&#34;https://en.wikipedia.org/wiki/AWK&#34;&gt;awk&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Determine the required number of replicas based on the request count based on our custom logic, implementing load-based scaling.&lt;/li&gt;
&lt;li&gt;Execute the &lt;code&gt;docker service scale&lt;/code&gt; command to adjust the number of replicas horizontally, effectively increasing or decreasing the number of service instances.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here&#39;s a snippet of code as an example to get the total request count for an endpoint from the Nginx logs:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;awk -v start=&amp;quot;$(date --date=&#39;15 seconds ago&#39; &#39;+%d/%b/%Y:%H:%M:%S&#39;)&amp;quot; \
    -v end=&amp;quot;$(date &#39;+%d/%b/%Y:%H:%M:%S&#39;)&amp;quot; \
    &#39;$4 &amp;gt;= &amp;quot;[&amp;quot;start&amp;quot;]&amp;quot; &amp;amp;&amp;amp; $4 &amp;lt;= &amp;quot;[&amp;quot;end&amp;quot;]&amp;quot; &amp;amp;&amp;amp; $7 ~ /\/populate/ {count++} \
    END {}&#39; /var/log/nginx.log
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To establish the relationship between the number of requests and the required replicas, we conduct load testing on our services using the load testing tool &lt;a href=&#34;https://k6.io/&#34;&gt;k6&lt;/a&gt;. By performing &lt;a href=&#34;https://grafana.com/docs/k6/latest/using-k6/scenarios/executors/constant-arrival-rate/&#34;&gt;constant-rate arrival tests&lt;/a&gt;, we identify the maximum requests a single Docker instance can handle for each service on our specific hardware. This data informs our autoscaling setup, ensuring we can effectively manage resource allocation in response to varying traffic demands.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;// Autoscaling Pseudocode

// Read request counts from the Nginx access log for each endpoint
request_counts = read_log_parser_output()

// Determine the required number of service replicas based on 
// request counts based on our custom logic and prior load testing
required_service_replicas_lookup = get_scale(request_counts)

// Execute scaling commands for each service
FOR EACH service, replicas IN required_service_replicas_lookup:
    scale_service(service, replicas)
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&#34;thoughts&#34;&gt;Thoughts&lt;/h2&gt;

&lt;h3 id=&#34;the-good&#34;&gt;The Good&lt;/h3&gt;

&lt;h4 id=&#34;lowcost-setup-which-works&#34;&gt;Low-Cost Setup Which Works&lt;/h4&gt;

&lt;p&gt;This setup is Ideal for small to medium-scale projects due to its low resource requirements. It lacks the complexity of more advanced frameworks like Kubernetes, requires only Linux&#39;s awk and python installation and is simple enough to set up and deploy, making managing it easier.&lt;/p&gt;

&lt;h4 id=&#34;customizability&#34;&gt;Customizability&lt;/h4&gt;

&lt;p&gt;It offers greater control over the autoscaling logic, allowing for adjustments such as adding custom logic to monitor additional metrics, making the scaling logic more sophisticated or simply modifying the polling duration.&lt;/p&gt;

&lt;h3 id=&#34;the-bad&#34;&gt;The Bad&lt;/h3&gt;

&lt;h4 id=&#34;dependency-on-load-testing&#34;&gt;Dependency on Load Testing&lt;/h4&gt;

&lt;p&gt;Given that our scaling setup uses predefined load thresholds, the primary limitation of our DIY setup is its dependence upon manual load testing to determine appropriate scaling thresholds.&lt;/p&gt;

&lt;h4 id=&#34;polling-limitations&#34;&gt;Polling Limitations&lt;/h4&gt;

&lt;p&gt;Another limitation of our polling-based scaling setup is that it may miss traffic peaks since any decision on whether to scale or not can come only after a predefined duration of 15 seconds, leading to delayed scaling responses.&lt;/p&gt;

&lt;h4 id=&#34;clunky-setup&#34;&gt;Clunky Setup&lt;/h4&gt;

&lt;p&gt;Given that the setup involves setting up a cron job every 15 seconds, setting up the correct path to the nginx logs, the autoscale scripts, etc., it can feel quite clunky compared to industry-standard autoscaling frameworks such as Kubernetes.&lt;/p&gt;

&lt;h4 id=&#34;limited-metrics&#34;&gt;Limited Metrics&lt;/h4&gt;

&lt;p&gt;The setup only considers the number of incoming requests reading from nginx logs. It does not consider other vital metrics, such as CPU and memory usage, which would be valuable indicators when evaluating scaling needs. Libraries such as &lt;a href=&#34;https://github.com/google/cadvisor&#34;&gt;cAdvisor&lt;/a&gt;, which can get container health metrics such as CPU usage, memory, etc, can be added to this setup to get a complete picture before deciding to scale.&lt;/p&gt;

&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;We added a simple (auto)scaling capability to our vector search application deployed on a cluster of Raspberry Pis. The setup is highly low-cost but has limitations, such as being prone to missing traffic peaks, requiring manual load testing before the setup, and having limited metrics under consideration for scaling. Adding a standardized auto scaler such as Kubernetes would be the next step.&lt;/p&gt;

&lt;p&gt;Until the next iteration.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Playlist2vec: A Raspberry-Pi Powered Vector Search System - 1</title>
      <link>https://bitsandmusic.com/post/playlist2vec-a-raspberry-pi-powered-vector-search-system-1/</link>
      <pubDate>Wed, 04 Dec 2024 00:00:00 +0000</pubDate>
      <guid>https://bitsandmusic.com/post/playlist2vec-a-raspberry-pi-powered-vector-search-system-1/</guid>
      <description>&lt;p&gt;&lt;strong&gt;Disclaimer 1:&lt;/strong&gt; &lt;em&gt;The design choices mentioned in this article are made keeping in mind a low-cost setup. As a result, some of the design choices may not be the most straightforward ones.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disclaimer 2:&lt;/strong&gt; &lt;em&gt;You can explore the vector search application, playlist2vec, here: &lt;a href=&#34;https://playlist2vec.com/&#34;&gt;https://playlist2vec.com/&lt;/a&gt;. You can find the code for the demo application &lt;a href=&#34;https://github.com/piyp791/playlist2vec&#34;&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&#34;introduction&#34;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;In 2019, we published a &lt;a href=&#34;https://bitsandmusic.com/publication/playlist2vec/&#34;&gt;paper&lt;/a&gt; titled &amp;quot;Representation, Exploration, and Recommendation of Music Playlists.&amp;quot; In this work, we utilized &lt;a href=&#34;https://en.wikipedia.org/wiki/Seq2seq&#34;&gt;sequence-to-sequence models&lt;/a&gt; to create playlist &lt;a href=&#34;https://www.cloudflare.com/en-gb/learning/ai/what-are-embeddings/&#34;&gt;embeddings&lt;/a&gt;, which can be employed for various downstream tasks like search and discovery. You can see these embeddings in action at &lt;a href=&#34;https://playlist2vec.com/&#34;&gt;playlist2vec.com&lt;/a&gt;. The purpose of this post is to explain how we built Playlist2Vec, a playlist search application powered by the embeddings mentioned earlier.&lt;/p&gt;

&lt;h2 id=&#34;main-features&#34;&gt;Main Features&lt;/h2&gt;

&lt;p&gt;The main features of the app are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A search box with typeahead search where users can enter the item&#39;s name they are looking for.&lt;/li&gt;
&lt;li&gt;After selecting their preferred playlist name and submitting it, the system will display playlists similar to the one queried.&lt;/li&gt;
&lt;li&gt;The app provides &lt;a href=&#34;https://open.spotify.com/&#34;&gt;Spotify&lt;/a&gt; URLs for the items, allowing users to navigate to them from the results page easily.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&#34;design-considerations&#34;&gt;Design Considerations&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;The typeahead search must be instantaneous.&lt;/li&gt;
&lt;li&gt;The &lt;a href=&#34;https://www.ibm.com/topics/vector-search&#34;&gt;vector search&lt;/a&gt; should be capable of completing in under 2 seconds on a low-cost machine, like a &lt;a href=&#34;https://www.raspberrypi.com/&#34;&gt;Raspberry Pi&lt;/a&gt;, under normal traffic load.&lt;/li&gt;
&lt;li&gt;Both vector and full-text search should be deployable onto a single machine with 4GB of RAM.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&#34;developer-friendly-outline&#34;&gt;Developer Friendly Outline&lt;/h2&gt;

&lt;p&gt;We primarily wanted a setup with a lower footprint but still scalable if needed. So, we designed our system using a &lt;a href=&#34;https://microservices.io/&#34;&gt;microservice architecture&lt;/a&gt; so that the system components are decoupled from each other and can be &lt;a href=&#34;https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-23/wat.concept.horizontal-scaling.en.html&#34;&gt;horizontally scaled&lt;/a&gt; independently. With that in consideration, our tech stack looks like this for the app:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&#34;https://expressjs.com/&#34;&gt;NodeJS (ExpressJS)&lt;/a&gt;&lt;/strong&gt; webserver&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&#34;https://fastapi.tiangolo.com/&#34;&gt;FastAPI&lt;/a&gt;&lt;/strong&gt; for building two of our APIs; one is the search API for vector search, and the second is the autocomplete API for the typeahead search.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&#34;https://github.com/unum-cloud/usearch&#34;&gt;USearch Vector Search Library&lt;/a&gt;&lt;/strong&gt; for vector search. Given a query playlist, similar playlists are found from a corpus of 745,543 playlists using vector search. This particular library was chosen because of its speed and &lt;a href=&#34;https://en.wikipedia.org/wiki/Mmap&#34;&gt;mmap&lt;/a&gt;* support.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&#34;https://github.com/seperman/fast-autocomplete&#34;&gt;Fast Autocomplete Python library&lt;/a&gt;&lt;/strong&gt;, a &lt;a href=&#34;https://en.wikipedia.org/wiki/Directed_acyclic_word_graph&#34;&gt;Directed Word Graph-based&lt;/a&gt; library for the typeahead search.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&#34;https://www.sqlite.org/&#34;&gt;SQLite database&lt;/a&gt;&lt;/strong&gt; to store additional details for playlists such as Spotify ID, name, and cover image link. This specific database is for its portability and a smaller footprint.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&#34;https://www.docker.com/&#34;&gt;Docker&lt;/a&gt;&lt;/strong&gt; containers to run the APIs and the webserver to facilitate horizontal scaleup&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&#34;https://nginx.org/en/&#34;&gt;Nginx, as a reverse proxy&lt;/a&gt;&lt;/strong&gt; for our setup, so that features such as caching, rate limiting, etc., do not have to be baked into the code. Configured to be installed on the host machine instead of running as a docker container**.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&#34;overall-system-design&#34;&gt;Overall System Design&lt;/h2&gt;

&lt;p&gt;Here&#39;s how the workflow looks like:&lt;/p&gt;

&lt;figure&gt;
  &lt;img src=&#34;https://bitsandmusic.com/assets/images/playlist2vec.drawio.png&#34;  /&gt;
  &lt;figcaption&gt;
      &lt;p&gt;Playlist2vec design architecture. Scalable microservice-based architecture with NodeJS webserver, FastAPI-based vector search and autocomplete APIs, and Nginx as a reverse proxy.&lt;/p&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;ol&gt;
&lt;li&gt;When you begin typing the item you want to search for, Nginx will return a cached response if one is available.&lt;/li&gt;
&lt;li&gt;If there is no cached response, the request is forwarded to the web server, which then sends it to the autocomplete API. The API returns a list of suggested item names and their corresponding IDs.&lt;/li&gt;
&lt;li&gt;Once the user selects an item from the list, the ID is sent to the web server, which forwards it to the search API. The USearch vector index retrieves the k-closest results.&lt;/li&gt;
&lt;li&gt;Additional details about these closest results, such as the playlist name, ID, and the cover image link, are then fetched from the SQLite database and returned to the browser.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&#34;thoughts&#34;&gt;Thoughts&lt;/h2&gt;

&lt;h3 id=&#34;the-good&#34;&gt;The Good&lt;/h3&gt;

&lt;h4 id=&#34;scalable-architecture&#34;&gt;Scalable architecture&lt;/h4&gt;

&lt;p&gt;Using Docker containers to hold our system modules (web server, search, and autocomplete APIs) makes it easy to scale the setup if needed.&lt;/p&gt;

&lt;h4 id=&#34;memory-friendly-setup&#34;&gt;Memory friendly setup&lt;/h4&gt;

&lt;p&gt;We chose SQLite as our database and USearch as our vector search library due to their lower memory footprint and MMAP-based implementation. Since this system is read-only, we do not require the concurrency features offered by enterprise databases like MySQL or PostgreSQL. Regarding vector search, the mmap support allows us to avoid loading the vector search index into memory, which helps conserve system RAM. While the performance may not match memory-based alternatives, it is sufficient for our needs.&lt;/p&gt;

&lt;h4 id=&#34;robust-traffic-support-by-using-nginx&#34;&gt;Robust traffic support by using Nginx&lt;/h4&gt;

&lt;p&gt;Using Nginx enables robust support for traffic management, whether it is &lt;a href=&#34;https://blog.nginx.org/blog/rate-limiting-nginx&#34;&gt;rate limiting&lt;/a&gt; to prevent any DDOS attacks (or even any volume of traffic which are beyond what our application can handle), &lt;a href=&#34;https://docs.nginx.com/nginx/admin-guide/content-cache/content-caching/&#34;&gt;caching&lt;/a&gt; (to have efficient utilization of system resources), or the rendering of static resources such as images, CSS, JS files, etc.&lt;/p&gt;

&lt;h3 id=&#34;the-bad&#34;&gt;The Bad&lt;/h3&gt;

&lt;h4 id=&#34;autocomplete-memory-consumption&#34;&gt;Autocomplete Memory Consumption&lt;/h4&gt;

&lt;p&gt;The memory consumption of the autocomplete API can be pretty high under heavy load. Considering mmap-based alternatives may be beneficial in this case.&lt;/p&gt;

&lt;h4 id=&#34;no-https-support&#34;&gt;No HTTPS support&lt;/h4&gt;

&lt;p&gt;The &lt;a href=&#34;https://github.com/piyp791/playlist2vec/releases/tag/v1.0.0&#34;&gt;v1.0.0&lt;/a&gt; setup doesn&#39;t support HTTPS, so the setup requires something like a &lt;a href=&#34;https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/&#34;&gt;Cloudflare tunnel&lt;/a&gt; for HTTPS support.&lt;/p&gt;

&lt;p&gt;Additionally, there is no built-in HTTPS support for the search and autocomplete APIs. The Node.js web server communicates directly with the API containers without a proxy in place to manage SSL or other networking rules. This means that all traffic between the web server and the APIs is unencrypted.&lt;/p&gt;

&lt;h4 id=&#34;no-autoscaling-yet&#34;&gt;No (Auto)scaling (Yet)&lt;/h4&gt;

&lt;p&gt;Although the application has been designed to support scaling, the current version, v.1.0.0, still requires a scaling configuration, either &lt;a href=&#34;https://kubernetes.io/&#34;&gt;Kubernates&lt;/a&gt;-based or another approach.&lt;/p&gt;

&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;We designed a playlist search application powered by the embeddings from the sequence-to-sequence model we described in our paper. The setup is low-cost, enabling it to be deployed on a Raspberry Pi while still being designed to be scalable if needed. This version does depend on an HTTPS frontend (such as a Cloudflare tunnel) and does not yet have any scaling configuration.&lt;/p&gt;

&lt;p&gt;Until the next iteration.&lt;/p&gt;

&lt;h2 id=&#34;notes&#34;&gt;Notes&lt;/h2&gt;

&lt;p&gt;*Memory-mapped I/O (mmap) is a technique that allows a file or a portion of a file to be directly mapped into the memory address space of a process. This enables applications to access the file&#39;s contents as though they were part of the program&#39;s memory, facilitating efficient file input/output (I/O) operations.&lt;/p&gt;

&lt;p&gt;**While Nginx could also have been installed as a docker container, we decided to run it on the host machine itself so as not to depend on the docker running itself, which can be used to show a maintenance page during any docker upgrades.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Next Stop, Vector Databases: Building a Music Discovery App - 3</title>
      <link>https://bitsandmusic.com/post/building-music-discovery-app-3/</link>
      <pubDate>Mon, 05 Feb 2024 00:00:00 +0000</pubDate>
      <guid>https://bitsandmusic.com/post/building-music-discovery-app-3/</guid>
      <description>&lt;p&gt;&lt;strong&gt;Disclaimer 1:&lt;/strong&gt; &lt;em&gt;This is the third instalment in the &lt;a href=&#34;https://bitsandmusic.com/series/how-not-to-build-a-music-discovery-app/&#34;&gt;How Not to Build a Music Discovery App&lt;/a&gt; series, based on our paper titled &lt;a href=&#34;https://archives.ismir.net/ismir2021/latebreaking/000054.pdf&#34;&gt;Bit of This, Bit Of That: Revisiting Search and Discovery&lt;/a&gt;. In &lt;a href=&#34;https://bitsandmusic.com/post/building-music-discovery-app-1/&#34;&gt;Part 1&lt;/a&gt;, we present the initial monolithic version, and in &lt;a href=&#34;https://bitsandmusic.com/post/building-music-discovery-app-2/&#34;&gt;Part 2&lt;/a&gt;, we discuss the transition to a microservice-based architecture.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disclaimer 2:&lt;/strong&gt; &lt;em&gt;The design choices mentioned in this article are made with a low-cost setup in mind. As a result, some of the design choices may not be the most straightforward.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disclaimer 3&lt;/strong&gt; &lt;em&gt;You can explore our music discovery platform, This &amp;amp; That Music, here: &lt;a href=&#34;https://discover.thisandthatmusic.com/&#34;&gt;https://discover.thisandthatmusic.com/&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id=&#34;brief-summary&#34;&gt;Brief Summary&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Genre-fluid music&lt;/strong&gt; is any musical item (song or a playlist) which contains more than a single genre. Think &lt;a href=&#34;https://www.youtube.com/watch?v=pXRviuL6vMY&#34;&gt;Stressed Out&lt;/a&gt; by Twenty-One Pilots. Or &lt;a href=&#34;https://www.youtube.com/watch?v=FoYdeEDdtK4&#34;&gt;Peaches En Regalia&lt;/a&gt; by Frank Zappa. Genre-fluid music has been &lt;a href=&#34;https://www.waterandmusic.com/tracking-genre-diversity-and-fluidity-in-the-billboard-charts&#34;&gt;gaining popularity over the last few decades&lt;/a&gt;. However, the search interfaces in music apps like Spotify and Apple Music are still designed for single-genre searches. Our paper proposes a platform to discover gene-fluid music through a combination of expressive search and user experience created around the core idea of genre-fluid search.&lt;/p&gt;

&lt;p&gt;&lt;a href=&#34;https://bitsandmusic.com/post/building-music-discovery-app-1/&#34;&gt;Part 1&lt;/a&gt; of this series outlines the initial monolithic architecture used to build this platform. In &lt;a href=&#34;https://bitsandmusic.com/post/building-music-discovery-app-2/&#34;&gt;Part 2&lt;/a&gt;, we break the monolith into three components: web server, discovery engine, and vector search server, using PostgreSQL for keyword search, Spotify ANNOY library for sparse genre-vector search, Gensim Word2vec for similarity-based search, and Redis as the cache storage.&lt;/p&gt;

&lt;figure&gt;
  &lt;img src=&#34;https://bitsandmusic.com/assets/images/discovery-app/version2.drawio.svg&#34;  /&gt;
  &lt;figcaption&gt;
      &lt;p&gt;&lt;b&gt;Previously designed microservice-based architecture (version2).&lt;/b&gt;&lt;/p&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt; 

&lt;h2 id=&#34;limitations&#34;&gt;Limitations&lt;/h2&gt;

&lt;p&gt;In the design version &lt;code&gt;version2&lt;/code&gt;, we broke the monolithic architecture into smaller components to simplify horizontal scaling. However, there were some accompanying design issues as well. These are as follows:&lt;/p&gt;

&lt;h3 id=&#34;too-many-search--lookup-sources&#34;&gt;Too Many Search &amp;amp; Lookup Sources&lt;/h3&gt;

&lt;p&gt;Our search/lookup source schema looks like the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://www.postgresql.org/&#34;&gt;PostgreSQL&lt;/a&gt; for full-text search&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://redis.io/&#34;&gt;Redis&lt;/a&gt; for storing cached and app data&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/spotify/annoy&#34;&gt;Spotify ANNOY&lt;/a&gt; data structures for sparse vector search&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://radimrehurek.com/gensim/models/word2vec.html&#34;&gt;Gensim library&lt;/a&gt; for dense vector search.&lt;/li&gt;
&lt;li&gt;In-memory sparse genre-vectors needed by the scoring module&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That seems a bit much. Having these many search/lookup sources adds to the complexity of managing all these, such as scaling complexity and performance tuning.&lt;/p&gt;

&lt;h3 id=&#34;suboptimal-vector-database-implementation&#34;&gt;Sub-Optimal Vector Database Implementation&lt;/h3&gt;

&lt;p&gt;Our vector search component, which uses the Spotify ANNOY library and the Gensim library for vector search, packaged as a &lt;a href=&#34;https://fastapi.tiangolo.com/&#34;&gt;Fastapi&lt;/a&gt; service, is not the most optimized vector-search component.&lt;/p&gt;

&lt;p&gt;Spotify ANNOY was already a &lt;a href=&#34;https://ann-benchmarks.com/glove-100-angular_10_angular.html&#34;&gt;mid-tier vector search library&lt;/a&gt; in terms of speed in 2023, and we &lt;a href=&#34;https://bitsandmusic.com/post/building-music-discovery-app-2/#reducing-memory-consumption&#34;&gt;used the mmap mode&lt;/a&gt; on top of that to increase the search time further.&lt;/p&gt;

&lt;p&gt;In addition, we also have the same problem of using two different libraries for vector search where using one &lt;a href=&#34;https://www.pinecone.io/learn/vector-database/&#34;&gt;vector database&lt;/a&gt; would have sufficed.&lt;/p&gt;

&lt;h3 id=&#34;redis-overhead&#34;&gt;Redis Overhead&lt;/h3&gt;

&lt;p&gt;Redis is a very convenient caching solution, but as the application data schema becomes more complex, we start to see code snippets like this everywhere.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;all_integer_items = [int(item.decode(&#39;utf-8&#39;)) for item in all_cached_items]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We must cast the byte response to our required data type as &lt;a href=&#34;https://redis.io/docs/data-types/&#34;&gt;Redis stores the data as string data type&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Additionally, Redis works best with flat storage (list, set, or dictionary) and is not designed to store nested data. This makes Redis a less-than-ideal data storage (or caching) candidate as the application data becomes more complicated.&lt;/p&gt;

&lt;p&gt;Finally, since we know that making multiple calls vs making a single bulk call can make all the difference from a performance perspective, Redis also provides limited support in that context by only supporting &lt;a href=&#34;https://redis.io/commands/mget/&#34;&gt;bulk GET queries for dictionaries&lt;/a&gt; and not lists.&lt;/p&gt;

&lt;h2 id=&#34;design-changes&#34;&gt;Design Changes&lt;/h2&gt;

&lt;p&gt;Based on the abovementioned limitations, we can make the following design changes:&lt;/p&gt;

&lt;h3 id=&#34;replacing-redis-with-mongodb&#34;&gt;Replacing Redis With MongoDB&lt;/h3&gt;

&lt;p&gt;In the design versions 1 and 2, we used Redis for caching and storing the application data. This worked fine until we ran into the problems of having to write additional serialization/deserialization code for storing and using the retrieved data from Redis, having no direct support for storing nested data, and limited bulk query capabilities.&lt;/p&gt;

&lt;p&gt;We can solve all those problems by moving to a &lt;a href=&#34;https://aws.amazon.com/nosql/&#34;&gt;NoSQL&lt;/a&gt; solution, such as &lt;a href=&#34;https://www.mongodb.com/&#34;&gt;MongoDB&lt;/a&gt;*. By doing so, we get two advantages straight off the bat:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No more serialization deserialization overhead&lt;/li&gt;
&lt;li&gt;No more data storage format restrictions. We can store our data as JSONs with support for nesting as well.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We also get one more benefit by migrating to MongoDB: With its new storage engine, &lt;a href=&#34;https://www.mongodb.com/docs/manual/core/wiredtiger/&#34;&gt;WiredTiger&lt;/a&gt;, we can choose the amount of memory to be allocated to it, meaning we can control the performance-memory tradeoff by controlling the amount of memory allocated to MongoDB.&lt;/p&gt;

&lt;h3 id=&#34;using-elasticsearch-for-allthingssearch&#34;&gt;Using Elasticsearch For All-Things-Search&lt;/h3&gt;

&lt;p&gt;Instead of using two separate solutions (PostgreSQL and SpotifyANNOY/Gensim) for full-text search and vector search, we can use a service which supports both, such as &lt;a href=&#34;https://www.elastic.co/elasticsearch&#34;&gt;Elasticsearch&lt;/a&gt;. Elasticsearch has existed for a long time as a distributed full-text search engine. However, it has also added the &lt;a href=&#34;https://elastic.co/elasticsearch/vector-database&#34;&gt;vector search functionality&lt;/a&gt; over the past few years. Using this as our out-of-box vector database gives us the benefit of having a well-established, stable, and highly optimized service for our search use cases, making our application design much cleaner and more optimized.&lt;/p&gt;

&lt;figure&gt;
  &lt;img src=&#34;https://bitsandmusic.com/assets/images/discovery-app/version3.drawio.svg&#34;  /&gt;
  &lt;figcaption&gt;
      &lt;p&gt;&lt;b&gt;Version3 design architecture. Search source aggregation achieved by using Elasticsearch for full-text and vector search. MongoDB replaces Redis for caching and application data storage.&lt;/b&gt;&lt;/p&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&#34;search-workflow&#34;&gt;Search Workflow&lt;/h2&gt;

&lt;p&gt;The search workflow in &lt;code&gt;version3&lt;/code&gt; remains similar to that discussed in &lt;code&gt;version2&lt;/code&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The web server forwards the user query to the core discovery service.&lt;/li&gt;
&lt;li&gt;The query parser module parses the query, builds a query payload and forwards it to the caching module.&lt;/li&gt;
&lt;li&gt;The caching module checks whether the query results have already been stored in MongoDB. In case of a cache hit, steps 4–6 are skipped, and the result set is returned to the browser. In case of a cache miss, the query payload is forwarded to the Candidate Aggregation module.&lt;/li&gt;
&lt;li&gt;The candidate aggregation module sends the query to Elasticsearch for vector search, which returns 10k candidates.&lt;/li&gt;
&lt;li&gt;The scoring module scores the candidates with respect to the query.&lt;/li&gt;
&lt;li&gt;The filtering module removes duplicate candidates with regard to the item name and primary artist composition.&lt;/li&gt;
&lt;li&gt;The detail population module finally populates the result set candidates.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&#34;thoughts&#34;&gt;Thoughts&lt;/h2&gt;

&lt;h3 id=&#34;the-good&#34;&gt;The Good&lt;/h3&gt;

&lt;h4 id=&#34;search-source-aggregation&#34;&gt;Search Source Aggregation&lt;/h4&gt;

&lt;p&gt;What this setup succeeds in achieving is search source aggregation. We replace PostgreSQL, Spotify ANNOY, and Gensim entirely with Elasticsearch, making our design much cleaner and easier to manage in terms of infrastructure management, scaling, and performance tuning.&lt;/p&gt;

&lt;h4 id=&#34;mongodb-convenience&#34;&gt;MongoDB Convenience&lt;/h4&gt;

&lt;p&gt;By using MongoDB in place of Redis for caching purposes and application data storage, we now have data stored in a JSON format that closely resembles how we use the data in our application. And we no longer need to cast the data back to their intended data types, resulting in a cleaner code. All of this comes with an option to specify the memory allocated to MongoDB, thus making this setup suitable for the hardware resources available.&lt;/p&gt;

&lt;h3 id=&#34;the-bad&#34;&gt;The Bad&lt;/h3&gt;

&lt;h4 id=&#34;incomplete-inmemory-cleanup&#34;&gt;In-Complete In-Memory Cleanup&lt;/h4&gt;

&lt;p&gt;The in-memory cleanup remains incomplete, with the sparse genre vectors remaining in the memory. We can store those in MongoDB, but fetching genre vectors from MongoDB despite sufficient memory allocation would still result in an increased fetch time compared to in-memory.&lt;/p&gt;

&lt;h4 id=&#34;10k-scoring-time&#34;&gt;10k Scoring Time&lt;/h4&gt;

&lt;p&gt;Since the beginning, one persistent issue with the scoring module has been that it takes longer than expected to calculate scores for 10,000 candidates. This results in an overall increase in search response time.&lt;/p&gt;

&lt;h3 id=&#34;the-ugly&#34;&gt;The Ugly&lt;/h3&gt;

&lt;p&gt;While Elasticsearch is better than Spotify ANNOY (mmap mode) in terms of search performance and convenience it provides by handling both vector search and full-text search, the memory consumption in this setup went past all our self-imposed restrictions. With a total index size of around 25GB, we must allocate a whole new 64GB server for search, which amounts to roughly &lt;a href=&#34;https://instances.vantage.sh/aws/ec2/m4.4xlarge&#34;&gt;$576 monthly for running a self-hosted single instance&lt;/a&gt; of Elasticsearch. Ouch!&lt;/p&gt;

&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;We cleaned up the application design by using Elasticsearch for both full-text and vector search queries. And it came at the expense of memory consumption. We also replaced Redis with MongoDB for caching and data storage purposes, thus aligning the data storage format (JSON) with the usage format. We must reduce costs significantly as we advance while keeping the design cleaner. Another problem is the 10k scoring time problem, which needs to be substantially reduced. And lastly, the memory cleanup remains to be completed.&lt;/p&gt;

&lt;p&gt;Until the next iteration.&lt;/p&gt;

&lt;p&gt;&lt;sub&gt;* In place of MongoDB, we can also use &lt;a href=&#34;https://redis.com/modules/redis-json/&#34;&gt;RedisJSON&lt;/a&gt;, a document-based database similar to MongoDB that supports data storage and retrieval as JSON.&lt;/sub&gt;&lt;/p&gt;

&lt;hr&gt;

&lt;p&gt;&lt;em&gt;Ready to explore genre-fluid music? Visit our music discovery platform, This &amp;amp; Thats Music, here: &lt;a href=&#34;https://discover.thisandthatmusic.com/&#34;&gt;https://discover.thisandthatmusic.com/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Say Hello To Microservices: Building a Music Discovery App - 2</title>
      <link>https://bitsandmusic.com/post/building-music-discovery-app-2/</link>
      <pubDate>Mon, 15 Jan 2024 00:00:00 +0000</pubDate>
      <guid>https://bitsandmusic.com/post/building-music-discovery-app-2/</guid>
      <description>&lt;p&gt;&lt;strong&gt;Disclaimer 1:&lt;/strong&gt; &lt;em&gt;This post is in continuation of the &lt;a href=&#34;https://bitsandmusic.com/post/building-music-discovery-app-1/&#34;&gt;last post&lt;/a&gt; about building a music discovery platform based on our paper: &lt;a href=&#34;https://archives.ismir.net/ismir2021/latebreaking/000054.pdf&#34;&gt;Bit Of This, Bit Of That: Revisiting Search and Discovery&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disclaimer 2:&lt;/strong&gt; &lt;em&gt;The design choices mentioned in this article are made with a low-cost setup in mind. As a result, some of the design choices may not be the most straightforward.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disclaimer 3&lt;/strong&gt; &lt;em&gt;You can explore our music discovery platform, This &amp;amp; That Music, here: &lt;a href=&#34;https://discover.thisandthatmusic.com/&#34;&gt;https://discover.thisandthatmusic.com/&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id=&#34;brief-summary&#34;&gt;Brief Summary&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Genre-fluid music&lt;/strong&gt; is any musical item (song or a playlist) which contains more than a single genre. Think &lt;a href=&#34;https://www.youtube.com/watch?v=r7qovpFAGrQ&#34;&gt;Old Town Road.&lt;/a&gt; Or &lt;a href=&#34;https://www.youtube.com/watch?v=eVTXPUF4Oz4&#34;&gt;Linkin Park&lt;/a&gt; &lt;a href=&#34;https://en.wikipedia.org/wiki/Nu_metal&#34;&gt;(Nu-Metal genre)&lt;/a&gt;. Genre-fluid music has been &lt;a href=&#34;https://www.waterandmusic.com/tracking-genre-diversity-and-fluidity-in-the-billboard-charts&#34;&gt;gaining popularity over the last few decades.&lt;/a&gt; However, the search interfaces in music apps like Spotify and Apple Music are still designed for single-genre searches. Our paper proposes a platform to discover gene-fluid music through a combination of expressive search and user experience created around the core idea of genre-fluid search.&lt;/p&gt;

&lt;p&gt;&lt;a href=&#34;https://bitsandmusic.com/post/building-music-discovery-app-1/&#34;&gt;Part 1&lt;/a&gt; of this series outlines the initial system architecture (&lt;code&gt;version1&lt;/code&gt;) used to build this platform. In &lt;code&gt;version1&lt;/code&gt;, we use a monolithic architecture with in-memory lookup objects as data sources, &lt;a href=&#34;https://www.postgresql.org/&#34;&gt;PostgreSQL&lt;/a&gt; for keyword search, &lt;a href=&#34;https://github.com/spotify/annoy&#34;&gt;Spotify ANNOY library&lt;/a&gt; for sparse genre-vector search, &lt;a href=&#34;https://radimrehurek.com/gensim/models/word2vec.html&#34;&gt;Gensim Word2vec&lt;/a&gt; for similarity-based search, and finally, package the whole system as a Python Flask application.&lt;/p&gt;

&lt;figure&gt;
  &lt;img src=&#34;https://bitsandmusic.com/assets/images/discovery-app/version1.drawio.svg&#34;  /&gt;
  &lt;figcaption&gt;
      &lt;p&gt;&lt;b&gt;Previously designed version1 architecture. Monolithic in nature, with in-memory objects as the primary data source.&lt;/b&gt;&lt;/p&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt; 

&lt;h2 id=&#34;limitations&#34;&gt;Limitations&lt;/h2&gt;

&lt;p&gt;While the main strength of &lt;code&gt;version1&lt;/code&gt; design is its simplicity of implementation, there are quite a few shortcomings as well. These are as follows:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In-memory litter:&lt;/strong&gt;  Instead of aggregated data sources such as Redis or MongoDB, we have multiple in-memory objects, which give memory a disjointed look.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High Memory Consumption:&lt;/strong&gt; Due to the in-memory objects that cannot be shared across multiple application instances, horizontal scaling becomes difficult.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monolith Problems:&lt;/strong&gt; The whole design set-up as a monolith makes the scaling challenges even worse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Significant Search Time:&lt;/strong&gt; &lt;a href=&#34;https://bitsandmusic.com/post/building-music-discovery-app-1/#candidate-aggregation-module&#34;&gt;Sequential search to ANNOY search trees&lt;/a&gt; increases search time, worsening the user experience.&lt;/p&gt;

&lt;h2 id=&#34;design-refactoring&#34;&gt;Design Refactoring&lt;/h2&gt;

&lt;p&gt;The best way to refactor the design would be to consider the abovementioned limitations and make changes accordingly.&lt;/p&gt;

&lt;h3 id=&#34;reducing-memory-consumption&#34;&gt;Reducing Memory Consumption&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;We can start by migrating the in-memory app data (entity data) to Redis. It adds a bit of serialisation/deserialisation overhead, but on the plus side, our in-memory data can now be shared across multiple app instances.&lt;/li&gt;
&lt;li&gt;We can use the &lt;a href=&#34;https://en.wikipedia.org/wiki/Mmap&#34;&gt;mmap mode&lt;/a&gt; for Spotify ANNOY search to reduce memory consumption further. It searches the search tree without loading it into memory, reducing memory consumption. The downside of this is relatively slower (but still okay) search times.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&#34;breaking-the-monolith&#34;&gt;Breaking the Monolith&lt;/h3&gt;

&lt;p&gt;As part of breaking the monolith to make scaling more manageable, we can remove the vector search component from the main application and make it its own service—our own &lt;a href=&#34;https://www.cloudflare.com/learning/ai/what-is-vector-database/&#34;&gt;vector database.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We further divide the main application into two parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The component that contains the core search logic, which we call the &lt;strong&gt;Core Discovery Service&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;public-facing web&lt;/strong&gt; component that forwards requests to the core discovery service&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This leaves us with three separate services: web, core discovery, and vector search.&lt;/p&gt;

&lt;h3 id=&#34;upgrading-the-query-parser-module&#34;&gt;Upgrading The Query Parser Module&lt;/h3&gt;

&lt;p&gt;We can upgrade the &lt;strong&gt;Query Parser Module&lt;/strong&gt; by using &lt;a href=&#34;https://www.turing.com/kb/a-comprehensive-guide-to-named-entity-recognition&#34;&gt;Named Entity Recognition&lt;/a&gt; as its core component and moving it to the core discovery service from the Javascript side for the final piece of refactoring. The purpose of this module is to automatically identify and extract genres (&lt;strong&gt;Rock Blues&lt;/strong&gt; playlists), their quantifiers (Blues playlists with &lt;strong&gt;a little&lt;/strong&gt; Rock), and other related entities, such as artists, from the user-written query and pass it onto the search workflows.&lt;/p&gt;

&lt;p&gt;With this upgrade, the user no longer needs to specify a search mode for their query explicitly. This module can intelligently decide if the query is to be categorised as a keyword or genre-based search and appropriately forward the request to PostgreSQL or search-related modules.&lt;/p&gt;

&lt;figure&gt;
  &lt;img src=&#34;https://bitsandmusic.com/assets/images/discovery-app/version2.drawio.svg&#34;  /&gt;
  &lt;figcaption&gt;
      &lt;p&gt;&lt;b&gt;Version2 design architecture. Microservice-based.&lt;/b&gt;&lt;/p&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&#34;services&#34;&gt;Services&lt;/h2&gt;

&lt;h3 id=&#34;web&#34;&gt;Web&lt;/h3&gt;

&lt;p&gt;This service, packaged as &lt;a href=&#34;https://fastapi.tiangolo.com/&#34;&gt;FastAPI,&lt;/a&gt; is the web layer of the system. It takes in the query from the browser and forwards the requests to the core discovery service endpoints. This way, we keep our core discovery service client-agnostic. It performs the following functions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input validation&lt;/li&gt;
&lt;li&gt;Request Authentication&lt;/li&gt;
&lt;li&gt;User input conversion to an appropriate payload as accepted by the discovery service.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can also customise this layer to add support for multiple client-specific workflows, keeping the discovery service untouched. Also, we can horizontally scale it with ease and keep it behind a reverse proxy solution such as &lt;a href=&#34;https://www.nginx.com/&#34;&gt;NGINX&lt;/a&gt; for even better performance.&lt;/p&gt;

&lt;h3 id=&#34;core-discovery&#34;&gt;Core Discovery&lt;/h3&gt;

&lt;p&gt;This is the main application layer of the system containing core search logic. This is also packaged as a FastAPI application, is not public-facing and only accepts requests from the web layer over HTTP protocol. It has the following data sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Genre sparse vectors, stored &lt;strong&gt;in memory&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Redis&lt;/strong&gt; for the application data (entity lookups) needed for scoring and item detail population modules.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PostgreSQL&lt;/strong&gt; For keyword-based searches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This service can be scaled by using multiple &lt;a href=&#34;https://fastapi.tiangolo.com/deployment/server-workers/?h=workers#gunicorn-with-uvicorn-workers&#34;&gt;Uvicorn workers&lt;/a&gt; or &lt;a href=&#34;https://fastapi.tiangolo.com/deployment/docker/&#34;&gt;packaging it in a Docker container&lt;/a&gt; and using something like &lt;a href=&#34;https://kubernetes.io/&#34;&gt;Kubernetes&lt;/a&gt; to manage multiple Docker containers.&lt;/p&gt;

&lt;h3 id=&#34;vector-search&#34;&gt;Vector Search&lt;/h3&gt;

&lt;p&gt;This is the last layer in our system containing code for vector search. It can be viewed as our custom-implemented vector database. This is also packaged as a FastAPI application accepting vector search requests from the core discovery service using HTTP protocol.&lt;/p&gt;

&lt;p&gt;Each genre vector search spawns two processes to search dot and angular metric trees, combine their result, and return the results to the core discovery service. Spawning of separate processes parallelises the search workflow, cutting the search time in half. This layer can also be scaled using Docker or Uvicorn workers. The ANNOY vector trees can still be shared among multiple processes using mmap.&lt;/p&gt;

&lt;h2 id=&#34;search-workflow&#34;&gt;Search Workflow&lt;/h2&gt;

&lt;p&gt;The search workflow in &lt;code&gt;version2&lt;/code&gt; remains similar to the one discussed in &lt;code&gt;version1&lt;/code&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The web server forwards the user query to the core discovery service.&lt;/li&gt;
&lt;li&gt;The query parser module parses the query, builds a query payload and forwards it to the caching module.&lt;/li&gt;
&lt;li&gt;The caching module checks whether the query results have already been stored in Redis. In case of a cache hit, steps 4–6 are skipped, and the result set is returned to the browser. In case of a cache miss, the query payload is forwarded to the Candidate Aggregation module.&lt;/li&gt;
&lt;li&gt;The candidate aggregation module sends the query to the vector search service, which returns the candidates from the ANNOY search trees.&lt;/li&gt;
&lt;li&gt;The scoring module scores the candidates with respect to the query.&lt;/li&gt;
&lt;li&gt;The filtering module removes duplicate candidates with regard to the item name and primary artist composition.&lt;/li&gt;
&lt;li&gt;The detail population module finally populates the result set candidates.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&#34;thoughts&#34;&gt;Thoughts&lt;/h2&gt;

&lt;h3 id=&#34;the-good&#34;&gt;The Good&lt;/h3&gt;

&lt;h4 id=&#34;memory-consumption&#34;&gt;Memory Consumption&lt;/h4&gt;

&lt;p&gt;We gain over 7 GB of application memory by transferring the in-memory app data to Redis. And around 3 GB memory by using mmap for ANNOY search, although it comes at the expense of some speed.&lt;/p&gt;

&lt;h4 id=&#34;scalability&#34;&gt;Scalability&lt;/h4&gt;

&lt;p&gt;Now that we have broken down the monolith into the web, core discovery, and vector search services, this design version, &lt;code&gt;version2&lt;/code&gt;, renders itself far better for horizontal scaling than the previous version, as we can scale the services independently.&lt;/p&gt;

&lt;h3 id=&#34;the-bad&#34;&gt;The Bad&lt;/h3&gt;

&lt;h4 id=&#34;incomplete-memory-cleanup&#34;&gt;Incomplete Memory Cleanup&lt;/h4&gt;

&lt;p&gt;Since we need the genre vectors for the main scoring module, they must be in memory for the fastest possible retrieval. So, we are still left with some lookup data structures inside memory. This data, as before, cannot be shared with other instances of the core discovery service, making for a suboptimal horizontal scaling.&lt;/p&gt;

&lt;h4 id=&#34;redis-overhead&#34;&gt;Redis Overhead&lt;/h4&gt;

&lt;p&gt;Redis does make our data shareable across instances, but not without some added overhead. And as we move some of our app data from in-memory to Redis, it becomes more evident.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First of these is the serialisation/deserialisation overhead. Compared to in-memory objects, Redis cannot store the data with the same freedom (no support for integer keys, nested objects), leading to Redis serialisation/deserialisation code all over the place.&lt;/li&gt;
&lt;li&gt;Secondly, bulk retrievals can become time-consuming compared to in-memory lookups, especially when storing lists. For example, if we want to store vectors in Redis as lists, there is no way to make bulk calls similar to &lt;a href=&#34;https://redis.io/commands/mget/&#34;&gt;MGET.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&#34;suboptimal-vector-search-service&#34;&gt;Sub-Optimal Vector Search Service&lt;/h4&gt;

&lt;p&gt;The problem with our vector search service is that it is just like a vector database minus all the optimisations provided by the out-of-box solutions. Everything has plenty of scope for improvement, ranging from communication to serialisation/deserialisation protocols, from storage to search implementations.&lt;/p&gt;

&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;We broke the monolith outlined in the design &lt;code&gt;version1&lt;/code&gt; into three smaller components. This new version enables smoother horizontal scaling, and the memory view seems much more aggregated than the previous version. The vector search service, however, appears as if it has been put together like the Frankenstein monster. Redis overhead is also something that needs to be addressed by replacing it with NoSQL storage. Another scope of improvement is aggregating the search/retrieval sources, including PostgreSQL, Redis, Spotify ANNOY search trees, Gensim word2vec indices, and the core discovery service in memory.&lt;/p&gt;

&lt;p&gt;Until the next iteration.&lt;/p&gt;

&lt;hr&gt;

&lt;p&gt;&lt;em&gt;Ready to explore genre-fluid music? Visit our music discovery platform, This &amp;amp; Thats Music, here: &lt;a href=&#34;https://discover.thisandthatmusic.com/&#34;&gt;https://discover.thisandthatmusic.com/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
</description>
    </item>
    
  </channel>
</rss>
