Word2vec | Bits And Music

Next Stop, Vector Databases: Building a Music Discovery App - 3

Mon, 05 Feb 2024 00:00:00 +0000

Disclaimer 1: This is the third instalment in the How Not to Build a Music Discovery App series, based on our paper titled Bit of This, Bit Of That: Revisiting Search and Discovery. In Part 1, we present the initial monolithic version, and in Part 2, we discuss the transition to a microservice-based architecture.

Disclaimer 2: The design choices mentioned in this article are made with a low-cost setup in mind. As a result, some of the design choices may not be the most straightforward.

Disclaimer 3 You can explore our music discovery platform, This & That Music, here: https://discover.thisandthatmusic.com/.

Brief Summary

Genre-fluid music is any musical item (song or a playlist) which contains more than a single genre. Think Stressed Out by Twenty-One Pilots. Or Peaches En Regalia by Frank Zappa. Genre-fluid music has been gaining popularity over the last few decades. However, the search interfaces in music apps like Spotify and Apple Music are still designed for single-genre searches. Our paper proposes a platform to discover gene-fluid music through a combination of expressive search and user experience created around the core idea of genre-fluid search.

Part 1 of this series outlines the initial monolithic architecture used to build this platform. In Part 2, we break the monolith into three components: web server, discovery engine, and vector search server, using PostgreSQL for keyword search, Spotify ANNOY library for sparse genre-vector search, Gensim Word2vec for similarity-based search, and Redis as the cache storage.

Previously designed microservice-based architecture (version2).

Limitations

In the design version version2, we broke the monolithic architecture into smaller components to simplify horizontal scaling. However, there were some accompanying design issues as well. These are as follows:

Too Many Search & Lookup Sources

Our search/lookup source schema looks like the following:

PostgreSQL for full-text search
Redis for storing cached and app data
Spotify ANNOY data structures for sparse vector search
Gensim library for dense vector search.
In-memory sparse genre-vectors needed by the scoring module

That seems a bit much. Having these many search/lookup sources adds to the complexity of managing all these, such as scaling complexity and performance tuning.

Sub-Optimal Vector Database Implementation

Our vector search component, which uses the Spotify ANNOY library and the Gensim library for vector search, packaged as a Fastapi service, is not the most optimized vector-search component.

Spotify ANNOY was already a mid-tier vector search library in terms of speed in 2023, and we used the mmap mode on top of that to increase the search time further.

In addition, we also have the same problem of using two different libraries for vector search where using one vector database would have sufficed.

Redis Overhead

Redis is a very convenient caching solution, but as the application data schema becomes more complex, we start to see code snippets like this everywhere.

all_integer_items = [int(item.decode('utf-8')) for item in all_cached_items]

We must cast the byte response to our required data type as Redis stores the data as string data type.

Additionally, Redis works best with flat storage (list, set, or dictionary) and is not designed to store nested data. This makes Redis a less-than-ideal data storage (or caching) candidate as the application data becomes more complicated.

Finally, since we know that making multiple calls vs making a single bulk call can make all the difference from a performance perspective, Redis also provides limited support in that context by only supporting bulk GET queries for dictionaries and not lists.

Design Changes

Based on the abovementioned limitations, we can make the following design changes:

Replacing Redis With MongoDB

In the design versions 1 and 2, we used Redis for caching and storing the application data. This worked fine until we ran into the problems of having to write additional serialization/deserialization code for storing and using the retrieved data from Redis, having no direct support for storing nested data, and limited bulk query capabilities.

We can solve all those problems by moving to a NoSQL solution, such as MongoDB*. By doing so, we get two advantages straight off the bat:

No more serialization deserialization overhead
No more data storage format restrictions. We can store our data as JSONs with support for nesting as well.

We also get one more benefit by migrating to MongoDB: With its new storage engine, WiredTiger, we can choose the amount of memory to be allocated to it, meaning we can control the performance-memory tradeoff by controlling the amount of memory allocated to MongoDB.

Using Elasticsearch For All-Things-Search

Instead of using two separate solutions (PostgreSQL and SpotifyANNOY/Gensim) for full-text search and vector search, we can use a service which supports both, such as Elasticsearch. Elasticsearch has existed for a long time as a distributed full-text search engine. However, it has also added the vector search functionality over the past few years. Using this as our out-of-box vector database gives us the benefit of having a well-established, stable, and highly optimized service for our search use cases, making our application design much cleaner and more optimized.

Version3 design architecture. Search source aggregation achieved by using Elasticsearch for full-text and vector search. MongoDB replaces Redis for caching and application data storage.

Search Workflow

The search workflow in version3 remains similar to that discussed in version2.

The web server forwards the user query to the core discovery service.
The query parser module parses the query, builds a query payload and forwards it to the caching module.
The caching module checks whether the query results have already been stored in MongoDB. In case of a cache hit, steps 4–6 are skipped, and the result set is returned to the browser. In case of a cache miss, the query payload is forwarded to the Candidate Aggregation module.
The candidate aggregation module sends the query to Elasticsearch for vector search, which returns 10k candidates.
The scoring module scores the candidates with respect to the query.
The filtering module removes duplicate candidates with regard to the item name and primary artist composition.
The detail population module finally populates the result set candidates.

Thoughts

The Good

Search Source Aggregation

What this setup succeeds in achieving is search source aggregation. We replace PostgreSQL, Spotify ANNOY, and Gensim entirely with Elasticsearch, making our design much cleaner and easier to manage in terms of infrastructure management, scaling, and performance tuning.

MongoDB Convenience

By using MongoDB in place of Redis for caching purposes and application data storage, we now have data stored in a JSON format that closely resembles how we use the data in our application. And we no longer need to cast the data back to their intended data types, resulting in a cleaner code. All of this comes with an option to specify the memory allocated to MongoDB, thus making this setup suitable for the hardware resources available.

The Bad

In-Complete In-Memory Cleanup

The in-memory cleanup remains incomplete, with the sparse genre vectors remaining in the memory. We can store those in MongoDB, but fetching genre vectors from MongoDB despite sufficient memory allocation would still result in an increased fetch time compared to in-memory.

10k Scoring Time

Since the beginning, one persistent issue with the scoring module has been that it takes longer than expected to calculate scores for 10,000 candidates. This results in an overall increase in search response time.

The Ugly

While Elasticsearch is better than Spotify ANNOY (mmap mode) in terms of search performance and convenience it provides by handling both vector search and full-text search, the memory consumption in this setup went past all our self-imposed restrictions. With a total index size of around 25GB, we must allocate a whole new 64GB server for search, which amounts to roughly $576 monthly for running a self-hosted single instance of Elasticsearch. Ouch!

Conclusion

We cleaned up the application design by using Elasticsearch for both full-text and vector search queries. And it came at the expense of memory consumption. We also replaced Redis with MongoDB for caching and data storage purposes, thus aligning the data storage format (JSON) with the usage format. We must reduce costs significantly as we advance while keeping the design cleaner. Another problem is the 10k scoring time problem, which needs to be substantially reduced. And lastly, the memory cleanup remains to be completed.

Until the next iteration.

_{* In place of MongoDB, we can also use RedisJSON, a document-based database similar to MongoDB that supports data storage and retrieval as JSON.}

Ready to explore genre-fluid music? Visit our music discovery platform, This & Thats Music, here: https://discover.thisandthatmusic.com/

Of Monoliths & In-Memory Litter: Building A Music Discovery App - 1

Wed, 10 Jan 2024 00:00:00 +0000

Disclaimer 1: This post is the first post in the series about building a music discovery platform based on our paper: Bit Of This, Bit Of That: Revisiting Search and Discovery.

Disclaimer 2: The design choices mentioned in this article are made keeping in mind a low-cost setup. As a result, some of the design choices may not be the most straightforward ones.

Disclaimer 3 You can explore our music discovery platform, This & That Music, here: https://discover.thisandthatmusic.com/.

The term genre-fluid music sounds so fancy and sophisticated and something you aren't supposed to know, yet you do. Even if you didn't realize you did, you do. After all, it's all around us. Think Old Town Road by Lil Nas X. How would you describe its genre? Is it country, or is it hip-hop? Or is it both? And that's what genre fluidity is: any musical item consisting of more than a single genre.

Over the past few decades, genre-fluidity has been gaining prominence in the billboard charts. However, search interfaces have yet to keep up with this trend and continue to be designed for single-genre searches, while genre fluidity is served mostly via recommendations and curated playlists. This may soon change as recent advancements in AI, specifically Large Language Models (LLMs), offer more expansive and expressive access to content.

So why are we talking about genre fluidity and old-town roads? A few years ago, we published a paper proposing a platform to discover genre-fluid music through a combination of expressive search and a user experience created around the core idea of genre-fluid search. This post explains the initial architecture we used to build that discovery system and the lessons we learned.

Search

Our proposed system enables users to search for music by combining different genres and specifying the proportion of each genre they want in the results. We call it genre-fluid search. Adopting the design principles of existing apps, we keep the existing keyword search ("Adele playlists") along with our proposed search to keep the cognitive load as low as possible.

System search modes: Keyword search mode is the conventional search mode, while genre search mode is the proposed one.

Data Summary

Before going into the depth of the architecture, here is a summary of the data used for creating the system:

There are a total of 88 genres supported by the system.
There are three entity types: artists, playlists, and tracks.
The system has around one million playlists, seven million tracks, and a million artists.
Each entity has demographic data, such as name, popularity, preview audio link, and cover image.

Embeddings

We use neural embeddings for our search and similarity-based recommendation purposes. There are two types of embeddings in our system:

Genre-based embeddings
Similarity-based embeddings

To get the genre-based embedding for each entity, we annotate entities with their associated genres so each entity can be represented as an 88-length sparse vector. This vector consists of real values, meaning the nonzero values range between 0 and 1. In addition to entity-genre annotation, we make use of genre relationships and have those relationships reflected in the entity genre annotation.

Similarity-based embeddings are obtained by training word2vec models on our corpus, treating playlists as sentences and tracks (and artists) as words. These embeddings are 300-length vectors, with values ranging between 0 and 1. Finally, to perform Approximate Nearest Neighbor search on our embeddings, we use the Spotify ANNOY library and store the indexes for each entity on disk.

Data Structures And Storage

The main consideration for our search workflow is retrieval as fast as possible, which means it should be memory-based instead of disk or mmap-based. We use the Python pickle module to store our entity details objects as a dictionary, with entity-id being our key and entity details stored in an array ([name, preview audio link, cover image link]). Popularity lookups are stored in the same format as well. We store the sparse genre embeddings as H5 files. Word2vec and Spotify ANNOY data structures are saved as their corresponding supported files.

System Modules

The main components of this system, packaged as a Python Flask application, are:

Query Parser Module
Candidate Aggregation Module
Scoring Module
Filtering Module
Detail Population Module
Caching Module

Query Parser Module

This module converts the search query typed in natural language ("rock with playlists some country") into an 88-genre vector. We apply the genre-relationship rules mentioned in the embeddings section to the query vector.

Candidate Aggregation Module

Even though we create a tree data structure built with binary space partitioning using the Spotify ANNOY library from our genre embeddings, we use our scoring module instead of the conventional query-to-candidate distance as the final scoring metric. We create two indexes per entity type for genre embeddings to get the maximum possible candidates for a given query: one using the Angular distance and another using the Dot Product distance. We sequentially query the indexes and retrieve N=10,000 candidates from the indexes using the candidate aggregator module.

Scoring Module

We use our custom scoring function to calculate a score for each of those 10,000 candidates, where we consider things such as the candidate values corresponding to each query genre, missing genres in a candidate, additional candidate genres not specified in the query, and candidate popularity.

Filtering Module

Since our corpus contains user-created playlists, it is possible to have multiple playlists with the same name. Also, many playlists for a given search query can mainly comprise the same artist. To avoid returning duplicate results in their name or primary artist composition, we have a separate module to filter out such items and finally return K=300 result items sorted by their score.

Detail Population Module

This is where we get details for each result item and prepare the final paginated result set to be sent back to the client.

Caching Module

Since our system serves results without considering user personalization, each query should get the same response. With this in mind, we can cache our query results using Redis, as it is the simplest and the most common solution. We serialize each query as a string and store the indexes returned by the filtering module as the value for each query. Only the query and detail population modules get called for repeated searches, bypassing the rest.

Overall System Design

System Design Architecture `version1`. Monolithic. Simple.

Here's how the application design looks, zoomed out. The query "Rock playlists with blues" is first entered in a search box in a web application.

The query parser module parses this text and converts it into an 88-length vector.
The query vector is sent to the server, where the caching module checks whether the query results have already been stored in Redis. In case of a cache hit, steps 3-5 are skipped, and the result set is returned to the browser. In case of a cache miss, the query vector is forwarded to the Candidate Aggregator module.
The candidate aggregator module uses this query vector to query the Spotify ANNOY indices, retrieves 10000 candidates, and passes it onto the scorer module.
The scorer module runs our custom scoring function to calculate scores for each of those 10000 candidates.
The filtering module removes duplicate candidates by name and primary artist. If the resulting set has items under a certain threshold K, it adds the items removed in the filtering stage.
Finally, the demographic data is retrieved for the first k=30 candidates to be returned to the browser.
Candidates calculated are stored in Redis corresponding to the search query.
In order to support existing keyword-based searches, we use the PostgreSQL database as it supports full-text search from disk with a decent performance.

Thoughts

The Good

The best thing about this setup is its simplicity. Its monolithic design is as simple as possible, adding little coding complexity overhead. The straightforward nature of this design makes it an ideal choice for prototyping. The design is also modular, making it easier to move away from monolithic architecture in the future. Lastly, using PostgreSQL over other full-text solutions, such as Elasticsearch, works fine without needing the memory most memory-based full-text search solutions need.

The Bad

In-Memory Litter

Using pickle module for serializing data to disk is excellent for entry-level applications and prototypes. They are even more efficient in data storage than other candidates, such as JSON. However, there are better options than pickle when it comes to interoperability, with JSON being one of them. Additionally, moving across different Python versions can also lead to problems.

Another problem that having in-memory objects creates is high memory consumption. While data access is high-speed and straightforward, it also makes deploying the application more complex, as the memory needed for a single instance of the application equals the size of these in-memory objects. This data tied to an instance's memory cannot be shared with other application instances.

Search Time

Owing to our sequential search for Spotify ANNOY indices and the scoring function being run over 10k candidates, the search time is well over 5 seconds for a new query, which is far over the ideal response time.

Caching At A Cost

Caching using Redis is great until it is not. It usually starts with simpler string serialized keys and values, which later becomes a whole thing of writing logic to serialize and deserialize data.

Monolith

The monolithic architecture has notable drawbacks, such as the need to manage, test, and deploy all components as one unit. Additionally, the entire system is bound to a single programming language, which may become less than ideal as the application expands and becomes more complex.

The Ugly

Deploying an application that requires 32GB of memory presents a challenge due to the complexity and high cost of horizontal scaling. For instance, the monthly cost of an AWS t4g.2xlarge instance, which could support such an application, is approximately $180.

Conclusion

The design mentioned above is a decent place to start as it lets the main application idea take focus, owing to its simplicity. Its limiting factors, however, are quite a few and critical as well. Monolith is all well and good up to a certain point, beyond which it stops making sense. In-memory objects lead to substantial memory consumption, complicating the scaling and deployment processes. As we aim to improve this design, our approach would involve reducing memory usage and transitioning from a monolithic structure to a more modular, component-based design.

Until the next design.

Ready to explore genre-fluid music? Visit our music discovery platform, This & Thats Music, here: https://discover.thisandthatmusic.com/

Building Music Playlists Recommendation System

Mon, 09 Sep 2019 00:00:00 +0000

Quick Summary

Goal

The goal of this work is to represent playlists in a way which captures the true essence of the playlist, i.e. information such as type, genre, variety, order, and the number of songs in the playlist, and which can be used for tasks such as playlist discovery and recommendation.

Contribution

Built a recommendation engine for playlists using sequence-2-sequence learning.
Evaluated the work using a recommendation-based evaluation task.
Assembled a dataset of 1 million Spotify playlists and 13 million tracks for this work.

Applications

Playlist Discovery/Recommendation Engine

Our system can be used for playlist discovery and recommendation. Given a query playlist, the system returns the playlists from the database which are most similar to the input playlist.

Image displaying the usage for our playlist recommendation engine. The system returns playlists most similar to the query playlist

Developer Friendly Outline

Here’s a quick outline of our proposed approach:

Download playlists data using Spotify developer API and everynoise.com.
Filter the data by removing noise (rare songs, duplicate songs, outlier sized playlists, etc.)
Train a sequence-2-sequence⁹ model over the data to learn playlist embeddings.
Annotate the data (songs and playlists) for genre information.
Evaluate the embeddings using our proposed evaluation tasks.
Build a recommendation engine by populating a KD-tree with the learned playlist embeddings, and retrieving search results by utilizing the nearest neighbor approach.

Introduction

Playlists have become a significant part of our music listening experience today. There are over three billion of these on Spotify alone¹. There are playlists for every moment, every mood, every season, and so on. With millions of songs at their fingertips, users today have grown accustomed² to:

Immediate attainment of their music demands.
An extended experience. While recommendation engines service the first aspect, playlists handle the second aspect of this changing behavior, making playlist recommendation extremely important, both for the users and music companies.

What is a Playlist, and Why Should I Care?

“A playlist is a set of songs supposed to be listened together, usually in an explicit order.” ³

Playlists are extremely important today from the perspective of both users and music researchers. From the user perspective, playlists are an effective way to discover new music and artists. From the researcher perspective, it is important to understand that music is consumed through listening and playlists formalize that listening experience³. Playlists are a unit component which can be discovered and recommended, just like artists, songs and albums.

”From the researcher perspective, it is important to understand that music is consumed through listening and playlists formalize that listening experience” ³

So What’s the problem?

As mentioned before, owing to the meteoric rise in the usage of playlists, playlist recommendation is crucial to music services today. However, over the past couple of years, from a research perspective, playlist recommendation has become analogous to playlist prediction/creation⁷ ⁸ and continuation⁵ ⁶ rather than playlist discovery. However, playlist discovery forms a significant part of the overall playlist recommendation pipeline, as it is an effective way to help users discover existing playlists on the platform.

Discovery is the name of the Game

Our work aims to represent playlists in a way which can be used to discover and recommend existing playlists. We use sequence-to-sequence learning⁹ to learn embeddings for playlists that capture their semantic meaning without any supervision. These fixed-length embeddings can then be used for recommendation purposes.

Intuition Behind the Approach

Why Sequence-to-sequence Learning?

The primary intuition behind choosing sequence-to-sequence learning is that playlists can be interpreted just as sentences, and songs as words in a sentence. In the past few years, sequence-to-sequence learning has been widely used to learn effective sentence embeddings in applications like neural machine translation¹⁰. We make use of the relationship playlist:songs:: sentences:words, and take inspiration from research in the field of natural language processing to model playlist embeddings the way sentences are embedded.

We make use of the relationship playlist:songs :: sentences: words, and take inspiration from research in natural language processing to model playlist embeddings the way sentences are embedded.

Sequence-to-Sequence Learning

The name sequence-to-sequence learning in its very core implies that the network is trained to take in sequences and output sequences. So instead of predicting single word, the network outputs the entire sentence, which could be a translated in a foreign language, or the next predicted sentence from the corpus, or even the same sentence if the network is trained like an autoencoder.

For this work, we use seq2seq framework as an autoencoder where the task of the network is to reconstruct the input playlist and in doing so, learn a compact representation of the input playlist, which captures the properties of the playlist.

The overall concept of using seq2seq network like an autoencoder.

Seq2seq Models

We use the Attention technique for the seq2seq models used in this work to learn the playlist embeddings which capture the long-term dependencies between the songs in the playlist because of the relatively longer length of playlists (50–1000 songs). We experiment with 2 variants of seq2seq models:

Unidirectional seq2seq networks
Bidirectional seq2seq networks

A bidirectional seq2seq network is different from the unidirectional variant in the sense that a bidirectional RNN is used, meaning the hidden state is the concatenation of a forward RNN and a backward RNN that read the sequences in two opposite directions. This allows the network to capture more contextual information for the decoder to predict the output symbol.

Data: Curation, Filtering, and Annotation

Need

For this work, we need a list of playlists (sentences) with each playlist consisting of a list of songs (words). For solving this problem in it’s simplest form, we need just the playlist IDs and the song IDs mapped with the appropriate playlist IDs.

Dataset Creation

We download the data using the Spotify Web API. To collect a big enough set of terms to query the Spotify system, we use everynoise.com. This interactive website contains a list of some 2600+ genres, graphed out according to their relationship with each other, along with an audio example for each genre. We parse the data from the home page of this website and get the list of all the genres. Then for each genre, we download playlists (along with the corresponding song information) using the Spotify Web API. The whole flow is shown in Fig 1.

Fig 1: Data download workflow

Here are the downloaded data details:

1 Million Playlists
3 Million Artists.
13 Million Tracks
3 Million Unique Tracks
3 Million Albums
2680 Genres

Data Filtering

We follow [1] in doing the data clean up by removing the rare tracks, and outlier sized playlists (having a number of songs less than 10 or greater than 5000). This leaves up with 755k unique playlists and 2.4 million unique tracks.

Annotation

Image of a subset of genres taken from everynoise.com

Problem: Genres, Genres Everywhere

Although this step is not directly needed for the training part, however, it is crucial for the evaluation phase. This step aims to label the playlists with their appropriate genre. There are certain problems with the information available so far:

Spotify provides genre information for the artist, but not the song. Labeling the song genre same as the artist genre would not be entirely correct as an artist can have songs of different genres.
Assigning the songs the same genre as that of the playlist, which in turn is derived from the query term for which it was fetched, would be problematic. The issue in going this route is the subjectivity associated with it. Provided this fine-grained annotation, what would be the difference between soft rock, 80’s rock, classic rock, and rock from the perspective of classification?

Hence, we need to bring down the number of genres (output labels) from 2680 to a more manageable number.

Proposed Solution

Data annotation workflow

To solve for this, we train a word2vec¹¹ model on our corpus to get song embeddings, which capture the semantic characteristics (such as genre) of the songs by their co-occurrence in the corpus.
The resulting song embeddings are then clustered into 200 clusters (arbitrarily chosen number in an attempt to maintain the balance between the feasibility of the annotation process and size of formed clusters. Smaller cluster size and lesser annotation time are desired).
For each cluster:

• Artist genre is applied to each corresponding song and a genre-frequency (count) dictionary is created. A sample genre-count dictionary for cluster with 17 songs would look like {rock: 5, indie-rock:3, blues: 2,soft-rock: 7}

• From this dictionary, the genre having a clear majority is assigned as the genre for all the songs in that cluster.

• All the songs in a cluster with no clear genre majority are discarded for annotation.

Based on the observed genre-distribution in the data, and as a result of clustering sub-genres (such as soft-rock) into parent genres (such as rock), the genres finally are chosen for annotating the clusters are:

Rock, Metal, Blues, Country, Reggae, Latin, Electronic, Hip Hop Classical.

To validate our approach, we train a classifier on our dataset consisting of annotated song embeddings. With training and test set kept separate at the time of training, we achieve a 94% test accuracy.

t-SNE plot for genre-annotated songs, with 1000 sampled songs for each genre

For playlist-genre annotation, only the playlists having all the songs annotated, are considered for annotation. Further, only those playlists are assigned genres for which more than 70% of the songs agree on a genre.

Evaluation

Since the aim of our work is to learn playlist embeddings which can be used for recommendation, we evaluate the quality of embeddings using a recommendation task.

Recommendation Task

The recommendation being inherently subjective in nature is best evaluated by having user-labeled data. However, in the absence of such annotated datasets, we evaluate our proposed approach by measuring the extent to which the playlist space created by the embedding models is relevant, in terms of the similarity of genre and length information of closely-lying playlists. We use the Approximate Nearest Neighbors Algorithm using Spotify ANNOY library¹³ to populate the tree structure with the playlist embeddings. A query playlist is randomly selected and the search results are compared with the queried playlist in terms of genre and length information. There are nine possible genre labels. For comparing length, ten output classes (spanning the range {30…250} corresponding to bins of size 20 are created. An average of 100 precision values for each query is considered.

Baseline Comparison

To evaluate the performance of our proposed technique, we need some sort of baseline performance as well. As our baseline model, we experiment with a weighted variant of Bag-of-words model¹⁴, which uses a weighted averaging scheme to get the sentence embedding vectors followed by their modification using singular-value decomposition (SVD). This method of generating sentence embeddings proves to be a stronger baseline compared to traditional averaging.

Results

The Recommendation task, as shown in Figure below, captures some interesting insights about the effectiveness of different models for capturing different characteristics. Firstly, high precision values demonstrate the relevance of the playlist embedding space which is the first and foremost expectation from a recommendation system. Also, BoW models capture genre information better than seq2seq models, while length information is better captured by the seq2seq models, demonstrating the suitability of different models for different tasks.

Applications

One of the direct applications of this work is a recommendation engine for playlists. Given a query, the system would recommend/retrieve similar playlists form the corpus. The tree data structure discussed in Recommendation Task section can be directly used for this purpose. Given a query playlist, its k-nearest neighbors would be the most similar items to it and would be the system recommendations. Demonstration for our work can be seen in the video.

Recommendation System Demo video

And that’s that. We have presented a seq2seq based approach for learning playlist embeddings, which can be used for tasks such as playlist discovery and recommendation. Our approach can also be extended for learning even better playlist-representations by integrating content-based (lyrics, audio, etc.) song-embedding models, and for generating new playlists by using variational sequence models. In the paper, there are many more evaluation techniques for assessing the quality of playlist embeddings with respect to the encoded information, which is out of scope for this post. I will be discussing that in another post. Until next time!

P.S — Here’s the link to the paper.

References

_{[1] https://newsroom.spotify.com/2018-10-10/celebrating-a-decade-of-discovery-on-spotify/}
_{[2] Keunwoo Choi, George Fazekas, and Mark Sandler. Towards playlist generation algorithms using rnns trained on within track transitions.arXiv preprint arXiv:1606.02096, 2016}
_{[3] Fields, Ben, and Paul Lamere. “Finding A Path Through The Jukebox — The Playlist Tutorial, ISMIR.” ISMIR, Utrecht (2010).}
_{[4] De Mooij, A. M., and W. F. J. Verhaegh. “Learning preferences for music playlists.” Artificial Intelligence 97.1–2 (1997): 245–271.}
_{[5] Ching-Wei Chen, Paul Lamere, Markus Schedl, and Hamed Zamani. Recsys challenge 2018: Automatic music playlist continuation. InProceedings of the 12th ACM Conference on Recommender Systems, pages 527–528.ACM, 2018}
_{[6] Maksims Volkovs, Himanshu Rai, Zhaoyue Cheng, Ga Wu, Yichao Lu, and Scott Sanner. Two-stage model for automatic playlist continuation at scale. In Proceedings of the ACM Recommender Systems Challenge 2018, page 9. ACM, 2018}
_{[7] Andreja Andric and Goffredo Haus. Automatic playlist generation based on tracking user’s listening habits.Multimedia Tools and Applications, 29(2):127–151, 2006.}
_{[8] Beth Logan. Content-based playlist generation: Exploratory experiments. InISMIR, 2002.}
_{[9] lya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, pages 3104–3112, 2014.}
_{[10] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014.}
_{[11] Mikolov, Tomas, et al. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781 (2013).}
_{[12] Anita Shen Lillie.MusicBox: Navigating the space of your music. PhD thesis, Massachusetts Institute of Technology, 2008}
_{[13] Bernhardsson, E. “ANNOY: Approximate nearest neighbors in C++/Python optimized for memory usage and loading/saving to disk.” GitHub https://github. com/spotify/annoy (2017).}
_{[14] Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. “A simple but tough-to-beat baseline for sentence embeddings.” (2016).}