Shazam Audio Recognition Design Deep Dive with Google SWE! | Systems Design Interview Question 23

Подписаться 40 тыс.

Просмотров 4,4 тыс.

50% 1

Only noise I recognize is the sound of your girl's cheeks clappin
00:00 Introduction
00:56 Functional Requirements
01:27 Capacity Estimates
02:34 API Design
02:53 Database Schema
03:31 Architectural Overview

Наука

Опубликовано:

7 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 22

@Ms452123 Год назад

Probably my most favorite design thus far. Was thinking about Shazam yesterday

@theinfamous4231 Год назад

thank you so much for the Explanation man!!

@jordanhasnolife5163 Год назад

Np! It's a cool Algo!

@jayantprakash6425 Год назад

good explanation

@jh0720 Год назад

Another banger 🤷‍♂️

@jordanhasnolife5163 Год назад

Thanks for all the kind words Jonathan, I appreciate it :)

@almogkurtser4557 Год назад

Great stuff! I think it's good you avoided the whole DFT thing that just diverts the discussion into very specialized algorithmic land. As for partitioning, while it's clearly a viable solution, I think this specific use case has two properties that might lend themself well to relying solely on single leader replication instead: 1. Immutable data 2. Writes (insertions of new songs) do not need to be reflected immediately (= we should be okay with gigantic replication lag caused by a batch job) 3. Even if all reads were in the order of 10ms, song identification might still take as much as 10 seconds. Slow DB reads that hit the disk will be negligible I think. So I guess both strategies are viable (we'll need some sort of replication on top of partitioning in any case to have high availability), so maybe the discussion can be diverted to pros and cons in terms of costs, which can lead to another nest of vipers, namely capacity estimation for how many nodes we'll need with each approach and their cost. :(

@jordanhasnolife5163 Год назад

You make a good point - I'd imagine the matching phase is a bottleneck and as a result those index reads don't have to be as fast! Though based on another paper I had read it seemed it was best to do everything in memory - in which case I'd think partitioning was necessary.

@almogkurtser4557 Год назад

@@jordanhasnolife5163 Could you point me to the paper that mention the significance of keeping everything in memory?

@jordanhasnolife5163 Год назад

@@almogkurtser4557 Just faster than on disk. By keeping things in memory you can use a hash table index as opposed to a B-Tree or LSM-tree + SSTables

@almogkurtser4557 Год назад

@@jordanhasnolife5163 That's true, B-Tree is how I imagined it working in a replication only setup. I Probably just misunderstood your prev comment thinking you were alluding to additional challenges that would be inherent to storing on disk. The main disadvantage is potentially using more nodes to serve the same amount of request due to higher latency per request with the advantage being simpler routing since we only need to know which nodes exist rather than using consistent hashing. Availability would also be better since every key is contained in every node. I guess it comes down to complexity vs cost. I have no idea if say 5x B-Tree node would be cheaper or more expensive than R*X servers with more memory (R being the replication factor for each partition). We can make the partition solution cheaper ( = more complex) by saying that we don't need RX servers but rather just X, each being a partition follower but keeping that partition as a cold storage and only warming it into memory when the partition leader fails.

@cd-ux9ot Год назад

What about an ML approach? Train an autoencoder model where the input data is all the songs plus random/background noise and the output is the actual song id. Then use the model weights as embeddings and store them in a vector database. During search, create embedding from the query audio, lookup candidates in the vector database using similarity search. Then another model to score the candidates an select the candidate with the highest score.

@jordanhasnolife5163 Год назад

This could perhaps work, the biggest problem imo is that the clip of the song can be from any starting point. I think something like an RNN would work best here since they are good for sequences, and you could make each fingerprint an element of the sequence that you're trying to classify.

@cd-ux9ot Год назад

@@jordanhasnolife5163 An RNN is a great choice. Transformers have recently shown great performance with sequence data. Convolution layers can even work with audio data because a spectrogram is image-like.

@jordanhasnolife5163 Год назад

@@cd-ux9ot Cool point! I will say I don't know too much about machine learning but I'd be interested to see if anyone has taken an inference based approach for this problem - it may be a bit overkill I guess since shazam was already working pretty well without it :)

@user-se9zv8hq9r Год назад

based and jewfro-pilled

@sandhyaaa24 Год назад

How singing song app created???

@AyushPanditofficial Год назад

I m from india i love your videos i watched all.your videos we need more videos i m learning many thing from your video implementing in my web app

@Jashan77114 Год назад

What is sharding the indexes?

@jordanhasnolife5163 Год назад

We use consistent hashing on the 64 bit fingerprint which is comprised of the start note, end note, and time delta. If you mean what is responsible for upholding the consistent hashing, then we could use a coordination service like zookeeper to do that for us.

@UjjwalBhardwaj96 7 месяцев назад

@@jordanhasnolife5163 You have mentioned in the video that we can shard the data such that all the generated hash (fingerprint) with the same anchor point goes in the same node. Can we achieve this using consitent hashing?