TinyURL/PasteBin Design Deep Dive with Google SWE! | Systems Design Interview Question 1

Jordan has no life

Подписаться 53 тыс.

Просмотров 28 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

20 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 108

@insaneVlogs1221 6 месяцев назад

Thanks Jordan for all the good content, i had an interview at microsoft and i only knew DSA , Design Patterns and AWS microservices and i did not have any idea about the tradeoffs and chalenges we face when designing systems. I went through a lot of system design content but the creators did not propose alternative solutions for the problems nor did they prove the decisions they took while designing the system. I binge watched the 2 playlists - system design and system design questions in a span of a week. I was able to design a graph database using noSQL in the design round and got selected for the role. Also your views don't justify the quality of content you have in these playlists. I will surely promote this channel later in my linkedIn post and your friends too will know how big of a nerd you are.

@jordanhasnolife5163 6 месяцев назад

Legend!! I may have to untag myself so my work connections don't see but glad the hard work paid off, all props to you ♥️

@Ms452123 2 года назад

The man keeps on prestiging. Keep blessing us fam.

@jordanhasnolife5163 2 года назад

You know it brother, currently in silver hoping to get to radiant

@Dun1007 2 года назад

I am studying grokking right now and this is very helpful. Thank you!

@2tce 2 года назад

Since reads are 100x writes, I think we need a consistent DB like Mongo DB or Postgres and then leverage a non-consensus algorithm. Yes you said Single Leader. I'll go more with Linearizable Compare-and-Swap, which you also hinted. However, this will increase latency. OR we can use Google Spanner for our Key Generator DB so that it ALWAYS doles out unique hashes to requests. It could be centralized with its own atomic clock. But again, that's SUPER expensive. :)

@2tce 2 года назад

Wow. I replied before finishing the video. I see that you used Mongo. Cool.

@krish000back 2 месяца назад

Thanks a lot for great contents and mainly to continue your dedication to keep producing these. Quick question: you mentioned Riak can be used, so that if conflict occur, we will assign one the next available sequence. But, at that point of time, we might have already provided the short urls to both the users, that will still be same issue as Cassandra? Am I missing anything here to use Riak?

@jordanhasnolife5163 2 месяца назад

Can you give me a timestamp here? In retrospect I don't think Riak is a good choice here lol, would recommend the 2.0 version of this video

@FatherPhi 2 года назад

Best thumbnail I've seen yet 👌

@jordanhasnolife5163 2 года назад

Thanks homie it turns out my only fans skills transfer over well to RU-vid

@meditationdanny701 2 года назад

Grt content never stop posting bud 🙏

@jordanhasnolife5163 2 года назад

I'll do my best there's a chance I get canceled first

@deepjyotkaurbindra Год назад

This was beautifully explained, thanks!

@VedsarKushwaha Год назад

Please excuse my ignorance if this is covered in video or comments. The key generation service approach has another disadvantage. For same actual url passed by user, the system will generate new tinyURL everytime. That creates duplication of data as compared to md5 or sha1 approach where tinyurl has 1-1 mapping with actual url.

@jordanhasnolife5163 Год назад

Agreed - though I think that this is actually intended behavior for TinyURL just so that each user can track the performance of their own TinyURL. That's a great point to make during an interview though!

@tunepa4418 7 месяцев назад

But you can always check the cache/DB for the presence of the URL if it has been genearted by that user before

@raj_kundalia 10 месяцев назад

Read all comments and answers and thanks for the video!

@maciex36 5 месяцев назад

Hey Jordan, awesome content! 1. Does single leader replication implies there is always only one leader per db, or one leader per partition? 2. What do you say about having a key generation service generating X keys / second, storing them in Redis Cluster [key (string) -> used (bool)] and only using the keys from Redis when needed? Redis use single-leader replication, locks the data during update, so theoretically there should be always one unique key returned? And for Link Table we can use cassandra freely (as keys already generated and stored in redis)? 3. Would you use some message brokers? How the services should communicate?

@jordanhasnolife5163 5 месяцев назад

1) 1 per partition 2) That's reasonable, but I doubt you're going to get better performance using a separate node as a lock than you would just using locking on each node, otherwise you need a 2 phase commit 3) If I was gonna do the approach in step 2, I'd probably just have to use a two phase commit here, no message broker

@tarunnurat Год назад

Hey Jordan, I had a question about what caching strategy you would use on writes, and to keep the cache coherent. From all the strategies you've mentioned in your previous video, what strategy would you suggest? 1) I'm guessing we would not use a write-back strategy, as we don't want to batch writes in our cache and then send them to our DB, as having write consistency for our long url -> short url mapping is important, and if we batch writes, then 2 different users could end up having the same short url for different long urls. 2) I'm guessing we don't use write-around, as we don't generally update the same key (same shortened URL) very often, so there's no need to write the key to the url, and invalidate it in the cache. 3) So I'm guessing we'd use a write through strategy, where we update the DB and the cache in parallel, but without using 2PC, as we only care about the write succeeding in the DB, but not so much about the write to the cache. I see from your diagram that MongoDB is reaching out to your cache, does that mean you've used a write-around?

@jordanhasnolife5163 Год назад

Personally I think that a write around or write through is fine. It's no big deal if there are a couple of cache misses on the first couple of reads, so long as the popular items are being pulled into the cache. Keep in mind the issue with write through is that every url written would be sent to our cache, and there are probably a lot of tinyurls created that aren't read for a while or at all.

@arielg36 10 месяцев назад

Hey Jordan. Thanks for all the great videos! Shouldn't the fetch/write servers query ZooKeeper in order to know which partition they should use for a given key?

@jordanhasnolife5163 10 месяцев назад

Yeah realistically this would be cached on the LB but something like that

@igorrybalka2611 Год назад

Thanks, fantastic content, much more in-depth than other similar design videos (although requires more prior knowledge). What do you think about the idea of each service instance pre-loading (claiming) say 1000 keys each to minimise DB locking and synchronising on a single ever-increasing counter i.e. rowId. I think this was suggested in Grokking series you mentioned.

@jordanhasnolife5163 Год назад

That's the key generation service idea! I believe that I mention that in this video as well, albeit perhaps less in depth :) Thanks for the nice comment!

@dacksmilliken9804 2 года назад

hey Jordan, I'm curious if you could elaborate more on the "key generation service". you glossed over this pretty quickly after elaborating on why we wouldn't want to use hashing for this. so...how do you create the keys? how do you make them only be 8 characters long? is this simply something you can do in Mongo i.e. populate the LinkTable with a bunch of entries that are empty except for the key? how is this managed? thanks. also...where's the video of the train being run on u?

@jordanhasnolife5163 2 года назад

Oh, I actually think hashing is fine for this problem (and is probably what you should do), but just realize that you have to accept the possibility of collisions, and change one of the keys accordingly. As for the key generation service, the general premise is all of the possible keys are pre-populated in the database, even if they don't have a website associated with them. Then, everytime a user requests a new shortlink, they are assigned one of the available keys and that key is marked as used. The key thing to note here is that because each process is first going to be reading from a row and then trying to claim it, there should be locks used on all the rows to ensure that somebody does not claim a key only to have their claim overwritten by another concurrent request - it is important that all subsequent users see that the key has been taken. Thus, you would want a data base with ACID properties such as a relational one.

@jordanhasnolife5163 2 года назад

As for me getting a train run on me, just get me to 10k subs and I'll see what I can do

@AnilYadav-xg5iw Год назад

Hey Jordan, i have a doubt , not sure if i understood correctly. 1)So the create key servers generates the key using hashing and stores in the particular partition and how do we decide which partition that short url is to be stored. 2)Is our hash function generating same key if same url again comes to system?

@jordanhasnolife5163 Год назад

1) Based on the hash 2) It should yes, unless of course you don't want it to because you want different people to have different links for the same URL, in which case you should pick a different shortened key

@jordanhasnolife5163 Год назад

If you do want to end up grabbing a different key, then you should use probing or chaining to do so until you find a key that has not yet been used.

@Summer-qs7rq 9 месяцев назад

Hey jordan, thanks for this video. Appreciate your efforts. However i have a question related to single leader replication. I would like to understand more about below 2 scenarios 1. assuming that if there is one leader and our service is hosted across multiple regions (possibly in different countries as well ) does all call go to the same leader hosted in one region ? 2. if above is not true assuming that every region has single leader and multiple replication then how do calls from multiple data centers guarantee uniqueness ? so does all leaders in all regions call each other to see if the key is already created ?

@jordanhasnolife5163 9 месяцев назад

1) For writes yes. Since this is TinyURL, I'm not too concerned about write latency. If we really needed to, we could shard our leaders based on IP address and then have those leaders in separate data centers (E.g. north american clients write to database in USA, indian clients write to database in India). You could assign each region a subset of the possible tinyURLs to assign so we don't have to worry about conflicts. 2) The above is true.

@ninadkavimandan7040 11 месяцев назад

Great video! Question - if this is going to be a user oriented tinyurl generation, why not use the sha of (long url + userid)? Wouldn’t this ensure no collisions?

@jordanhasnolife5163 11 месяцев назад

Hey would recommend watching the newer version of this video, that's what I did. It doesn't ensure no collisions, but should greatly reduce the amount.

@ninadkavimandan7040 11 месяцев назад

Ah! Didn’t realise there was another version! Thanks!

@gpv1008 Месяц назад

Why do you say "200 writes per second is no joke. We are gonna need to support high write throughput"? Most RDBMS can sustain that load. Could you elaborate more on rules of thumb you use for QPS and what different systems can support?

@jordanhasnolife5163 Месяц назад

200 writes per second to the same key is no joke

@trumpetharp 2 года назад

Hey Jordan, Thanks for the effort in putting the series. Few questions though 1. Why is it problematic if two users want to generate tinyURLs for same long URL? Beyond analytics use-cases and audit information for the user who created the tinyURL, what other drawback is there in returning the same URL? 2. Why not use something like Twitter Snowflake instead of Key Generation Service?

@jordanhasnolife5163 2 года назад

Hi! I think you mainly voiced the problem right there, otherwise it would be totally fine. However, afaik many interviewers will want unique URLs. As for question 2, I'll have to look into Twitter snowflake, but generally niche technologies like that tend to be less viable for systems design interviews because the interviewer may not know them lol, so you'll have to explain it really well.

@trumpetharp 2 года назад

@@jordanhasnolife5163 ha ha, sure thanks.

@Snehilw Год назад

@Jordan, wondering if you have prepared slides for all the system design questions by any chance (similar to the system design playlist slides).. If yes, those will greatly help as a refresher as each video is 30mins and is not possible to go through all those to refresh the concepts/applications quickly before an interview..

@jordanhasnolife5163 Год назад

Alas I do not, just watch the last 2 mins of each video I guess

@Snehilw Год назад

@@jordanhasnolife5163 these sys design slides and videos are simply unparalleled btw.. love them man! Thanks for making these! Upped my sys design game by 10x for sure 😄. Great alternative to reading DDIA as I did twice, it is brick for sure yes 😀

@jordanhasnolife5163 Год назад

@@Snehilw Very glad to hear thanks man!

@scbs2007 2 года назад

@20:09 *Dynamo: leaderless replication; DynamoDB: single Leader replication

@jordanhasnolife5163 2 года назад

Ahh fair point I always forget dynamodb and dynamo style are different, though I believe dynamo db isn't open source so it's hard to know exactly what's going on behind the scenes there

@rahulvandamme 2 года назад

Hey Jordan, can you help elaborate a bit more on how does single leader based replication help with the collision problem better than a leaderless replication scheme ?

@jordanhasnolife5163 2 года назад

In single leader replication the writes can only go to one place so you can use locking to ensure that keys will not be duplicated - on the other hand in a leaderless schema this is not possible because two requests claiming the same short link will go to different replicas and only find out about one another later down the line. Then we will have a collision.

@rahulvandamme Год назад

Hey Jordan, if there is a URL generator service that pre-populates free URLs to use (GET API queries for free URL while using locking), in that case how would write request to assign long URL->short URL cause write conflicts (and hence we can use leaderless replication on database that stores short URL->long URL association) ?

@jordanhasnolife5163 Год назад

Hey Rahul, I'm a bit confused on what you're asking for here! For the key generation service there shouldn't be two users requesting the same short URL. because we just take the next free short URL as opposed to hashing where two distinct long URLs might hash to the same short URL. That being said, in a key generation service, there can still be write conflicts using leaderless replication because different replicas of the generation service may assign the same short URL to two different long URLs.

@rahulvandamme Год назад

@@jordanhasnolife5163 you mentioned "different replicas of the generation service may assign the same short URL to two different long URLs", how can that happen? Key generation service I assumed would have : database of free URLS that is partitioned (using consistent hashing to distribute load), and each partition grabs a lock on the row of free URL and then marks URL provisioned, how will 2 requests receive same URLs (which can potentially cause write conflict)? Do you mean replication between free URL partitions might cause replicas to be out of sync and hence even locking will not guarantee unique free URLs reported? Would using single leader replication be better at this stage rather than at key generation service (detecting conflict sooner) ?

@jordanhasnolife5163 Год назад

@@rahulvandamme If the key generation service isn't single leader this could happen, so yeah I'd keep the KGS single leader :)

@ahnjmo Год назад

Hey Jordan! Awesome video. For the key generation service, if we are storing lets say 6 characters for each shortened URL, and we have a service that has a list of all the possible shortened URLs we can give out, how come we don't create a boolean in our shortened URL database called used? Also is there a process we can search for the next available shortened URL to be used (i.e. some sort of indexing capability?)

@jordanhasnolife5163 Год назад

Hi Stephen! Both of these suggestions definitely work! The biggest thing is to be very careful of race conditions - e.g. you and I try to claim abcdef at the same time, and both of us see it is unavailable, and then we both successfully claim it one after the other. The point is that locking is necessary here.

@TheImplemented Год назад

I am somewhat surprised people are worried about double key allocation and using complex solutions with used\unused keys. If you are requesting a key and after that abandon it - forget about unused values. What do you think?

@jordanhasnolife5163 Год назад

I think the one scenario you just want to avoid is where two people end up getting the same short URL for their link, but you can pretty easily avoid this with transactions

@shibhamalik1274 6 месяцев назад

How about if we use Cassandra with STRONG consistency for this db ?

@jordanhasnolife5163 6 месяцев назад

How about it?

@nkramaswamy9267 Месяц назад

If 100:1 reads to write and if it is 200 writes/sec, then reads should be 200K/sec instead of 20K/sec, isn't it ?

@jordanhasnolife5163 Месяц назад

I'm going to let you pull out a calculator on this one

@nkramaswamy9267 Месяц назад

@@jordanhasnolife5163 sorry. my bad. It's 20K only. Thanks!

@secondIncomePartTime 2 года назад

How to come up with these calculations in interview. do we need to remember or calculate them or we can ask interviewer about the scale we should cater to?

@jordanhasnolife5163 2 года назад

I'd work with the interviewer and ask them how many users you can expect, and then start making up numbers based off of that - this part isn't overly important you just need to get something in the ballpark of the correct scale

@prabhatsharma284 2 года назад

Can you elaborate more on Zookeeper please? Does fetch service or load balancer b/w fetch service and Db communicate with zookeeper to get the details of MongoDb partitioning i.e. get which master/replica contains the full length url given shortened url. Also, what happens when we add/remove servers from MongoDb cluster, will zookeeper detect it automatically and adjust it’s partition mapping?

@jordanhasnolife5163 2 года назад

The job of zookeeper is basically just to tell database nodes which of the database nodes in the cluster are up or down. Upon figuring out this information, the nodes in the cluster can make decisions to rebalance the partitions or things like that. I think with MongoDB specifically they may handle this for you, so perhaps ZooKeeper was unecessary here but were you to be making your own distributed database you would likely need a coordination service unless you decide to elect to use a gossip protocol.

@prabhatsharma284 2 года назад

Makes sense! Most of the Nosql Dbs now support automatic sharding but if we’re building something on our own then definetely we need some sort of coordination either through zookeeper or by making a HA configuration of specific nodes assigned as master node for coordination just how ElasticSearch does it

@jordanhasnolife5163 2 года назад

@@prabhatsharma284 Yep, well said!! Most of them end up using a coordination service under the hood for the high availability part

@ryan-bo2xi 11 месяцев назад

is MongoDB LSM based ? I guess it uses some sort of tiger storage engine + B trees.

@jordanhasnolife5163 11 месяцев назад

Yeah I believe it uses B-trees

@2tce 2 года назад

Hey Jordan. Nothing to be too mindful of here. But I'd have added "clickCount" to the Analytics table :)

@jordanhasnolife5163 2 года назад

I don't think that's unreasonable but beware of race conditions or use locking/atomics

@sagarmoghe 2 года назад

I have interview in 5 days, where can i find condensed information about different databases like this is single leader architecture or this one uses last write wins during conflict? Basically i want to know if there is a source of titbit info about various databases

@jordanhasnolife5163 2 года назад

Tbh I don't really think there's a great source for that which is unfortunate - I generally would just say Cassandra if you want leaderless, Mongo for single leader NoSQL, and MySQL for single leader relational databases - you could always talk about NewSQL solutions in a system design interview but to me it seems those are the three most relevant

@sagarmoghe 2 года назад

@@jordanhasnolife5163 thank you for your reply, i just said these terms because they were mentioned in the video. If you could make a video on something like this would be unique across RU-vid and help full for people looking to gain some info. Video series like comparing the different database solutions for different criteria like handling db collision handling and sharding support and so on. It would be good source for system design prep. Thank you for your time.

@jordanhasnolife5163 2 года назад

@@sagarmoghe While I agree and had originally planned on doing something like this the truth of the matter is that there are a lot of possible database solutions - if you look on my channel I more or less made a video on each of the notable ones, I'd just recommend watching them all.

@sagarmoghe 2 года назад

@@jordanhasnolife5163 thank you. I am really early career wise and don’t have much experience with db. When u used terms like leaderless, they are intuitive to understand and I didn’t know something like this exist. Even if u don’t make dedicated video on its working I understood the gist of it without going through the db workings in detail. Thank you😁 i was just lazy and wanted a condensed info/properties about databases without going deep into their workings. But i guess there is no way around.

@jordanhasnolife5163 2 года назад

@@sagarmoghe no worries! I also am super early in my career, glad we can learn together!

@nareshh74 Год назад

I understand we are using single leader replication to avoid race conditions. But how do we know if the single replica can handle 200 writes/sec? when (the breakpoint write RPS) would the write load become a bottleneck, to consider other options?

@jordanhasnolife5163 Год назад

Yeah this is where partitioning comes in handy. The writes for the same hashes go to the same server, so the write conflict will still be solved, while allowing us to scale out write throughput.

@insaneVlogs1221 7 месяцев назад

I do not understand the responsibility of the zookeeper here. Once in an earlier video, you said that it holds the info for which hash can be found on which partition but i do not understand what it does here.

@jordanhasnolife5163 7 месяцев назад

It holds everything that describes whicj servers are responsible for what. Partition breakdowns, replication configurations, etc etc etc

@alicezhang3504 Год назад

Hey Jordan, Grokking estimates that tinyURL will need 12,500 servers... I think that is way too many and they must have made a mistake, but I want your confirmation before I go around saying that they're wrong.

@jordanhasnolife5163 Год назад

Why do you think it's way too many? What part of the calculation do you disagree with?

@alicezhang3504 Год назад

@@jordanhasnolife5163 I disagree with the formula Number of Users = DAU / 8000. It would make sense if each DAU were giving one request per second, but it's more like each DAU gives one or two requests per day. So it should be (DAU * 2 ) / ( 8000 * 10^5) , where 10^5 is an estimation for the number of seconds in a day. So given a DAU of 100M (asa per Grokking) that should be like one server. Add a few more for replication and different az zones and maybe to account for peak traffic, but no more than 10.

@jordanhasnolife5163 Год назад

@@alicezhang3504 A bit confused where that 8000 number is coming from but otherwise seems reasonable to me!

@alicezhang3504 Год назад

Oh that’s the standard that Grokking chose to use as an average of the number of requests per second a server can handle. I’m not entirely convinced about it though. Would love to hear your thoughts about server load capacity and how many connections it can handle per second.

@jordanhasnolife5163 Год назад

@@alicezhang3504 Yeah truthfully I'm not sure there and I'm sure it changes from framework to framework and CPU to CPU, but 8,000 doesn't seem terribly unreasonable.

@kyakarein 2 года назад

Cassandra is leaderless replication. why are you saying multi leader replication so many times?

@jordanhasnolife5163 2 года назад

Cassandra replication is configurable, so depending on what you use could be leaderless or multileader

@mcee311 Год назад

what happen the analytics part?

@jordanhasnolife5163 Год назад

Didn't I talk about a counter

@mcee311 Год назад

@@jordanhasnolife5163 ahh is there any videos that dive deeper into that part? Encountered this exact question in an interview

@ifyugwumba8120 2 года назад

Hey happy Sunday to you .... Am kinda of big fan of yours . How can I get job as a developer .

@jordanhasnolife5163 2 года назад

Thanks! Assuming that you are completely new to coding - I'd first recommend learning a decently simple language like Python at freecodecamp, then make a personal project or two with it, and then finally start practicing medium questions at leetcode.com for a couple of months before applying to jobs!

@ifyugwumba8120 2 года назад

@@jordanhasnolife5163 not a newbie actually this is my 4 years in this stuff ....

@ifyugwumba8120 2 года назад

@@jordanhasnolife5163 thank so far