9: Design Yelp/Google Places | Systems Design Interview Questions With Ex-Google SWE

Jordan has no life

Подписаться 51 тыс.

Просмотров 12 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Will be selling Jordan's juicy buns shirts imminently along with the foot pics, get in line

Опубликовано:

3 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 79

@idiot7leon 5 месяцев назад

Brief Outline 00:00:41 Brief Introduction to Yelp 00:01:21 Problem Requirements 00:02:25 Capacity Estimates 00:03:45 Reading Reviews 00:05:59 Finding Restaurants 00:08:48 Finding Restaurants 00:11:54 Finding Restaurants - Example 00:16:38 Restaurant Data Choices 00:19:48 Data Caching 00:21:56 Geo Sharding 00:23:20 Geo Sharding 00:25:40 Partition Balancing Illustrated 00:27:07 Searching by Name 00:28:32 Final Diagram - Yelp Thanks, Jordan~

@matthewsaucedo2471 8 месяцев назад

Do you have a Patreon? Had some successful system design interviews and I feel like your high level coverage of most common building blocks really got me over a hump in my designs, specifically with learning how to incorporate replication and more complex streaming into trade off discussions. Would love to show support!

@jordanhasnolife5163 8 месяцев назад

Nope! No need to pay me man, just pay it forwards!

@anjukrishna8102 6 месяцев назад

Very good explanation to understand geospatial index. Thank you so much

@yiannigeorgantas1551 16 часов назад

If we use the same hash function on restaurantId for both the reviews and restaurants table, wouldn't the review and restaurant data end up in the same partition (ie. all the reviews for a particular restaurant will end up on the same partition as the details of that restaurant)? Once the tables get particularly large, couldn't we just repartition, and move the reviews for a restaurant, as well as its details, together, to a new partition? By having the two tables colocated together on the same MySQL partition, I think we can get rid of the "Review Change Data Kafka" and "Spark Streaming" components. We'd be able to update the restaurant's rating and post a user's review atomically because both tables would live on the same machine.

@allenxxx184 6 месяцев назад

How come your explanation is so clear for every single video?!

@jordanhasnolife5163 6 месяцев назад

I've got a lot of practice gaslighting the romantic partners in my life, gotta learn how to talk clearly

@buddh4r 8 месяцев назад

Really appreciate your system design videos! In case you run out of ideas, I'd love to see a Slack/Discord system design, especially in regards to SaaS and multitenant architecture. I also assume the messaging aspect heavily differs between Slack/Discord and Facebook Messenger/WhatsApp. I am working on such a platform as a side project for some time and plan to release it as open source soon. Even though I don't plan to make it as scalable as Slack or Discord, it would be interesting to see your take on such a system. Anyway, thanks for the great work you've done so far.

@jordanhasnolife5163 8 месяцев назад

Hey Buddha! Why do you feel that the architecture is much different from messenger?

@buddh4r 8 месяцев назад

@@jordanhasnolife5163 Hey, thanks for answering. The main reason for my assumption is that platforms like Discord support significantly more users; instead of supporting just 2-10 users per chat, a Discord server can handle up to 500,000 concurrent users (per chat). Additionally, a Slack workspace or Discord server acts as its own tenant, compared to a global chat system. But yeah, 'heavily' probably is not the right word. Some other general differences include Slack's support for various integrations (e.g., Webhooks, OAuth), channels, threads, organizational hierarchy, permissions, etc. I know some of these details might be too specific for a system design overview, just for the purpose of inspiration. There is also an interesting article about how Discord evolved from MongoDB to Cassandra to ScyllaDB. Hope I don't come across as the 'Please design my system for me' type of guy. My side project is more or less complete and in terms of system architecture will be much simpler than any system you discussed on your channel, I'm just interested in this type of systems at the moment.

@jordanhasnolife5163 8 месяцев назад

@@buddh4r Haha not at all, thanks for sharing! Yeah I'll try to look into it/give it some thought, and if it feels like there's some unique elements in terms of the design here I'll eventually make a video on it!

@byte_easel 8 месяцев назад

Hi Jordan, thank you for your these system design videos that I'm pretty sure whose quality most people take INSANELY granted for. These videos are better the ones creators on Udemy literally $100 for. But anyways, I had a request. Can you make a guide on recruiting for new grad 2025 (or just new grad in general)? I don't have anything lined up for the summer and I heard system design is apparently expected ON TOP of all the leetcode patterns you need to be well practiced in for technical interviews. It's overwhelming and I'm not sure how to best prepare myself to secure a new grad role. I would really appreciate if you can make a video on that (if you can/have the time). I'd gladly donate to a patreon or paypal link for all the insane trouble you go through to make these videos (and the one I'm requesting if possible).

@ingenieroriquelmecagardomo4067 8 месяцев назад

i dont think you need to prepare that much on system design because you are new, new engineers are expected to have strong technical coding skills because most high-level technical decisions are made by seniors and staff engineers, and they expect the engineers under their supervision to be able to implement their directions in a competent way. tldr: focus more on algos and not bombing the dsa interview

@jordanhasnolife5163 8 месяцев назад

Agreed with the above comment. As a new grad, I would mainly focus on leetcoding and just getting referrals for my applications. In my day, I had what I would consider to be two genuine systems design interviews as a new grad (robinhood and hubspot), but the rest mainly cared much more about coding. Other companies have "system design" interviews at the new grad level but it's really more low level stuff rather than distributed systems. Not to say you shouldn't learn this stuff! Just want you to prioritize your time well. I have videos on this channel devoted to recruitment, you've gotta look through them but they're there - mainly it boils down to: 1) have a resume with some internships or personal projects that show you can write code 2) apply with referrals, you can get them pretty easily off of blind or linkedin 3) Study a LOT of leetcode Good luck! I don't want your money for now man, keep killing it ;)

@TheCyanide01 8 месяцев назад

Hey Jordan! Thanks a lot for the incredible content. I have a quick question: I've seen a bunch of system design videos about Proximity Services (such as Yelp), and most of them seem to use GeoHashes and/or QuadTrees. Is there a reason R-Trees aren't typically presented in these solutions? Aren't they considered to be faster than QuadTrees for KNN as well as Range queries (such as the ones covered in this video)? 🤔 Would really appreciate your insight here. Thank you, again!

@jordanhasnolife5163 8 месяцев назад

Hey!! R-Trees are mainly for indexing actual shapes on a map, whereas we're mostly concerned with radius search here. Hope that makes sense!

@5atc 8 месяцев назад

Do you have a video that explains the CDC / derived data / Kafka / Flink pattern? You use this a lot and I'd like to understand it better. Great video as always!

@jordanhasnolife5163 8 месяцев назад

Yep! So I have like 3 videos dedicated to stream processing in the 2.0 playlist and one for flink as well, would recommend watching those!

@jordiesteve8693 2 месяца назад

great video! I was thinking for reviews database one could make the case to use dynamoDB. It would allow to have reviews sorted by timestamp, and speed up reads (i.e, no need to sort them to show to the user). A bit of a stretch - we could cache sorted results - , but an interviewer might like to hear some pros/cons

@jordiesteve8693 2 месяца назад

Apart from this, I know you love CDC (I also do), but in a situation with a low volume of writes, one could argue to have 2PC, since it is cheaper than maintaining Kafka, spark and flink clusters. WDYT? Anyway, we don't pay the bills lol

@jordanhasnolife5163 2 месяца назад

Yeah fair enough on the 2pc point. As for DynamoDB, why can't other databases sort by timestamp? This is just an index you're describing

@jordiesteve8693 2 месяца назад

@@jordanhasnolife5163 indeed, I got confused and mixed things up. Thanks!

@htm332 2 месяца назад

- For geospatial partitioning you could probably use population as an initial proxy until some usage data is collected - CDC for reviews updating restaurant rows is overkill. There just aren't that many reviews per item on Yelp (e.g. the super touristy Katz's deli in NYC only has 15k). 2PC would be fine.

@jordanhasnolife5163 2 месяца назад

1) agree 2) agree, but it's a systems design interview so figured I may as well show the overoptimized solution just in case

@yiannigeorgantas1551 22 дня назад

@@jordanhasnolife5163 Both the reviews and restaurants table are sharded by restaurantId, so I wonder if 2PC or CDC is really required here. Since the sharding is the same, can't both these tables live on the same database instance, but on two separate tables. Assuming we're using MySQL, which is ACID-compliant, we'd be able to update both these tables atomically.

@mangeshshikrodkar6192 8 месяцев назад

Very good videos. Lots to learn.

@tysonliu2833 День назад

I am a bit confused around the final design diagram. So u have a MySQL table for restaurants ( populated by business owners I guess which is not shown here), and constantly updated with the aggregated results of likes/ views etc which makes sense, but then there is a geo index at the bottom, I thought geo index should be an embedded part of the restaurant db( is it possible)? Else how do we ensure the range query and all that good stuff if we simply shard by restaurant Id (unless this is is prefix with the geo index?)

@thunderzeus8706 2 месяца назад

Hi Jordan, I have a question developed around the overall DB choice after following each of your videos. I noticed that you always look at "read heavy/write heavy" + "replication pattern" to make decisions, but almost never look at the overall load. Like in this video for deciding DB choice for comments, since it is obviously read heavy and there is not really challenges against single leader replication, you chose MySQL. I wonder if the number of comments grow very large let's say 100B comments (I think RU-vid has more than that scale) and the read QPS is hundreds of Kilos, should we still stay with MySQL? What is the threshold of the overall number of items and QPS load for you to stay with or shun away from MySQL? (My old impression of SQL's QPS limit is just thousands of QPS, but maybe that is for a single Mysql instance, and may be the story before "Mysql Cluster" was introduced?) In between SQL and NoSQL, there is also "NewSQL" DB like Spanner and CockcroahDB. As I am aware of, both of them use LSM + SStable (or variant), which is actually slower in read than BTree index. But in reality, I feel the sentiment is Spanner is favored over MySQL in such large scale use cases. Does the slight edge of "BTree index gets lookup'/range scan queries faster" get offset by other considerations? Thanks a lot!

@jordanhasnolife5163 2 месяца назад

1) I think regardless of how many QPS we need, your database choice isn't going to make you able to handle orders of magnitudes more - at that point you just need more shards. 2) Spanner is nice when you want to make consistent reads spanning multiple partitions, or you want strong consistency, neither of which we really need here. I'm not saying spanner will be slower in reality, but I can't make any theoretical argument in favor of it here other than looking up some actual benchmarks.

@rakeshvarma8091 5 месяцев назад

Great Video. Can you please share the slides of system design 2.0 series as well as the design questions series. Thanks.

@jordanhasnolife5163 5 месяцев назад

Going to do this in batch once I finish the current series

@Anonymous-ym6st 5 месяцев назад

I am wondering how do you feel about the tradeoff about updating the reviews / restaurant info in an offline way vs real time? My thinking on key points are potentially: 1. update frequency (if update frequency is not insane, I think online update should be fine? 2. real time update's customer experience could be better, but could lead to more outage if not handled carefully (here we use CDC to enable at least delivery but spark's micro batching should get some delay on data freshness (which I think much shorter than any other offline solution like a snapshot service every 1 hour?) I didn't find a lot of materials discussing about online / offline data update from write path (while it is a ready heavy service), would like to learn from your expertise here!

@jordanhasnolife5163 5 месяцев назад

To be honest, I think that's a requirement of the service, and basically just whatever your interviewer asks of you. Running batch jobs is easier 100% of the time, but sometimes people want their data quick haha

@TruptiJoshi-gi2qf 4 месяца назад

Hey Jordan! great content as usual. Thank you for doing all the hard work. I have a question. We are capturing the change restaurant data and processing in Flink. However, if the user just want to read about the restaurant or stars, it will fetch the data from the follower replica of the Restaurants table right? Is my understanding correct? Because the read arrow is directly from the flink node. So, I got confused.

@jordanhasnolife5163 4 месяца назад

Yes you're correct

@useryash09 8 месяцев назад

Hi Jordan, I am a big fan of your videos. I want to know can we ge the slides of these video as well. playlist: "system design questions 2.0".

@jordanhasnolife5163 8 месяцев назад

Hey! I am going to get around to this eventually, but will probably upload them all at once in bulk after I stop making this round of videos (give it a couple months).

@annon8350 7 месяцев назад

From some of the videos I have seen, who have used geohash for locality services - The idea is to present a customer with restaurants within 1/5/10 miles (lets say). And essentially the length of the geohash ID can determine this. For example, if the geohash for the current location of a customer is 98bewp, then show the restaurants which has a geohash id similar to 98bew% or 98be%%. This makes sense to me and simple to implement. Mostly because I think you can cover upto 1 mile radius with just 6 characters (hence no issue on scalability). However the idea you presented here with the sorting, binary search and pythagoras theorem isn't fully clear to me.

@jordanhasnolife5163 7 месяцев назад

I think what I said is just a generalizable approach to get the exact radius that you want, as opposed to yours which relies on the size of the bounding box. Same idea though. I just get all of the points in the neighboring bounding boxes, and then use their lat long to determine the exact distance from me, and filter them out if they're outside the radius I care about. The sorting is just to demonstrate that if you make an index on the geohashes (which is effectively sorting them), points that are closest to one another will be next to each other in the index.

@annon8350 7 месяцев назад

@@jordanhasnolife5163 This makes sense. Thank you for the response

@davidrice974 6 месяцев назад

Thanks for the great content Jordan. Just had a question about geohashes - in this case they represent a location for a restaurant or a square location that could contain multiple restaurants? It seems like the former but if that’s the case then does a geohash represent a fixed point that a restaurant may not be assigned too? Ie. There’s restaurants at aaa and aac but none at aab?

@jordanhasnolife5163 6 месяцев назад

Yep GeoHashes are just bounding boxes. However, eventually geohashes become so small that it's more or less accurate to just say some geohash is equal to my current location. As a restaurant, I am technically located within multiple bounding boxes, but the geohash we'd want to use here is the smallest one that contains my whole restaurant.

@iamlarissa87 Месяц назад

may I ask , what's widget you use to draw on ur laptop? are u using a graphic pen and drawing tablet? I am interviewing with companies, and I found hard to draw SD diagram on site like Miro using touch pad in my mac book pro.

@jordanhasnolife5163 Месяц назад

Drawing on my iPad with OneNote and screen recording. Don't know if this will work/be allowed for interviews.

@trisqwit 3 месяца назад

Hey Jordan, thanks for the great video. One question I have, do we really need to partition the geohash table? 10 million restaurants isn't a lot, maybe doesn't exceed 2GBs in size. We can distribute the workload by replication. If we want to further reduce the latency and achieve data locality for dense areas, we can utilize CDNs. I also saw that in System Design Interview book, the database also was partitioned without a clear justification? So I don't get what is the point of partitioning here, is there anything I am missing? Thanks

@jordanhasnolife5163 3 месяца назад

Agree with your point, in reality we may not need to partition. It's just helpful to understand how partitioning works for location based data. One scenario where you may want to partition is caching since those are smaller.

@yiannigeorgantas1551 12 дней назад

Both the reviews and restaurants table are sharded by restaurantId, so I wonder if CDC (or 2PC) is really required here. Since the sharding is the same, can't both these tables live on the same database instance, but on two separate tables. Assuming we're using MySQL, which is ACID-compliant, we'd be able to update both these tables atomically.

@jordanhasnolife5163 11 дней назад

Definitely could, it's just a matter of whether that's feasible when these tables get particularly large. Sharding by restaurantId doesn't necessarily mean that the same groups of restaurant ids are in identical partitions for both tables.

@yiannigeorgantas1551 11 дней назад

@@jordanhasnolife5163 If we use the same hash function on restaurantId for both the reviews and restaurants table, wouldn't the review and restaurant data end up in the same partition (ie. all the reviews for a particular restaurant will end up on the same partition as the details of that restaurant)? Once the tables get particularly large, don't we just repartition, and move the reviews for a restaurant, as well as its details, together, to a new partition? By having the two tables colocated together on the same MySQL partition, we can get rid of the "Review Change Data Kafka" and "Spark Streaming" components.

@yiannigeorgantas1551 11 дней назад

@yiannigeorgantas1551 16 часов назад

@@jordanhasnolife5163 If we use the same hash function on restaurantId for both the reviews and restaurants table, wouldn't the review and restaurant data end up in the same partition (ie. all the reviews for a particular restaurant will end up on the same partition as the details of that restaurant)? Once the tables get particularly large, couldn't we just repartition, and move the reviews for a restaurant, as well as its details, together, to a new partition? By having the two tables colocated together on the same MySQL partition, I think we can get rid of the "Review Change Data Kafka" and "Spark Streaming" components. We'd be able to update the restaurant's rating and post a user's review atomically because both tables would live on the same machine.

@zhonglin5985 5 месяцев назад

In the earlier section of the design, the restaurant table is partitioned by geohash, but when it comes to the final architecture, the partition key suddenly changes to restaurant id. Are we using two global indexes here?

@jordanhasnolife5163 5 месяцев назад

Yeah I ended up just going with derived data from the original restaurants table to build out the geospatial index

@Anonymous-ym6st 5 месяцев назад

Thanks for the great content as always. I am a bit lost about the second binary search relative to the # of restaurants, we are doing a range query as the restaurants are indexed on their geohash (so we are basically describe the range query as log(N)? Another questions: I am wondering what we usually select for the bounding box dimension (say the radius is 0.3mile, what's the tradeoff of choosing between 1 mile, 0.5 mile and 0.25mile)? does it matter with the user location (like close to a smaller dimension's boundary, we choose a bigger dimension)? Thanks a lot in advance!

@jordanhasnolife5163 5 месяцев назад

1) I think you described it very well, what exactly are you confused about? 2) You'd just want to choose the smallest bounding box that includes your whole radius ideally (or potentially a smaller one than that but now you have to look at multiple bounding boxes). It all just boils down to whatever is going to let you have to evaluate the fewest number of points. A bigger bounding box will have more points that you need to check to see if they're actually in the radius that you want.

@Anonymous-ym6st 5 месяцев назад

@@jordanhasnolife5163 Thanks a lot! It makes sense! I am wondering if the returned list of restaurances is very long and we only want the nearest 100 results, how can we do it? (learnt from the video, we might want to avoid send raw list to application layer? is there a way we can do it in DB like keep a heap of that?

@jordanhasnolife5163 5 месяцев назад

@@Anonymous-ym6st Yeah so to be clear, the process that I described is probably actually something that would run on a db itself. If we were building this from scratch, we'd have to code that logic ourselves, but I think that for something like Postgres, you can get it all done there!

@indraneelghosh6607 8 месяцев назад

Hi, Had a general question about the write flow. How do you determine that we should just directly write to a DB or we should write using a message queue. Is there a rule of thumb that you use to determine at what TPS you want to introduce a message queue?

@jordanhasnolife5163 8 месяцев назад

TPS, probably not realistically. But I'd say that if you have any fear that a write is going to involve a significant amount of processing (such that returning an instant result could hog our server capacity), or that it would be better to just be able to handle all of the writes on one thread for correctness's sake (but maybe later), introducing a message broker could be a good idea!

@meenalgoyal8933 6 месяцев назад

Hey there! Thank you for the video. I was wondering if the reads will feel slow, since for every new nearby call, we first spend O(log n) time to figure out geoHash, another O(logn) to get all placeId from the geoHash indices and then filter out for radius r. Is there any optimization besides caching that we can do to make reads faster?

@jordanhasnolife5163 6 месяцев назад

Hey! I think in reality, partitioning will make things feel faster, you're not doing a logarithmic search on the whole table. Other than that though, yeah you're basically out of luck

@Anonymous-ym6st 4 месяца назад

I am wondering if you think we should keep a separate DB for location (geohash indexed) data for query? or just separate table (but under the same DB with restaurant)?

@jordanhasnolife5163 4 месяца назад

I'm not sure it really matters for the sake of our interview beyond like convenience for the developer/maybe overloading the physical machine.

@MuhammadUmarHayat-b2d 3 месяца назад

is it lograithmic to the base 4 in search?

@jordanhasnolife5163 3 месяца назад

Yeah

@zuowang5185 4 месяца назад

what happens when your num_reviews and total_stars aren't aligning, lets say you see 501 total stars on 100 reviews where the max star per review is 5.

@jordanhasnolife5163 4 месяца назад

Nothing - it's not the end of the world for this to happen, it's okay to deal with eventual consistency

@ariali2067 7 месяцев назад

Curious is elastic search a DB table? i.e. both Geo Hash Index and Inverted Index stores in 2 separate DB?

@jordanhasnolife5163 7 месяцев назад

I wouldn't call it a database, it just provides a few different types of indexing methods. Conceptually, think of those indexes as separate tables though yeah

@ariali2067 6 месяцев назад

Thanks for the response! Just to confirm, search indexes and geo hashes on built on top of our existing tables, but not creating new and separate tables right? I always assume that we are building new tables on top of that, but this is not the case. They are mainly secondary index of existing tables. Can you help confirm that my understanding is correct here? Thanks again! :) @@jordanhasnolife5163

@jordanhasnolife5163 6 месяцев назад

@@ariali2067 no, they're new tables in this case

@maxvettel7337 8 месяцев назад

Why don't you use timecodes in video? I remember you did this

@jordanhasnolife5163 8 месяцев назад

General laziness - maybe I'll go back and add them in at some point

@maxvettel7337 8 месяцев назад

@@jordanhasnolife5163 This is not a problem just a question. Content is the best but I am currently preping and it would be easier for me to navigate through the video

@jordanhasnolife5163 8 месяцев назад

@@maxvettel7337 Understood - I've gotta get around to doing this but I will at some point, thanks for the suggestion!

@ingenieroriquelmecagardomo4067 8 месяцев назад

Nice video, though you admitting being an avid taco bell lover just highlights that you know very little about authentic mexican system design.

@jordanhasnolife5163 8 месяцев назад

Murica number 1