0:00 Introduction 0:50 What we will cover in DynamoDB 1:20 Who I am? and books I wrote. 3:05 What is DynamoDB 5:10 what DynamoDB was designed for 6:57 Key concepts of DynamoDB 8:41 Primary Keys in DynamoDB 9:41 Composite Primary Key 9:46 API Actions -Item based 10:51 API Actions -Query Actions 11:37 API Actions -Scan Actions. Such Very Expensive! Wow. 12:05 Secondary Indexes 13:43 Data modeling example 15:12 Forget concepts from Relational Databases 16:30 The example 16:50 The example -ERD 17:53 The example -Identify your access patterns 19:29 The example -Design your primary keys & secondary indexes 22:27 The example -One-to-many relationships 23:00 De-normalize + document types (Maps, Lists) 28:58 One-to-many relationships patterns recap 30:00 Filtering: "Build filter in primary key" 31:42 The example -Orders in different partitions 33:21 The example -Filtering access patterns 33:34 The example -Composite sort key 37:34 The example -Sparse index 39:00 Filtering patterns recap
Hi Vikram I was wondering if you know how Amazon can use DynamoDB for their shopping cart when someone buys something dont they need ACID transactions?
@@tango12341234 Amazon gets around not using transactions because 1) The way inventory (Network of distribution centers throughout the country) and supply chain is managed, they can meet demand almost all of the time. In the rare case an item is not available they either offer refund or provide a latter delivery date. 2) Airline tickets, concert tickets etc.. require transaction because every seat is unique w.r.t location/time , contrast this with a particular book title , which is the same irrespective of the quantity. As a result the costs of not meeting an order in Amazon or Brick store's case is minimal. P.S: I had a race condition only once with Amazon, they offered a refund in that case.
For a newby like me, Alex is the lifesaver. I've watched Rick's talks earlier, but there were many things which gone over my head. Alex did a wonderful job doing fill in the gaps for me here. So, I'd suggest anyone who is just starting with Dynamodb, first watch this video. Rick is the wizard, so to understand him properly, you'll need some normal guy talk first (basics) which Alex will provide you in this video, in a much comfortable speed. Thanks Alex.
Thanks!, really helped me understand things. I started with Rick's lecture and it was a bit hard for a beginner to understand. This lecture was really well made and clear, now after I practice a bit I am going to rewatch Rick's lecture :)
This is such a great talk. I've been doing lots of theory on DynamoDB for the last couple of weeks with an idea in my head how GSI's may work and this explained it really well. Would be great to hear you do an extended talk about projections and such. Thanks for your imparting your knowledge!
Glad to learn that I have been using DynamoDB wrong for over a year. Seriously though, these access patterns make a ton of sense when handling multiple entities. I had been maintaining multiple tables, one for each entity, like I was still using an RDBS. I had created "ghetto" joins by invoking Lambda functions that would basically make 2 independent queries and then join them computationally inside the Lambda by merging the JSON results of the two queries, with the Lambda returning the joined result. This works, but is slower (80 - 100ms instead of ~10ms), in addition to the cost of a lambda invoke every join query. I will definitely consider this new organizational pattern for a single table. The Lambda joining method could still work for people that are running occasionally complicated queries like running a report or something, but not a good technique for everyday queries.
Edit: this was just before watching Rick Houlihan, the wizard. Go check his videos of you haven't. Original: Amazon really needs to select their developer advocates / speakers better. This guy is one out of ten speakers I've recently listened to, while trying to understand DynamoDB basics and data modeling, who speaks plain English. The other guys just get lost in their own abstractions.
I'd really be glad to see comparison of WCU/RCU counts (cost) for each method because I suppose adding more and more and more GSI is pretty expensive, in all sense.
Very helpful. I’d love to see a follow-up that shows the code and setup for these, including creation of data, deletion, and updates. The latter two are important as they’re obviously messier when you have denormalised data.
At 27:46 can't I put OrderID in PK and ItemID in SK and then create a GSI same way as Alex created. i.e., PK as OrderId and SK as ItemID ? For me it should solve the problem ? Please correct me if I am missing something here.
Great presentation!! However for strongly consistent requirements, if the Order Items are updated, and we use a inverted index (GSI1) to fetch items, then there is a chance that we may get inconsistent status of the order Items. How does one ensure strong consistency in this case?
I don't understand the logic behind modeling the entity OrderItem as PK:ITEM# SK:ORDER#. Couldn't we use PK:USER# SK:ITEM##? Is to share the order items more evenly across different partitions? Even PK:ORDER# SK:ITEM# would make more sense to me because it would be following the previous pattern of one(PK)-to-many(SK).
I am new to DynamoDB and am probably missing something here. Why does order item have a different PK? I thought you could only have one primary key for the entire table.
Thanks for this really great video. I'm surprised at how much you can do with DynamoDB in terms of querying etc. You've shown some really great strategies for querying. It's like you can do almost anything that you can do with Sql. Can you, therefore, give some guidance on when not to do things in DynamoDB e.g. when does the extra effort become too expensive, when should you stick to sql/ mysql?
Thanks, Alex. Your SKs are prefixed by the entity type, but the USER SK had "#PROFILE#" which was prefixed with a hash (at 27:24). Why is this one done differently?
probably just because it's a sort key, to distinguish the formatting from the partition key. This is totally up to you as a developer anyway how you choose to create your composite keys. I have a similar setup to this
Good question, Chris. It's one that a few people have asked, so I probably should have explained better :). The short answer is "It doesn't really matter". I mostly wanted to do it to make it look nice. I wanted the User Profile item to appear before the Order items the table, but the Sort key is sorted lexicographically. This means "PROFILE" would be *after* "ORDER". As such, I prefaced the "PROFILE" record with a "#" so that it would appear before ORDER items. If I had a use case of "Fetch a User Profile and the User's Orders", then I may want this sorting so that I know my User Profile item would always be the first one when changing into objects in my application. There are a few other reasons you might need something like a "#" to help with ordering as well. A little hard to explain in this comment section though.
@LaggyOnline Just to make sure I understand -- is your question why do the filtering in DynamoDB? Instead, just fetch all items related to a user (both Orders and the Profile), and then filter client-side to get just the Profile? If I'm understanding correctly, my response would be that it could be a significant amount of extra data you're reading without any benefit. In this example with 11 items that are pretty small, it doesn't seem like a big deal. But what if a User has 200 Orders? Or what if each Order has a lot of data on it and is 50KB per item? Now you're using extra read capacity units to fetch data that you're just going to throw away. Also, if your item collection is more than 1MB, you'll have to page through multiple requests just to find that one item you want. Hope that helps. Let me know if I misunderstood :)
@LaggyOnline Ahh gotcha. Yea, if you wanted to fetch two 1-N relationships in a single request, you're probably going to need to make multiple requests *unless* you have an idea of how many related items there are. One note is that you can basically model two 1-N relationships within a single item collection. For your sort key, have your parent item right in the middle, with one relationship going ascending and one relationship going descending. I've got a few examples detailing this further in the book --> dynamodbbook.com
I was wondering if putting "USER#" at the beginning of a partition key (as you can see at 21:31) would ever result in a hot partition? I guess I'm imagining that the partitions would be created based on alphabetical order of partition keys so in the above scenarios there would be a lot of entries in the partition for partition keys starting with U, but probably something more sophisticated is going on. I just wanted to check.
LaggyOnline is correct -- partition keys are hashed before being placed on storage nodes to prevent this issue. DynamoDB even announced some interesting rebalancing features recently where they will move frequently-accessed items to a different partition to help alleviate pressure on other items that happened to be on the same node. aws.amazon.com/about-aws/whats-new/2019/11/amazon-dynamodb-adaptive-capacity-now-handles-imbalanced-workloads-better-by-isolating-frequently-accessed-items-automatically/
Not a dig on DynamoDb in particular, but he glosses over the severe modeling issues you'll encounter in this kind of database. It seems to work fine in the scaled-down data model he demonstrates, but in a more real-world scenario, NoSQL will be unable to reasonably deal with queries such as "find the average sales volume per month of customers who spend 80% of their purchases on bananas that come from Guatemala". The model would have to be designed and pre-joined in advance to answer this specific query. You'll find yourself creating multitudes of models, each designed for a different query. I suppose that's good for people who sell compute and disk space, tho.
@Alex DeBrie really good talk, every time I try to watch the Rick talk I end by watching yours to clean my mind hahaha. But a big question for me is not about access patterns but the way to store the data in real life thinking in FE -> BE. From your table is very clear to me to use USER#userId as partition to store users and orders, make sense every time you create an order from a user you store it one by one so a user can not create many orders at the same time. But in case of items, when you create an order, the same order can have many items, so from UI point of view, you don't create items at the same time you create an order, you pick them from a list, so to store an item you should not have any order Id yet right? And in case you do have a list of items, every time you create an order, you also store each items of it? I don't get it. Thx for this video :)
This is exactly the question that is running in my mind right now. User creates an order, which will get an OrderID, and from there with that OrderID, it further populates the order item from the list to store inside the table? Wouldn't that be using multiple writes?!
So from GUI prospective user select the products and once the order is placed these products will be stored on the same order as Map/list; similarly how Alex showed one-to-many relation between user and user_address. Or did I misunderstood your question?
If Amazon is switched over to using Dynamo DB for their shopping cart how can they guarantee ACID transactions when someone buys stuff on shopping cart. Dont you have to use a relational database for that?
Great talk Alex. Thank you very much! I have a question though. Let say we have another order status called 'DELIVERED'. How do you filter orders with status 'SHIPPED' or 'DELIVERED' efficiently? In sql, it will look like "select * from orders where status = 'SHIPPED' OR status = 'DELIVERED'".
Thanks, Christian! Glad you liked it. Good question. One follow-up: do you know you'll be filtering for just those two statuses (SHIPPED and DELIVERED), or are you wanting the ability to flexibly specify multiple statuses (SHIPPED and DELIVERED here, but maybe another access pattern has SHIPPED and CANCELED) ? If the former, you could use the Composite Sort Key pattern or the Sparse Index pattern discussed in the talk. If using the composite sort key, make the sort key something like 'SHIPPED_OR_DELIVERED#", which would allow direct queries on that. If using the Sparse Index, then you would have a property that only existed on items that were in the SHIPPED or DELIVERED status. Create an index using the property and then query that index. If you want the latter -- ability to have some flexibility around fetching multiple statuses -- it's a little tricky. DynamoDB wants you to be very specific about the access patterns you have. In this case, I would just make two parallel queries to DynamoDB on the index with the Composite Sort Key -- one that queries for Status = 'SHIPPED' and one that queries for Status = 'DELIVERED". Does that make sense?
Thanks @@alexbdebrie and yes it makes sense. My use case is the later. Closest one is the "Refine by" section in amazon.com where you can select multiple brands and/or sellers. Looks to me that dynamodb is not suited for this use case.
@@ianpogi5 Making two parallel requests isn't the end of the world either. Where you really get into trouble with DynamoDB is where you're waiting to make multiple, dependent requests that create a waterfall. That's when your requests get slow.
That's what the "inverted index" pattern is for, which he explains at 28:10 In other words, the original composite key is PK=USER#nathanj, SK=ORDER#123 and PK=ITEM#456, SK=ORDER#123. The inverted key (a global secondary index) would give you PK=ORDER#123, SK=USER#nathanj, PK=ORDER#123, SK=ITEM#456. Thus, if all you have is orderId=123, you can query for get(PrimaryKey="ORDER#123"), and that will return the item whose sort key starts with USER# and ITEM#.
This is a little dated, but just found from the context of your book being released. Wondering about the STATUS sparse index example ... I'd imagine status needs to be queried as part of many business processes ... once the status moves from 'picked' to 'ready', doesn't the shipping process need to query all the 'ready' status items? Burn another GSI? ... then it goes from 'shipped' to 'delivered' ... again, burn another GSI? Or is there some implied micro service architecture that once a status is achieved it's in another app and at some point the order status in this process will be updated just by order_id? Or maybe an inverted index from the USER# STATUS#ORD_DATE? Seems the last is problematic as can't there be multiple orders for a customer with the same STATUS#ORD_DATE? In the inverted index, since you'd need the full STATUS#ORD_DATE to query as PK how do you handle a distant backorder that's now filled?
I have the same feeling. Quick access at the expense of too much accidental complexity. Not really concerned about the cost of duplicating everything for each access pattern we need (i.e: a secondary index is essentially creating another table with some or all attributes projected/duplicated), but the modeling, maintenance and complexity seems just too much compared to a MongoDB or similar document DB. Add pagination to the mix and it's just crazy.
So we will persist the same dataset repeatitively based on access pattern and hence different secondary index. We are adding lot to Storage cost here by adding same records with different sk and tags to pk and sk attributes. The solutions need to be simplified in nosql world! Miles to go....
this is a problem with DynamoDB and key value stores, not with NoSQL in general. In MongoDB you can query, index and filter documents without having to worry about duplicating information for every single access pattern.
I was with you till you added the model for order Item here - ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-DIQVJqiSUkE.html. It seems like a bad idea to put Order Item Id as a primary/partition key, as the order items of a particular order is supposed to stay together in a single partition for better performance. Now by making it as a primary key, you're going to have multiple order items in different partitions, making the query perform badly. And just to overcome that we're adding extra index and all, making it less efficient overall.