Episode #35: Advanced NoSQL Data Modeling in DynamoDB with Rick Houlihan (Part 2)

Serverless Chats

English - February 10, 2020 09:00 - 52 minutes - 48.1 MB - ★★★★★ - 29 ratings
Technology Education serverless faas baas cloud aws lambda Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: Episode #34: Advanced NoSQL Data Modeling in DynamoDB with Rick Houlihan (Part 1)

Next Episode: Episode #36: The Cloud Database Landscape with Suphatra Rufo

This is PART 2 of my conversation with Rick Houlihan. View PART 1.

About Rick Houlihan:

Rick has 30+ years of software and IT expertise and holds nine patents in Cloud Virtualization, Complex Event Processing, Root Cause Analysis, Microprocessor Architecture, and NoSQL Database technology. He currently runs the NoSQL Blackbelt team at AWS and for the last 5 years have been responsible for consulting with and on boarding the largest and most strategic customers our business supports. His role spans technology sectors and as part of his engagements he routinely provide guidance on industry best practices, distributed systems implementation, cloud migration, and more. He led the architecture and design effort at Amazon for migrating thousands of relational workloads from Oracle to NoSQL and built the center of excellence team responsible for defining the best practices and design patterns used today by thousands of Amazon internal service teams and AWS customers. He currently work on the DynamoDB service team as a Principal Technologist focused on building the market for NoSQL services through design consultations, content creation, evangelism, and training.

Twitter: @houlihan_rickLinkedIN: https://www.linkedin.com/in/rickhoulihan/Best Practices for DynamoDB: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/best-practices.html2017 re:Invent Talk: https://www.youtube.com/watch?v=jzeKPKpucS02018 re:Invent Talk: https://www.youtube.com/watch?v=HaEPXoXVf2k2019 re:Invent Talk: https://www.youtube.com/watch?v=6yqfmXiZTlM

Transcript:

Jeremy: So one of the things that you have never mentioned or at least I don't think I've ever seen you mention it, at least not in any of your talks for your modeling is local secondary indexes.

And I used to think, "Hey, this is great. They've got really strong guarantees and then it's sort of this great use case if you want to do a couple of different sorts." But LSIs are not quite ...

Rick: Not the panacea you might think they are.

Jeremy: Yes, correct.

Rick: So LSIs, I'm not exactly sure. I mean, I think you're exactly correct. The biggest value of LSI is the strong consistency, right? But the limiting factor of the LSI is it doesn't really let you kind of regroup the data, right?

Jeremy: Right.

Rick: You have to, you have to use the same partition keys to the table. So the only thing you can really do is resort the data, right? So right there, that's a limited set of use cases, right? There's not a lot of access patterns. I mean there are, but there's not necessarily a ton of access patterns or applications that only required me to resort the data. Most applications are going to require to group the data on multiple dimensions so that limits the effectiveness of the LSI. The other thing about the LSI that kind of stinks is they have to be created at the time the table is created, they can never be deleted.

So if you mess it up, then you've got to recreate the table to get rid of them and I find them to be extremely limited use. I mean, most developers can tell you that strong consistency is an absolute requirement, but when you get down to it and started looking at the nature of their application, yeah, what they really need is read after write consistency, right? It's worth kind of talking about the difference, right? Strong consistency implies that no update to the database is going to be acknowledged to the client unless all copies or all indexed or copies of that data are also updated, right?

Jeremy: Yeah.

Rick: That's strong consistency. That means if I'm in a highly concurrent environment, that no two clients could read different data, okay? Unless the read is not, or the write is not yet fully committed. As long as the right hasn't committed, you're not going to get two copies of the data. Well, most use cases are really more about like if I make the write and I read back, did I get the right data?

So what we're really talking about is read after write consistency. If you think about the round trip between the client and the system, if I have a let's say in DynamoDB, GSI replication is 10 milliseconds or less, it's highly unlikely that you're ever going to be able to return to the client, that the client is ever going to be able to returned to the server and ask for the same data back in 10 milliseconds.

Jeremy: And honestly, if you do, welcome to distributed systems.

Rick: That's exactly right. I mean, that's the other thing I was going to say and most distributed systems, what you'll find is there's a propagation delay on configuration data. So oftentimes, even if you get to the point where the developers are going to tell you that there's going to be concurrent access on this data, when you back up a step, you're going to find configuration data is going to live in multiple entities. So hey, all bets are off, right?

So let's take a look at that need for strong consistency and not make arbitrary requirements because as developers when we make arbitrary requirements, it's like hooking a fire hose up to our wallets. Let's make sure that we're actually making requirements that are meaningful to our business. 90% of the application workloads I work with, I would say even maybe even higher don't require strong consistency. So let's just use those GSI. They're much more flexible, right? They can be completed anytime, they carry their own capacity allocations. They don't pillage capacity from the table. Overall they're just a lot more flexible.

Jeremy: Yeah, and you've got more control. I mean, that's one of those things too. If you are doing the single table design and you're using all those different entity types and so forth, what are the chances that all those LSIs and the sorts all align with one another too. It seems like a lot of wasted capacity.

Rick: Inevitably, you're going to end up using GSIs, right?

Jeremy: Right. Exactly.

Rick: You may be able to use an LSI for one use case, but you can't use them for all of them.

Jeremy: Yeah, and I mean, and I think just the important thing about LSIs too is regardless of the inflexibility of them, there's also a doubles the costs, right?

Rick: Well, all indexes double the cost, right? I mean [crosstalk 00:49:35]

Jeremy: Of course, yeah.

Rick: Because actually, one of the things people kind of ... It's kind of an incorrect assumption about LSIs is that customers believe that, "Oh, they use the same capacity as the table. Oh, they must be free." No, they're not free. You still pay for the storage, you still pay for the capacity. I'm just going to have to allocate twice as much capacity to the table now.

Jeremy: Moving on from LSIs and GSIs, the other thing that always comes up is this idea of hot keys or hot partitions where you basically have one key that gets access quite a bit. You sort of pointed this out in your slides where you s...

Twitter Mentions

@houlihan_rick