@UC0FVA9DdusgF7y2gwveSsng This, and it's not really different from life, most things have tradeoffs, some almost don't and some barely have any advantages
They refer to AWS as another product because Prime video and AWS are actually two different companies, although they are bought owned by Amazon. I believe Prime video still pays AWS for using its cloud services, which makes Prime Video an AWS user as was stated in some parts of the text
Exactly this, and add the fact that long ago Bezos told all of AWS they were to become a platform, and could only communicate/use each other through APIs and be each other's "customer".
@@Sanglierification Cos they are two different companies, has to do with their business model I think....so Prime video is just like any other AWS customer
Having worked with both there are definitely pros and cons. However, doing microservices correctly is very very difficult and most people tend to prefer microservice and serverless because it’s cool and fresh. At the end of the day monoliths should be a great default until there are proofs that your business requires something else.
I think a more accurate description would be they moved from serverless distributed asynchronous nanoservices to monitor streaming data, to a synchronous microservice which is obviously more performant for real time video and data analytics. This doesn't really have anything to do with a "monolith", just seems like they broke up the service too fine grained and saw the problems of that. This video streaming quality monitoring service is just one service in the bigger prime video system. ECS tasks are designed for microservices, same idea as a pod in Kubernetes, can be scaled up based on rules. Seems like a lot of people are caught up in the definitions of terms rather than choosing the right technologies and patterns to solve a particular problem.
I think you hit the nail on the head Neal. The smaller scale corollary is splitting up your code into async functions that live in separate packages, deploying each one with some sort of auto-scaling containarization env, then having each one constantly trigger the other ones and eat load balacing, sping up, and other associated costs, while calling each function a microservice.
using S3 bucket as an intermediate step sounds like something suitable for a prototype. But if I'd want to move the project to production, I'd consider using a message broker like Kafka instead of s3. What do you think?
Serverless was always solid performance and cost wise when you had not so constant loads. A video streaming platform will almost always see traffic, and almost every out going stream is significant (not short lived connection/load) so it made sense for them to switch. Excited to check out the video now, because when I read the blog post last month, it left out a lot of details. Thanks for the great content as always :D
It is literally not the case. Video streaming platform will never have consistent traffic, it will vary very hard hour by hour. Serverless grants you significant advantages in up/downscaling that, albeit arguably not really necessary ones considering the cost
Won't video streaming be greatly impacted by the content (content release schedule specifically)? Example, the Lord of The Rings series being released would definitely bring a different load.
That is the most strange thing to me. I think that is fairly obviously that serverless have more cost when handling loads 100% of the time, have more price per minute than EC2 but with the advantage that you pay just what you use. I believe that for sure we have a cost advantage between monolith and microservices architecture but 90% of cost reduction don't seems to be the number. Another thing is: they keep using a microservice architecture (maybe you can convince me that is a SOA), just have change the boundaries that they system have. Big title but the technical aspect it's a mess
@@noobgam6331 Would you explain, please, why streaming platform never has consistent traffic? My naive opinion would be, there're multiple time zones as well as multiple shifts people work in, hence, we could expect a certain pattern of traffic.
This wasnt a microservice vs monolith issue at all. Read the Amazon article. The issue was more with choosing a serverless functions approach in a computationally heavy image processing task. Not much to do with it being a microservice. Their two main issue was: 1. Having to pass large image files between functions. Higher S3 cost 2. Hitting serverless platform limits due to frequent statechanges. Most of this issue could have gone away with properly partitioning the service and deploying in a platform that could handle intensive loads. Both their issues made it clear it was a mistake for going with a serverless approach for a production workload. Might have been fine for a testing or smaller need.
For sure, another thing that bothers me is the fact that they analyze images files. I know that codecs are hard to handle, but if I'm a billion dollar company for sure I'll take my time to handle this using codec information instead of rgb images.
the payment part. Amazon is internally a customer of aws and each department in amazon is billed by aws separetely. Thats why they write about user paying
I don't understand why people call this a monolith. Thing is still decoupled to containers that are also fault tolerant, but just because the s#i7 is running in a solely EC2 instance per AZ or region (which is not specified but is has to) they dare to call it "monolith". Sometimes I believe this kind of articles are just marketing drived.
17:15 AWS Step function is actually different from the AWS lambda. it's like a workflow orchestration tool / state machine. I only know this because I was also initially confused by step function and thought it was the same as a normal lambda
Maybe I'm getting lost in the terminology, but IMO "serverless" != "microservices". This is still a microservice in the sense that it completes a singular task within a larger ecosystem of services. I'm not sure where the rule came in that microservices needed to be serverless by definition.
You are right, it's youtube gurus who never written microservices/monolith, they just treat microservices and serverless as a bible because its trendy.
@@mannycalavera121 those terms describe different things. Serverless means you as a developer dont need to worry about setting up the environment. Most also autoscale and shutdown if there is no pending requests. Microservice describe how a large application is split into smaller, managable pieces that are much easier to develop and maintain. It also has added benefits of fine tuning the number of active instances based on their load. Commonly microservices are deployed on serverless platforms given both their benefits align with lightweight usecases. But this isnt always the case. Microservices can be deployed on variety of production environments mostly as containerized deployments. They can be programmed to run nonstop on a platform you set up and maintain. Likewise a large monolith "can" be run on serverless. Theres wont be much benefit given slow start-up times and increased costs brsed on how the platform charges.
Thank you for the video! I'm working on a personal project, I hope to turn into a business. I'm starting with a monolith, because I believe it's the right choice. In the last 10 years I've exclusively worked on microservices or breaking a monolith into a microservices. I've never seen a well engineered monolith. It's very hard to find such content, in order to learn about best practices, as well. It'll be great if you can make a video/Udemy course on building a proper monolith in modern times. Best architecture practices, best way to deploy and monitor it, etc. I know how to do that for microservices, but not for a monolith... The irony...
@@stevenhe3462 I can check Elixir out, but I'd expect a technology agnostic approach of a proper monolith architecture along with a good way of deploying, scaling, monitoring it.
@@skwtf I recommend the book "Functional Web Development with Elixir, OTP, and Phoenix." In a nut shell, separate the logic and communication, put them into separate light processes, and send messages around. Of course, there are other ways, but this is a way that is easy to manage and scales "infinitely."
IMO, the monolith system should follow a Producer/consumer model with in-memory message queue (for now) in between. Media converter can be the producer, and starts converting and pushing work in the queue, and Detectors could be the consumer threads pulling from the queue. Not sure why there is a need for an orchestrator. In future, the in memory message queue could be replaced by some thing like a Redis based message queue to scale media converters and predictors separately.
My 2 cents. AWS step functions is their orchestration tool (lambda is serverless product) . We use orchestration when we need a central party controlling the order to execute a flow like A-B-C-D( called as workflow in orchestration) where A,B,C,D are different network components (called as tasks/activities in orchestration). Every transfer from A to B and B to C is a state transitions and orchestration systems like step functions and temporal charge the customer(prime video team here) by state transitions - usually for any business flow they will quickly go up.
Thanks for enlightening us on this topic. You made a lot of educated guesses and I think your interpretation is very logical despite the fact that so much data is missing.
I think that there are 2 things here that were not mentioned: 1) they seem to have realized that they didn't need to make this async (in the microservice approach they were using S3 + step functions as a queue 2) they seem to have realized that everything could fit into RAM, and they didn't need S3
The prime video symbol shows like arrow of moving from Microservice to Monolithic. Companies need to analyze the design and costing before deciding which one to go. Prime video coming with blog on this is really superve! and I understand it has business impact to AWS.
I agree that the diagrams do seem like something a random developer at a small company would develop to fulfill the ask of a "architecture diagram" with the product, and are rather rough around the edges/quickly made.
I believe the SNS notification is for the client application, so it knows that monitoring is actually going on and it's not just sending data into a dead service
Gosh. You don’t say? Imagine if you will, that a simpler architecture with simpler deployment is substantially simpler to maintain. Man. Who would have ever thought? Oh, yeah, literally any developer that’s actually thought about it for more than 30 seconds.
If we think of prime as a different company and a customer of AWS, then the AWS SNS customer's notification and the state transitions charge makes sense
if the orchestration workflow is too expensive then their initial segregation into microservices was probably too fine grained. So they now found a better bounded context. it does not mean the concept of microservices was flawed.
The team used an OLTP architectural style (microservices) on a OLAP problem (analytical). This mismatch in problem space to solution space leads to suboptimal outcomes. There are tools, techniques and architectural processing styles specifically for real time big data analytical use cases. Data engineering is an entire discipline dedicated to this problem space. Apache spark is an example of technology optimized for this type of processing. AWS has managed Services like EMR that allow you to run these workloads easily. Not everything is a nail 🙂
Umm why develop monitoring service this way. 1. Spawn random test prime apps on different regions, stream the required live event as is. 2. Do any analytics operation on same machine ( where test prime app is running ). 3. Send results to one source. 4. Setup alerts on the source with regards to analytics received. Benefits : 1. Highly scalable 2. Doesnot utilize actual customer’s bandwidth ( which is critical, suppose a customer experiencing buffering coz uploading of performance metrics is taking time ) 3. Even if one test app crashes, still data would sent by other apps. Higher reliability in case of actual performance issue.
I believe the users and costumer topic is just to mention the aws account serving the infrastructure, user is the aws user account and customer is just an sns topic exclusive to receive the user data.
Microservices means code that performs a single function and you have lots of it in services. A monolith could be an application that handles all kinds of requests and usually deals with consistency very well because it doesn't have to wait for other services in the network. Maybe Amazon Prime realized they don't need to break down all the functions and a single container can handle all the client's requests. My company pay $0 every month for ecs + ec2 as we have the cost saving plan on the ec2 instant type.
+1. this blog was very high level. When I read it, I was not understanding a lot of user flows, which I think is missing in the doc. The diagrams could have been better. Came to this video after reading the blog :)
Writing to s3 is often good. It needs to be asynchronous and also if the analysis failed, it can be retried later so data will not be lost. In this case I believe it is solely used for triggering the serverles states function. S3 traffic is free within the region. Their problem is the number of requests hitting as they are charged for each of them Instance memory is non network disk unlike ebs volume. They dont persist data over reboot I guess. But still it is ssd. I think they are still in micro service or as you said macro service. They just moved away from serverless stack.
Going from a monolith to microservices I learnt one thing... If your microservices are overly complicated a. Your staff are not skilled enough to manage a microservices environment. b. You designed your microservices incorrectly. c. Your service app does not fit into a microservice design. Which goes with b. you designed it wrong. And goes with a. your staff are unskilled.
To your point about the 2nd diagram, in the original diagram they showed the S3 content being consumed by the frame checking service. They don't have a solid black line pointing at the frame checking (step) functions, but it can be intuited that any time a new frame hits the bucket an S3 event triggers a step function consuming it (the dotted line pointing at S3). I don't think however that the media conversion service triggered by the lambda answers to that lambda at all, as you seem to allude to with your mouse gestures when you say "I called it".
When you say that the asynchonous operation runs in the background and calls a callback function when it finishes so that the main event loop can execute it, isn't that "background" a different thread which makes javascript not a single-threaded?! 🤔
This is exactly why one of my company's products failed. The wanted a full severless backend. Worked __okay__ during development, despite issues like latency, which can be solved, but once it went into production they quickly realized the amount of data transfer between microservices would trump any benefit provided by the architecture. In my opinion, microservices are great for asynchronous functions, but not great for building complex systems that pass data back and forth. Also, step functions i feel are just a marketing gimmick to get you to have greater transfer costs and you get charged for each step. Why do I need step functions, when i can just code the steps into a single lambda? Yes there are use cases, and we do use them... Just not for everything
We had a third-party API which did not notify us after triggering some process that we can consume their data. So polling was the only choice but polling every minute or so would be expensive. BTW it might take up to 20 min for the data to be ready for consumption and only users can trigger that process. Eventually, we decided to use step functions with their timers to run the same lambda multiple times but with some delays in between to have better user experience. I don't know if there is a better design but at least it works 😅.
Really poor content from Amazon Engineering as expected. It's their weird Leadership Principles who tells them to label every end-device/user as "customers". Not enough details were provided, diagrams are poorly done and it looks like this blog hasn't been through QA. They do not even describe it properly. It's just that they wanted to prove to the world that they achieved cost reduction by converting microservice to monolith. I'd argue that their initial microservice design was sub-optimal. There was already a lot of room for improvements even if it were to remain as a microservice. Looks like a L4 engineer designed this and wrote the blog for their promo.
for how huge amazon in, i don't expect the separate divisions to consider they all belong together. to prime video, aws is just another cloud provider.
When the article says "AWS charges user", in this case it means AWS charges Prime Video. Basically Prime video is customer of AWS. I think its also a cultural thing in AWS / Amazon , even if an internal team is using your service , its just customer. Treated alomost like an external customer. Even they are charged just like an external customer.
I think, maybe why the went with the monolithic approach in this case is for one reason… Their main goal was cost reduction, and so the solution they picked was the one that reduced cost the most, with respect to other factors & standards. I believe Amazons culture is a business centralized culture, not an engineering centralized culture (cough cough Google), which is why they did not go with engineering perfection but rather with business perfection.
I assumed it was just what they encoded for the device. But I’m pretty sure they won’t just be uploading things from the client again at least not without the consent of the user. Anything back from the customer would not necessarily reflect what the customer saw.
I don't know whether to laugh or cry at the initial architecture. Tools like ffmpeg work better on chips with large amounts of cache because pulling in data from RAM can be too slow! Of course network access is going to cause slow downs. At the very least the I-frame needs to be available to render all the deltas from it, they literally have a data localization problem, they would have seen similar issues just going to multi-socket systems. The "monolithic" approach they are taking is just doing everything they need in one place. It would be stuff like is the feed frozen, is the audio out of sync, is it breaking up is the quality really different from the reference feed, etc. The approach is no different than having you get dealt with by one person if you say go to the bank vs being sent to one department for part 1 then another department for part 2 but then the first department needs to send over your file (the context) or more applicably department 2 asks almost everything all over again (in this case only the media conversion is shared). Now thinking about it, for threaded programming it is said that if your code is computationally intensive threading it might actually make it slower, the reason is the context switches are not free and constantly tearing down and rebuilding that context every time a switch between threads occurs hurts performance.
Just looking at the diagram i say initially and my experience from step functions and lamda(ill delete this comment if my opinion changes) Both step functions and lamdas aren't great for a highly used production system at least with respect to cost. They can scale really well but they'll cost alot.
From limited info, keeping related things together is better option to minimize communication overhead. Not sure if vertical scaling would ever be an issue , since the media conversion and detectors run in the same box. If there are x detectors running on the box, would there be a need for 2^^x detectors in future, such that media conversion and all the detectors cant fit on the box?
This is a by-product of designing by trends vs designing to address a particular solution. There are basic concepts that over the years has been lost. For instance, in such a design, co-locating the data and processing. Distributed computing comes at a cost; in various ways. We the Cloud, Docker, terraform etc, its easy to design something poorly that words, but comes at hidden costs; stability and money.
I really wonder whether this whole system initially started as a monolith and is now coming back, or whether it was built in a microservices architecture from the ground up.
"AWS Step Functions charges users per state transition". This is totally clear, right? 😑 OBVIOUSLY the user of these step functions is Amazon themselves (the title also makes this fairly clear). They're simply outlining how step functions are charged to companies/developers utilizing them; how is the word "user" here confusing? It's how all the documentation for services reads too.
i dont think step function get trigger like lambda .... you need lambda to start a step function and you need s3 to point the newly started step function to s3 ... i dont think you can use lambda for data transfer to step function becuase of 15 and payload limitations!
Cant you have a hybrid system that runs as a monolith under normal loads and spins up instances as required? If you keep the code modular i dont really understand the problem? Dunning Krueger effect i guess
If you see Primegean’s take on it, who btw works at Netflix, you’ll see that Amazon’s use of their resources are atrocious to begin with. So I wouldn’t read to deep into their white paper.
they can run the defect detectors in the browser of the user using a webworker, they don't need all this mess, it doesn't matter if its microservices or monolith, to me this tells me they don't even understand the problem, they don't even understand distributed systems.
Okay so they gave like an intern some AWS tools and they hacked something together. Threads with shared memory can more efficiently communicate than writing everything in an object/file store first. No suprise. If this wasn't Amazon nobody would would find this notable. I feel like the whole thing is a waste of time.
i wonder whether it is really necessary to monitor every single stream. this sounds like a complete waste of resources. nobody would be harmed if a stream of a movie has corrupted frames or audio. wouldn’t it be enough to just let e few people randomly switch through the streams and listen and watch for issues, then check the next stream? yes, some issues might remain unnoticed but then the customer will probably complain at the support center anyways
Summary automatically generated by Alphy: 1. Background and challenges faced by Amazon Prime Video's engineering team - Prime Video offers thousands of live streams to customers and has an architecture to detect and monitor the user's experience. - To address high scale and expensive infrastructure issues, they moved all components to a single process and kept data transfer within the process memory. - The service consists of three major components: media converter, data ahead and error, and the data attached. - AWS step function functioned as their orchestration workflow, which became one of their main scaling bottlenecks. 2. The solution: moving from serverless and microservices to a monolith - Amazon Prime Video's engineering team moved a monitoring service from serverless and microservices to a monolith, saving 90% of the cost.- They achieved higher scale resilience and reduced cost by moving to a monolith application. - The company decided to re-architect by creating a monolith and added an orchestration layer to load balance and stack forward requests to the server. - Microservices and serverless components are tools that work at high scale, but it depends on the case by case basis. 3. Technical details of the new solution - They faced cost issues related to state transitions and passing frames around different components, which was reduced by building a micro server that splits videos into frames and temporarily uploading images to an S3. - The current system involves downloading and uploading large files, which has its limits, particularly with S3 and Amazon products. - The diagram shows a solution with media conversion directly to the S3 detector and serverless function for frame inputs. This eliminates the need for orchestration. - The components of the system remain the same, and the capacity for the single instance is exceeded with a need to clone the service multiple times for parameterized use. - Advantages of moving the solution to Amazon EC2 include using compute saving plans that help reduce costs. 4. Benefits and results of the new solution - The changes allow Prime to monitor all streams viewed by customers, resulting in even higher quality.