How to handle message retries & failures in event driven-systems? Handling retires with Kafka?

Подписаться 1,4 тыс.

Просмотров 31 тыс.

50% 1

How to handle message retries & failures in event driven systems?
Make sure to watch • Apache Kafka: Keeping ... (Keeping the order of events when retrying due to failure) after this one
In event-driven architecture if your services are running and processing without a problem, event driven architecture is great but handling failures can be hard.
How do you handle retires in Apache Kafka?
#eventdrivenarchitecture #danieltammadge #ApacheKafka
-
I use www.lucidchart... for my diagrams & www.flaticon.com where I use my pro subscription to find images for my content

Опубликовано:

20 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 52

@IsabelPalomar 2 года назад

Great video! I really like your conclusion and final comments. I have been working with Kafka a lot this year and event driven is definitely complex.

@Danieltammadge 2 года назад

Thank you for taking the time to comment. Glad you liked it

@StephenTD 2 года назад

Awesome it cleared up my questions around how to handle retries using a event streaming platform like Apache Kafka and thank you for part 2, where you went into how to keep ordering. Again amazing videos!!!!

@Danieltammadge 2 года назад

Thank you Danny for taking your time to watch my videos and taking your time to write a comment on each one. And I’m glad that you found them helpful

@tibi536 3 года назад

Nicely explained - I really liked the presentation :)

@Danieltammadge 3 года назад

Thanks I am glad it helped

@ricardo.fontanelli 3 года назад

Great video. I would just add one small thing to the retry mechanism: think about event order! Do you really want to consume event 5 after consumer event 7? In many cases, if you already consumed event 7, for example, to update an entity copy in a microservice, all that you need is to discard event 5. To do so, you need to record the id/offset of the last event successfully processed.

@Danieltammadge 3 года назад

I'm glad you liked it. When you use the term “entity copy” and “microservices”, I’m assuming you are looking at this from change data capture perspective where the order is, of course, important as you are looking to maintain a local copy of data using the events. And in this use case, one could ignore the failed event if a later event for the same entity is processed later. Or, if you cannot ignore the earlier event, then trigger a different logic or process to remedy the out of sync data. In specific solutions where requirements have meant the processor needed to adhere to exactly-once processing or when events could be processed out of order, we implemented a processing log table which we perform guard checks can be performed against. Hopefully, Ricardo, I have understood your point. Let me know if I have misunderstood anything. And thank you for taking the time to point this out as that is important and shows that not all solutions fit all, and we need to understand our requirements and design the “least-worse.”

@Danieltammadge 2 года назад

I’ve uploaded part 2 to this video where I describe an approach which keeps the order of events ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-FO2ptQNQKhM.html

@MrBillJDavis Год назад

This is great, thank you. It would be really helpful to talk about what issues might lead to a message getting retried and how that might dictate deciding on X number of retry topics.

@Danieltammadge 7 месяцев назад

Thanks Here are some common reasons for what might lead to a message might get retired: 1. Transient Failures: If an event fails due to a transient issue (e.g., a temporary network failure, a dependent service being momentarily unavailable), retrying the event after a delay might result in successful processing. Moving the event to a retry topic allows the system to handle it separately without blocking the processing of new events. 2. Rate Limiting and Backpressure: External systems or APIs might enforce rate limits, and surpassing these limits can result in failed event processing. Publishing failed events to a retry topic enables you to implement backoff strategies and control the rate at which you attempt to reprocess these events. 3. Resource Contention: If processing fails due to resource contention (e.g., database locks, high CPU utilization), moving events to a retry topic allows the system to alleviate immediate pressure and retry processing later, possibly under more favorable conditions. 4. Error Isolation and Analysis: Moving failed events to a separate topic makes it easier to isolate and analyze errors without disrupting the flow of successfully processed events. This separation facilitates monitoring, debugging, and fixing issues specific to the failed events. 5. Prioritization of Events: In some scenarios, certain events might be more critical than others. If an event fails but does not immediately need to be retried (due to lower priority), it can be moved to a retry topic, allowing higher-priority events to be processed without delay. 6. Maintaining Event Order: If the order of events is crucial, and a failed event needs to be processed before subsequent events, retrying the event while continuing to process others might violate the order. By using a retry topic, you can control the order of reprocessing to ensure that events are handled in the intended sequence. 7. Handling Poison Messages: Some events might repeatedly fail processing due to being malformed or due to an issue that cannot be resolved immediately (poison messages). Moving these events to a separate topic prevents them from repeatedly causing failures in the main processing flow and allows for special handling or manual intervention.

@doganaysahin9770 2 года назад

I agree. You can use this kind of implementations. But you should be also careful when you retry. Because you can loose the order and some stale data could be happen. I have a question . How you can handle , exception occurs when you try to send retry topic ?

@Danieltammadge 2 года назад

Thank you for taking the time you watch and comment. You are right. The approach shown here would not ensure that events are processed in the correct order. To preserve the order of a particular business object changes, you would need to hold any events relating to the object, which has an earlier event pending retry and successful processing, in a holding area. And only process the later events after the initial failed event is reprocessed.

@Danieltammadge 2 года назад

Please check out my latest video in response to your question, where I go into detail on how to keep the order of events ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-FO2ptQNQKhM.html

@sagarbhong-f5q Год назад

Thank you for sharing @Daniel

@kristinaribena1654 2 года назад

Awesome. Thanks for posting

@eduardleroux9550 Год назад

Great video! Would love to get your take on using Kafka vs AWS SNS / SQS. It would be great if kafka had a built in retry mechanism (one that does not require additional topics) and once that fails then it's moved to a DLQ.

@Danieltammadge Год назад

Great suggestion! Eduard. I am working on a video with my take currently so stay tuned. Thank you for taking the time to watch and comment. And apologies for taking so long to reply.

@eduardleroux9550 Год назад

@@Danieltammadge No worries mate, life happens! Looking forward to it, and thanks for posting awesome content and sharing the knowledge!

@kevinding0218 Год назад

Great video! I'm interested in the design and would like to dive a little bit, we usually would have different schedule retry in 2nd/the 3rd topic, for example, we want to retry 2nd time after 5 mins/3rd retry after 10mins, but Kafka didn't support a delay queue, how should the producer handle produce a 2nd/3rd retry event so it can be executed with scheduled waiting time?

@Danieltammadge Год назад

Thank you for taking the time to comment. Hopefully the following will help danieltammadge.com/2023/02/delaying-apache-kafka-retry-consuming/

@Danieltammadge Год назад

Try this link. It looks like it got corrupted when I copied kafka.apache.org/0102/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html The waiting logic is in the downstream retry consumer who consumes the retry topic. When the upstream event processor needs to retry, the processor should publish events without delay. When the retry consumer consumes the retry topic and retrieves message/s, the consumer must check if 5 minutes have passed since the upstream processor published the event. However, if 5 minutes have not passed, then the consumer needs to pause the consumer group. And set a timer in the service to resume processing in x seconds or minutes to resume the consumer group. Regarding consumer lag, you would want a 5-minute consumer lag. Consumer lag showing that the consumer is processing events quicker than 5 minutes, shows the retry consumer is not waiting the designated time.

@kevinding0218 Год назад

@@Danieltammadge Thank you so much! That makes the process much clear!

@cuongnguyenmanh4554 2 года назад

Thanks for sharing 👍. I have a question for waiting time in retry topic. how to config it. Thanks

@Danieltammadge 2 года назад

Hope you found it helpful. Let’s say you have a consumer subscribing to a retry topic. And for the messages in this topic you want to wait 5 minutes since publishing to reprocess the messages. What you would do is to take advantage of the Consumption Flow Control, which allows you to manually control the flow (kafka.apache.org/0102/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html). So you do the following steps 1. Consume messages 2. Check if the message published at timestamp is greater than the retry interval. If yes ( > 5 mins), then process, if not (< 5mins) continue to the next step 3. Pause consumer 4. Wait the time required 5. Resume consumer (in some cases, you may need to close and start the consumer after resuming) Note: you cannot pause processing without pausing the consumer or Kafka may think the client is in a faulted state and push the message to another consumer. Also, remember the next message will always be published later, so if you are still waiting for the message at index 5, then index 6 will have a longer time to wait.

@DesuTechHub Месяц назад

5:50 is the lesson from experience

@rajapattanayak Год назад

Hi Daniel great video indeed. I have a question. How can we manage if there is any unhandled exception? If we handle the exception then we can send to retry topic.

@Danieltammadge Год назад

Hi not sure if I understand your question could you maybe rephrase…

@kristinaribena1654 2 года назад

Great video

@Danieltammadge 2 года назад

Part 2 is uploaded so after you watch this one be sure to check it out. Link is at the end of the video

@kristinaribena1654 2 года назад

Awesome

@abhishekbajpai1208 3 месяца назад

good explanation,

@Danieltammadge 2 месяца назад

Glad you liked it

@musicmania6214 3 года назад

Great video👏

@Danieltammadge 3 года назад

Thanks

@jincyv7386 2 года назад

Hi ,how can we handle persistent error in producer side with spring cloud stream

@Danieltammadge 2 года назад

Not sure I understand your comment. But if your system is not ensuring at l least once publishing I would recommend you to watch ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-yUmzJ7mP3Iw.html

@amseager 2 года назад

Really wanted to implement a monolith after all of that lol

@Danieltammadge 2 года назад

Event-driven architecture is not simple

@xinyuzhang 8 месяцев назад

Thank you!!!!!!

@Danieltammadge 8 месяцев назад

You're welcome!

@saritakumar1039 3 года назад

Can u plz share some code for retry

@Danieltammadge 3 года назад

I don't have code to share. But you need to look at pausing the consumer kafka.apache.org/25/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html And then waiting until the message should be processed, and then unpausing by polling again. Using google and the term Kafka consumer pausing. should get you to what you need... or at minimum provide you with the building blocks.

@chessmaster856 Год назад

Any code or only this. Anybody can write code but only some can talk

@Danieltammadge Год назад

ChessMaster thank you for taking the time to comment. Quick question is your comment a question?

@chessmaster856 Год назад

@@Danieltammadge yes. Can you provide some. Ode configuration examples a out how many error scenarios need to be handled in a messes queue.

@manideepkumar959 Год назад

if handson is also there it would have been better,cant get most out of it

@Danieltammadge Год назад

Hopefully the following will help danieltammadge.com/2023/02/delaying-apache-kafka-retry-consuming/ Thanks for watching and taking the time to comment