"Workflows, a new abstraction for distributed systems" by Dominik Tornow (Strange Loop 2022)

Подписаться 82 тыс.

Просмотров 14 тыс.

50% 1

For the past 45 years, the database systems community has enjoyed an unparalleled developer experience: Database Transactions mitigate challenges such as failure on a platform level, entirely eliminating these challenges on an applications level.
Unfortunately, the distributed systems community has not enjoyed a similar developer experience: There was no equivalent abstraction that mitigates challenges like failure on a platform level.
However, many companies, including Snap, Uber, and Netflix, are adopting a new paradigm: Workflows. Workflows are to distributed systems what transactions are to databases.
This talk explores how Workflow Systems mitigate challenges on a platform level and provide a developer experience for distributed systems that rivals the developer experience for databases, allowing you to literally code as if failure does not even exist!
Dominik Tornow
Temporal, Principal Engineer
@DominikTornow
Dominik Tornow is a Principal Engineer at Temporal. He focuses on systems modeling, specifically conceptual and formal modeling, to support the design and documentation of complex software systems.
----- Sponsored by: -----
Stream is the # 1 Chat API for custom messaging apps. Activate your free 30-day trial to explore Stream Chat. gstrm.io/tsl

Наука

Опубликовано:

12 окт 2022

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 16

@GrigorySapunov Год назад

A very insightful and useful talk! Thanks for it, Dominik!

@joshgraham4711 Год назад

Great talk and a compelling approach for distributed systems. Also, whenever you slipped into Arnold Schwarzenegger voice, I had no choice but to agree with anything you were saying.

@joshgraham4711 Год назад

What happens, however, if there is a failure in writing the execution log? The task completed but its result was not recorded. Or perhaps recording the log is eventually consistent and hasn't been recorded by the time the platform re-iterates the main coroutine. It seems that to avoid distributed transactions, the task should record the log as part of it's database transaction (i.e. the account table is updated and the log table is inserted within the same transaction). This isn't particularly transparent but could be included in data layer framework mechanics (with information from the platform like idempotency ID).

@dominiktornow1052 Год назад

@@joshgraham4711 You are correct, that is indeed a concern: There is an uncertainty interval from the point of executing a step to the point of logging that execution. If the execution completes and logging the execution fails, the step will be retried. So steps have to be idempotent In the paper Fault Tolerance via Idempotence, G. Ramalingam and Kapil Vaswanipropose propose a solution along the lines of your suggestion where the execution of a side effect and logging that execution happens within the same transaction www.microsoft.com/en-us/research/wp-content/uploads/2016/02/popl38-ramalingam.pdf

@klasus2344 Год назад

@@joshgraham4711 Is this perhaps related to causal logging? You attach the log to the result in case the logging fails

@xrrocha Год назад

Enlightening!

@dominiktornow1052 Год назад

Thank you :)

@linerider195 Год назад

Fascinating talk. I would love to have a list of the reading references he gave, and/or a textual presentation. I felt lost in some of the technical details when following along the presentation

@dominiktornow1052 Год назад

Hey Pablo, I am happy you enjoyed the talk, here is the list of references, in order of mentioning: • What color is your function, Bob Nystrom, journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function • Structure and Interpretation of Computer Programs, Abelson, Sussman, web.mit.edu/6.001/6.037/sicp.pdf • Modern Operating Systems, Tanenbaum, Bon, csc-knu.github.io/sys-prog/books/Andrew%20S.%20Tanenbaum%20-%20Modern%20Operating%20Systems.pdf • Beyond Distributed Transactions, Pat Helland, www.ics.uci.edu/~cs223/papers/cidr07p15.pdf • The Weekend Read, Newsletter, Dominik Tornow, www.getrevue.co/profile/dtornow If you have questions, find me on twitter @DominikTornow

@megamaser Год назад

This is not a complete solution. There is no explanation for how the runtime can be certain that a remote job has actually completed. This was the hard problem and it's still not resolved. It's just moved to a different layer of abstraction within the orchestrator. Coroutines within the orchestration language give no such guarantees of exactly once execution on a remote system. The runtime is just a layer within the orchestrator, and it still doesn't know how to distinguish between a request never being received vs the job completing but the response never being received, plus there is no way to know if the remote system is idempotent. Somehow the remote system needs to be made aware of the coroutine scope to interoperate with it. This is the interesting problem to solve and I don't see any explanation for it.

@GiveMeSomeMeshuggah 8 месяцев назад

This is the Two Generals Problem. You can increase the statistical likelihood of a failure being detected and decrease the likelihood of false positives, but guaranteeing this detection has been proven to be unsolvable.

@karlfimm Год назад

Interesting talk but it seemed to imply that any failure could be resolved by retrying which is not always the case. You can retry as many times as you like but if some idiot has dropped the table (yes, I've had that happen) it's never going to work.

@richardmcdaniel4467 Год назад

Some failures require a code deploy to fix them or some external change. Once the fix is deployed, the retry will work and you haven’t lost anything.

@MisFakapek Год назад

Yeah, nice ideas but it's too complex to be even practiced in 99.5% of services and systems. Another part that I didn't enjoyed is a very theoretical aspect of the talk. For any of such systems you would have to employ an extremely disciplined way of implementing any business logic with so many boundaries and conventions that we will effectively spend most effort on keeping the "mechanism working" rather than implementing the actual business logic. It is so hard to establish and maintain approach cohesion this kind of convention on a small team level, let alone medium to large size organization.

@warever9999 Год назад

Isn't it better to implement such things on top of actor model than on a language-specific co-routines / logs ?

@dominiktornow1052 Год назад

The actor model is a great model for distributed systems but with a different developer experience: Traditional actors are not resumable, they are restartable, that is, after a failure an actor restarts in its initial state, not its latest state