Discussion: Making Programming Language Parsers, etc (Q&A is in separate video).

Jonathan Blow

Подписаться 85 тыс.

Просмотров 110 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

28 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 189

@ZeroZ30o 3 года назад

Wanted to find something in the video, figured I would timestamp along the way for others 0:00 Meta-intro 1:30 Technical problems 2:22 Casey's intro (how to get to a known point, academia's take, "how do you think about parsers?") 4:45 JBlow disclaimer (experience with parsers) and rant about academia tools (bad errors, errors are user feedback for languages), academia obsession with how many tokens you look ahead 11:00 Parsers are just another program 11:31 What is valid about academia parsing concepts (text->tokens->...) 13:00 On-screen example of how the parser "models" the text 14:23 Back to lexer (text->tokens->...), shows on-screen enum of the token types and what information tokens store 18:07 Casey sums up why having a "lexer step" is a good idea 19:50 Small discussion about what tokens store, separation between lexer and requesting token info (read tokens are buffered) 21:20 Text files are buffered once (one OS operation) 22:55 Lexer performance is negligible compared to other stages of the compiler, Casey talks about caching lexed files, lexing is not a problem that interacts with other things, C language rant 27:40 Announces there are tricks to simplify the generation of the syntax tree, writing a program 31:39 Explaining syntax decisions with the program written, decisions on the syntax can make the parser's job MUCH harder 36:35 Casey says "you don't have to do all the parsing in the step that goes after the lexer", "that step doesn't have to give a finished result, things should be done in the step where they are easier" 38:28 Question "if you had to change syntax, would you have to re-do everything" and generalization with an example 41:20 Casey's take on domain-specific languages: benefit vs drawbacks (amount written vs amount of byte code generated) 42:22 Jon's (and Casey at the end) take on domain-specific languages: benefit vs drawbacks (expertise needed, more work per amount of time when using the language vs limitations) 47:38 Most powerful tool/paradigm...? is flow control 48:30 Casey asks about trees 49:30 Jon's example with math syntax trees, operator priority explained for binary operators 51:18 Speedrun ice tea any% 52:58 How Jon generates the tree, with examples (for binary operators) 1:04:40 You might have non-binary operators 1:06:47 Rant on a question 1:07:50 The better technique to fix up operator priority Good luck finding anything else xd

@smiley_1000 3 года назад

1:26:55 Blow shits on SSA

@nyrtzi 4 года назад

I remember a C++ guru talking in a C++ conference about the problem that due to how everything is so overloaded in the language a single fragment of code depending on the context can be interpreted in so many different non-obvious ways that it's way more difficult and more work to build tools to help programmers than it's for other languages. Which is why C++ IDEs according to him are a decade behind Java IDEs.

@____uncompetative 2 года назад

Yep. You can basically give C++ line noise and it stays quiet because it means something in some interpretative context.

@carlyounger6262 2 года назад

Vaughan Pratt's _Top Down Operator Precedence_ parsing algorithm is an excellent way to start writing your own parsers. The original paper is quite difficult to understand, but there are articles online that explain it in Python, JavaScript _et cetera_ that make it easy to understand. Lexer's are trivial.

@sigmundwong2489 10 месяцев назад

The amendment to the simple RD parser Jonathan describes at around 1:12:00 to 1:15:00 is the basic principle of Pratt's TDOP parser, isn't it? AFACT the only missing bit is differentiating nud() and led() from parse_expr(). For example, "left = 3*4" -> "we see +" -> "+ means recursively call parse_expr()" would instead be rephrased: "we already parsed something to the left, so we call the led of '+'" -> "plus_token.led(left) returns left + parse_expr(next token)." The idea being that if you parsed input to the left, the next token should act as an infix operator (so call its led), otherwise, you're at the start of an expression (so call nud). Instead of a single recursive parse_expr(), parse_expr() and nud()/led() are mutually recursive.

@dietermuller4345 4 года назад

Who would also want Jon to do like a "how to write a compiler" series?

@bitw1se 3 года назад

@Dominik-K 2 года назад

@jackbenson5314 2 года назад

Pleeeaaase

@japroz 2 года назад

@____uncompetative 2 года назад

I doubt he knows how

@blenderpanzi 3 года назад

A recursive parser that creates the correct tree without that complicated fixing is really easy to write. I think this part of the video is making it look more difficult than it is. Just divide up the parser functions so that one function parses things of the same operator precedence in a loop and call the parser function of the things that have higher precedence from the ones that have lower precedence and your done. Very simple excerpt in pseudo code: parse_add_sub() -> AstNode { let node = parse_mul_div(); while (peek_token() == '-' || peek_token() == '+') { let token = consume_token(); let right = parse_mul_div(); node = BinaryOp { operation: token, left: node, right: right }; } return node; } parse_mul_div() will look similar, except it'll call the parse function of the next higher precedence syntax element until you reach parse_atom_or_parenthesis(). Inside of the parenthesis you'll call the top-level parse_expression() again. Doing that you get the correct tree. Btw. you don't need to emit a parenthesis node. Just return the expression node you get inside of the parenthesis. :D And yeah, I'm not a fan of yacc & Co. either. These tools take me longer to understand than writing it all by hand. And doing so I don't have any of the restrictions of these tools! What I described above is how we learned to write recursive parsers at university btw. Well, in one exercise. In the next we had to use yacc, bison, and that optimizer framework, was it burg or something? On one hand it gave you a pattern matching language for optimizations, which is nice, but on the other it was again weirdly restrictive.

@LoganKearsley 4 года назад

The OCaml compiler does a really neat thing for stack traces; they trace out the call graph and pre-compute every possible traversal (with some special-casing for handling cycles) and assign each one a hash value, so stack traces can be passed around as simple integers.

@fzy81 4 года назад

That's so neat! How does it work with recursive calls?

@LoganKearsley 4 года назад

@@fzy81 I am not certain; the presentation I watched on it didn't go into great detail about that. I assume that they just treat loops as if they were only traversed once, and make some kind of note in the cached stack trace that there may have been an arbitrary number of cycles around each relevant section.

@gabrielcoronelcascante9111 3 года назад

Yeah but the OCaml compiler does not give any useful error message, that is why I prefer F#

@rikamayhem 3 года назад

@@gabrielcoronelcascante9111 You mean for stack traces? Run with the env var OCAMLRUNPARAM=b and you'll get the full stack trace with file location information (unless the exception was raised with raise_notrace, but not catching those should be considered a bug).

@____uncompetative 2 года назад

@@rikamayhem Rust has nice error messages. Shame the language design is an abortion on fire.

@ricardopieper11 Год назад

This video was what pushed me to try and make a toy programming language interpreter, and now a compiler, turning the previous one into a statically-typed language. The "write regular, every day code" advice is really helpful. Of course what I did is not production ready in the slightest. Now rewatching it I see how I also bamboozled myself by not adding proper error reporting early on, which is now fixed. Also learned LLVM is annoying, but it's not some otherwordly tool, especially since I have a lower-level representation of the code that is somewhat close to what LLVM or Cranelift accept as input.

@iestynne 4 года назад

Love the approach of building into the language the totality of program binary generation (compilation, optimization, linking) and key debugging primitives (stack trace, data breakpoints), so that these development essentials are well-defined, platform-independent and free of other irksome dependencies. The IDE should just be a UI, not infiltrating its tendrils disruptively into the development process! Additional debugging tools built as JAI libraries would also be preferable to anything external.

@Gismo359 4 года назад

Here are a few examples to illustrate the ambiguity in C++'s grammar: 1. "A < B > C;" can be parsed as "the expression ((A less than B) greater than C) which throws out the return value" _or_ "declaring a variable named C whose type is a template A with argument B". Both can be valid code but depend on the semantic meaning of the identifiers A, B and C. 2. "A B(C);" can be parsed as "declaring a variable B of type A where C is an argument to A's constructor" _or_ "declaring a function B whose return type is A and takes a single unnamed argument of type C" Both can be valid code but depend on the semantic meaning of the identifiers A, B and C. (This was named the "Most Vexing Parse") The problem that this introduces is that grammatical analysis and semantic analysis become mutually dependant - you can't parse something without knowing what it means but you have to parse something in order to know its meaning. That forces you to do both at the same time - which is very tedious to do fast, compared to grammars which allow independant semantics and grammar. This is the reason you sometimes need to add a hint to the compiler so it knows what you are trying to say in heavily templated code (adding typename or template before an expression).

@Gismo359 4 года назад

@UCzsytK0iZ6JAqz-bk2zvpKQ The grammar is clearly ambiguous - that single snippet can have multiple different meanings depending on the values of A/B/C when placed in a function body. How you parse it depends on how you have previously parsed something completely unrelated (which MUST define A/B/C). If A is undefined you have no idea how to parse the rest because you have multiple equally valid (from the compiler's perspective) routes to take. So the compiler must take a guess and hope that guess leads to better error reporting overall. Jon's language, form what I have seen, is not ambiguous in this way: defining the equivalent of a template would be "C: A(B);" and a free expression would be "A < B > C;" (if that is even allowed). That way, if A is undefined the compiler still knows _with absolute certainty_ which one of the routes should be taken - giving the compiler a chance to emit more relevant error messages (e.g. it _knows_ that you wanted to declare a variable named C but it doesn't know its type) In fact, the snippet "A < B > C = D" is equally ambiguous.

@metalim 4 года назад

Hey, Jon! Could you put names of persons you chat with into video description? In this case I found out it's Casey Muratori.

@sandwich_technology 2 года назад

"The problem is the problem" Hearing things like this brings a smile to my face.

@uncoherentramblings2826 4 года назад

I hope you continue this series.

@BillyLemonZest 2 года назад

I have *never* heard the words: "Oh you can do that probably by sabotaging your hash function". I would never have considered such a thing in my life maybe. Amazing conversation.

@yyny0 4 года назад

1:00:00 the algorithm I use for parsing expressions recently is to use "holes". By carefully chosing a "holefor" function that can give the right place to put an expression given an operator/precedence level, we can avoid having to recursively call a function or allocate a stack at runtime: By using the `NULL` value available in many programming languages, we can save all the relevant information for parsing within the AST itself (you would probably want a stack anyway for performance reasons, but even so, it saves having to take multiple passes through the AST to correct mistakes, which can take a lof of time for complicated expressions) 5 + 6 * 7 would be parsed as: struct Literal { int value; }; struct Add { Expr *left; Expr *right; }; struct Mul { Expr *left; Expr *right; }; Expr *mkOp(int optype, Expr *lhs, Expr *rhs); Expr **holefor(Expr **e, int optype); // Where the magic happens, basically just compares precedence to find the right hole Expr *e = NULL; // The expression we will parse Expr **h = &e; // The hole // Found Literal 5. h = holefor(&e, OP_LITERAL); // Returns &e assert(*e == NULL); *e = mkLiteral(5); e is now: Literal(5); // Found Add. h = holefor(&e, OP_ADD); // Returns &e *e = mkOp(OP_ADD, *e, NULL); e is now: Add(Literal(5), NULL); // Found Literal 6. h = holefor(&e, OP_LITERAL); // Returns &e->right assert(*e == NULL); *e = mkLiteral(6); e is now: Add(Literal(5), Literal(6)); // Found Multiply. h = holefor(&e, OP_MUL); // Returns &e->right *e = mkOp(OP_MUL, *e, NULL); e is now: Add(Literal(5), Multiply(Literal(6), NULL)); // Notice how the multiply "stole" the literal 6. // More specifically, we defined our "hole" given the current operator precedence to include the literal 6 // Found Literal 7: h = holefor(&e, OP_MUL); // Returns &e->right->right (The `NULL` inside the Multiply) assert(*e == NULL); *h = mkLiteral(7); e is now: Add(Literal(5), Multiply(Literal(6), Literal(7)));

@BrandNewByxor 4 года назад

"Regular everyday code" is a powerful way of reframing writing code to solve seemingly difficult problems

@SimGunther 4 года назад

Expression cluster parsing seems to be the right track here (see that 2015 paper published by Laurent and Mens for reference) from the algorithm described at 1:14:00 (minimum precedence was what threw me for a loop), but I don't know if that's what it is since I can't read their mind.

@0xCAFEF00D 4 года назад

01:15:40 I can find sites describing the algorithm that were made recently like this one: eli.thegreenplace.net/2012/08/02/parsing-expressions-by-precedence-climbing which has many good links in it. But Wikipedia says it first came about in 1979 en.wikipedia.org/wiki/Operator-precedence_parser#cite_ref-Richards1979_3-0

@nyrtzi 4 года назад

ACM's Digital Library has Vaughan R. Pratt's original paper "Top down operator precedence" from October 1973. They've made all of ACM DL free access until June 30, 2020 due to Corona-chan. Douglas Crockford wrote Javascript implementation of it in 2007 and posted it on his homepages. Crockford also should have a video about it on RU-vid somewhere.

@SimGunther 4 года назад

@@nyrtzi there's a website mirroring the findings of pratt's paper tdop.github.io

@shohidulhaque759 2 года назад

I really enjoyed this video. the conversation style makes it easier to understand things.

@piotrarturklos 2 года назад

This guy is an absolute master of content marketing.

@kennybentley1161 4 года назад

I liked casey's comment about the parser generating more and more specific things each phase.. it made a lot of sense to me. I wrote a few parsers in my life, but I thought it was silly ho make the tokeniser stage. I just made functions to deal with the ascii more directly (essentially reading the ascii in token increments). I now see the light :) he made a new picture in my mind of a really abstract language which was very contextual, meaning that the parser may have more than like 4-5 phases to it. since parsing is so cheap, and many combinations of really abstract concepts can generate a wide variety could increase the bang for the grammar buck.. it kind of makes me want to explore that idea in my next dsl.

@casperes0912 3 года назад

I used Menhir and OCamlLex to make a programming language. I felt like I had full freedom over the error message output honestly

@gettem6341 2 года назад

The thing I got out of this is that the modulus operator is the key to all programming

@Schcarraffone Год назад

The best programmers like John Carmack or Linus Torvalds or Stallman are able to write entire 3d games or operating systems using only the modulus operator... but they are very good admittedly...

@c2ashman 4 года назад

Amazing content. Keep 'em coming. Love your work.

@sortof3337 4 года назад

Hello, I am also making a problem language for my university project. It is a DSL focused on teaching programming to kids in their langauge rather than being general purpose. I am very interested in hearing your opinion on Lex, bison and LLVM workflow.

@jouniosmala9921 4 года назад

I assume that the precedence question was about specific implementation method that fast way to parse expressions, but cannot be used as a general parser for everything. The methods name is "Operator-Precedence Parser" so it is quite easy to make mistake in asking it in away that would lead your interpretation of question. It is used in LLVM tutorial's parser to handle expressions if people want actual source code example of how it is used. en.wikipedia.org/wiki/Operator-precedence_parser

@iWouldWantSky 4 года назад

Afraid to watch this given I lost a month of my life to a context-free grammar.

@joshuadonahue5871 4 года назад

That's exactly why you need to watch this one in particular :)

@potatoxel7800 4 года назад

someone make a language with problem being a threading word xD

@blenderpanzi 3 года назад

You could mmap() the file instead of reading it all at once, if you reeeeeally want to ensure you have enough memory for huge files.

@blenderpanzi 3 года назад

@@shaurz Exactly!

@godDIEmanLIVE 4 года назад

Your videos are extremely interesting, thanks for making this publicly available. Software engineering is awesome.

@MrRed400 4 года назад

I believe the second parser described is called a pratt parser

@jonahbranch5625 3 года назад

Pratt parser is from the 70s. The one in this vid is supposedly recent

@MadPumpkinGames 4 года назад

Casey's audio is better here, I think it was mostly just some echo or gain last time.

@antoniocs8873 4 года назад

01:16:18 - Does anyone know what the paper is?

@magikworx3748 3 года назад

Looks like this algo: en.wikipedia.org/wiki/Operator-precedence_parser#cite_note-Richards1979-2

@fourscoreand9884 4 года назад

Great stuff.

@The-cyber-imbiber 4 года назад

Why not mention Casey in the title?

@nivimani 4 года назад

@Jonathan Blow you probably don't read those, but I believe that saying "this type of question represents something that's wrong with the internet" is wrong, since this behaviour is very common among e.g. college lectures where students ask annoying lazy questions because they weren't focused or whatever. So this type of question is actually related to something that's wrong with all people (or at least lazy/impulsive people).

@m4rt_ 2 года назад

oh Jonatan, he looks healthy ... oh 1:32:42 ...

@sonicfreak94 4 года назад

2:43:39 I get the point here, but Visual Studio isn't that bad these days. You can absolutely link across visual studio versions. Granted, this was broken (by design) until 2015, but VC++ 2015 through 2019 are all cross-compatible, no questions asked. I use this extensively. Regarding project upgrades: the Visual Studio project format has been largely the same for years upon years. What you guys are describing with upgrading is exclusively compiler targets and Windows SDK versions, which are naturally stored in the project settings. These upgrades are optional if you have the correct target compiler/windows sdk installed (i.e. vs2017 build tools using vs2019), just like any other build system. The format remains the same, maybe with some new GUIDs because I guess Microsoft likes those.

@SF-nh9nc 2 года назад

Memo: 1:47:00

@anupkodlekere8604 4 года назад

Where can I find the source code to this?

@leonardosnts 4 года назад

It isn't public yet.

@noodle-eater 4 года назад

Wow cool, I always want to make programming language but do not understand c++

@AlexCouch65 4 года назад

You don't need to use C++. You can use C or really any other language but if you're serious about performance and capabilities and maintainability, I'd recommend using either: C C++ Kotlin (I'd recommend Kotlin/Native, which is what my "production compiler" is done in) Go (I guess) Rust And there is a really great resource for created languages here: craftinginterpreters.com/ I am also planning a youtube series very similar to that resource, where we design and implement a programming language, starting with the use of tools like ANTLR and creating a VM but then moving on to creating a compiler using LLVM (which I am working on a kotlin internal dsl called kotlinx-llvm: github.com/AlexCouch/kotlinx-llvm). I really do encourage you to try whatever it takes to practice and understand language design and development. strumenta.com has some great resources as well as tomassetti.me/. I would also recommend looking into open sourced compilers like Zig, Kotlin, Rust, etc.

@noodle-eater 4 года назад

@@AlexCouch65 Wow that's cool, thank you. I followed this tutorial before but not finish it yet. ruslanspivak.com/lsbasi-part1/ So my knowledge in language development is still not much, but I've XD try to make scripting language in game jam using my own logic and it's fun. It seems I should give more effort and time for learning this, thank you for the resources :D that was awesome. looking forward to your youtube series.

@noodle-eater 4 года назад

I open the links and I just know that we can do that kind of thing when we can create programming language. At first, I think there will be anyone interested developing programming language for commercial or for something great and so far I only make it for fun.

@AlexCouch65 4 года назад

@@noodle-eater If you wanna follow my compilers, I have an experimental compiler inside the link to kotlinx-llvm as a submodule called toylang, and here is a link to my primary compiler, which is being planned out in more detail, called Beagle: github.com/AlexCouch/beagle-lang Unfinished specifications: github.com/AlexCouch/beagle-lang-specifications

@noodle-eater 4 года назад

@@AlexCouch65 Wow thanks that's cool, I will take a look :)

@godDIEmanLIVE 4 года назад

Jonathan seriously looks like the brother of the canadian arm wrestler Devon Larratt :D

@DeGuerre 4 года назад

This is an excellent talk, and there's very little in here that I disagree with. Nonethless, I have done a lot of compiler writing (I worked in semantic analysis and optimisation), both commercially and in academia. So naturally I have comments. Jon is 100% correct that a lot of academic compiler education isn't necessarily practical for a modern programmer who just wants to write a compiler. An undergraduate science course, no matter what the field of science, is at least partly a history course by design. On that point, Jon is also is 100% correct that a lot of published compiler advice is for environments and machines that no longer exist. The edition of the Dragon Book that I used as an undergraduate in the early 90s doesn't mention SSA form, bottom-up code generation, graph colouring register allocation, superscalar instruction scheduling, or cache optimisation. But it does talk about how to handle Fortran's COMMON blocks. Back then, you had to interleave lexical analysis and file reading. You typically buffered a line at a time, and had special code to ensure that peeking over the end of a line was handled correctly. Unless you're doing a REPL command line, today you can just read or memory map whole files. But that was also partly to fix the "make_sandwich" problem. One thorny issue in any programming language is distinguishing between type names and variable names. Languages like C and C++ require identifiers to be defined before they are used, so this information can be fed back into the lexical analyser to make the grammar unambiguous. Some languages (e.g. Prolog, Haskell, see also sigil languages like BASIC and Perl) distinguish types of symbols using lexical syntax, for essentially the same reason. The main alternatives are to design the grammar carefully so that type names and variable names can never be confused (e.g. Wirth languages), or to avoid the problem altogether by not having a type language in the first place (most scripting languages). Incidentally, the reason why we use the word "analyser" is that in the early days, people thought of compilation as essentially two phases: "analysis" followed by "synthesis". We still tend to use this model, though not for the compiler as a whole. An optimisation pass, for example, typically proves that some precondition is true, and then applies transformations which are only valid if those preconditions are true. Analysis, followed by synthesis. There is an extremely important reason why we split parsing into lexical analysis followed by syntax analysis that is worth mentioning in detail: lexical analysis lets us hide the formal ambiguity. If you see, for example, "ifwhile" in C, that is an identifier, not the keyword "if" followed by "while". There are two kinds of ambiguity resolution: the "maximal munch rule" states that the next token is the longest sequence of characters that looks like a token. And a "disambiguation rule" states that in the case of a tie, some token takes precedence; for example, an identifier is a sequence of characters which looks like [something], unless it's a keyword. There are good reasons why we do all that formal language theory in academia. We do want to understand what notation means, and the deep links between languages and automata go to the heart of what we mean by "computation". But more importantly for compiler writers, we need to be able to define a porogramming language without referring to a specific implementation. We as an industry started doing it the other way, and we must never go back. Academic parsing education has never shunned recursive descent, or at least not in my experience. It's not covered in detail in academic courses because there's not much to say. Recursive ascent, on the other hand, is almost never mentioned. Recursive descent is the recursive version of LL parsing, where recursive ascent is the recursive version of LR parsing. The operator parsing presented around the 1:08:30 mark, by the way, is recursive ascent applied to operator precedence grammars. It's all been there since the early 80s, just nobody bothered to write it up. On trees as an intermediate form: It's true that we generally don't use trees for optimisation of imperative languages any more. It made sense in the 70s when memory was tight, but not today. The main advantage of using abstract syntax trees at the front end is that it matches the structure of the source code. So this is the ideal for generating error messages or warnings that the programmer is going to see (e.g. type errors). Once you start transforming/optimising, the structure is lost, so messages/diagnostics are inevitaby going to be less comprehensible to the programmer. This point was underlined at the end of the stream. Because many modern compilers are expected to be retargetable, compiler writers who work in that space tend to think in terms of three "levels": - High-level optimisations are source language-specific, and generally done on the AST. - Low-level optimisations are target language-specific, and generally done on the assembly language of the target machine. - Intermediate-level optimisations are independent of both, and generally done on the IR (e.g. SSA form). Common subexpression elimintation is generally considered "intermediate", so wouldn't be done on the AST these days. Examples of a high-level optimisation in C++ is copy elision; if the standard didn't say that copy elision was kosher, you couldn't do it. These intermediate-level optimisations are really the place where SSA form gives you a clear advantage compared to pretty much all other representations that came before it. It's overhyped, to be sure, but it's still the best we have for this specific job. If your compiler has a single source language and only one or two related target languages, it may not make sense to think in terms of an "intermediate level", so it may not give you an advantage. And, of course, compilers for non-imperative languages would generally use a different IR that is better suited, like lambda calculus for Haskell. To see what I mean by "the best we have", we can compare with def-use chains. SSA form gives you slightly more information, because it also gives you the paths along which the definitions reach. This is useful information if you want to do optimisations like partial redundancy elimination.

@mattewlefty991 4 года назад

Pseudonym73 Hi, as a mathematics student interested in parsing, what resource/path would you recommend? I want to make parsers for inserting data in a friendly way (like matrices {{1,2,3},{4,5,6}}) and for evaluating arithmetic expressions. Thank you

@sortof3337 4 года назад

Hello, what would you suggest when I have multiple keywords in different langauge for the same identifier? Will it be easier to have a processor directives defined in the syntax header or implement the dynamic parsing from the compiler? e.g. write and écrire for the same function.

@lzh97 4 года назад

I have some ideas about IR. I think Graph-based IR is worth a look, and the most well-known Graph-based IR is the "Sea of Nodes SSA" proposed by Click Cliff, which is actually very similar to PDG. Sea of nodes SSA is widely used, such as Graal compiler (in Oracle's JVM), TurboFan compiler (in Google's V8 Javascript engine) etc. The magic thing starts here, with some articles proving the commonality of SSA and Lambda (CPS and ANF) and pointing out which programming language feature are suitable for different IR to handle (e.g. SSA fit for loop optimazations, CPS fit for function inling and partial evaluation) [4]. There has been a further development in understanding the commonality of machine-independent optimization algorithms. Some compiler use CPS to optimize high-level programs, and then use SSA for medium-level machine-independent optimization. For example, Apple's JavaScriptCore, they convert JS programs into AST form CPS. When the count of the method call is triggered to a number, the CPS is converted to Graph-based SSA. This increases some complexity to the implementation. It has previously been proved that SSA is a special form of CPS, and the algorithm for SSA can be converted to CPS one. The problem is that algorithms for CPS don't always convert to SSA one, and loop optimization is verbose for CPS, SSA is better at handling loops. So many compilers still use CPS and SSA separately to handle different feature of program. In 2015, the difference between SSA and CPS became smaller, a paper proposed a Graph-based CPS IR [5]. I found it very similar to The Sea of nodes SSA. Algorithms for SSA, the adaptations of algorithms do not require much change. It's completely like an extended version of the Sea of nodes SSA. This means that an IR data structure can benefit both CPS and SSA. Because the most important of compiler infrastructure is stability, and compiler infrastructure development requires a lot of capital. So I don't think there should be any industrial compilers using this idea in 15 years, but I still expect to see it. [1] C. Click, M. Paleczny. A Simple Graph−Based Intermediate Representation. In ACM SIGPLAN Workshop on Intermediate Representations, pages 35−49, Jan. 1995. [2] Richard Kelsey. A correspondence between continuation passing style and static single assignment form. In Intermediate Representations Workshop, pages 13-23, 1995. [3] M. M. T. Chakravarty, G. Keller, and P. Zadarnowski. A functional perspective on SSA optimisation algorithms. Electr. Notes Theor. Comput. Sci., 82(2):347-361, 2003. [4] Flanagan, C., Sabry, A., Duba, B. F., and Felleisen, M. The essence of compiling with continuations. In Proceedings of the ACM Sigplan Conference on Programming Language Design and Implementation (1993). [5] Roland Leißa, Marcel Köster, and Sebastian Hack A Graph-Based Higher-Order Intermediate Representation (2nd prize: Artifact Evaluation for CGO/PPoPP’15) Proceedings of the 2015 International Symposium on Code Generation and Optimization (CGO), pp. 202-212, San Francisco, CA, USA, February 7-11, 2015.

@JannisAdmek 3 года назад

Man, that was interesting to read! You don't happen to have a blog or a twitter?

@timeyyydaman 4 месяца назад

Very interesting

@kiwec 4 года назад

Totally unrelated to the video, but the camera and the lighting is amazing.

@999a0s 4 года назад

compilers are fascinating but the real takeaway is how this man said "let me take 10 seconds to grab an iced tea" and resumed the conversation exactly 10 seconds later. jblow's mind running a RTOS confirmed

@in0lasc0 4 года назад

Off topic, but I'm amazed that Jonathan said he'd be back in 10 seconds and later in 2 minutes, and he actually came back after 10 seconds and 2 minutes respectively.

@DonaldDuvall 4 года назад

Some people are really good with time/spatial awareness. Not Me...

@dannywithnuggets 4 года назад

@@DonaldDuvall No, it's just autismo powers.

@nexusclarum8000 4 года назад

I'd like a language like C, but with namespaces and good meta-programming facilities (not templates) and no OOP.

@siddharthupadhyay6347 3 года назад

You defined something similar to Rust.

@nexusclarum8000 3 года назад

@@siddharthupadhyay6347 Yeah, but Rust lacks good taste. That's just my opinion.

@siddharthupadhyay6347 3 года назад

@@nexusclarum8000 By taste you mean ?

@moizifty 3 года назад

@@siddharthupadhyay6347 Not true, i dont know why you gave an example of a kernal, when something like linux is written in C, Which isn't bloated at all. You don't need a bloated language to make some complicated software, that's foolish thinking.

@siddharthupadhyay6347 3 года назад

@@moizifty I am sorry if I wasnt clear. By "bloated" I mean full of features and standard library complete with features for each and every situation.There are a lot of rust features that would make it feel as overwhelming,especial std::cell,std::UnsafeCell and other atomic operations. To a normal programmer this may feel excessive and "bloated" but these are the features which come most handy in making concurrent databases.This is what I meant by "bloated" in my statement.

@keroppl 4 года назад

Spot on commentary about lex/yacc and flex/bison. When first starting out I naively believed those tools would save me time, but just led me down a path of learning those tools and hating my resulting code structure. I later rewrote everything by hand (recursive descent) and it took less time resulting in smaller, cleaner code. Added bonus, I knew everything my code was doing end to end.

@markotikvic 4 года назад

It's basically a parallel to the GC problem. It's trying to do memory cleanup in a generalized way for ALL programs written in that particular language when in fact every program has an optimal way of managing memory and the more complex your code base becomes the less chance there is that the GC is doing the optimal thing. It's the same with generalized parsers - you're almost always going to be better of writing your own parser instead of fighting some generalized tool to do what you want.

@keroppl 4 года назад

Lonesomecoder If you want to leverage an existing tool, I recommend LLVM IRBuilder. Take advantage of their optimizers and backend generators. That’s where the hard stuff is, not parsing.

@whiitehead 4 года назад

You should put Casey in the title. I was surprised to hear him.

@buttonasas 4 года назад

It almost seems like Jon calls him Casey "Discussion" Muratori considering the titles of these videos :D

@oblivion_2852 4 года назад

I absolutely love the idea of compilation units and how the abstraction of header and body is used so that definitions can be out of order but the whole declaring name, arg types, return types is a separate compilation unit to the body. Such a beautiful idea

@saniel2748 2 года назад

In 2022 unity finally realized thing John say here about ECS and started moving towards hybrid approach, where you use ECS when it is reasonable to do so Who would've thought

@AlexeyShamrin 4 года назад

1:15:10 There is a mention of a paper with a novel approach to precedence handling. Where can I find the paper?

@DudeOfSlam 4 года назад

I do not know the paper they talk about but it sounds similar to this: eli.thegreenplace.net/2010/01/02/top-down-operator-precedence-parsing

@nyrtzi 4 года назад

@@DudeOfSlam The original paper from Pratt from 1973 is available for now (during Corona) from ACM's Digital Library: dl.acm.org/doi/abs/10.1145/512927.512931 Douglas Crockford's presentation might be easier to read though: crockford.com/javascript/tdop/tdop.html

@julesjacobs1 4 года назад

SSA is just a way to extend expression trees / DAGs to conditionals and loops. It's a good idea because most of the time when you do optimisation in a compiler you do (1) an analysis that analyses the possible values of an expression (2) local rewrites based on this information. Both (1) and (2) are almost as easy on SSA as on expression trees: you analyse the value of f(x,y) based on the values of x,y and you do local rewrites by walking bottom up over the tree and applying rewrites on the way down. With SSA this simple algorithm for expression trees can be extended to conditionals and loops by (a) deciding what to do with phi nodes (b) iterating the analysis/local rewrites until it stabilises. Doing these things directly on an imperative IR with mutable variables is a huge pain. For example doing common subexpression elimination on imperative IR is very complicated. On expression trees you can turn them into a DAG by the hashing algorithm that you described. Pretty much the same hashing algorithm works on SSA. There is no complicated invalidation of the hash table or anything like what you'd need for imperative IR. To pull loop invariant code out of a loop you just walk backwards over the SSA expression and see if it contains a phi for the loop. If not, it's loop invariant and you can move it out by copying it out of the loop verbatim. With imperative IR it's more difficult to detect loop invariant code because you need some kind of fixedpoint computation, and the transformation is more complicated too because you can't just copy the expression outside of the loop because the state of the variables might be different there.

@willw2596 10 месяцев назад

For the '+' and '*' operator precedence problem, why not use the correct grammar? That would be the simplest solution done with the natural recursions of the recursive descent parser. No need to do the post fix-up on the tree or adding separate definition for precedence levels for all the operators. Expr -> Expr '+' Term | Term Term -> Term '*' Factor | Factor Factor -> '(' Expr ')' |

@m4rt_ Год назад

21:50 a 16 gb source file would be 16,000,000,000 chars (16 billion) So if on average you have 50 chars per line that would be 320,000,000 lines (320 million) If you are writing a source file that size you have other problems than the compiler having to load that 16 gb file

@danjames1922 10 месяцев назад

That problem at 56:30 takes me back to solving the exact same problem, pre youtube, during my CS/Software Engineering degree. Writing a parser/compiler/assembler by hand was among the most enjoying/challenging pieces of programming I've done. Game development is up there as well. Then I fell into an ERP systems career that primarily involved shuffling data around for 20 years. Sucked the life right out of programming for me.

@nephew_tom 8 месяцев назад

Programming life is still going... and you can make it not to suck. Cheers!

@oblivion_2852 4 года назад

"There a small set of predefined things that have rigid meaning usually, except static in c++ like what the hell does that mean." I relate to this so much. Many of the keywords I've found in c++ have very odd definitions and take a bit to wrap my head around what it could mean. Also things like extern and defining instances or names to static classes just boggle me. I get that it's a restriction of how c++ is processed mostly in order of its tree definition (if you follow each include etc). But overall the way that everything links, having to declare a name aswell as later declaring the function associated to that name to stop dependency loops and all that is just a mess. I love how fast c++ is but there's so many odd things in it.

@Spiderboydk 4 года назад

What makes this worse is the C++ committe strongly favors overloading existsing keywords with new meaning in new contexts than introducing new ones. I think the rationale is introducing new keywords risks breaking existing code. The additional mental burden on reasoning about these keywords is IMO an expensive cost for this backward compatibility to avoid a relatively simple search-and-replace operation.

@oblivion_2852 4 года назад

@@Spiderboydk I've done search and replace file by file. But are there any tools to select a whole directory and subdirectories and recursively search and replace across all files? (maybe even excluding comments)

@pleggli 4 года назад

@@oblivion_2852 Some languages allows code to overwrite keywords which also can help keep things backwards compatible since old code won't use new keywords but it don't work in all languages and all situations and maybe other features would be better. I believe that Go is maybe gearing up towards the solution that since the go version is specified in the module filer (for a project/repository/library) if there is a breaking change the compiler could potentially choose which features of the language are enabled for each individual package in a build. It is interesting but it also might complicated the compilers work a lot if they need to support a range of different versions of the language for all time,

@viciouswaffle 4 года назад

@@oblivion_2852 If you are using GNU/Linux then you can use 'sed' for this. You can run this command "sed -i -- 's/foo/bar/g' *", where s is search, 'foo' is what is searched for, 'bar' is what to replace with and 'g' is replace all. '*' is files, here it's all files in the working folder, you can replace it with individual file names. Hope that helps, you can probably also setup sed on windows with cygwin or mingw. PS. The weird sign after sed -i is double '-'.

@Jcarr250 3 года назад

Not sure why everyone seems to have this idea that parser generators are hard. They save ridiculous amounts of time when changing and changing the entire structure of how something is parsed is a one-line change. I'd hate to have to change it by hand. And for something that's supposedly easier they spend a long time figuring out ways to parse by hand

@nyrtzi 4 года назад

Pratt's paper from 1973: dl.acm.org/doi/abs/10.1145/512927.512931 (Free access until July 30, 2020 due to Corona). Crockford's Javascript version of it: crockford.com/javascript/tdop/tdop.html

@nyrtzi 4 года назад

This is probably not hard to reason about once someone just makes an animation out of it. Is there a tool around which could show step by step like a debugger how the algorithm works? All it would need to show is the list of lexical tokens left to process, the state of each call of parse_expression and the constructed syntax tree.

@krux02 4 года назад

What you say about endless recursion in type checking for recursive functions, that is actually a bug in Nim.

@5Gazto 4 года назад

12:00, yeah, and people wonder why software and programming documentation sucks, programmers have no idea of what a good name looks like so they just reuse English words, making at some point their meanings intractable. But of course if you ask a pedantic documentation writer he will blame it on the incompetent reader unable to disambiguate from a myriad of contexts that could all be valid in some cases. Portmanteaus are not a new thing, to a software developer and documentation writer most definitely it seems so.

@InforSpirit 8 месяцев назад

Problem is, we only sell functional programs (atleast that was goal in some point of history), but no functional source code. It is similar problem that fantasy writers has: You need to convey some novel idea without burdening readers mind with all little trinkets and tricks. Imagine author call paying readers stupid and droppining sales after... Any classic story has many writing iterations (on paper or in mind) before publishing. Programmers call this process refactoring. Documentation and refactoring is major skill issue that most programmers has. Many times fail of theory of mind. Many times seniors who has mental model of codebase and are against refactoring (even if you are good, you still are single pointfailure for organization and code base should be rename and factor.) Writing good code is in some sense harder than writing novel, because you have two different readers to please: Humans and machine.

@torgnyandersson403 2 года назад

About the parsing of binary operators. I honestly don't understand why the "correct way" (as they call it) is considered the complicated way when it should be the obvious way if you just sit down and write the code needed. Here's how you come up with it or understand it: 1) write separate parse functions for each precedence level, 2) realize that they a look the same, 3) factor them into one function that takes the precedence as argument.

@giantisopod 4 года назад

The operator precedence part was interesting. I didn't know about the first method. The second method is actually very similar to what I did when I wrote a calculator program for practice, the only difference is that I pushed tokens onto a stack instead of using recursion. But for a general programming language, recursion is probably much easier.

@torarinvik4920 Год назад

We really need a complete programming language that is made specifically for generating compilers and interpreters. The closest we come to that at the moment is Ocaml. But Im talking about a language that is specific enough to be easy to use, and expressive enough to do anything you would want. I really dislike the parser generators, they are simply not fun to use, and it will take longer time to learn than to actually hand write a parser. They also don't help with ASTs, type checking, evaluation, code generation, storage ect. EDIT: Parser combinators are highly useful!

@m4rt_ Год назад

I watched this video a while back, before every having tried writing a lexer/parser... but now after done some small projects doing that I think less of it is going over my head. And it's a bit more enjoyable to listen in. It was a lot of fun listening in when I knew nothing, but it's even more fun now.

@m4rt_ 2 года назад

1:33:10 why not just use the crappy not optimized build for quick builds and testing, and then just add a feature to the compiler that you enable with a flag or something, and you start a compile that will last a while that goes through all the stuff and tries to optimize it, and then cache stuff so you don't do it all from scratch every time, unless you changed it. kinda like how rust has the --release flag, but spend more time optimizing.

@gregg4 4 года назад

2:32:00 Casey says compilers cannot tell you which #if was not closed. Since those things can be nested, compilers cannot reliably tell you which one wasn't closed. There is the potential it would mislead by suggesting which one is not closed. It is probably better that it doesn't give you a hint as no information is usually better than misleading information.

@gettem6341 2 года назад

This explanation is really helpful in writing parsers, for some reason I never thought to do them the way he describes here, but it makes it much easier.

@georgemargaris 4 года назад

highly valuable, thank you

@ronanderson1023 4 года назад

Dropping the video resolution to make room for all the poor people watching Netflix in my building. Loving the content

@Spongman 4 года назад

the msvc front-end runs multiple compilers in parallel, it also shares the parsed precompiled header info between compiler instances that share the same pch. what's more, the code-generation phase is moved into the linker in order to support whole-program optimization.

@jerzydziaa1819 4 года назад

read about Parser Combinators, they make writing parsers from scratch extremely easy

@mortenhaukesolvang5665 4 года назад

how do parser combinators not introduce the issues discussed in the video, namely obtuse error messages?

@999a0s 4 года назад

lmao these are two of the most experienced and skilled programmers currently working, imo. they know what parser combinators are.

@jpratt8676 4 года назад

@@mortenhaukesolvang5665 They do generate bad error messages without manual error management, but they do normally provide tools for that.

@5Gazto 4 года назад

3:40 Oh, crap, you are telling me that, my life-long experience with computers and programming. [Pressured vapor escapes from the pipeline's valve]

@BxBL85 4 года назад

56:33 critical error, expression not found..

@thedoublehelix5661 3 года назад

2:23:10 Casey is onto something here

@HAL_NlNETH0USAND 4 года назад

Mysticized 36;25

@seanarooni 4 года назад

had a turkey flavored sandwich for lunch today

@josiahmanson 4 года назад

I disagree with Casey about wanting to have optimizations constantly run in the background. This has several bad effects. First of all, you don't want to ship an unoptimized build and count on the customer optimizing it. That will result in a product that starts slow and speeds up over time. It is also having every customer pay the cost of running an optimizer rather than the developer running it once and distributing the results. Also, every customer will wind up with a slightly different build effectively, and bugs in the optimizer could cause a build to in some non-deterministic way for a subset of customers. Even for local testing, non-determinism of performance characteristics will be bad. How can you test if some new code you write is faster than the code it replaces? BTW, this sort of JIT compilation and optimization is what Java style languages and shaders in graphics APIs do, so it is not a new idea.

@dmoore764 4 года назад

I don’t think he meant have it run in the customers computer but on the developers computer as they are working on the program. As far as performance testing, if the code is constantly optimizing as you are developing, then essentially your performance testing would be always on an optimized version unless you just loaded a bunch of code from somewhere and immediately started performance testing on it or something (although maybe if you are code sharing you could also share the background work the compiler has done as well?)

@SimonBuchanNz 2 года назад

This is generally referred to as "online" compliers. They're super common in specific domains, but haven't caught on more generally. I hope the rise of native support for LSP in languages eventually leads to more common support for this, given they can share a codebase.

@shitheadjohnson2797 2 года назад

tokenize -> then run the parser calculator. but if your just using assembler theres no need for a compiler. :)

@shitheadjohnson2797 2 года назад

@FichDich InDemArsch I thought assembler didnt even have a tokenizer. its just raw to the ram address with no variable names.

@daviddunleavy187 4 года назад

1:15:04 saving for later

@remyclarke4020 3 года назад

a couple months ago I watched this video and was surprised by how interesting the content was despite my complete lack of knoweledge in programming. Since then I have gone down a rabbit hole of learning about software and programming, set up a linux laptop, tried some programming and watched and read a whole lot about computer science. I'm happy to say that comming back to this video, I have only found it more interesting and entertaining. I'm really happy to hear about your experience with other languages, and how they have informed you about what you care about. I know that you focus on game related subjects but I would be really interested to hear about your experiences with the other languages. What languages did you enjoy before, but had to decide not to use because of performance concerns?

@FalconGames109 4 года назад

Great commentary on SSA. I've always felt that way and it seems like a lot of people worship it as if it's something special, but it seems like they just fail to understand the tradeoffs it introduces (and the fact that it's not required for anything). I'm glad that my compilers professor was very good about that and was pretty upfront about SSA being mostly bullshit.

@KilgoreTroutAsf 4 года назад

37:00 to be fair, parsers are much more general than lexers in the kind of rules they can resolve

@buttonasas 4 года назад

Isn't it the other way around? Lexers read just about anything and will only fail to resolve when you have very simple combinations of characters. Parsers are more picky and actually require some actual structure. Picky = less general.

@skepticmoderate5790 4 года назад

@@buttonasas Lexers are basically finite state machines, and they simply produce a list of tokens with no structure, while parsers can have arbitrary state and produce abstract syntax trees. In fact, sometimes you can choose to skip lexing completely and combine the lexical analysis into the parsing step.

@larfeeheelarfeehee9600 4 года назад

Can anyone give a short answer to the approximate compiler/language release date?

@SimGunther 4 года назад

When it's done

@ekrem_dincel 4 года назад

@@SimGunther I agree

@RobertHildebrandt 4 года назад

This is pure gold!

@KhalilArafan 4 года назад

hey, memory lane, lex and yacc + ocaml here around 2002 in uni in france :p

@TheMRJewfro 4 года назад

Flex and Bison in 2017 in California

@Schcarraffone Год назад

And baguettes too?

@KhalilArafan Год назад

@@Schcarraffone and all the cheeses haha

@saniel2748 2 года назад

In 2022 unity finally realized thing John say here about ECS and started moving towards hybrid approach, where you use ECS when it is reasonable to do so Who would've thought

@alonecoder600 4 года назад

If 0..255 is reserved for ASCII codes, where the rest of letters (i.e. Cyrillic, Greek etc.) go?

@pleggli 4 года назад

you could have used the emacs repl to demonstrate an emacs lisp evaluation of (* 5 (+ 6 7)) in a REPL in a GCed LISP with a tree structure.... You can say a lot of things about lisp but it sure is a good base to teach programming languages on, even emacs lisp to some extend.

@PankajDoharey 4 года назад

It is easy to do this in Lisp (+ 1 2) The precedence and the tree is already defined in the syntax. Designing a Lisp for a game language would be the best way.

@TheSandvichTrials 4 года назад

Not best for the user. I much prefer the syntax of modern languages...

@Cons-Cat 3 года назад

@@TheSandvichTrials Lisps are modern lmao. Look at Racket, Chicken, Joker, Clojure, Guile, etc. What you mean is to say you prefer C flavored languages, and you prefer them for purely arbitrary reasons.

@thedoublehelix5661 3 года назад

@@Cons-Cat It's not really arbitrary because that's the way he's been taught how to do math since childhood. It's also the way we've been writing and doing math for hundreds of years (at least in Europe), so why would you not support it.

@Cons-Cat 2 года назад

@@thedoublehelix5661 I am quite certain that people have not been writing bitwise operators for hundreds of years.

@thedoublehelix5661 2 года назад

@@Cons-Cat I was talking about + and *

@TheaDragonSpirit 4 года назад

I think these two videos show how a computer works, basically you start off with really simple logic gates, then you build hardware to run those logic gates faster, then you write a computer language that work with those logic gates which is ascii, and then you have a language which makes it easier for humans to compile ascii and that is it. Bit overly simplified but basically you would start with this music box: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-HjBhO9iqEc0.html Then you would build it in to something more along these lines: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-IvUU8joBb1Q.html Progression basically.

@ekrem_dincel 4 года назад

Ascii is a character encoding, not language

@TheaDragonSpirit 4 года назад

@@ekrem_dincel What is the difference between encoding and language? That just seems like semantics to me.

@ekrem_dincel 4 года назад

@@TheaDragonSpirit The difference is really big, encoding determines what values on physical computer will mean. For example, the byte 0x12 may mean "a" in ASCII but also may mean "b" in UTF-8. www.google.com/url?sa=t&source=web&rct=j&url=en.m.wikipedia.org/wiki/ASCII&ved=2ahUKEwifs63Vi83oAhVHjqQKHagQCdgQFjABegQIBRAB&usg=AOvVaw0WclVB3d61DhE2GVEXbssD Python is a programming language, you can write your code as ASCII or UTF-8...

@TheaDragonSpirit 4 года назад

@@ekrem_dincel I would consider Binary a language. ASCII to me is also a language, it's a formation of numbers which can be interpreted. I would say Encoding is just the type of language it is. In my opinion. For example take the alphabet, it means nothing till you put it in a certain order, but I would not say the alphabet isn't a language.

@riomh 4 года назад

@@TheaDragonSpirit It might help to think of it this way: The language is something we communicate with, and the encoding is how we define things so that they can stored/transmitted. With computers, we need a way to store/transmit (read: encode) language. For example, spoken language is encoded in waves moving through the air, and written language is encoded in photons. Since computers use bits, we must find some way to convert a language to bits - so we created ASCII which is like a rulebook for how language will be stored/transmitted in computers.

@sysInt64 4 года назад

why so low quality - 360p only :(

@ReFreezed 4 года назад

It takes some time for RU-vid to process the higher quality settings.

@sysInt64 4 года назад

@@ReFreezed I see, I'll wait a little bit then )

@slutmonke 4 года назад

The key takeaway from this and many of the other times you've talked about your disagreements with CS as taught in academia, is that it's really easy to optimize for the wrong thing when aggressively pursuing intuitive metrics for efficiency instead of measured execution efficiency itself. Like intuitively we might expect shorter code to be faster, but in some situations that leads to tons of extra branching, or we might expect that generalizing the lexer and parser into their own separate generic design spaces allows us to optimize each one, but in some cases we'd lose so much due to the translation of generics at the interfaces that it's actually not helping us.