From NUL to DEL: Why 7 Bit ASCII IS Actually Really Clever

Dylan Beattie

Подписаться 21 тыс.

Просмотров 27 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

25 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 380

@ke9tv 2 месяца назад

Once upon a time, I worked in the office next door to Bob Bemer, the editor of the first ASCII standard. Which, by the way, also specified EBCDIC. IBM was the only manufacturer that embraced EBCDIC rather than ASCII because EBCDIC was more punched-card friendly, and IBM virtually owned the marked on 80-column card equipment. Single newline came from the B programming language. Multics used . X-ON and X-OFF are misidentified in your table. They're DC1 and DC3 respectively. ETB was the standard 'file mark' that separated multiple files on a magnetic tape. EM 'end medium' was a mark that meant, 'this file spans multiple reels, time to switch to the next reel.' NAK - negative acknowledgment - is the ^U that you use to cancel the stuff you're typing at the command line.

@allangibson8494 2 месяца назад

IBM literally invented the punched data card via the Hollerith Company in 1889… Punched cards controlling machines however dates from 1798.

@lbgstzockt8493 2 месяца назад

@@allangibson8494 It's crazy that modern IDEs still have markings at 80 characters. That technology was so far ahead of its time.

@allangibson8494 2 месяца назад

@@lbgstzockt8493 80 Characters was what was determined by the U.S. census as being adequate to store the population information as a line item…

@thezipcreator 2 месяца назад

I didn't even know ^U existed, I've always just done ^C (and shells are smart so they know to catch that and not just terminate)

@TheEvertw 2 месяца назад

@@thezipcreator Shells terminate with ^D, not ^C. ^D is End of Transmission (i.e. connection). ^C is passed on from the shell to the program that is running at the time.

@fritzp9916 2 месяца назад

Great video. Though I think what deserves to be mentioned is backspace. On paper terminals, you can't delete a character you've already written, so all that backspace did was going back one space, allowing you to print over the previous character. This was useful for making text bold - as you mentioned when discussing carriage return - but also for creating combined characters. Want to type "café"? Just type "cafe", hit backspace, and type an apostrophe. The fonts used for paper terminals were carefully designed to make this look good. Likewise, o with " on top was a "good enough" approximation of ö. Some ASCII characters were included specifically for this reason: the tilde, the acute/backtick, the caret. But most importantly, the underscore. The only reason why it was included was to underline words to highlight them.

@billwall267 2 месяца назад

Important context!

@gcewing 2 месяца назад

There was a phase of my life when I was using a 5-bit teleprinter as an I/O device for my homebrew 8-bit system. It unfortunately didn't have any backspace ability, which was very annoying when I wanted to print zeroes with slashes through them. I ended up doing a CR and then going over the whole line again to fill in the slashes.

@jasonclark1149 2 месяца назад

My browser decided to buffer at the perfect moment, in the Morse Code section. "The code for A is , but if you leave a gap, ..." and then it just started spinning. It was a VERY long gap 😂

@Huntracony 2 месяца назад

I always wondered how terminal progress bars and such worked! This also explains why these often kinda break when there's an error or warning during the progress bar. Thanks, this was entertaining _and_ useful.

@exciting-burp 2 месяца назад

On modern systems, since roughly the '80s (support added ub Windows 11, which previously used a *very* different method), it's all done using VT100 and its successors. Here you'll find ways to encode things like "move to row X column Y", and "set color to red" (there are hundreds of commands). The trade-off is that commands are no longer single bytes. This was used for the first digital displays, especially for dumb terminals.

@declanmoore 2 месяца назад

Windows has even added support for these newer control codes to their console host. Before that, and what still works, is to send commands to the console host driver (condrv.sys) via device IO controls.

@backpackvacuum9520 2 месяца назад

As an American I will proudly ignore all further episodes as I now have everything I need. /s 😂

@_f355 2 месяца назад

so you don't need that emoji over there, right? :)

@nickwallette6201 2 месяца назад

It's kind of funny, but that decision became a self-fulfilled prophecy. Because it wasn't a given that you would have consistently-mapped upper ASCII characters to represent even the most common international letters, it got to be fairly commonplace to see letters with accent marks dropped back to their un-accented variants. Granted, I'm a native English speaker, and so 26 letters ought to be enough for anybody. ;-) But, it didn't seem to have much of an effect on the intelligibility of words that used those letters. I recall seeing discussions on this where, specifically, Spanish-language and German speakers shrugged it off as, "eh... we knew what it meant." And, again, as a native English speaker, I have rarely considered the word "jalapeno" spelled with anything other than a plain 'n', and yet I recognize it easily enough in either form. On a related note, I got a crash-course in the peculiarities of various languages when I started writing a driver for FAT filesystems. Plain FAT (as in, pre-LFN) is case-insensitive, and meant to only consider letters in the low-ASCII range A to Z. All lowercase alpha chars are converted to uppercase with that toggling of bit 5. But, when LFN support was added, well... now we're dealing with Unicode characters in UTF-16 form, and ... _technically_ ... we should be case-folding everything (I think?) to uppercase to store the 8-dot-3 compatibility entry, and when searching for or comparing filenames in either 8.3 or LFN form. I say "I think" because the official FAT LFN spec is a bit quiet on what to do about chars with ordinals above 127, probably for the most obvious reason: It's kind of a pain to handle those. You have languages that case-fold differently depending on context, and when you're converting from Unicode to local code-pages, that character might not even exist. While (IIRC) the common US English DOS code page has upper- and lower-case variants of all the accented characters >127, not all of the code pages do, making it impossible to represent properly uppercased versions of any given filename entered with lowercase chars. And, of course, if you change code pages (by either changing the local code page, or moving a file to a system with a different code page), the filename might "change" completely, resulting in lowercase chars in the filename, potentially making it inaccessible by normal means, or causing match collisions with files that get created with uppercasing applied. I think some (or maybe most?) implementations just continue cas-folding lower-ASCII chars, and letting everything else slide. All of this because the original implementations were designed in a relatively simple (cultural) language with straightforward rules, and when -- or if -- the thought occurred to anyone about how to handle other languages, they just shrugged and thought, "eh... that's a problem for future developers."

@timseguine2 2 месяца назад

It took me longer than I care to admit to figure out that ASCII is also BCD with extra "tag" nibbles in between. You can read numbers off easily in a hex dump just by ignoring the extra 3s everywhere. Well and if you get used to it, you can also read letters pretty easily from the hex dump but that feels more like using one of those old cereal box decoder rings.

@ShenLong991 2 месяца назад

The thing with the numbers is even more pretty if you look closely on the bits and have your hexadecimal in mind. 0x30 till 0x39 are d'0' to d'9'. So if you are in embedded Programming, and plan your decimals according, you can look what each are, without have to try and disect the bits.

@JoQeZzZ 2 месяца назад

This Is cool, although an inevitable side effect of the "& 0b1111" thing. In order to get a string to int using only an and, they would have to be LSB aligned, and because there are 10 digits they need 4 bits.

@timseguine2 2 месяца назад

This is a side effect of it being backwards compatible with BCD. If you wanted to you could actually do arithmetic directly in string form because of that.

@gcewing 2 месяца назад

And I still have burned into my brain that there are 7 character codes between '9' and 'A', from all the hexadecimal to binary conversion routines I wrote in machine language. I've sometimes fancied that if I were to design an improved character encoding I would make the first 36 codes be all the digits followed by all the uppercase letters. It just makes sense. To a programmer, at least.

@peterlinddk 2 месяца назад

A lot of the "skipped" codes, like ACK, NAK, and SYN was used in a lot of early communication-protocols, like XMODEM and the likes. And for some reason I don't understand, DC1 and DC3 was used for XON and XOFF that I think we all remember from the old modem-days. I don't know why SO and SI are called X-On and X-Off in some ASCII-tables ... maybe some other protocols used those? Ah, the days of RS-232 ASCII-based protocols!

@timseguine2 2 месяца назад

STX (start text) and ETX (end text) are used sometimes for framing purposes.

@ke9tv 2 месяца назад

DC1 and DC3 turned on and off the paper tape reader on Teletypes. DC2 and DC4 turned on and off the paper tape punch. When you were sending a paper tape down the line, if you were threatening to overrun a buffer, the other end would send DC3 to say, 'hold on there, tiger', and DC1 again when it was ready to slurp down more. ^S and^Q still work that way on most Unix terminal emulators.

@darrennew8211 2 месяца назад

Shift in and shift out controlled what you might think of as the typeface.

@edgeeffect 2 месяца назад

In your chart, you've got x-On and X-off as 14,15 SO,SI "control-N" / "control-O" but, in any system I remember, XOFF and XON are DC1/DC3 17,19 "Control-S" / "control-Q" ... and that takes me back to writing printer handshaking diagnostics for the repair centre at work and saying, "Oh that's why some of the old 8-bit machines had control-S to pause scrolling". My old manual typewriter didn't even have an exclamation mark... because you could make one out of single-quote, backspace, full stop.

@fburton8 2 месяца назад

Control-O was commonly used to throw away the rest of the current terminal output.

@ShadowKestrel 2 месяца назад

^D isn't dead and gone. in most CLI/TUI contexts it's a semi-standard way to close out into the parent shell, and still works well in cases where ^C is taken (e.g. in the python shell, where it will raise KeyboardInterrupt).

@RoamingAdhocrat 2 месяца назад

incidentally if you're using the python shell more than very very occasionally... install ptpython infinitely nicer python shell

@pidgeonpidgeon 2 месяца назад

Except on windows where its usually Ctrl Z

@darrennew8211 2 месяца назад

What ^D does is it sends any buffered data, including whatever you've already typed, but not the ^D. If you type it with nothing buffered, then it sends zero bytes. And Unix treats a read of zero length as an end-of-file. ^Z is the character that CP/M decided to put inline to mark the end of a text file, because all files in CP/M were a multiple of 128 characters long. You never saw a file that was like 74 bytes long, so if you had a 74-byte text string in a file, you tacked ^Z on as byte 75.

@RoamingAdhocrat 2 месяца назад

@@darrennew8211 I don't know why RU-vid sent me a notification about your comment but I'm glad it did.

@nickwallette6201 2 месяца назад

Huh. While I knew the conventions of the above, I did not know the reasons why. This has been educational.

@Vennotius 2 месяца назад

I enjoyed this one very much. I still remember discovering an ASCII table in one of my father's handbooks when I was a kid. This video took me back.

@karlfimm 2 месяца назад

I still remember hitting my first EBCDIC files (about 1985) and being amazed that the A-Z characters were scattered around at what looks like random.

@timseguine2 2 месяца назад

If I remember correctly, EBCDIC was designed to be backward compatible with IBM's punchcard systems, which were still relevant at the time. I think there were considerations for efficient electromechanical sorting and also for not producing too many consecutive holes in the card which could clog the reader or the hole punch machine. Back when it was invented, IBM was almost superstitious about punchcards because they were a huge reason for their financial success, and continuing financial success. In hindsight they don't seem so important of course.

@peterholzer4481 2 месяца назад

@@timseguine2 Right. The punchcards didn't use a binary encoding of the digits 0-9. Instead they had 1 row for each digit. So it made sense to use only the digits 0-9 in the lower nibble for the letters, too. There is a picture of a punchcard in the Wikipedia article about EBCDIC. It looks quite neat and not random at all.

@darrennew8211 2 месяца назад

They're lined up properly if you ignore the right holes on the punch card, rather than ignoring the right bits in a byte.

@pjl22222 2 месяца назад

EBCDIC was just a newer, fancier version of BCDIC, binary coded decimal interchange code, which itself was more a group of similar but different encodings. BCDIC was a 6 bit encoding where the numbers 0-9 were encoded as the values 0-9 and everything else was distributed basically randomly. The letters (uppercase only) were divided into three groups which were backwards, S-Z was encoded with smaller numbers than J-R which were smaller than A-I. EBCDIC is an 8-bit encoding (although many code points were left undefined) which didn't fix the noncontiguous problem but it did fix the order of the letter groups.

@jasmijnwellner6226 2 месяца назад

ASCII 27 (ESC, generally written in source code as \x1b, \033 or \e) is still used a lot for terminal applications for more complex than can do, including changing the colour of the text or background!

@ke9tv 2 месяца назад

There was a whole ANSI standard that came later for what the various escape codes were supposed to do. (Nobody implemented the whole thing, and no two vendors implemented the same parts.)

@jovetj 2 месяца назад

Don't forget *^[* 😉 ESC is a pretty important character. Not as important as 0x0A or 0x0D, though.

@BradHouser 2 месяца назад

VT100 and later ANSI Escape sequence made BBS pages colorful and graphical (boxes, symbols, etc.). DEC added REGIS graphics to the Escape sequences, and graphic primitives could be drawn on the screen enabling interactive graphis terminals, all using 7 bit ASCII.

@pitan9445 2 месяца назад

First time viewing you channel - this was excellent. Before HTML was a thing, I worked for an organisation selling structured news (sports results &c) We used record separators (RS, ascii 30) and file separators (FS ascii 28) to split up our rows and fields. It took me a long time to realise we were redefining the acronyms.

@ke9tv 2 месяца назад

RS was right to separate records. The fields should have been separated with US, unit separator. GS and FS were higher level.

@mhzellers 2 месяца назад

If you have ever punched a Hollerith card, EBCIDIC makes a certain amount of sense.

@kevinmcnamee6006 2 месяца назад

Excellent video. It certainly brought back memories. My first job as a programmer (1975) was working on code that allowed IBM mainframes to communicate with ASCII terminals. This involved translating ASCII to EBCDIC and of course worrying about how all the control characters worked, like CR, LF, TAB, NULL, etc. On the old Teletype 33 terminals you even had to worry about how long it would take for carriage to return to the left margin after printing a long line, and insert enough NULLs to allow it time to happen before the next printable character arrived. We referred to them as dumb-ASCII terminals. One thing that made things more tricky was that the guy who wrote the specs for the communications controller on the IBM mainframe got the bit order reversed, so the low order bit from the IBM system was sent as the high order bit on the wire. Another difference was sort order. In ASCII, digits sort first, followed by lower case letters, and then upper case. In EBCDIC, upper case letters sort first, followed by lower case, and then numbers.

@nuk1964 2 месяца назад

One of the frustrations that I remember from the early 1980s was the occasional mangling of data when going between EBCDIC and ASCII world. Alphabetic and digits were OK, as most of the punctuation. Manged were things such as horizontal tabs, circumflex, backslash, curly braced and square brackets (apparently some versions of EBCDIC had these, and some did not, and those that did sometimes they appeared in different locations). E-mail and general text would generally pass through OK (or if was "mangled" in translation, it was still understandable). What was not so great was when you tried to transfer some source code in languages like C or Pascal. Learned quickly to NOT use TAB characters for indentation (due to the inconsistent translation -- sometimes it translated directly into a single spaces, other times it would get "expanded" to a sequence of spaces but inconsistently -- if you're lucky it expanded to the right number of spaces to preserve the indentation, but more often than not, it didn't). This helped to preserve the indentation of code -- allowing for easier recovery when the curly braces would get lost (and you had a better chance to guess correctly the location of those missing curly-braces). The loss of curly braces, square brackets and backslashes would render C source code unusable -- but a "somewhat obscure" feature of trigraphs became quite useful in this case. Downside is they make your code *really ugly*. For Pascal code, found the some the alternates used in Pascal/VS on the IBM useful -- such as the "(." and ".)" aliases for the square brackets, and "->"" alias for the caret.

@nuk1964 2 месяца назад

My first encounter with double-byte character set was on the Control Data mainframe -- where a double-byte system was used to get beyond the limitation of 6-bit bytes. It was also on the Control Data systems that I'd finally understood why Pascal had used eoln() function (rather then looking at the character value and check for carriage return or linefeed) -- end-of-line was a very specific pattern (iirc it was something like a word-aligned sequence of contiguous zero-bytes -- where there were 10 6-bit bytes in a 10-byte word).

@LordPhobos6502 2 месяца назад

Looking forward to next week's video! Reading ascii codes in decimal hurts my poor lil brain though, I was taught early on in hexadecimal, and it always made more sense to me that way :)

@Lord-Sméagol 2 месяца назад

I learned BASIC at school using an ASR-33 TeleType dialling in to an HP 2000F, saving my programs to paper tape. Sometimes, classmates would want to know which program was on their paper tape that they forgot to write the name on. This was easy enough if the terminal wasn't being used, but I could read the holes and tell them :)

@VoyVivika 2 месяца назад

Clicked on this video only to discover it's by the guy who made the Rockstar programming language, lmao wasn't expecting that. Loved the video btw!

@ReneKnuvers74rk 2 месяца назад

13:14 I’m pretty sure not the creators of ASCII threw all the hyphens and quotes on a couple of piles - it was the teletype-makers that around 1900 to 1960’s had no 1, only an i without a dot, a separate dot that doubled as a single quote, and no separate characters for o and 0. That meant that ASCII adding back these additional characters would force mechanical changes to the devices that were supposed to use the new standard. Since computers need a distinction between a letter and a number the 1/i and 0/O issue was required to be solved, but the start and end quotes have no functional meaning in a computer.

@darrennew8211 2 месяца назад

Not just teletype. That was pretty common on typewriters too.

@McDuffington 2 месяца назад

One of my favorite subjects! Looking forward to the follow up parts!

@TimSavage-drummer 2 месяца назад

EOT (Ctrl+D) is still used in Unix/Linux to end a terminal session. I also find it odd that 28-31 aren't used more, they are perfect for use in CSV(like) files to avoid needing to do escaping etc.

@lupinzar 2 месяца назад

The utility of CSV is that you can edit it in pretty much any text editor in a pinch and it still remains (fairly) human readable. Once you introduce control codes that won't be visible at all in some editors and require special settings in others, you might as well develop a binary format that is more efficient. That said, if you can't influence the design of a data format and need an extra set of delimiters they are useful, but probably not best practice.

@darrennew8211 2 месяца назад

Control D doesn't end a terminal session. It flushes the keyboard buffer without adding anything to it. If you're at the start of a line, then you flush zero bytes. A read from a file of zero bytes indicates an end of file in Unix. So the terminal reads zero bytes, thinks its input is closed, and exits. Write a program that sits in a loop reading the stdin and writing what it gets without any buffering. Then type "ABC" and hit ^D, and you'll see instead of exiting it just prints ABC.

@lennartbenschop656 2 месяца назад

They even did take care to support foreign western languages to some degree. ASCII includes the grave accent `, circumflex accent ^ and tilde ~ and you could backspace and print it over a letter (on a real teletype, not on a video screen). The single-quote/apostrophe character 0x27 ' did triple duty as an acute accent and in some old fonts it looks like a mirror image of the grave accent. The double quote character " could be used as umlaut/diaeresis in a pinch. The double-quote and single-quote characters were also common on typewriters and these did not have separate opening an closing quotes. The underscore character was meant to be overprinted on other text as well, just doing a CR without LF.

@lennartbenschop656 2 месяца назад

@greggoog7559 That has nothing to do with ASCII as such. Compose combinations are substituted with codepoints for accented letters (formerly in your favorite 8-bit code page, today in Unicode). I was talking about old printers that only had 7-bit ASCII and could print a letter, then backspace then the accent.

@AdrianDerBitschubser 2 месяца назад

11:50 The Rest contains one really important character: The ESC, or Escape-Character. It is used with ANSI Escape Codes to generate all the wonderful color and other formatting in terminals even to this day. Maybe that is worth a video.

@dj196301 2 месяца назад

Subscribed! No dumb-ass stock footage, no tangent shots, just an entertaining and informative chap talking about cool stuff. Looking forward to "Why UTF8 is Actually Very Clever"--unless you've done and ii just haven't seen it. Thank you.

@DylanBeattie 2 месяца назад

@@dj196301 thank you! UTF-8 is coming in a few weeks. Got some other stuff to talk about first :)

@davidh.4944 2 месяца назад

I've always liked how caret notation makes clever use of the ascii scheme. If you ever hit backspace in a terminal and see ^H^H^H or cat -A a text file written in windows notepad and see a bunch of ^Ms (or see the programmers use them in comments here), it's because the display has taken the non-printing character, flipped one bit, and is presenting it as its corresponding alphabetic block character. So NUL (00000000) becomes ^@ (01000000), TAB (00001001) becomes ^I (01001001), etc. It also works in reverse to enter these characters, as the Control-C bit in the video explained. Very clever.

@billwall267 2 месяца назад

Excellent. Thanks.

@foo0815 2 месяца назад

Thanks for the DEL story!

@dragonfly-7 10 дней назад

That was awesome ! Quite an excellent wrap-up of lots of things I had been learning in the past 50 years or so. Thanks a lot !!! When I did porting a software to an Amdahl machine back in 1993 I had been driven crazy when trying to test the s/w (BTW: compile of pure C code when thru without a clitch). I had lots of attempts with entering the license key. After launching the debugger it turned out that a character was missing. Finally the system admin did ask which characters are among the license key. It turned out that that was the right one; The '#' (a.k.a. hash or pound) was used to was used as a "DEL'/delete character to 'X' out unwanted input. Typewriter style software at its best ...

@bishaladhikari9499 2 месяца назад

Loved every second of it

@MattJoyce01 2 месяца назад

Some of this I knew, but I didn't realise the deliberate design elements. Good job.

@RaceriEmil 2 месяца назад

Thanks. That was very informative and insightful. I like your delivery and the small jabs/joked you put in. I am looking forward to your next video!

@xdcountry 2 месяца назад

That was great. Excellent tour through the origins. Just incredible.

@niczoom 2 месяца назад

Great video and very well explained! The point about why certain commands are still in use today and their origins was very interesting. I learned something new-thanks for sharing

@amarqueze 2 месяца назад

Very nice video. I work with computers since the 80s, and never though about ASCII. Now I know how python progress bar is built and other clever ideas. Well done Dylan!

@DragoniteSpam 2 месяца назад

Lol I didn't expect that little shower thought to turn into a whole video, good fun!

@OrigamiMarie 2 месяца назад

Ctrl-d is still used a little with Bash. If you want to quit a user session fast (and can't be bothered with "exit"), ctrl-d will end it.

@edgeeffect 2 месяца назад

Reminds me of the MCP in Tron with his "End of Line". Ctrl-d can be used anywhere you want to end a file like `cat - >my_file.txt` - type a line, type another line, ctrl-d

@rogerramjet8395 2 месяца назад

And CTRL-L to "clear" the screen. (Maps to "Form Feed" … which shifted the paper to the start of the next - blank - page).

@pidgeonpidgeon 2 месяца назад

Ctrl D is used a lot on Linux in general. Anytime you use a pipe it takes one processes stdout and connects it to another's stdin and the convention to say that the stdout is empty is to send Ctrl D

@0LoneTech 2 месяца назад

@@pidgeonpidgeon No, ctrl-d for end of transmission is in the terminal (tty) layer. Between processes end of file is indicated by closing the connection, see shutdown or close system calls. The terminal in cooked mode also permits using ctrl-d to input an unterminated line without ending the file, similar to fflush, or actually transmitting EOT with ctrl-v ctrl-d. More details in e.g. stty(1); try "stty -a".

@cigmorfil4101 2 месяца назад

It's more than bash. *nix uses ^D to mean EOF. Any program reading from STDIN getting an EOF would exit as it can no longer read any input; eg: $ cat > hello World ^D $ cat hello World $ Thus, when you put an EOF (as the first character) to bash, it gets an EOF and exits, as do sh, csh, tsh, etc.

@OranCollins 2 месяца назад

I've always loved your talk on ascii. Love seeing more stuff from your brain! keep it commin!

@lostcarpark 2 месяца назад

You skipped over 16-31 very fast. I think the Escape character at least deserves a mention! You mention Morse code, but there were several other digital codes that predate even computers. Baudot was developed in France in the 1870s for telegraph machines as a 5-bit digital code. The early consoles used a piano-like keyboard, and required operators to press keys together to make chords, so the code was designed to be easier for operators, with more common letters in single bit positions, and even the numbers weren't continuous. This was later adapted into Murray code, in the early 20th with the development of teletype terminals and teleprinters that let operators use a QWERTY style keyboard. As they were mechanical, the code was designed to minimise wear on the machinary. Finally, fully electronic machines started appearing in the 1930s, leading to the development of ITA2 (which at least put the numbers back in a contiguous block). Having been developed for one purpose and evolved and tweaked for others, the code was quite messy, so we can probably be grateful that the designers of ASCII decided to go with a clean sheet design. There probably is a universe in which they decided to take Baudot/ITA2 and extend into a 7 bit code. ASCII effectively has four 5-bit "pages". I could imagine taking the "letter" and "figure" modes of ITA2 as two of those pages, than adding lower-case and control codes as the other two. Then, your video would be explaining why the ASCII code letters weren't in alphabetical order.

@BradHouser Месяц назад

Fun Fact: Some of us remember the key-strokes Ctrl-S and Ctrl-Q. They are the ASCII codes to stop and resume display output. They use the codes for Device Control 1 (ASCII 11 Hex) and Device Control 3 (13 Hex) to tell the sending device to stop sending data.

@Dominik-K 2 месяца назад

Thanks a bunch for this video. I've known most of these things already, but in my programming career knowing those fundamental bit layouts and tricks had been so valuable to writing efficient and understandable code

@andythebritton 2 месяца назад

This seems to be an abridged version (or possibly the first episode) of Dylan's 'No such thing as plain text' talk, which is well worth a watch.

@rabidbigdog 2 месяца назад

Good lawd, this was awesome. Kinda hilarious how everyone else tried to ensure IBM was out there in the wind.

@bread8070 2 месяца назад

One more thing, following on from how upper and lower case are separated by a single bit: look at the number keys on a keyboard and the symbols on them. Starting from 1 you’ll notice the codes for the numbers and symbols are also separated by a single bit. It goes a bit wrong about half way along, but on old keyboards (pre IBM PC) this usually works for the whole set. Now look at the keys for the non alphabetic symbols in those two alphabetic ‘blocks’. You’ll find the symbol in the low case block is on the same key as the equivalent symbol in the upper case block. Thus, the symbols and numbers on most keys differ only by a single bit. Why? Because taking a keyboard scan code and converting it to ASCII requires a bunch of code and a look up table. Old computers were very slow and had very little memory. So old keyboards generated ASCII codes in hardware, to be returned to the processor. Arranging the keys so the symbols on them were one bit apart made the hardware much simpler. To be fair, it’s probably fair to say that the ASCII codes were derived from existing typewriter layouts. So it’s actually the ASCII code ordering being chosen to match the keyboard layout rather than the layout being designed to match the ASCII. But that just makes the ASCII design even smarter. (And I suspect the same is true for teletypes and the symbol pairings on the hammers - which were probably inherited from typewriters anyway).

@ke9tv 2 месяца назад

There was also a design for conversion between EBCDIC and ASCII that required only a handful of transistors. The two standards were developed together. (IBM 026 and 029 card code preceded EBCDIC.)

@probablypablito 2 месяца назад

Incredible video!

@lennartbenschop656 2 месяца назад

Between Morse code and ASCII there was also ITA2 (sometimes incorrectly called Baudot code), a five-bit code for mechanical teletypes. It used control codes (letters and figures shifts) to switch between letters and digits/punctuations. ASCII still has SO/SI control codes to make it possible to temporarily switch to a different character set. ITA2 has a Null character, CR and LF and even Bell and "Who are you" (similar to the ENQ control code in ASCII).

@agranero6 2 месяца назад

It is mostly forgotten that we have SOH, STX, ETX, EOT, ENQ, ACK, SYM, ETB, FS, GS, RS, US, and particularly EM: end of medium. This was primarily designed for data transmission like Baudot and not for use for use on the computers themselves (like memory and files) as the very name states: "for Information Interchange". It is interesting to analyze those systems by their purpose (a teleology if you want): Morse made the most used characters shorter (he went to a printing press and looked at the size for the type cases, the most common were bigger, yes this is why we call uppercase and lowercase); Baudot was firstly designed to minimize the wear in the mechanical parts of the telegraph (not the modern Baudot), and ASCII, well we see hew hints of a protocol attached to a machine as those mentioned and DC1, DC2, DC3 and DC4. I always wonder if it was used this way or that part of the standard was simply ignored. Yeah, a teleprinter used many of them, but certainly not FS, GS, RS and US they are used for sending files not only inside of files, you do not need FS inside a file (except maybe a file like a TAR) but need it on a data stream that has several files, like a paper tape a magnetic tape or something like that.

@KhalilEstell 2 месяца назад

Amazing video, loved it.

@ib9rt 2 месяца назад

When I was first introduced to computers in 1977, I used an ASR-33 Teletype complete with paper tape punch/reader. The ASR-33 only had uppercase letters, so it was with a sense of wonder I discovered that some more advanced terminals could also do lowercase! And everyone wrote the obligatory program that scanned through codes 0 to 127 and printed them out to see what they would do. Sending a string of ^G characters to an ASR-33 produced a sound never equaled by later devices, especially since they never seemed to insert a gap between the beeps.

@dimitrioskalfakis 2 месяца назад

useful and well presented.

@threee1298 2 месяца назад

New to the channel, this is wonderful

@gcasar 2 месяца назад

so happy i got this as a suggested vid

@clasqm 2 месяца назад

ASCII 27 still maps to the Escape key.

@briansepolen4917 2 месяца назад

One great thing about these blocks described is that one can see that like using Ctrl-C for ASCII 3 (ETX), one can also use Ctrl-[ (ESC) instead of lifting hands off the home row for Escape. Great for increasing TUI speed and efficiency.

@nurmr 2 месяца назад

Yep, ESC is essential for CSI (and SGR in particular), so without it there would be no ANSI terminal colors!

@flamewingsonic Месяц назад

You missed one very important use of characters in the control block: character 27 (ESC) is used by terminal emulators as part of the "control sequence introducer" ("CSI") to do things such as changing foreground/background color, setting bold/italics/underline, etc. Although this is more rpevalent in UNIX world, even DOS (and the Windows command prompt) had a device driver (ANSI.SYS) supporting these ANSI escape codes.

@jovetj 2 месяца назад

Excellent video!

@chrisd561 2 месяца назад

Great video!

@ChannelSho 2 месяца назад

Another neat thing about the way the digits are organized in ASCII is if you convert it to hex, you just look at the lower half and you'll get the number. Also I like how the alphabet characters start with bit 0 as 1, because it makes more sense to use that A = 1 rather than A = 0.

@KX36 2 месяца назад

The Device Control characters are still very important for configuring barcode scanners. How do you change the settings of a barcode scanner about e.g. whether or not to insert a or or nothing after scanning a barcode, you send combinations of device control characters followed by alphanumerics. Exact combinations are device specific. Also, we just last year migrated away from a 1980s unix program (still a very popular program) that uses a database of literal ascii strings, each field separated by the Record Separator character.

@philipoakley5498 2 месяца назад

I remember doing port-a-punch cards in EBCDIC for my first computer programmes at grammar school! 10-6-8 everyone (or was it 11-6-8;-).

@OhhCrapGuy 2 месяца назад

I've actually used 0x1F instead of commas when I needed to save something with the sheer simplicity of a CSV file while not having to figure out the logic of how to handle data with commas or quotes in them. Works great. You know, since that's what it's for, haha

@unvergebeneid 2 месяца назад

Thanks for bringing the GIF/JIF debate to EBCDIC ;D

@DylanBeattie 2 месяца назад

The first c in EBCDIC is pronounced like the c in "Pacific Ocean" - what's the problem? 🤣

@unvergebeneid 2 месяца назад

@@DylanBeattie ;p luckily I have yet to see someone argue that those 256-color images are pronounced "SHIF" :D

@Colaholiker 2 месяца назад

I don't normally comment on the clothing style of RU-vid creators - but that t-shirt rocks. 🤣

@luserdroog 2 месяца назад

I like this, but what about the earlier threads like Jacquard Looms? There's some fascinating stuff in the first APL books (I forget if it's in A Programming Language or Automatic Data Processing) about how to design encodings for punch cards with various numbers of holes.

@KX36 2 месяца назад

I got distracted at 3:50 and reimplemented morse code as a canonical Huffman code. By hand, in Excel, for fun. 😅 Each character is 3-9 bits long but it's a binary prefix code so no need for gaps in transmission.

@BradHouser 2 месяца назад

My first programming was over a dialup teletype at 110 Baud or 10 characters per second. I was in high school in the '70s and dial up time share systems running BASIC cost $6.00 per hour, so connect time was precious. You wrote your program offline on paper, then entered it on the teletype, punching it on tape as you typed, and if you made a mistake, the DEL key was like digital White out. Of course, it did not speed up data transmission. Once you had it all typed onto paper tape, you dialed the number with a Touch-Tone keypad, logged in and then played the paper tape back to upload your program. Then you ran it, you could also renumber, and list it back and re-punch it for later. When I told my mom I needed money to learn BASIC programming, she asked what I did on the computer. I told her games. I love her: she didn't complain. I became an Electrical Engineer/Computer Science guy.

@BradHouser 2 месяца назад

One of my friend's dad had a 300 baud terminal/printer, and we used to dialup GE's free modem line and just print out stuff in order to watch it work.

@JamieBainbridge 2 месяца назад

Ctrl+d is still commonly used on Linux. It's the way to logout of a shell, and also the way to get out of a REPL like Python.

@Squossifrage Месяц назад

4:51 While eight-bit bytes were already common when work on ASCII began in the early 1960s, they did not become ubiquitous until the mid-to-late 1970s.

@sheridenboord7853 2 месяца назад

Great talk thanks. I always suspected DEL because of how it sat in the ASCII table didn't look right. A control character all by itself as if it was an after thought. So a program reading from a stream would just ignore DEL characters.

@herbie_the_hillbillie_goat 2 месяца назад

Love the DL inspired shirt.

@mag-icus 2 месяца назад

You missed the cleverness behind code 33-41. These punctuation marks come in the same order as they do on the number keys on an (American) keyboard; this means that similar to how lower case letters were converted to upper case by resetting a single bit (toggled by the shift key), the same were actually true about pressing shift + a number key.

@Tweekism86 2 месяца назад

7:12 Speak for yourself! I still use Control-D, to close terminals and exit SSH sessions, quit python or node.js and the like. Edit: Love the video btw, can't wait for the next one :)

@SirusStarTV 2 месяца назад

On Windows python repl only accepts ^Z and enter key needs to pressed for it to work

@Tweekism86 2 месяца назад

Dammit Windows, this is why we can't have nice things!

@0LoneTech 2 месяца назад

@@Tweekism86 In this case you can blame CP/M, in particular where file length in bytes was not recorded.

@__christopher__ 2 месяца назад

Control character 4 (EOT), that is, Ctrl-D, still lives on in terminal emulators of Unix-derived system like Linux as the end-of-file character (although technically it's the flush-input-buffer character, but returning an empty input is interpreted as end of file on Unix-derived systems, therefore it effectively acts as end-of-file for terminal input and also is commonly referred to as such; the difference can be seen if you try to use it on a non-empty line).

@notthedroidsyourelookingfo4026 2 месяца назад

8:18: Dylan taking a stance on tabs vs. spaces 😂

@rollinwithunclepete824 2 месяца назад

Very interesting! Thank you

@andrewjameswelch 2 месяца назад

Great vid, thanks. A follow up vid could be a similar explainer about how utf-8 uses multiple bytes and what happens when that is read using a single byte encoding.

@SteveOnTheInterweb 2 месяца назад

More about the ASCII graveyard, please! For instance, RS Record Seperator, now used in application/json-seq format to separate JSON objects, e.g. in a streaming event log that will never finish. Lots of goodies in the graveyard...

@orterves 2 месяца назад

Good video, nice refresher of a topic I haven't really thought about directly since university - except for bloody Windows crlf when working with cross platform code

@BobFrTube 2 месяца назад

The extra bit also provided parity.. CR and LF were separate because going to the next line on a teletype took two character times. Multics chose LF as the NL because CR could be considered as not doing anything. _ was originally a left arrow.

@aaronbredon2948 2 месяца назад

ASCII being 7 bit covered most of the generic characters including accented characters via overprinting. If the inventors had wanted to include all possible characters across the world, they would have needed at least 2 bytes per character to be able to handle Chinese and Japanese ideographs. Leaving the remaining 128 values of a byte unspecified allowed different countries to add country specific characters. In the IBM PC world these were implemented as “code pages”, and were a bit of a problem when talking between countries. Unicode eventually resolved this communication problem, but it requires 32 bits or 4 bytes to encode the over 140,000 characters, and there are visually identical Unicode characters that are logically different, which makes it easier for scammers to fake internet addresses. And something as large as Unicode wasn’t practical in the early days of computing, when every single byte saved was significant. EBCDIC had the advantage that numbers were readable on hex crash dump printouts, but numbers and letters shared the same character codes (C1 represented either A or positive signed 1, depending on what the data type was.)

@DrCoomerHvH 2 месяца назад

I like how you've recycled some of the points from your talks into their own little videos, especially when the video topics are directly interactive with the community or fans.

@cfhay 2 месяца назад

EOT (End of Transmission) is Ctrl+D and can be used today still. Ctrl+D in Linux (and other similar systems) will flush the current buffer. If this buffer is empty, it will result in a zero-byte read. A zero-byte read mean end of file/end of input in most contexts. For example using it in at a shell prompt will cause the shell exit with exit code zero. If that was a login shell, it causes a logout. I use it every day. Also ESC is widely used to decorate Linux console output (colors etc).

@zeitgenosse 2 месяца назад

I'm very much looking forward to the next episode (kohuept and èÁÒÉ ðÏÔÅÒ).

@williamlyerly3114 2 месяца назад

As one who lived and died on TTY33/35 devices this was very interesting. Programmed in SLEUTH (Univac assembler) later BAL. Lived in ASCII land.

@aDifferentJT 2 месяца назад

Ctrl-D in the terminal is great, it will exit most REPLs or shells

@mag-icus 2 месяца назад

Also, ctrl-d is still used to mark end of streams on unix. So it is not just ctrl-d and ctrl-g that has survived until this day.

@chri-k 2 месяца назад

I can't believe you just called ^[ and ^D unimportant

@darrennew8211 2 месяца назад

Fun facts: The ASCII underscore character was originally a left-pointing arrow, which is why Smalltalk (from around 1976) uses "_" as the assignment operator, and why Pascal (designed to work with EBCDIC also) uses ":=" instead, to look like an arrow as close as you can get on punched cards. EBCDIC has the same sort of bitwise feature for letters that the upper/lower trick in ASCII uses, except it's designed for punched cards. So with a card 11 columns high, the letters are in "contiguous" numbers if you ignore the proper holes on the card rather than ignoring the proper bits in the byte.

@__christopher__ 2 месяца назад

Actually, Pascal had several digraphs to be used when certain characters were not available. For example, Pascal comments were written in curly braces {like this}, but in case curly braces were not available, you could also use parentheses with asterisks (*like this*). Now the only character used by Pascal that was not available in ASCII was the left arrow, whose digraph replacement was :=, which is why that one became commonly known as the Pascal assignment operator.

@cigmorfil4101 2 месяца назад

@@__christopher__ Were '" not available? "" for assignment, eg: RA -> VARLOC means the contents of the A register are stored in the location pointed to by VARLOC (effectively a variable).

@__christopher__ 2 месяца назад

@@cigmorfil4101 that's already the less-equal operator.

@cigmorfil4101 2 месяца назад

@@__christopher__ Interesting how all the BASICs I've used over load the '=' operator to mean both "assign" and "compare equal" - the meaning based on context. How about ""? (That looks more like an arrow than ":=".)

@__christopher__ 2 месяца назад

@@cigmorfil4101 that is already a less-than followed by a unary minus operator. Also, := was already in use in mathematics for definitions, so it fits quite well. Note also that a proper assignment statement in BASIC was LET var = value A lot of BASIC interpreters (in particular Microsoft's) allowed omitting the LET though.

@u9vata 2 месяца назад

The ESCAPE ascii character is often used in various APIs like old BIOS interrupts for reading the keyboard you can grab "scan" codes or ascii codes. Most people who wrote games go for scan codes and many other software too, but even though there are no ascii returned properly for arrow keys for example, the escape key generates the ESC character properly in the bios - just example.

@pepe6666 2 месяца назад

awesome content. subscribbbed. also props for the def leppard shirt

@dfs-comedy 2 месяца назад

Ctrl-D is still "End-of-File" in UNIX tty land.

@darrennew8211 2 месяца назад

Technically not. It's "send the buffered output without sending the ctrl-D". If there's no buffered output, the program gets a read length of zero, which is EOT. But if you type something first and hit control D, it just sends what you typed.

@timothynewton5231 2 месяца назад

I'd love to here more about the characters in line 16 through 31 and their uses and if there are still any uses for them today.

@jensschroder8214 2 месяца назад

The Baudot code is older than ASCII. A 5-bit code for teleprinters. But this code has the disadvantage that it does not accommodate all 26 letters and 10 digits. That's why there are two shift codes and two different characters per code. The 7-bit ASCII code accommodates the Latin alphabet, but lacks special characters used for French, Spanish, German and other languages. Therefore an 8-bit ACSII with code page was used. Other languages cannot be represented. Unicode is used today. The first 128 characters correspond to 7-bit ASCII.

@cigmorfil4101 2 месяца назад

Though before 8-bit characters were used when the 8th bit was used by serial devices as a parity check (leaving only 7 bits for characters) devices (printers) had different regions programmed into them which could be selected via a code sequence and substituted the locale characters for standard characters. eg A printer set to UK would substitute '£' for '#' so that you sent "#5,899.99" to it and it printed "£5,899.99". Working with an Apple ][ with a printer set to UK listings would include things like PR£3 instead of PR#3.

@AutomatedChaos 2 месяца назад

While working in IT for more than 2 decades now, it surprises me that developers try to invent character separated values (csv) for columnar data again and again while there are literally 4 ASCII characters reserved to handle these cases. But no, let's use the comma, semicolon, tab (\t), pipe, tilde or even the |~| combination as separator with all problems that can occur like escaping, quoting and in-field newlines.

@DylanBeattie 2 месяца назад

10 years working tech support made me realise that if regular folks can't read it on their screens and type it on their keyboards, they're not gonna use it... and, honestly, I think they're right. We wanna bring back ASCII field and record separators, we should be putting them on keyboards.

@jovetj 2 месяца назад

Yep. Control characters are generally un-keyable and non-displayable. Not very practical for most people.

@ABaumstumpf 2 месяца назад

@@jovetj "Control characters are generally un-keyable and non-displayable. " No, they were simple control-character - literal keys on the keyboard, and they are very much displayable as even MSWord can show them.

@cigmorfil4101 2 месяца назад

MSWord might, but Notepad doesn't (other than as an undecipherable character as to which control code it actually is - try looking at a PDF using notepad) - CSV being a plain text format, Notepad, a plain text editor, would be _the_ tool for the job, not MSWord.

@ABaumstumpf 2 месяца назад

@@cigmorfil4101 "Notepad, a plain text editor, would be the tool for the job, not MSWord." notepad is just a scratch TEXT-editor and NOT for working with csv. I mentioned word cause most programs do display them correctly. And notepad is just far far off from being the correct program for anything. for CSV you would use a program that either can actually deal with ASCII (so not notepad) or better - a program designed for handling tabular data.

@gwaptiva 2 месяца назад

Thanks; now I know how that blasted vertical tab got into that text field that then didn't serialize to XML CDATA, but in fact errored out completely

@sashatz3387 2 месяца назад

Thank you for this! I've been working with a lot of regex, and been wondering a lot about this. Would you have any resources for further reading?

@gcewing 2 месяца назад

It's not quite true that CP/M didn't have device drivers. It had a BIOS that contained all the machine-dependent code for dealing with I/O devices. Conversion from a single end-of-line character to CRLF could have been done there, but for whatever reason it was chosen not to.

@kevskevs 2 месяца назад

Praised be the Algorithm ... it happens VERY rarely that I want to upvote a video and notice that I have already done so. Guess I'll have to subscribe ...

@rascta 2 месяца назад

Sadly lost and not mentioned here, the FS, GS, RS, and US characters (28-31). Meant to serve as distinct bytes that wouldn't be part of text data, and therefore could easily be used to delineate it. But alas instead we just totally forgot they existed and therefore ended up with formats like CSV, which gave double meaning to commas, newlines, quotes, etc. With special escaping rules and incompatibilities between systems. And we've spent generations figuring out how to handle that properly and handle all the edge cases. Just because we didn't have and didn't bother to come up with a few symbols to represent those 4 characters. Some of those other low code points were perfect for networking, sending a single byte to communicate something that now we need an entire packet to communicate the same thing.

@darrennew8211 2 месяца назад

The number of self-taught computer programmers who reinvent the wheel because they were never taught what already works always astounds me.

@cigmorfil4101 2 месяца назад

Interestingly Pick uses characters 252-254 as markers in dynamic arrays (and filed items) between the "elements": FE - 254 - Attribute mark FD - 253 - Value mark FC - 252 - Sub Value mark The whole dynamic array is a string with the elements separated by the marks. If an element is required that doesn't exist, Pick adds enough of the relevant marks to create it when setting the value of the "element" or returns a null. This means you get to access things like: Data = '' Data = 'attr 1' Data = 'at 2, v1, sv 3' Data = 'at 2, v 3' Data = 'at 4, v2' CopyData = Data Element2 = Data The strings Data and CopyData contain: attr 1[am][sm][sm]at 2, v 1, sv 3[vm][vm]at 2, v 3[am][am][vm]at 4, v 2 And Element2 contains [sm][sm]at 2, v 1, sv 3[vm][vm]at 2, v 3 Where [am] is char(254), [vm] is char(253) and [sm] is char(252) Pick is a multi-value DBMS OS with all fields of variable length and type (though as the whole is stored as a string they're effectively all strings which are converted to the relevant type at time of use).

@cigmorfil4101 2 месяца назад

The use of CSV is to _avoid_ non-printing control characters (other than a line break) so that the data is easily edited as plain text by a plain text editor. A plain text editor generally only understands line breaks; how control characters are displayed depends upon their programming: some may display as ^c, some may display a '?' regardless of the chatacter, some may let the display driver decide what to do (hence the smiley face, musical notes, etc, that the original IBM PCs displayed for control characters) As there was no consensus how to handle control codes, CSVs avoided them and stuck to plain text, using commas (hence the name: _Comma_ Separated Values) requiring some sort of escape for commas - enclose a field with commas within quotes - and a mechanism to handle the quoting characher within fields.

@darrennew8211 2 месяца назад

@@cigmorfil4101 I always found this argument bizarre. ASCII was invented well before any "plain text editor" was, so saying "we changed this because plain text editors couldn't handle ASCII" sounds like working around the problems in tools rather than just fixing the tools. There was also an image format called NetPBM which was great, and one of the options was to represent all the bytes with decimal digits. Like, you could read it with BASIC even. Red would literally be "255 0 0" with nothing other than ASCII digits and spaces.

@darrennew8211 2 месяца назад

@@cigmorfil4101 Wow. It has been *ages* since I heard anyone else who ever used Pick. :-) Blast from the past there.

@brnddi 2 месяца назад

I actually didn't know that CLI progress bars are done by just printing a carriage return. I mean, it's painfully obvious in hindsight, but I never really thought of it even though I've written my fair share of CLI apps. I kinda always just assumed they're using some special API to clear the last line or something.

@cigmorfil4101 2 месяца назад

Try using a glass teletype that only understands: ^G (sound bell), ^H (non destructive back space), ^J (scroll up a line) and ^M (goto column 1), and only adds stuff to the bottom line of the screen - happy days...

@TheEvertw 2 месяца назад

The ESC code is still MUCH used by all sorts of programs across platforms. You shouldn't have skipped over that one. And you might have mentioned ^S and ^Q, which are the flow control characters. If you ever press ^S in e.g. a UNIX shell, it will hang until you press ^Q. Which is unfortunate, as many programs have re-purposed ^S as a shortcut for "Save".

@BradHouser 2 месяца назад

The eighth bit was often used for parity checking.

@TheEvertw 2 месяца назад

"The next two have fallen out of use" You skip over CtrlD (End Of Transmission). It is still used A LOT. Most UNIX (and Linux) tools accept CtrlD (End Of Transmission) to end a stream, file or connections. It is the recommended way of closing e.g. an interactive Python session, an SSH session, etc, etc, etc.