Code Pages, Character Encoding, Unicode, UTF-8 and the BOM - Computer Stuff They Didn't Teach You #2

Подписаться 177 тыс.

Просмотров 156 тыс.

50% 1

Computer Stuff They Didn't Teach You #2 - Code Pages, Character Encoding, Unicode, UTF-8 and the BOM
computerstufftheydidntteachyou...
• Computer Stuff They Di...
Thanks to Carlos Schults for his hard work on the English and Portuguese Subtitles!

Наука

Опубликовано:

18 ноя 2019

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 211

@ammarsalmi 2 года назад

I hit subscribe automatically after watching this single video. It's the first time I found this channel.

@tomkirbygreen 4 года назад

Never gets old this stuff, I’m trying to debug Japanese PDF generation using an ancient open source library. Hours of fun.

@wolfstelheilm3211 4 года назад

Sir, you're right, not everyone shares the same class, really appreciate what you're doing. I hope you're channel grows.

@markfarnsworth8685 3 года назад

Great video! I’ve been designing and building software full-time (mostly large UNIX backend systems) for the last 30 years, and I still learn so much from all of your videos. I love your from-the-ground-up approach. It reminds me of things I’d forgotten, and fills in the details of things I never consciously thought about, but have just used without fully understanding them for years and years. Many thanks for helping humanity!

@raj-we9yr День назад

Thank you very much. You showed us so clearly how this is used in practice. This is a tech topic that has never been demo'ed adequately. Humbly requesting you to share a video on how Unicode text and its encoding is handled in PDF files and displayed/printed.

@GardenFatherReviews Год назад

I love your video, you're course is more understanding than anyone. Thanks

@JordonPatrickMears11211988 3 года назад

I've had your tools page bookmarked for a decade, subbed immediately after seeing this video. Top quality 👌

@matteroftouch 3 года назад

I am learning so much, and I love the conversational tone. I did not finish my formal education, but I am pursuing an IT career and this really helps make the task less daunting. Thank you.

@dalevross 4 года назад

Brought back memories. My high school crush's name had an accented e, so Alt-0223 was committed to memory really early :).

@dalevross 4 года назад

0233, just noticed when someone liked it :)

@RichardNobel 3 года назад

@@dalevross I've always been using *ALT+130* for *é* 😁 instead of 0233. Later I started using the Character Map in different versions of Windows (or "Insert Symbol" in Microsoft Word). Then in recent years I've gotten used to switching keyboard/input languages in Windows by pressing ALT+SHIFT (I switch Dutch and English), so I can press the applicable accent key before pressing the letter key. In current versions of Windows 10, it's also possible to use the Windows key + (period), or Windows key + (semicolon), keyboard shortcut to open the emoji/symbols panel. Then click the Omega button to access the symbols, where *é* can be found with the Latin symbols ( Ç ). 👉 pureinfotech.com/insert-symbols-windows-10/ But still the _"old school"_ ALT-codes can be typed faster, of course. ;-)

@jvsnyc 3 года назад

@@RichardNobel super-helpful, I hadn't been using Win-.

@SonoHaynombre 2 года назад

Bro this is the cutest thing I've read all day

@SimSimsTECHcrunch 2 года назад

Reading this made me smile ngl

@dennisrasmussen3899 4 года назад

Thanks Scott! I never learned this stuff. It is very useful and you present the material in a very simple way, so it is easily understood. I'll watch every single video you make in this series!

@DeliberateGeek 4 года назад

Great content! It's amazing the things you realize that you either forget or never knew. I was aware of the various encodings. The BOM , however, I either never knew or completely forgot about, Thanks Scott!

@MahendraKumar-fu7bp 2 года назад

Excellent Explaination. Feels like learned something unique today. Thanks

@theFoodieCyclist 4 года назад

Really cool series! The basics are often overlooked although they are essential building blocks.

@ballafolife19 3 года назад

Thank you for taking the time to share, and enlighten us as well, with your acquired knowledge.

@101orbitaldefence 3 года назад

Well i for one have never been taught this before, thank you for sharing your knowledge.

@pavelsapehin4308 4 года назад

00:42 Character Encoding 01:26 ASCII 02:19 ASCII: writing an app 05:57 "who decides that this character means A" - understanding what encoding is 07:11 encoding: "another way to think about it" 07:24 notepad: "it looks like crap" 09:11 notepad2: new file & Chinese chars & CMD 11:32 encoding: signature/utf BOM 12:24 unicode website 12:31 windows charmap utility 13:40 encoding: you can potentially loose information 13:58 why BOM is useful 14:50 the app: hard-coding the BOM 16:36 summary This table of content was created using "Smart Bookmarks for RU-vid" chrome extension. You can import and edit them using this extension. You can install it from the official Chrome Store Page (shortened link): smb.page.link/store

@alexandertorres8854 4 года назад

Good stuff Scott, keep them coming. Thank you for all your work. 🙂

@heavycavalry9919 3 года назад

You are the teacher I never got. Thanks for uploading!

@petermcclymont7347 4 года назад

I really enjoyed this video. I was aware of encodings, but this has always been an area of confusion to me so I appreciate it. Keep it up Scott!

@GankablePlayer 7 месяцев назад

Love the title and the series. Thanks for helping the community!

@ruiprincipe3484 4 года назад

Awesome and very useful series. Looking forward to the next videos. Thanks Scott! :)

@ryanrichard4805 4 года назад

Just found these list of videos. Really great. Thanks for this Scott

@sanjin1986 3 года назад

Entire series "Computer Stuff They Didn't Teach You" is excellent. Please continue and I wish you good luck ! Idea for future video: "Asynchronous/Synchronous Requests"

@dansanger5340 3 года назад

Back in the 1990s, Microsoft tried to get ahead of the curve by making strings made up of 16 bit characters as the preferred format for its APIs. The hope apparently was that every character in every language in the world would be assigned a unique 16 bit number. This was basically UCS2. It soon became apparent that there were too many characters in the world to accomplish this. So, people came up with UTF-16, which has a single 16 bit representation for most common characters, but other characters are represented by two 16-bit words. UTF-16 is what Windows uses internally. Although, typically strings in this format are 16 bits per character, UTF-16 is technically a variable character length encoding, so you don't truly have the advantage of one character per word, so you can't reliably index directly into strings, and string length doesn't necessarily correspond to array length. The UTF-32 encoding does have one character per word, but it wastes quite a bit of space, and isn't very popular. For both UTF-16 and UTF-32 you also have to be concerned with endianness Meanwhile, the internet moved in the opposite direction, with around 95% of all web pages now using UTF-8, in which every character is encoded in either 1, 2, 3, or 4 bytes. As of late 2019, Microsoft seems to have reversed course and is now recommending UTF-8 encoding be used internally for new apps. docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page * All the discussion above ignores the fact that you can combine two or more Unicode characters to form characters. For example, you might combine a base Unicode character with an accent Unicode character to form the accented form of that base character. So, you can't really always treat a character in a language as a single UTF-32, UTF-16, or UTF-8 character.

@AnonEMuss-gw8fm 4 года назад

FYI... The Unicode standard says "Use of a BOM is neither required nor recommended for UTF-8". Using a BOM with UTF-8 is almost exclusively a "Windows thing" -- you'll rarely see it on other platforms. Most of the non-Windows world treats text files as UTF-8 by default.

@dentjoener 3 года назад

Man the amount of times I had to forcefully strip off this BOM crap in code is just REALLY REALLY annoying. The BOM should disappear. Just my 2 cents.

@DanielKarbach 3 года назад

Using a byte order mark with an encoding that does not use more than one byte at a time is at best redundant, but honestly really stupid anyway.

@AndersJackson 2 года назад

Actually the interesting part start with 0x20, where space are for ASCII (ISO-646). First they started with ASCII (7 bit as you said). Then came Latin-1 to Latin-16(ISO-8859-1 to ISO-8859-16) which is 8-bit. Then came the "final" solution to the encoding problem with Unicode (ISO-10646) where the first 256 characters are Latin-1. UTF-8 is compression/encoding of Unicode (ISO-10646) that mask a ASCII file a popper Unicode-file, but Latin-1 (8-bit) will not be treated as a proper UTF-8 file. So, UTF-8 are strictly not Unicode, it is a compression of Unicode. And ASCII are getting through, but 128 and higher, you need to use UTF-8 coding. And that is why they higher part is garbage in your file. UTF-8 need a properly choosen character after to work, and you have not that. And please, don't use BOM: 🙂

@sarcasmasaservice 4 года назад

Great stuff, Scott, thanks so much! I'd welcome your insights on Int32/Int64 and single- and double-precision IEEE-754. My students would appreciate it as well!

@gabrielleluna2322 3 года назад

Idk who's been in your comments section telling you, you are anything other than awesome but I love your videos and the name because 1)the name feels extremely accurate (I always go to your videos for things I wasn't taught) and 2) As the Bob Ross of coding tutorials you could never deliver a subject poorly.

@courseprovider9871 3 года назад

this channel is a gift for humanity :)

@DB-nl9xw 4 года назад

Keep making videos like this. Amazing stuff. I like how productive you are with VS Code.

@MsJoeshmoo 4 года назад

Thanks for tutoring Computer Zen Mastery skills, Scott.

@hesamkalhor3263 3 года назад

Providing Fantastic Quality Content as always, Thank You Sir.

@sabrinaspecv 3 года назад

I was aware of some of this but never learned it. How have I just now found your videos? I am hooked! Thank you!

@vinodcs80 3 года назад

WoW clear the basic concept with live example, loved it. Thank you

3 года назад

I love this series. Thank you so much.

@natsaan 4 года назад

Great idea! Looking forward to more

@Eskinsaurus 3 года назад

Thanks for making these, this is a great series.

@RomuloMagalhaesAutoTOPO 3 года назад

Scott, thank you for share your time and large knowledge with us. For me this tips are amazing.

@justinmarshall1930 2 года назад

Thankyou Mr Hanselman, as you might have guessed, I was getting a little antsy trying to understand character encoding before finding this video.

@zaraahuja3144 4 года назад

Great explanation, I was familiar with them still learnt more about tools and background. Please keep up this series

@martinlottering244 2 года назад

I would thumbs up you twice if I could. I don't know where I missed this course but I've been programming for more than 20 years. Just shows you how little that means.

@jackbench5427 3 года назад

I agree, this is a useful series. There's just a bunch of extra information that's so helpful that you which you'd picked up earlier.

@salsarhangi9963 4 года назад

Thanks Scott! Great stuff.

@mateiionita4543 4 года назад

A really nice explanation! I enjoyed listening to it.

@DavidCSaint 2 года назад

Dug this. I encountered the critical importance of character encoding in the wild and it made my life hell lol. So important to get this in ones head from the jump

@outerheaven01 4 года назад

Thanks Scott! Great series

@yYggdtyy5433 3 года назад

Such a nice video. This made me subscribe to the channel. Everybody should know these kind of fundamental concepts.

@TheKarmjit435 4 года назад

Awesome.. enlightening. Best explanation ever.

@LuizBGomide 3 года назад

One correction, on Notepad2, when you change Enconding, you are actually getting the displayed characters to a new Code Page. So an "á" or a "ç" would still be an "á" or a "ç" (if supported) but in a different Code Page and with a different byte value (probably). What you meant to use was Enconding -> Recode. That rereads the file from disk interpreting it in the selected Code Page. Also, codepage 437 was used I think mostly in US, elsewhere codepage 850 was more common since it had more accented characters.

@j7ndominica051 11 месяцев назад

Other code pages sacrificed some DOS box character spots used to draw dialogs to get more diacritics in: the half filled cell, connections between double and single lines, and the hamburger menu (math symbol: identical to). If you open an "NFO" in a non-US codepage, some pixels might look out of place.

@awright18 4 года назад

That was da BOM! Seriously very cool even for seasoned people.

@arty2k 3 года назад

🏆 pun

@lucaslra 4 года назад

Great stuff once more!

@mohsenvafa 4 года назад

This video was simply awesome! I love it! Thanks Scott. Can you make one for Networking IP address and VNET from Azure Networking perspective

@ricardomlourenco 4 года назад

Thanks Scott! Can you make one explaining the Http protocol? And explaining from the bottom top What happens when the user types a URL in the browser? Also another suggestion: how TCP/IP works on the bit level?

@guy6311 4 года назад

Teach us about the 7 layer burrito. What I meant to say was the OSI 7 layers. I got Taco Bell mixed up with networking.

@poles1c 3 года назад

This was great, thanks!

@qfksspecial7866 4 года назад

Finally a simple explanation of code page.

@jessusrdev 3 года назад

Great stuff, thank you!

@tomthetitan101 Год назад

I love me a good scott hanselman vid, thank you for putting these together!!!

@tz3p9v0 4 года назад

Great Stuff -- as someone dealing with enterprise legacy systems -- aka IBM Cobol reports ... I often deal with reports formatted for "green bar" paper. So this mean reports of a given width and given page length -- typically 132 by 60 .... or is it 133 by 66? The reasons for this and the history may prove interesting and help those that see these legacy report and wonder why? Also -- while you passed over it briefly -- big-enden and little-enden encode of data may also prove interesting -- and for those of us who remember -- don't forget to mention octal notation ... Thanks again for passing along "lore" of computing.

@maheshwarraju2792 3 года назад

Thanks for your lessons. Can have a video about what makes the difference b/w each objective oriented program?

@loaderladdy 3 года назад

Idea for a future video Scott... “Little / Big Endian, What is that all about?!” This video took me back to my school days where we wrote code with a pencil on paper. My school of about 1,500 pupils had 2 computers. I’m still not a coder, but this was fascinating. Thank you

@youtube.com-handle 3 года назад

thanks for the videos, as someone who works with many languages this series is very helpful very informative also, whats the path highlight extension, inside the console

@NinjaGhostWarrior123 3 года назад

Great video. Thanks!

@francoisjoly9720 4 года назад

Very informative, even for advanced developers. I couldn't get over the 3 spaces though :P

@coldwire3684 4 года назад

Great video Scott! You da BOM!

@imumamaheswaran 4 года назад

This is Gold!

@pilotboba 4 года назад

Scott... you're doing good work here. Question... how do you get your VS Code terminal setup fonts/etc to match you Windows Terminal and PowerShell terminals all to match? It seems when I set one up, it doesn't apply to the others.

@warpiggaming4237 3 года назад

awesome channel!

@senadmeskincoding 4 года назад

Good job. Thanks for this.

@jt9277 4 года назад

I was unaware of code pages! thaks.

@j7ndominica051 11 месяцев назад

Code pages used to be part of every day computer use, and didn't need to be taught. There were a few Windows code pages for European languages, Wester, Central, Baltic, and way more DOS pages because of fewer open spots after DOS box drawing symbols. Why does the hex dump ignore the top bit and output ABCD again? I've never seen a hex dump behave like that. It should display the complete set of DOS symbols, or replace some with dots, to aid visual extraction of text. I recall a case where there was an incompatibility even within Unicode, where one non-Windows program did a simplistic conversion of Latin-1 into Unicode by putting all characters into the range 0 to 255, intead of their correct spots for things like double quotation marks down into general punctuation, and the copyright sign where that goes. This resulted in files being unopenable if those characters were in the names; and they put all crap into filenames today.

@alwaseem5309 2 года назад

Really cool title, rich content, thanks 👍

@chewmanfoo 2 года назад

I love this video! I found it searching for howtos on dealing with file encoding in golang programming. I have written a program that parses a text file with a very old internal syntax (character 127 is the token delimiter, character 212 is the line delimiter etc.) I found that my program runs great if and only if the file input is a utf-8 file. When I fed it a file encoded ISO-8859-1, it broke. Now, I'm trying to figure out what I should do to fix the problem. I think the first step is to "detect the encoding of the input file" and second might be to "convert from the given encoding to utf-8" before processing can begin. I know that I can encode the file on my mac in the shell using iconv, and it works file (the program processes the file with no issues after I do this manual conversion). I'm wondering if you have any comments?

@dixztube 2 года назад

man i really enjoyed this video. taught me a lot

@Codestud 3 года назад

I went through 4 years of my Computer Science degree not entirely getting to grips with all of this, and it still gives me a headache to this day.

@dmitrypichugin7449 4 года назад

Great video.

@takeshiasahi5494 2 года назад

Thank you sir .... No they never thaught me this, even though we were using it again and again and I asked them again and again. Again ... thanks a lot sir.

@B_dev 2 года назад

thanks scott, very cool

@digitalman2112 3 года назад

Ran into this gem a while back when doing some data science stuff. Opened a text file and was seeing Chinese characters. Open notepad, even in Win 10. Type the following without the quotes "colourcolourcolourcolourcolourcolourcolourcolourcolourcolourcolourcolourcolourcolourcolourcolourcolourcolourcolourcolourcolourcolourcolourcolourcolour". Save it, then open it again with notepad.

@jvsnyc 3 года назад

Well, I'll be danged!! Even in Win 10, 20H2!!

@felfut870216 4 года назад

It was so helpful, thank you could you explain the difference between encoding and encrypting

@Retic_01 4 года назад

Awesome Stuff, Is that Hex dump view an Extension? Thanks Scott

@Retic_01 4 года назад

It is indeed an extension 'vscode-hexdump'

@michaelwplde 4 года назад

16:50 Great crash course introducing the issue! 'Course, could be a post-grad, possibly even undergrad, course devoted to this subject, I would imagine.

@toddburr4542 3 года назад

Scott, First, Great videos! In your first two videos you've used hex. A lot of non-CS people that I've worked with don't understand hex, octal, or binary. Actually, even some CS people don't get it. If you're going with basics, maybe a short video on how all of those number systems work would be good.

@nickguerra8460 3 года назад

I'm 41 years old, and have been wondering what happens to cause my .Txt file to turn to gibberish since the 8th grade. Thank You Scott!

@rayzenpetrovski7070 3 года назад

sometimes we need basic to reach master.. thank for creating this video.

@gregc6107 4 года назад

Suggestion: explaining the different bases, hex, octal, binary, bit shifting, xoring... could be an interesting one

@MrMoscs 2 года назад

Thank you so much.

@rovinox 3 года назад

@Scot Hanselman love the video and the theme on that vscode can you please tell me the name of it?

@joen5000 3 года назад

I was just wondering whether it is possible to type using a keyboard alt-number combination or other combination any characters beyond the 256 set?

@TCP0011708 4 года назад

Thank you. Your videos are great!

@birdofhermes6152 3 года назад

Thanks a lot for this video

@secondaccount5196 3 года назад

This was on my TO DO list.

@onlypiku 4 года назад

How do applications (notepad/notepad2 whatever) understand the first 3 bytes of a file are BOM or file data?

@nagasudhirpulla 4 года назад

What is the VS Code theme used in this video, I like this theme

@quelorepario 2 года назад

question, before fonts existed, how was the ascii letter drawn on screen?

@AlexanderKrivacsSchrder 3 года назад

In UTF-8, bytes above 0x7F require special encoding to be valid. As such, those lone 0x80, 0x81, 0x82, etc. are not a valid UTF-8 values, and a proper UTF-8 parser will give you an error there.

@funtertainment2552 3 года назад

Scott- Thank You! One request mate, Can you please tell us a short cut while coding in Vscode.

@TheCrimemas 3 года назад

Approach taken was very practical. I have a question, shoudnt the range of the loop be 0 - 127 instead of 0 - 128??? Why for the second loop (0 - 256) was converted to be integer ??? Byte is 7 bits right ?? So its range is from 0 - 255 ??

@Timanator 2 года назад

Great video! now I understand how I made those characters in chat lol

@cprashanthreddy 3 года назад

Thanks Scott... :)

@dmsanz_youtube 4 года назад

Brilliant!

@Krptodr 3 года назад

Loved the video, and the knowledge you shared. I know you're probably keeping to a specific time frame, but I have a suggestion. Can you make a video on the advanced character encodings and conversions. It's important to know that ISO-8859-* itself has issues and it has various versions that either made improvements and made breaking changes. It'd be important to know that going from ISO-8859-2 (pulling the -2 from memory) and going to UTF-8 has a block of bytes that doesn't convert over so it's lost during conversion.

@Krptodr 3 года назад

Primarily the lost bytes that I can recall most immediately are the German characters for the bottom left quotation and the top right quotation.