This is based on an email I send my .NET team at work
Lets talk about Unicode. �
Unicode is just 16 bit characters instead of 8 right? Wrong. That depends on the encoding. Wait what.
Unicode is a deep topic, but I think its important to know the basics. The apps we build will have an increasingly global audience that depend on Unicode to be able to read and write their native languages on a computer.
Plus we want to make sure people’s emoji display correctly. 😎👍
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
This is a really helpful, though a bit dated, article that gives the historical background around the development of Unicode and what the difference between things like ASCII and LATIN-1 and UTF8 are. I’m sure these are all things you’ve heard of, but if you’re like me, they’re kinda fuzzy and interrelated and orthogonal to the problems I really want to solve so why can’t the runtime just figure it out for me…
The long and short of Unicode is that you can’t think a character directly maps to a particular bit sequence in memory or on disk. A Unicode character is just a number. How that number is translated (i.e. encoded) to a bit sequence is up to the program.
I wanna raise a comment on this statement though:
It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain” text is ASCII.
This is certainly true in languages like C where
string is really just
a pointer to a block of memory (
char*). In .NET, a string is internally
encoded in UTF-16, which (usually) uses 2 bytes to encode each character.
His advice, is however, valid when you’re working with a
representation of text, such as in a
Stream. If you want to load that
data into C#
string instances so you can do nice string things with
it, you need to know what encoding the underlying data is in. That will
tell you (or the runtime) how to translate the bytes into UTF-16 for the
instance. That’s why the constructors for classes like
take an Encoding
This is a pretty interesting and geeky deep dive into how the UTF-8
encoding actually works. UTF-8, as alluded to a bit in Joel’s article,
is a variable length encoding, which means that any given Unicode code
point (character) might be encoded in 1, 2, 3 or 4 bytes. For example,
A has code point 65, which nicely fits in 1 byte. But
others have code points in the thousands: these don’t fit in one byte and
will have their value split out over multiple.
James does a great job explaining a bit about how this dark magic works.
UTF-8 is largely regarded as the best encoding for documents that have to be shared over the internet. It has very wide device support, and has the nice benefit of being backwards compatible with ASCII: existing ASCII documents will open just fine in a UTF-8 reader.
This site argues that you should create every document in UTF-8 to maximize interoperability.
Our goal is to promote usage and support of the UTF-8 encoding and to convince that it should be the default choice of encoding for storing text strings in memory or on disk, for communication and all other uses. We believe that our approach improves performance, reduces complexity of software and helps prevent many Unicode-related bugs. We suggest that other encodings of Unicode (or text, in general) belong to rare edge-cases of optimization and should be avoided by mainstream users.
Following this recommendation is a little more complicated in .NET though.
The creators of .NET opted to go with UTF-16 instead as that was pretty much Window’s native string type. UTF-16 meant that users could do work with existing Windows technologies though things like COM interop without expensive marshalling of string data back and forth between different encodings.
For this reason, utility methods like File.WriteAllText create UTF-16 files. If you’re intending to share that file over the internet or with other operating systems, consider using an overload that takes an Encoding.