Introduction to charsets and encodings
Since lately I have heard and read misunderstandings on the topic again and again, or heard people say things like “Uh, now it broke my umlauts, why do I have an A with a tilde over it?” I write today a little bit about charsets and encodings.
First two important definitions of terms that are often confused or misinterpreted.
Basics
Charset: A charset is a list of characters that are combined into a group, for example ASCII, ISO-8859-1 or Unicode. A number is assigned to each character (for example, 65 is assigned to the character A in the ASCII character set). The charset itself initially has nothing to do with the digital processing or the representation of the characters.
Encoding: An encoding is used to store characters digitally. In the simplest case, ASCII characters are stored directly as bytes in the main memory or on the hard drive (the number 65 of the character A can be stored directly as the numbers of the ASCII characters never exceed the byte limit of 255). In more complicated cases (such as Unicode), individual characters have to be divided into several bytes, since the assigned numbers can be greater than 255 (the most common encoding for Unicode is UTF-8).
ASCII
The American Standard Code for Information Interchange is the simplest of all charsets. It is supported and displayed correctly by all common computers, operating systems and all programs. This is where the advantages end, because apart from punctuation marks, digits and the Latin letters A-Z and a-z, there are not many other characters in ASCII (for example no German umlauts). The characters are numbered from 0 to 127 and can thus be stored as bytes directly in the memory without any special encoding.).
8-bit charsets
Since you don’t get very far with these 128 characters, there are “larger” standardized 8-bit charsets. All of these 8-bit charsets correspond exactly to the ASCII charset in the assignment of 0-127. The remaining characters with the numbers 128-255 (the “upper half”) often contain region-specific characters. ISO-8859-1 (also known as Latin 1) contains, among other things, the German and French special characters, ISO-8859-2 for example the Polish and Hungarian special characters. Another common charset in this category is Windows-1252, which is based on ISO-8859-1, but contains a few more characters (for example the euro symbol, the dash or the curly quotation marks, in short all the characters you already know times copied by users from a Word document into a text area). If one of these charsets is specified in the HTML header using the meta specification http-equiv = “content-type”, the browser interprets characters with assigned numbers> 127 as characters of this charset and displays them correctly.
These 8-bit character sets are suitable for applications where only one language is used on a page. The advantage is the simple encoding, in which, as with ASCII, the numerical value of the character can simply be stored 1: 1 as a byte. The disadvantage is the restriction to the corresponding region. For example, with ISO-8859-2 alone, only some of the Eastern European languages can be displayed on a website, not Western European languages at the same time, i.e. roughly German.
PHP uses ISO-8859-1 internally, so functions such as utf8_encode () implicitly refer to the conversion from ISO-8859-1 to Unicode with the encoding UTF-8. Conversely, utf8_decode () can handle German umlauts that are UFT8-encoded, but Polish special characters cannot be represented in ISO-8859-1 and therefore create question marks. In PHP6 there will be other options here.
Unicode
In order to display several fundamentally different languages on one page (e.g. in a backend which displays orders from different countries in a list), Unicode is required. With Unicode, almost every character that exists on this planet is uniquely assigned a number. This assignment is understood by all modern operating systems and programs and can also be represented by standard fonts such as Arial and Helvetica.
As with the 8-bit character sets, the numbers of the ASCII code correspond to the same characters in Unicode (so the number 65 corresponds to the character A in Unicode as well). The German umlauts, the Polish or even Chinese characters have larger values, for example the Ä is assigned to the number 196, the small crossed L corresponds to 322. This brings us to the main disadvantage of Unicode: A byte encoding as with ASCII is no longer available here in question.
UTF-8
The most common encoding for Unicode
