Encode to Utf8 format





What is UTF8

UTF8 is a Unicode standard encoding which encodes by one to four bytes of 8-bits. UTF8 can represent all existing characters in the world. It is compatible with ASCII encoding because it was designed the same as ASCII binary value. While ASCII encoding using 7 bits and UTF8 are using 8 bits with the same binary value, therefore, ASCII encoding will be a subset of UTF8.

Now, UTF8 become the most popular character encoding for all website. Unfortunately, most people did not notice it because the browser has already been converted it to human characters especially on Non-English characters.

Pros and Cons of UTF8 encoding


Pros of UTF8 encoding

UTF8 support many languages.
Most of the programming languages support UTF8.
UTF8 is compatible with ASCII
UTF8 able to convert to other charsets easily by ICONV.

Cons of UTF8 encoding

UTF-8 uses a variable length encoding especially on high code point, so it hard to determine the number of UTF8 bytes.
Require encoding module for programming languages.
UTF8 consume more processing time to find sequence code unit because UTF-8 uses a variable length encoding.

How to encode UTF8 (UTF8 Converter)


Example – Encode string “₹” to UTF8 hexadecimal. (UTF8 Encode)

  1. Search for “₹” or rupee sign code point, which is “U+20B9”
  2. 2. Convert “20B9” hexadecimal to binary numbers
Hexadecimal Binary
2 0010
0 0000
B 1011
9 1001
"20B9" = "0010 0000 1011 1001"

3. Refer to Table UTF8 Code Point Prefix, Binary 16 bits need 3 bytes format below.

Code Point 16 Bits = "1110(XXXX) 10(XXXXXX) 10(XXXXXX)"

Start to rearrange bits from the left-hand side of previous binary 16 bits as UTF8 encoding format.

Rearrange: 0010 0000 1011 1001 -> 0010 000010 111001

Put prefix binary in each byte to rearrange formatted.

UTF8 codefix: "1110(0010) 10(000010) 10(111001)"

4. Now, you will get 3 bytes of UTF8 binary. Convert all binary back to hexadecimal.

Binary Hexadecimal
11100010 E2
10000010 82
10111001 B9

The result of “₹” UTF8 encoding will be

Hexadecimal : E2 82 B9
Hex notation : \xE2\x82\xB9

How to decode UTF8 (UTF8 Converter)

  1. Convert all hexadecimal to binary bits.
  2. Start to read binary bits and determine the starter prefix of each byte as we see in table UTF8 Code Point Prefix.
  3. Eliminating prefix bits and convert binary data back to Unicode code point.
  4. Mapping code point back to a string.

Table UTF8 Code Point Prefix