NSAttributedString, HTML and Unicode Encoding

read

While converting a HTML string to a NSAttributedString, I discovered a perculiar thing about it’s unicode encoding format.

HTML as NSAttributedString

Converting HTML String to NSAttributedString can be easily accomplished with this String extension:

extension String {
    func htmlAttributedString() -> NSAttributedString? {
        guard let data = self.dataUsingEncoding(NSUTF16StringEncoding, allowLossyConversion: false) else { return nil }
        guard let html = try? NSMutableAttributedString(
          data: data, 
          options: [NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType], 
          documentAttributes: nil) else { return nil }
        return html
    }
}

To use,

label.attributedText = "<b>Hello</b> \u{2022} World".htmlAttributedString()

In the above, I have purposely added a unicode \u2022 to show that it renders unicode correctly.

NSAttributedString uses NSUTF16StringEncoding

If you observe the code carefully, when we call dataUsingEncoding, we use NSUTF16StringEncoding.

Sidetrack: A small introduction to unicode and encoding will be in next section.

This is because NSAttributedString requires data to be encoded in UTF-16.

NOT UTF-8, as I expected at first.

Also note that NSUnicodeStringEncoding is the same as NSUTF16StringEncoding.

While the documentation never specify what encoding data has to be, we can get a hint from
NSString which is conceptually UTF-16.

Another way is to specify the encoding in NSAttributedString initializer:

NSMutableAttributedString(
  data: data, 
  options: [NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType, 
  NSCharacterEncodingDocumentAttribute: NSUTF8StringEncoding], 
  documentAttributes: nil) else { return nil }

In the above, data will have to be UTF-8 encoded.

An Introduction to Encoding Strings

Why do we need to encode?

We need to handle encoding/decoding of string when it is converted to data (eg NSData).

When writing to a file, ultimately, the string representation is written as bytes of data. This conversion of string to data requires an encoding form.

This is because 1 character in a string (or specifically unicode string) can be represented by 1-4 bytes, depending on the encoding form.

We discuss 2 of the most common forms here.

A UTF-16 character is minimumly 2 bytes (or 16 bits).

But a unicode character is up to 21 bit, so while UTF-16 can represent most of the unicode characters (specifically 63K characters) in 2 bytes, there are characters that it needs more than 2 bytes to represent.

A UTF-8 character is minimumly 1 byte (or 8 bits). Only 128 characters are encoded in 1 byte (this includes the very common ASCII). This is much more space efficient than UTF-16.

That’s it for an introduction.

For more, read about how Swift deal with String and from objc.io.

NSAttributedString, HTML and Unicode Encoding

HTML as NSAttributedString

NSAttributedString uses NSUTF16StringEncoding

An Introduction to Encoding Strings

@samwize

¯\_(ツ)_/¯