Tuesday, 25 September 2012

Understanding Unicode and UTF8 in Perl

I know there's a lot of things out there on the subject, but when it comes to Unicode, a quick review of core concepts and bug avoidance guidelines is never a waste of time.


I'm not a Unicode Guru, but working with third parties, I often find that a lot of people consistently fail to get the basics right about Unicode and encoding. There must be something esoteric about it. So here's yet another set of slides about Unicode/UTF8 in Perl.

 It's not meant to be a comprehensive presentation of all Unicode things in Perl. It's meant to insist on a couple of guidelines and give some pointers to get a good start writing a unicode compliant application and avoiding common issues.




Comments are open for questions. Happy coding!

8 comments:

  1. The recipe on slide 11 ("Set STDOUT to encode as UTF8") is not portable. UTF-8 is not expected on the console on Windows.
    Thanks for the tip on email headers.

    ReplyDelete
    Replies
    1. I suggest we petition M$ to get a UTF8 Windows console :)

      Delete
  2. A character isn't a combination of one or more glyphs. Glyphs are used to represent characters and are provided by fonts, which are collections of glyphs. Unicode doesn't define glyphs (although it includes example glyphs in the code charts) and Perl doesn't have any understanding of glyphs. Unicode defines characters, which have abstract meaning but not specific shapes. These characters are assigned code points and can be stored in bytes using character encoding forms. Throughout your slides, the word "glyph" can generally be replaced with the word "character". The Unicode standard uses the terms "base character", "combining character", and "precomposed character". See http://www.unicode.org/glossary/ for details.

    ReplyDelete
    Replies
    1. Thanks for the comment and pointer. My use of 'glyph' is indeed an over-simplification. Those slices are there to insist on the fact that multiple 'coded characters' can map an 'abstract character'. I ll do some fixing. As one of the slides say: People are confused, trust no one. It includes me :)

      Delete
    2. I agree that the naming of concepts related to character codes and encodings is very confusing. When used in the technical sense, the word character is equivalent to code point (e.g., base character, combining character, control character) but for common English usage it's equivalent to grapheme. My main point is that glyph isn't the right word to use here.

      There are entire three-day conferences dedicated to Unicode, so it's admittedly hard to compress the topic into a set of slides. You did a great job in your 19 slides. I attempted to explain these concepts in 63 slides at YAPC::NA 2012.

      Delete
    3. I don't know if I'd survive 3 days :) The idea here was to have an as small as possible (yet useful) introduction for my busy developer colleagues. Although we live in a global world and it's 2012 already, it's still hard to find people who have a basic and sane understanding of these things.


      thanks for your slides!

      Delete