SIGSTP: Understanding Unicode and UTF8 in Perl

Tuesday, 25 September 2012

Understanding Unicode and UTF8 in Perl

I know there's a lot of things out there on the subject, but when it comes to Unicode, a quick review of core concepts and bug avoidance guidelines is never a waste of time.

I'm not a Unicode Guru, but working with third parties, I often find that a lot of people consistently fail to get the basics right about Unicode and encoding. There must be something esoteric about it. So here's yet another set of slides about Unicode/UTF8 in Perl.

It's not meant to be a comprehensive presentation of all Unicode things in Perl. It's meant to insist on a couple of guidelines and give some pointers to get a good start writing a unicode compliant application and avoiding common issues.

Comments are open for questions. Happy coding!

8 comments:

LeoNerd26 September 2012 at 14:38
Looks good.
ReplyDelete
Replies
dolmen26 September 2012 at 17:59
The recipe on slide 11 ("Set STDOUT to encode as UTF8") is not portable. UTF-8 is not expected on the console on Windows.
Thanks for the tip on email headers.
ReplyDelete
Replies
Anonymous9 October 2012 at 17:22
A character isn't a combination of one or more glyphs. Glyphs are used to represent characters and are provided by fonts, which are collections of glyphs. Unicode doesn't define glyphs (although it includes example glyphs in the code charts) and Perl doesn't have any understanding of glyphs. Unicode defines characters, which have abstract meaning but not specific shapes. These characters are assigned code points and can be stored in bytes using character encoding forms. Throughout your slides, the word "glyph" can generally be replaced with the word "character". The Unicode standard uses the terms "base character", "combining character", and "precomposed character". See http://www.unicode.org/glossary/ for details.
ReplyDelete
Replies

Add comment