Tuesday 6 May 2008
According to Google (via: DF), UTF8 is now the most popular character set on the web! I wonder how much this is down to sensible defaults in web authoring tools, rather than a conscious shift in mindset. It's a long time since I looked at it, but as far as I can remember Dreamweaver defaults to UTF8 for new web pages, so a lot of beginning web designers are probably building Unicode sites without even realising it.
I think there are a couple of reasons that many web designers and developers still aren't using Unicode across the board.
“I don't need Unicode, because my site is in English!”
I'll bet this is the most common (and stupid) excuse. Even assuming all your content is in English, many of your visitors may not use English as their first language. If you've got any areas where users can contribute content (for example, forums, contact us, blog comments etc), things will go badly. Even if all your visitors are native English-speaking monoglots, it's more than likely that some will have characters in their name that can't be represented in Windows Latin or ASCII.
Even if all the pages on your website are hand-coded by you, and users have no opportunities to break your site by posting content in a different character set, there are still huge advantages to using Unicode. You can stop worrying about accents in English words like Résumé(i), or the pounds sterling symbol (£), or “quotation marks”. Basically, Unicode means you can stop worrying about HTML entities (except for & / < / > / ") forever.
“Unicode is hard!”
Actually, this one is partly valid, for a website at least, because there are quite a few steps involved in making a site fully Unicode-compliant. Let's go through the key steps for a typical PHP + MySQL / Postgres site:
A quick note on UTF-8 and Unicode
There are actually several formats of Unicode data, but UTF-8 is the most commonly used online. In this post, I'll refer to UTF-8 and Unicode as being the same thing. UTF-8 is a variable-width encoding, where each character takes up between 1 and 4 bytes. This sounds confusing and dumb, but there are actually two pretty good reasons for this:
- Very frequently used characters (e.g. roman letters, numbers, punctuation) only use 1 byte, while less frequently used characters use more. This means text takes up less space than it would if every character took 4 bytes (in a typical English-language document, around 4 times less!) Obviously, this probably won't be quite so much of a benefit if you generally write in a language that doesn't use these characters.
- UTF-8 is backwards compatible with ASCII. UTF-8 stores the characters that are valid in an ASCII file in the same was as they would be if that file was saved as ASCII text. This means that an ASCII text file is also a valid UTF-8 text file. This makes converting your legacy files much easier for the most part.
STEP 1: Set up your text editor / IDE to talk in UTF8
I probably use a different text editor from you, so I won't go into the steps involved for any particular editor. What you need to do is set your editor so that:
- New files are created in UTF-8, no BOM (more on this in a sec) format
- Existing files are read as UTF-8 when the character set could not be detected
A BOM (or Byte Order Mark) is a character that appears at the very start of a text file, to indicate which character set it is encoded in. Since plain text files are the simplest type of file there is, they don't have headers or meta data to tell software what type of data they contain. As a particular character code could represent two totally different characters in two different character sets, a hint on which encoding is used becomes useful when it is no longer possible for software to guess the encoding based on the content. UTF-8 text files can optionally use a BOM to tell software that reads them that they contain UTF-8 data. If your editor supports Unicode, you won't see this character, as it will be removed from the top of the file when you open it, and written to the start of the file when you save it.
But, you probably don't want to use a BOM if you're building a site in PHP, since PHP will include the BOM character in the output at the top of included files. As long as your editor is setup to assume UTF-8 where appropriate, this shouldn't be a problem.
STEP 2: Add the appropriate <meta> tag to your HTML header
<meta http-equiv="content-type" content="text/html; charset=utf-8"></meta>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
Another alternative for XHTML documents is to use the XML declaration to set the encoding for your web pages:
<? xml version="1.0" encoding="utf-8" ?>
However, this approach has one serious downside: IE 6 will jump back to 1997 and render the page in quirks mode. So, it's best if you just stick to the meta tag approach.
STEP 3: Setup PHP to work with Unicode
Of course, Unicode isn't just about how data is stored on disk. Any program that handles that data needs to be able to handle multi-byte characters, and ensure that the data remains valid UTF-8 when it changes.
Unicode is not quite a first class citizen in PHP, so you'll have to do some tweaking to get it to grok UTF-8.
Firstly, you need to ensure that you have MBString enabled in your copy of PHP. If you're on Linux and using a packaged PHP, it may be installed by default. If not, it's probably just a case of adding it with:
$ yum install php-mbstring
...or whatever the equivalent is for your package manager.
If you build PHP from source, all you need to do is make sure you add
...to your configure string.
Assuming you have multi-byte support built-in, now you need to make sure PHP knows that you want to handle text as UTF-8 internally. Add the following to an include that gets parsed before anything else, and you should be good to go:
//setup php for working with Unicode data mb_internal_encoding('UTF-8'); mb_http_output('UTF-8'); mb_http_input('UTF-8'); mb_language('uni'); mb_regex_encoding('UTF-8'); ob_start('mb_output_handler');
If you're doing anything with strings other than reading them from a database and outputing them, you'll probably want to read about PHP's multi-byte functions. Basically, many string functions have multi-byte capable alternatives, with the prefix 'mb_'. So, substr() becomes mb_substr().
STEP 4: Setup your database to store UTF-8
It's probably best set your server to use UTF8 at database level, that is, specifying the charset to use for each database, rather than having a single charset for for whole server. If you do things this way, you won't get caught out by the default character set changing when you move your database to a different server.
Sample SQL for creating a database using UTF-8:
CREATE DATABASE mydatabase CHARACTER SET utf8 COLLATE utf8_Unicode_ci;
In Postgres, you can create databases from the terminal:
$ createdb -E UTF8 mydatabase
...or in SQL:
CREATE DATABASE mydatabase WITH ENCODING 'UTF8';
STEP 5: Setup your database server to handle UTF-8
We also need to tell our database server that we want to talk to it in UTF-8.
MySQL has a bewildering range of options for configuring charsets in my.cnf. The safest way to ensure your scripts are sending and receiving UTF-8 from MySQL is to set the character set of the connection _after_ you connect to the server, by sending these queries:
SET NAMES utf8; SET CHARACTER SET utf8;
For postgres, it's nearly the same thing:
SET NAMES 'UTF8';
It might sound like all this is a lot of effort, but once you build this in to your workflow, it becomes trivial, and your sites can enjoy Unicode goodness to the end of their days.
- Well, I laughed.
Posted by Ben @ 3:52 PM