clean UTF-8 | /contrib/famzah

This is my implementation of a Perl regular expression which sanitizes a multi-byte character sequence by filtering only the valid UTF-8 characters in it. Any non-UTF-8 character sequences are deleted and in the end you get a clean, valid UTF-8 multi-byte string.

Note that this works only for a subset of the UTF-8 alphabet. I.e. this is not a general filtering regular expression, but it leaves the standard ASCII and only the Cyrillic UTF-8 characters. You can easily extend the regular expression and add another UTF-8 subset.

Let’s get to the requirements:

Standard ASCII symbols: As it is described at the Wikipedia UTF-8 page, the ASCII characters from Hex 00-7F are encoded without modification in a UTF-8 sequence, as they are “Single-byte encoding (compatible with US-ASCII)”. Therefore, any character between Hex 00-7F is valid in a UTF-8 sequence. Though, for our current example, we will leave only certain ASCII symbols and namely a few of the control ones, and the printable ones:
- ASCII control symbols: \t -> Hex 09, \n -> Hex 0A, \r -> Hex 0D.
- Printable single-byte ASCII symbols: Hex 20-7E.
Cyrillic multi-byte UTF-8 characters, only the Russian/Bulgarian ones: If you open the Unicode/UTF-8 character table, and navigate to the “U+0400…U+04FF: Cyrillic” block, you can visually choose which characters you want to allow in your UTF-8 sequence by looking in the “character” column. In my case, I want to allow the characters “А”, “Б”, “В”, “Г” and so on until “ю”, “я”. If you look at the “UTF-8 (hex.)” column, you will notice that the range of these Cyrillic characters is from Hex d0 91 to Hex d0 bf, and from Hex d1 80 to Hex d1 8f. Yes, two ranges.

Therefore, our regular expression has to allow only the following sequences:

Single-byte, standard ASCII: \t, \n, \r, and x20-x7E.
Multi-byte, Cyrillic UTF-8: xD090-xD0BF, and xD180-xD18F.

Once you have established these rules, it’s very easy to construct the regular expression:

$my_string =~ s/.*?((?:[\t\n\r\x20-\x7E])+|(?:\xD0[\x90-\xBF])+|(?:\xD1[\x80-\x8F])+|).*?/$1/sg;

Update: 19/Nov/2010

If you want to allow some more characters, for example, the German umlaut letters “ä”, “ö”, “ü”, you have to include the following sequence too:

Multi-byte, UTF-8 Latin letters with diaeresis, tilde, etc: \xC380-\xC3BF.

The new UTF-8 filtering regular expression then becomes the following:

$my_string =~ s/.*?((?:[\t\n\r\x20-\x7E])+|(?:\xD0[\x90-\xBF])+|(?:\xD1[\x80-\x8F])+|(?:\xC3[\x80-\xBF])+|).*?/$1/sg;

If you are wondering why I would need only certain ASCII control and only the printable ASCII characters, the answer is – because of the XML standard. As the XML W3C Recommendations state, only certain Hex characters and character sequences are valid in an XML document, even as HTML entities: #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF].

Libexpat is very strict in what you feed as input, and if your input isn’t a valid UTF-8 sequence, you will end up with the error message “XML parse error: not well-formed (invalid token)”.

/contrib/famzah

Enthusiasm never stops

Tag Archives: clean UTF-8

Filter a character sequence leaving only valid UTF-8 characters