/contrib/famzah

Enthusiasm never stops


4 Comments

Parse XML into a PHP array

There are many different examples on how to parse an XML document into an array with PHP. What mine is different with is that it:

  • is very memory efficient by using PHP references (similar to pointers in C)
  • uses no recursion, thus there is no limit on the XML subtree levels
  • is very strict and paranoid about correctness

The parsing is done using XML Parser.

An example input XML data follows:

<?xml version="1.0" encoding="ISO-8859-1"?>
<root>
	<first_item>Test 1st item</first_item>
	<first_level_nested>
		<item idx="0">value #1</item>
		<item idx="1">value #2</item>
		<second_level_nested>
			<item idx="0">value #3</item>
			<item idx="1">value #4</item>
		</second_level_nested>
	</first_level_nested>
	<second_item>Test 2nd item</second_item>
</root>

There is one specific hack here. Since XML allows it to have an element with the same name multiple times on the same subtree level (see <item> on lines #05, #06, #08, #09), and at the same time it does not allow to have an element with only numeric name, we need to make the following exception for arrays which have numeric indexes:

  • If an element is named <item>, and it has an attribute named “idx”, then we will use this attribute as name, and respectively array key.

This is handled in the XmlCallback() class, method startElement(), lines #44, #45, #46, which are also highlighted. You can see the sources at the end of the article.

XML also allows it that an element contains both DATA and sub-elements. This cannot be parsed into a PHP array, and will result in an Exception.

The parsed PHP array would look like as follows:

Array
(
	[root] => Array
	(
		[first_item] => Test 1st item
		[first_level_nested] => Array
		(
			[0] => value #1
			[1] => value #2
			[second_level_nested] => Array
			(
				[0] => value #3
				[1] => value #4
			)

		)

		[second_item] => Test 2nd item
	)

)

If you liked the results, you can download the sources which follow (click “show source” below):

<?php

function xml_decode($output) {
	$xml_parser = xml_parser_create();
	$xml_callback = new XmlCallback();
	
	if (!xml_set_element_handler(
		$xml_parser,
		array($xml_callback, 'startElement'),
		array($xml_callback, 'endElement')
	)) throw new Exception('xml_set_element_handler() failed');
	if (!xml_set_character_data_handler($xml_parser, array($xml_callback, 'data'))) {
		throw new Exception('xml_set_character_data_handler() failed');
	}
	if (!xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, 0)) {
		throw new Exception('xml_parser_set_option() failed');
	}
	
	if (!xml_parse($xml_parser, $output, TRUE)) {
		$xml_error = sprintf(
			"%s at line %d",
			xml_error_string(xml_get_error_code($xml_parser)),
			xml_get_current_line_number($xml_parser)
		);
		throw new Exception("XML error: $xml_error\nXML data: $output");
	}
	
	xml_parser_free($xml_parser);
	
	return $xml_callback->getResult();
}

class XmlCallback {
	private $ret = null;
	/* assign and use references directly to the array, or else you'll be in trouble */
	private $ptr_stack = array();
	private $level = 0;

	public function __construct() {
		$this->ptr_stack[$this->level] =& $this->ret;
	}

	public function startElement($parser, $name, $attrs) {
		if ($name == 'item' && isset($attrs['idx'])) {
			$name = $attrs['idx']; /* reconstruct arrays with numeric indexes */
		}

		if (!isset($this->ptr_stack[$this->level])) {
			$this->ptr_stack[$this->level] = array();
			$this->ptr_stack[$this->level][$name] = null;
		} else {
			if (!is_array($this->ptr_stack[$this->level])) {
				if (!strlen(trim($this->ptr_stack[$this->level]))) {
					/* if until now we got only whitespace (thus scalar data),
					but now we start a nested elements structure, discard this
					whitespace, as it is most probably just space between the
					element tags */
					$this->ptr_stack[$this->level] = array();
				} else {
					throw new Exception('Mixed array and scalar data');
				}
			}
			if (isset($this->ptr_stack[$this->level][$name])) {
				/* isset() == (isset() && !is_null()) */
				throw new Exception("Duplicate element name: $name");
			}
		}

		/* array_push() */
		++$this->level;
		$this->ptr_stack[$this->level] =& $this->ptr_stack[$this->level-1 /* MINUS ONE! */][$name];
	}

	public function endElement($parser, $name) {
		if (!array_key_exists($this->level, $this->ptr_stack)) {
			throw new Exception('XML non-existing reference');
		}

		/* array_pop() */
		unset($this->ptr_stack[$this->level]);
		--$this->level;

		if ($this->level < 0) throw new Exception('XML stack underflow');
	}

	public function data($parser, $data) {
		if (is_array($this->ptr_stack[$this->level])) {
			if (strlen(trim($data))) { # check if this is just whitespace
				throw new Exception('Mixed array and scalar data');
			} else {
				/* we tolerate AND skip whitespace, if we're already in
				a nested elements structure, as this whitespece is most
				probably just space between the element tags */
				return;
			}
		}
		if (is_null($this->ptr_stack[$this->level])) {
			$this->ptr_stack[$this->level] = ''; /* first data input */
		}
		$this->ptr_stack[$this->level] .= $data; /* we may be called several times, in chunks */
	}

	public function getResult() {
		return $this->ret;
	}
}

Update, 20/Jul/2011: The source code was modified to handle white-space better, in order to fix the following tricky sample XML input: <item6> &amp; &lt; </item6>

Update, 30/Jul/2011: Another bugfix which handles empty responses like: <response/>


References:


Leave a comment

Filter a character sequence leaving only valid UTF-8 characters

This is my implementation of a Perl regular expression which sanitizes a multi-byte character sequence by filtering only the valid UTF-8 characters in it. Any non-UTF-8 character sequences are deleted and in the end you get a clean, valid UTF-8 multi-byte string.

Note that this works only for a subset of the UTF-8 alphabet. I.e. this is not a general filtering regular expression, but it leaves the standard ASCII and only the Cyrillic UTF-8 characters. You can easily extend the regular expression and add another UTF-8 subset.

Let’s get to the requirements:

  • Standard ASCII symbols: As it is described at the Wikipedia UTF-8 page, the ASCII characters from Hex 00-7F are encoded without modification in a UTF-8 sequence, as they are “Single-byte encoding (compatible with US-ASCII)”. Therefore, any character between Hex 00-7F is valid in a UTF-8 sequence. Though, for our current example, we will leave only certain ASCII symbols and namely a few of the control ones, and the printable ones:
    • ASCII control symbols: \t -> Hex 09, \n -> Hex 0A, \r -> Hex 0D.
    • Printable single-byte ASCII symbols: Hex 20-7E.
  • Cyrillic multi-byte UTF-8 characters, only the Russian/Bulgarian ones: If you open the Unicode/UTF-8 character table, and navigate to the “U+0400…U+04FF: Cyrillic” block, you can visually choose which characters you want to allow in your UTF-8 sequence by looking in the “character” column. In my case, I want to allow the characters “А”, “Б”, “В”, “Г” and so on until “ю”, “я”. If you look at the “UTF-8 (hex.)” column, you will notice that the range of these Cyrillic characters is from Hex d0 91 to Hex d0 bf, and from Hex d1 80 to Hex d1 8f. Yes, two ranges.

Therefore, our regular expression has to allow only the following sequences:

  • Single-byte, standard ASCII: \t, \n, \r, and x20-x7E.
  • Multi-byte, Cyrillic UTF-8: xD090-xD0BF, and xD180-xD18F.

Once you have established these rules, it’s very easy to construct the regular expression:

$my_string =~ s/.*?((?:[\t\n\r\x20-\x7E])+|(?:\xD0[\x90-\xBF])+|(?:\xD1[\x80-\x8F])+|).*?/$1/sg;


Update: 19/Nov/2010

If you want to allow some more characters, for example, the German umlaut letters “ä”, “ö”, “ü”, you have to include the following sequence too:

  • Multi-byte, UTF-8 Latin letters with diaeresis, tilde, etc: \xC380-\xC3BF.

The new UTF-8 filtering regular expression then becomes the following:

$my_string =~ s/.*?((?:[\t\n\r\x20-\x7E])+|(?:\xD0[\x90-\xBF])+|(?:\xD1[\x80-\x8F])+|(?:\xC3[\x80-\xBF])+|).*?/$1/sg;


If you are wondering why I would need only certain ASCII control and only the printable ASCII characters, the answer is – because of the XML standard. As the XML W3C Recommendations state, only certain Hex characters and character sequences are valid in an XML document, even as HTML entities: #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF].

Libexpat is very strict in what you feed as input, and if your input isn’t a valid UTF-8 sequence, you will end up with the error message “XML parse error: not well-formed (invalid token)”.