/contrib/famzah

Enthusiasm never stops

Parse XML into a PHP array

4 Comments

There are many different examples on how to parse an XML document into an array with PHP. What mine is different with is that it:

  • is very memory efficient by using PHP references (similar to pointers in C)
  • uses no recursion, thus there is no limit on the XML subtree levels
  • is very strict and paranoid about correctness

The parsing is done using XML Parser.

An example input XML data follows:

<?xml version="1.0" encoding="ISO-8859-1"?>
<root>
	<first_item>Test 1st item</first_item>
	<first_level_nested>
		<item idx="0">value #1</item>
		<item idx="1">value #2</item>
		<second_level_nested>
			<item idx="0">value #3</item>
			<item idx="1">value #4</item>
		</second_level_nested>
	</first_level_nested>
	<second_item>Test 2nd item</second_item>
</root>

There is one specific hack here. Since XML allows it to have an element with the same name multiple times on the same subtree level (see <item> on lines #05, #06, #08, #09), and at the same time it does not allow to have an element with only numeric name, we need to make the following exception for arrays which have numeric indexes:

  • If an element is named <item>, and it has an attribute named “idx”, then we will use this attribute as name, and respectively array key.

This is handled in the XmlCallback() class, method startElement(), lines #44, #45, #46, which are also highlighted. You can see the sources at the end of the article.

XML also allows it that an element contains both DATA and sub-elements. This cannot be parsed into a PHP array, and will result in an Exception.

The parsed PHP array would look like as follows:

Array
(
	[root] => Array
	(
		[first_item] => Test 1st item
		[first_level_nested] => Array
		(
			[0] => value #1
			[1] => value #2
			[second_level_nested] => Array
			(
				[0] => value #3
				[1] => value #4
			)

		)

		[second_item] => Test 2nd item
	)

)

If you liked the results, you can download the sources which follow (click “show source” below):

<?php

function xml_decode($output) {
	$xml_parser = xml_parser_create();
	$xml_callback = new XmlCallback();
	
	if (!xml_set_element_handler(
		$xml_parser,
		array($xml_callback, 'startElement'),
		array($xml_callback, 'endElement')
	)) throw new Exception('xml_set_element_handler() failed');
	if (!xml_set_character_data_handler($xml_parser, array($xml_callback, 'data'))) {
		throw new Exception('xml_set_character_data_handler() failed');
	}
	if (!xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, 0)) {
		throw new Exception('xml_parser_set_option() failed');
	}
	
	if (!xml_parse($xml_parser, $output, TRUE)) {
		$xml_error = sprintf(
			"%s at line %d",
			xml_error_string(xml_get_error_code($xml_parser)),
			xml_get_current_line_number($xml_parser)
		);
		throw new Exception("XML error: $xml_error\nXML data: $output");
	}
	
	xml_parser_free($xml_parser);
	
	return $xml_callback->getResult();
}

class XmlCallback {
	private $ret = null;
	/* assign and use references directly to the array, or else you'll be in trouble */
	private $ptr_stack = array();
	private $level = 0;

	public function __construct() {
		$this->ptr_stack[$this->level] =& $this->ret;
	}

	public function startElement($parser, $name, $attrs) {
		if ($name == 'item' && isset($attrs['idx'])) {
			$name = $attrs['idx']; /* reconstruct arrays with numeric indexes */
		}

		if (!isset($this->ptr_stack[$this->level])) {
			$this->ptr_stack[$this->level] = array();
			$this->ptr_stack[$this->level][$name] = null;
		} else {
			if (!is_array($this->ptr_stack[$this->level])) {
				if (!strlen(trim($this->ptr_stack[$this->level]))) {
					/* if until now we got only whitespace (thus scalar data),
					but now we start a nested elements structure, discard this
					whitespace, as it is most probably just space between the
					element tags */
					$this->ptr_stack[$this->level] = array();
				} else {
					throw new Exception('Mixed array and scalar data');
				}
			}
			if (isset($this->ptr_stack[$this->level][$name])) {
				/* isset() == (isset() && !is_null()) */
				throw new Exception("Duplicate element name: $name");
			}
		}

		/* array_push() */
		++$this->level;
		$this->ptr_stack[$this->level] =& $this->ptr_stack[$this->level-1 /* MINUS ONE! */][$name];
	}

	public function endElement($parser, $name) {
		if (!array_key_exists($this->level, $this->ptr_stack)) {
			throw new Exception('XML non-existing reference');
		}

		/* array_pop() */
		unset($this->ptr_stack[$this->level]);
		--$this->level;

		if ($this->level < 0) throw new Exception('XML stack underflow');
	}

	public function data($parser, $data) {
		if (is_array($this->ptr_stack[$this->level])) {
			if (strlen(trim($data))) { # check if this is just whitespace
				throw new Exception('Mixed array and scalar data');
			} else {
				/* we tolerate AND skip whitespace, if we're already in
				a nested elements structure, as this whitespece is most
				probably just space between the element tags */
				return;
			}
		}
		if (is_null($this->ptr_stack[$this->level])) {
			$this->ptr_stack[$this->level] = ''; /* first data input */
		}
		$this->ptr_stack[$this->level] .= $data; /* we may be called several times, in chunks */
	}

	public function getResult() {
		return $this->ret;
	}
}

Update, 20/Jul/2011: The source code was modified to handle white-space better, in order to fix the following tricky sample XML input: <item6> &amp; &lt; </item6>

Update, 30/Jul/2011: Another bugfix which handles empty responses like: <response/>


References:

Author: Ivan Zahariev

An experienced Linux & IT enthusiast, Engineer by heart, Systems architect & developer.

4 thoughts on “Parse XML into a PHP array

  1. I would recommend fixing the duplicate name problem. That is the only thing I could find wrong with your script. I tried using it with the response result sent back from the USPS API and it broke on multiple SpecialServices. Being that I can’t control the results and there being duplicate names without index, I couldn’t use your (otherwise nice) script. Just a thought, but maybe try counting the number of duplicate names in a result set and setting an index to them yourself?

    • I had a different idea when creating this PHP example. The purpose is to parse an XML document into a PHP array where keys of the PHP array use the very same names as the XML element names. The exception “data” was introduced only because XML does not support an element with numeric name, like “data”.

      Therefore, if you want to parse an XML document where you have duplicate names for the XML elements, mine implementation won’t work out of the box. I’m sorry that I don’t have time to develop a version which suits your needs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s