There are many different examples on how to parse an XML document into an array with PHP. What mine is different with is that it:
- is very memory efficient by using PHP references (similar to pointers in C)
- uses no recursion, thus there is no limit on the XML subtree levels
- is very strict and paranoid about correctness
The parsing is done using XML Parser.
An example input XML data follows:
<?xml version="1.0" encoding="ISO-8859-1"?> <root> <first_item>Test 1st item</first_item> <first_level_nested> <item idx="0">value #1</item> <item idx="1">value #2</item> <second_level_nested> <item idx="0">value #3</item> <item idx="1">value #4</item> </second_level_nested> </first_level_nested> <second_item>Test 2nd item</second_item> </root>
There is one specific hack here. Since XML allows it to have an element with the same name multiple times on the same subtree level (see <item> on lines #05, #06, #08, #09), and at the same time it does not allow to have an element with only numeric name, we need to make the following exception for arrays which have numeric indexes:
- If an element is named <item>, and it has an attribute named “idx”, then we will use this attribute as name, and respectively array key.
This is handled in the XmlCallback() class, method startElement(), lines #44, #45, #46, which are also highlighted. You can see the sources at the end of the article.
XML also allows it that an element contains both DATA and sub-elements. This cannot be parsed into a PHP array, and will result in an Exception.
The parsed PHP array would look like as follows:
Array ( [root] => Array ( [first_item] => Test 1st item [first_level_nested] => Array ( [0] => value #1 [1] => value #2 [second_level_nested] => Array ( [0] => value #3 [1] => value #4 ) ) [second_item] => Test 2nd item ) )
If you liked the results, you can download the sources which follow (click “show source” below):
<?php
function xml_decode($output) {
$xml_parser = xml_parser_create();
$xml_callback = new XmlCallback();
if (!xml_set_element_handler(
$xml_parser,
array($xml_callback, 'startElement'),
array($xml_callback, 'endElement')
)) throw new Exception('xml_set_element_handler() failed');
if (!xml_set_character_data_handler($xml_parser, array($xml_callback, 'data'))) {
throw new Exception('xml_set_character_data_handler() failed');
}
if (!xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, 0)) {
throw new Exception('xml_parser_set_option() failed');
}
if (!xml_parse($xml_parser, $output, TRUE)) {
$xml_error = sprintf(
"%s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)
);
throw new Exception("XML error: $xml_error\nXML data: $output");
}
xml_parser_free($xml_parser);
return $xml_callback->getResult();
}
class XmlCallback {
private $ret = null;
/* assign and use references directly to the array, or else you'll be in trouble */
private $ptr_stack = array();
private $level = 0;
public function __construct() {
$this->ptr_stack[$this->level] =& $this->ret;
}
public function startElement($parser, $name, $attrs) {
if ($name == 'item' && isset($attrs['idx'])) {
$name = $attrs['idx']; /* reconstruct arrays with numeric indexes */
}
if (!isset($this->ptr_stack[$this->level])) {
$this->ptr_stack[$this->level] = array();
$this->ptr_stack[$this->level][$name] = null;
} else {
if (!is_array($this->ptr_stack[$this->level])) {
if (!strlen(trim($this->ptr_stack[$this->level]))) {
/* if until now we got only whitespace (thus scalar data),
but now we start a nested elements structure, discard this
whitespace, as it is most probably just space between the
element tags */
$this->ptr_stack[$this->level] = array();
} else {
throw new Exception('Mixed array and scalar data');
}
}
if (isset($this->ptr_stack[$this->level][$name])) {
/* isset() == (isset() && !is_null()) */
throw new Exception("Duplicate element name: $name");
}
}
/* array_push() */
++$this->level;
$this->ptr_stack[$this->level] =& $this->ptr_stack[$this->level-1 /* MINUS ONE! */][$name];
}
public function endElement($parser, $name) {
if (!array_key_exists($this->level, $this->ptr_stack)) {
throw new Exception('XML non-existing reference');
}
/* array_pop() */
unset($this->ptr_stack[$this->level]);
--$this->level;
if ($this->level < 0) throw new Exception('XML stack underflow');
}
public function data($parser, $data) {
if (is_array($this->ptr_stack[$this->level])) {
if (strlen(trim($data))) { # check if this is just whitespace
throw new Exception('Mixed array and scalar data');
} else {
/* we tolerate AND skip whitespace, if we're already in
a nested elements structure, as this whitespece is most
probably just space between the element tags */
return;
}
}
if (is_null($this->ptr_stack[$this->level])) {
$this->ptr_stack[$this->level] = ''; /* first data input */
}
$this->ptr_stack[$this->level] .= $data; /* we may be called several times, in chunks */
}
public function getResult() {
return $this->ret;
}
}
Update, 20/Jul/2011: The source code was modified to handle white-space better, in order to fix the following tricky sample XML input: <item6> & < </item6>
Update, 30/Jul/2011: Another bugfix which handles empty responses like: <response/>
References:
- There are plenty of other (similar) solutions out there: