There are many different examples on how to parse an XML document into an array with PHP. What mine is different with is that it:
- is very memory efficient by using PHP references (similar to pointers in C)
- uses no recursion, thus there is no limit on the XML subtree levels
- is very strict and paranoid about correctness
The parsing is done using XML Parser.
An example input XML data follows:
<?xml version="1.0" encoding="ISO-8859-1"?> <root> <first_item>Test 1st item</first_item> <first_level_nested> <item idx="0">value #1</item> <item idx="1">value #2</item> <second_level_nested> <item idx="0">value #3</item> <item idx="1">value #4</item> </second_level_nested> </first_level_nested> <second_item>Test 2nd item</second_item> </root>
There is one specific hack here. Since XML allows it to have an element with the same name multiple times on the same subtree level (see <item> on lines #05, #06, #08, #09), and at the same time it does not allow to have an element with only numeric name, we need to make the following exception for arrays which have numeric indexes:
- If an element is named <item>, and it has an attribute named “idx”, then we will use this attribute as name, and respectively array key.
This is handled in the XmlCallback() class, method startElement(), lines #44, #45, #46, which are also highlighted. You can see the sources at the end of the article.
XML also allows it that an element contains both DATA and sub-elements. This cannot be parsed into a PHP array, and will result in an Exception.
The parsed PHP array would look like as follows:
Array ( [root] => Array ( [first_item] => Test 1st item [first_level_nested] => Array ( [0] => value #1 [1] => value #2 [second_level_nested] => Array ( [0] => value #3 [1] => value #4 ) ) [second_item] => Test 2nd item ) )
If you liked the results, you can download the sources which follow (click “show source” below):
<?php function xml_decode($output) { $xml_parser = xml_parser_create(); $xml_callback = new XmlCallback(); if (!xml_set_element_handler( $xml_parser, array($xml_callback, 'startElement'), array($xml_callback, 'endElement') )) throw new Exception('xml_set_element_handler() failed'); if (!xml_set_character_data_handler($xml_parser, array($xml_callback, 'data'))) { throw new Exception('xml_set_character_data_handler() failed'); } if (!xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, 0)) { throw new Exception('xml_parser_set_option() failed'); } if (!xml_parse($xml_parser, $output, TRUE)) { $xml_error = sprintf( "%s at line %d", xml_error_string(xml_get_error_code($xml_parser)), xml_get_current_line_number($xml_parser) ); throw new Exception("XML error: $xml_error\nXML data: $output"); } xml_parser_free($xml_parser); return $xml_callback->getResult(); } class XmlCallback { private $ret = null; /* assign and use references directly to the array, or else you'll be in trouble */ private $ptr_stack = array(); private $level = 0; public function __construct() { $this->ptr_stack[$this->level] =& $this->ret; } public function startElement($parser, $name, $attrs) { if ($name == 'item' && isset($attrs['idx'])) { $name = $attrs['idx']; /* reconstruct arrays with numeric indexes */ } if (!isset($this->ptr_stack[$this->level])) { $this->ptr_stack[$this->level] = array(); $this->ptr_stack[$this->level][$name] = null; } else { if (!is_array($this->ptr_stack[$this->level])) { if (!strlen(trim($this->ptr_stack[$this->level]))) { /* if until now we got only whitespace (thus scalar data), but now we start a nested elements structure, discard this whitespace, as it is most probably just space between the element tags */ $this->ptr_stack[$this->level] = array(); } else { throw new Exception('Mixed array and scalar data'); } } if (isset($this->ptr_stack[$this->level][$name])) { /* isset() == (isset() && !is_null()) */ throw new Exception("Duplicate element name: $name"); } } /* array_push() */ ++$this->level; $this->ptr_stack[$this->level] =& $this->ptr_stack[$this->level-1 /* MINUS ONE! */][$name]; } public function endElement($parser, $name) { if (!array_key_exists($this->level, $this->ptr_stack)) { throw new Exception('XML non-existing reference'); } /* array_pop() */ unset($this->ptr_stack[$this->level]); --$this->level; if ($this->level < 0) throw new Exception('XML stack underflow'); } public function data($parser, $data) { if (is_array($this->ptr_stack[$this->level])) { if (strlen(trim($data))) { # check if this is just whitespace throw new Exception('Mixed array and scalar data'); } else { /* we tolerate AND skip whitespace, if we're already in a nested elements structure, as this whitespece is most probably just space between the element tags */ return; } } if (is_null($this->ptr_stack[$this->level])) { $this->ptr_stack[$this->level] = ''; /* first data input */ } $this->ptr_stack[$this->level] .= $data; /* we may be called several times, in chunks */ } public function getResult() { return $this->ret; } }
Update, 20/Jul/2011: The source code was modified to handle white-space better, in order to fix the following tricky sample XML input: <item6> & < </item6>
Update, 30/Jul/2011: Another bugfix which handles empty responses like: <response/>
References:
- There are plenty of other (similar) solutions out there: