Parse XML into a PHP array

July 15, 2011

There are many different examples on how to parse an XML document into an array with PHP. What mine is different with is that it:

  • is very memory efficient by using PHP references (similar to pointers in C)
  • uses no recursion, thus there is no limit on the XML subtree levels
  • is very strict and paranoid about correctness

The parsing is done using XML Parser.

An example input XML data follows:

<?xml version="1.0" encoding="ISO-8859-1"?>
<root>
	<first_item>Test 1st item</first_item>
	<first_level_nested>
		<item idx="0">value #1</item>
		<item idx="1">value #2</item>
		<second_level_nested>
			<item idx="0">value #3</item>
			<item idx="1">value #4</item>
		</second_level_nested>
	</first_level_nested>
	<second_item>Test 2nd item</second_item>
</root>

There is one specific hack here. Since XML allows it to have an element with the same name multiple times on the same subtree level (see <item> on lines #05, #06, #08, #09), and at the same time it does not allow to have an element with only numeric name, we need to make the following exception for arrays which have numeric indexes:

  • If an element is named <item>, and it has an attribute named “idx”, then we will use this attribute as name, and respectively array key.

This is handled in the XmlCallback() class, method startElement(), lines #44, #45, #46, which are also highlighted. You can see the sources at the end of the article.

XML also allows it that an element contains both DATA and sub-elements. This cannot be parsed into a PHP array, and will result in an Exception.

The parsed PHP array would look like as follows:

Array
(
	[root] => Array
	(
		[first_item] => Test 1st item
		[first_level_nested] => Array
		(
			[0] => value #1
			[1] => value #2
			[second_level_nested] => Array
			(
				[0] => value #3
				[1] => value #4
			)

		)

		[second_item] => Test 2nd item
	)

)

If you liked the results, you can download the sources which follow (click “show source” below):

<?php

function xml_decode($output) {
	$xml_parser = xml_parser_create();
	$xml_callback = new XmlCallback();
	
	if (!xml_set_element_handler(
		$xml_parser,
		array($xml_callback, 'startElement'),
		array($xml_callback, 'endElement')
	)) throw new Exception('xml_set_element_handler() failed');
	if (!xml_set_character_data_handler($xml_parser, array($xml_callback, 'data'))) {
		throw new Exception('xml_set_character_data_handler() failed');
	}
	if (!xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, 0)) {
		throw new Exception('xml_parser_set_option() failed');
	}
	
	if (!xml_parse($xml_parser, $output, TRUE)) {
		$xml_error = sprintf(
			"%s at line %d",
			xml_error_string(xml_get_error_code($xml_parser)),
			xml_get_current_line_number($xml_parser)
		);
		throw new Exception("XML error: $xml_error\nXML data: $output");
	}
	
	xml_parser_free($xml_parser);
	
	return $xml_callback->getResult();
}

class XmlCallback {
	private $ret = null;
	/* assign and use references directly to the array, or else you'll be in trouble */
	private $ptr_stack = array();
	private $level = 0;

	public function __construct() {
		$this->ptr_stack[$this->level] =& $this->ret;
	}

	public function startElement($parser, $name, $attrs) {
		if ($name == 'item' && isset($attrs['idx'])) {
			$name = $attrs['idx']; /* reconstruct arrays with numeric indexes */
		}

		if (!isset($this->ptr_stack[$this->level])) {
			$this->ptr_stack[$this->level] = array();
			$this->ptr_stack[$this->level][$name] = null;
		} else {
			if (!is_array($this->ptr_stack[$this->level])) {
				if (!strlen(trim($this->ptr_stack[$this->level]))) {
					/* if until now we got only whitespace (thus scalar data),
					but now we start a nested elements structure, discard this
					whitespace, as it is most probably just space between the
					element tags */
					$this->ptr_stack[$this->level] = array();
				} else {
					throw new Exception('Mixed array and scalar data');
				}
			}
			if (isset($this->ptr_stack[$this->level][$name])) {
				/* isset() == (isset() && !is_null()) */
				throw new Exception("Duplicate element name: $name");
			}
		}

		/* array_push() */
		++$this->level;
		$this->ptr_stack[$this->level] =& $this->ptr_stack[$this->level-1 /* MINUS ONE! */][$name];
	}

	public function endElement($parser, $name) {
		if (!array_key_exists($this->level, $this->ptr_stack)) {
			throw new Exception('XML non-existing reference');
		}

		/* array_pop() */
		unset($this->ptr_stack[$this->level]);
		--$this->level;

		if ($this->level < 0) throw new Exception('XML stack underflow');
	}

	public function data($parser, $data) {
		if (is_array($this->ptr_stack[$this->level])) {
			if (strlen(trim($data))) { # check if this is just whitespace
				throw new Exception('Mixed array and scalar data');
			} else {
				/* we tolerate AND skip whitespace, if we're already in
				a nested elements structure, as this whitespece is most
				probably just space between the element tags */
				return;
			}
		}
		if (is_null($this->ptr_stack[$this->level])) {
			$this->ptr_stack[$this->level] = ''; /* first data input */
		}
		$this->ptr_stack[$this->level] .= $data; /* we may be called several times, in chunks */
	}

	public function getResult() {
		return $this->ret;
	}
}

Update, 20/Jul/2011: The source code was modified to handle white-space better, in order to fix the following tricky sample XML input: <item6> &amp; &lt; </item6>

Update, 30/Jul/2011: Another bugfix which handles empty responses like: <response/>


References:


Testing exception message with PHPUnit

July 7, 2011

PHPUnit has a built-in method to test if an expected exception occurred during a test case:

$this->setExpectedException('Exception');

You cannot however test the message of the exception. There are cases where a program may throw the same exception type, but with different messages for different errors, and you want to differentiate between them.

Here is my example code on how to reliably test for the type and message of an exception:

class staticSessionTest extends PHPUnit_Framework_TestCase {
	...
	function test_bad_data() {
		$emess = null;
		try {
			$this->sess->start('must be array', FALSE, FALSE); # we expect an Exception here
		} catch (Exception $e) { $emess = $e->getMessage(); }
		$this->assertEquals($emess, 'Session data must be an array');
	}
	...
}

Putting the assertEquals() outside of the try…catch block ensures that you cannot forget to test for the message. The type of the exception is coded inside the catch(…) block.


UPDATE: I just re-read the latest PHPUnit Annotations, and this feature is already included in the standard PHPUnit suite. The difference between my custom code and the “@expectedExceptionMessage” annotation is that the annotation is valid for the whole test block of execution, while using try…catch you can specify precisely where you expect the exception to occur.


References:


Print to STDERR in PHP

November 30, 2010

If you are writing a command line tool in PHP and want to write to STDERR, here is the command:

file_put_contents('php://stderr', 'This text goes to STDERR');

The relevant documentation page is PHP input/output streams.

P.S. I actually wanted to reply to this page, but there was no way to leave a comment there…


PHP non-interactive usage in a cron job

September 14, 2010

Using a PHP script in a crontab is fairly easy, as stated in the “Using PHP from the command line” documentation… Until you start to get the following warning during the execution:

No entry for terminal type “unknown”;
using dumb terminal settings.

The script works, but this nasty warning really bothers you.

Here is a sample crontab entry:

* * * * * root sudo -u www-data php -r ‘echo “test”;’

When executed, it prints the warning on STDERR.

Yes, I know I don’t need “sudo” here, but this was my initial usage pattern as I discovered the problem, and at the first time I suspected that “sudo” got crazy. Well, it wasn’t “sudo” to blame, but PHP.

Here is the fixed crontab entry:

* * * * * root sudo -u www-data TERM=dumb php -r ‘echo “test”;’

The issue was encountered on an Ubuntu 10.04 server. I though crond usually sets $TERM to something… Anyway, problem solved.


C++ vs. Python vs. Perl vs. PHP performance benchmark (part #2)

August 2, 2010

This time we will focus on the startup time. The process start time is important if your processes are not persistent. If you are using FastCGI, mod_perl, mod_php, or mod_python, then these statistics are not so important to you. However, if you are spawning many processes which do something small and live for a very short time, then you should consider the CPU resources which get wasted while the script interpreter is being initialized.

The benchmarked scripts do only one thing – say “Hello, world” on the standard output. They do not include any additional modules in their source code – this may, or may not be your use-case. Though, very often the scripting languages have pretty many built-in functions, and for simple tasks you never need to include other modules.

Here are the benchmark results:

Language CPU time Slower than
User System Total C++ previous
C++ (with or w/o optimization) 2.568 3.536 6.051 - -
Perl 12.561 6.096 18.723 209% 209%
PHP (w/o php.ini) 20.473 13.877 34.918 477% 86%
Python 27.014 11.881 39.318 550% 13%
Python + Psyco 32.986 14.845 48.132 695% 22%

The clear winner among the script languages this time is… Perl. :)

All scripts were invoked 3000 times using the following Bash loop:

time ( i=3000 ; while [ "$i" -gt 0 ]; do $CMD >/dev/null ; i=$(($i-1)); done )

All tests were done on a Kubuntu Lucid box. The versions of the used software packages follow:

  • g++ (GNU project C and C++ compiler) 4.4.3
  • Python 2.6.5
  • Python Psyco 1.6 (1ubuntu2)
  • Perl 5.10.1
  • PHP 5.3.2 (1ubuntu4.2 with Suhosin-Patch), Zend Engine 2.3.0

The C++ implementation follows, click “show source” below to see the full source:

#include <iostream>
using namespace std;

int main() {
	cout << "Hello, world!\n";
	return 0;
}

The Perl implementation follows, click “show source” below to see the full source:

use strict;
use warnings;

print "Hello, world!\n";

The PHP implementation follows, click “show source” below to see the full source:

<?php
echo "Hello, world!\n";

The Python implementation follows, click “show source” below to see the full source:

#import psyco
#psyco.full()

print 'Hello, world!'


Update (Jan/14/2012): Copied the used test environment info here.


C++ vs. Python vs. Perl vs. PHP performance benchmark

July 1, 2010

Update: There is a part #2 of the benchmark results.


This all began as a colleague of mine stated that Python was so damn slow for maths. Which really astonished me and made me check it out, as my father told me once that he was very satisfied with Python, as it was very maths oriented.

The benchmarks here do not try to be complete, as they are showing the performance of the languages in one aspect, and mainly: loops, arrays with numbers, basic math operations.

Note: Give your ideas and use-cases on what to benchmark, and I’ll try to implement it for you. I.e. “benchmark the languages for reading a file, then splitting it to tokens by white-space and finally outputting all unique elements and their count”.

Out of curiosity, Python was also benchmarked with and without the Psyco Python extension, which people say could greatly speed up the execution of any Python code without any modifications.

Here are the benchmark results:

Language CPU time Slower than Language
version
Source
code
User System Total C++ previous
C++ (optimized with -O2) 1,520 0,188 1,708 - - g++ 4.5.2 link
C++ (not optimized) 3,208 0,184 3,392 99% 99% g++ 4.5.2 link
Javascript (nodejs) 3,096 0,384 3,480 104% 3% 0.2.6 link
Java 8,521 0,192 8,713 410% 150% 1.6.0_26 link
Python + Psyco 13,305 0,152 13,457 688% 54% 2.6.6 link
Python 27,886 0,168 28,054 1543% 108% 2.7.1 link
Perl 41,671 0,100 41,771 2346% 49% 5.10.1 link
PHP 94,622 0,364 94,986 5461% 127% 5.3.5 link

The clear winner among the script languages is… Python. :)

NodeJS JavaScript is pretty fast too, but internally it works more like a compiled language. See the comments below.

The times include the interpretation/parsing phase for each language, but it’s so small that its significance is negligible. The math function is called 10 times, in order to have more reliable results. All scripts are using the very same algorithm to calculate the prime numbers in a given range. The correctness of the implementation is not so important, as we just want to check how fast the languages perform. The original Python algorithm was taken from http://www.daniweb.com/code/snippet216871.html.

The tests were run on an Ubuntu Linux machine.

You can download the source codes, an Excel results sheet, and the benchmark batch script at:
http://www.famzah.net/download/langs-performance/


Update (Jul/24/2010): Added the C++ optimized values.
Update (Aug/02/2010): Added a link to the benchmarks, part #2.
Update (Mar/31/2011): Using range() in PHP improves performance with 5%.
Update (Jan/14/2012): Re-organized the results summary table and the page. Added Java.


Follow

Get every new post delivered to your Inbox.