ext/dom and libxml2 charset and entities behaviors

In case you are unaware, there is [as of PHP 5.1.0] a second argument to the DomDocument->SaveXML() method.

This argument currently only supports one value which is the constant LIBXML_NOEMPTYTAGS. This option makes sure that you do not end up with <tag /> but instead, <tag></tag>. This can make things easier if you need more predictable text to perform other changes on later.

However, in playing around with the option, I noticed that my markup changed somewhat significantly in size (it’s a large document). Some further playing yields that the following six uses of DomDocument->SaveXML() yield different results:

&#xA0; is a non-breaking space character (in HTML &nbsp;). ext/dom Defaults to UTF-8

<?php
$dom = DOMDocument::loadXML("<xml><test />&#xA0;</xml>");

echo   $dom->saveXML();
/*
Default behavior, entities stay as entities, no encoding added to the XML prolog
<?xml version="1.0"?>
<xml><test/>&#xA0;</xml>
*/

echo $dom->saveXML($dom->documentElement);
/*
Entities are transformed to output charset, no XML prolog
<xml><test/>[nbsp char]</xml>
*/

echo $dom->saveXML($dom);
/*
Entities are transformed to output charset, encoding added to the XML prolog
<?xml version="1.0" encoding="UTF-8"?>
<xml><test/>[nbsp char]</xml>
*/

echo $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);
/*
Entities are transformed to output charset, no XML prolog, tags expanded
<xml><test></test>[nbsp char]</xml>
*/

echo $dom->saveXML($dom, LIBXML_NOEMPTYTAG);
/*
Entities are transformed to output charset, encoding added to the XML prolog, tags expanded
<?xml version="1.0" encoding="UTF-8"?>
<xml><test></test>[nbsp char]</xml>
*/

echo $dom->saveXML(null, LIBXML_NOEMPTYTAG);
/*
Entities stay as entities, no encoding added to the XML prolog, tags expanded
<?xml version="1.0"?>
<xml><test></test>&#xA0;</xml>
*/
?>

Just something to keep in mind next time you’re fooling around with the DOM.

- Davey

Comments are closed.

Twitter

@janinaz I checked out your IMDB, very cool that you got into an episode of Dollhouse :)

@dshafik [3 hours ago]

@ejacqui You mean the PSPs retarded little brother?

@dshafik [3 hours ago]

Does anyone remember the Ms Dewey viral search campaign from Microsoft a couple of years ago? Ms Dewey was played by @janinaz. OHAI.

@dshafik [3 hours ago]

@dshafik Yeah, I have immediate uses for traits. I like namespaces, but they're boring. OTOH, closures + traits == yummy.

@weierophinney [9 hours ago]

Books & Things