ext/dom and libxml2 charset and entities behaviors

In case you are unaware, there is [as of PHP 5.1.0] a second argument to the DomDocument->SaveXML() method.

This argument currently only supports one value which is the constant LIBXML_NOEMPTYTAGS. This option makes sure that you do not end up with <tag /> but instead, <tag></tag>. This can make things easier if you need more predictable text to perform other changes on later.

However, in playing around with the option, I noticed that my markup changed somewhat significantly in size (it’s a large document). Some further playing yields that the following six uses of DomDocument->SaveXML() yield different results:

&#xA0; is a non-breaking space character (in HTML &nbsp;). ext/dom Defaults to UTF-8

[php]
 “);

echo $dom->saveXML();
/*
Default behavior, entities stay as entities, no encoding added to the XML prolog

 
*/

echo $dom->saveXML($dom->documentElement);
/*
Entities are transformed to output charset, no XML prolog
[nbsp char]
*/

echo $dom->saveXML($dom);
/*
Entities are transformed to output charset, encoding added to the XML prolog

[nbsp char]
*/

echo $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);
/*
Entities are transformed to output charset, no XML prolog, tags expanded
[nbsp char]
*/

echo $dom->saveXML($dom, LIBXML_NOEMPTYTAG);
/*
Entities are transformed to output charset, encoding added to the XML prolog, tags expanded

[nbsp char]
*/

echo $dom->saveXML(null, LIBXML_NOEMPTYTAG);
/*
Entities stay as entities, no encoding added to the XML prolog, tags expanded

 
*/
?>
[/php]

Just something to keep in mind next time you’re fooling around with the DOM.

– Davey