XML Canonicalization Testing

Software testers have to have a good grasp of techniques which involve XML. One concept which is not widely known is XML canonicalization. The W3C defines canonical XML which you can think of XML in a standard form. This means, for example, that whitespace is removed, all quotes are either single-quote or double-quote, XML declarations are removed, and attributes are placed in alphabetical order.

The most usual form of canonicalization is called C14N. It is quite complex. In a testing situation, you might run into XML canonicalization if you are testing some system which produces XML files or in-memory documents. To test this system, you need an expected result, which is an XML file/document. One strategy you can use to compare the actual XML result with an expected XML result is to canonicalize both the actual and the expected XML. Then you can do a simple string comparison for equality.

XML canonicalization was actually created for use in security-related scenarios. The idea is that XML is often transmitted over a network. To verify that the XML has not been maliciously or accidentally corrupted, the sender can compute a hash (such as MD5 or SHA1) of the XML and publish that hash value. The XML receiver can compute a hash of the received data to validate the data. Unfortunately, a lot of minor changes can happen over a network or the Internet — whitespace differences in particular. So, a way is needed to standardize XML so that hashes can be meaningfully compared.

I devote a section of chapter 12 of my book ".NET Test Automation Recipes" to the details of C14N XML canonicalization. See http://www.amazon.com/gp/product/1590596633/qid=1150213081/sr=2-1/ref=pd_bbs_b_2_1/103-1065536-4310244?s=books&v=glance&n=283155 for details.


This entry was posted in Software Test Automation. Bookmark the permalink.