Parsing XML Files with PowerShell

In the context of using Windows PowerShell for lightweight software test automation, one of the most common tasks you need to perform is parsing data from XML files. For example, you may want to extract test case input and expected result data from an XML test cases file, or you might want to pull out results data from an XML test results file. Compared to parsing a flat text file, parsing most XML files is a bit tricky because of XML’s hierarchical structure. There are several approaches you can take when parsing XML with PowerShell. In general, the most flexible technique is to read the entire XML file into memory as an XmlDocument object and then use methods such as SelectNodes(), SelectSingleNode(), GetAttribute(), and get_InnerXml() to parse the object in memory. Let me demonstrate with typical example. Suppose you want to parse this dummy XML test case data file:
<?xml version="1.0" ?>
 <testCase id="001">
    <arg1 optional="no">3</arg1>
 <testCase id="002">
    <arg1 optional="yes">5</arg1>
The dummy file represents test case data for a hypothetical Sum() method. Listed below is a PowerShell script which parses the XML file and produces as output:
PS C:\XMLwithPowerShell> .\parseXML.ps1
Parsing file testCases.xml
Case ID = 001 Arg1 = 3 Optional = no Arg2 = 4 Expected value = 7
Case ID = 002 Arg1 = 5 Optional = yes Arg2 = 6 Expected value = 11
End parsing
The complete script is:
# parseXML.ps1
write-host "`nParsing file testCases.xml`n"
[System.Xml.XmlDocument] $xd = new-object System.Xml.XmlDocument
$file = resolve-path("testCases.xml")
$nodelist = $xd.selectnodes("/testCases/testCase") # XPath is case sensitive
foreach ($testCaseNode in $nodelist) {
  $id = $testCaseNode.getAttribute("id")
  $inputsNode = $testCaseNode.selectSingleNode("inputs")
  $arg1 = $inputsNode.selectSingleNode("arg1").get_InnerXml()
  $optional = $inputsNode.selectSingleNode("arg1").getAttribute("optional")
  $arg2 = $inputsNode.selectSingleNode("arg2").get_InnerXml()
  $expected = $testCaseNode.selectSingleNode("expected").get_innerXml()
  #$expected = $testCaseNode.expected
  write-host "Case ID = $id Arg1 = $arg1 Optional = $optional Arg2 = $arg2 Expected value = $expected"
write-host "`nEnd parsing`n"
The first three statements of the script load file testCases.xml into memory as an XmlDocument object:
[System.Xml.XmlDocument] $xd = new-object System.Xml.XmlDocument
$file = resolve-path("testCases.xml")
I could have loaded the XML file in a single line like so:
[xml] $xd = get-content ".\testCases.xml"
Using the three-statement approach in the script has no technical advantage but is somewhat more readable by an engineer with C# coding experience. Next I fetch the all testCase nodes into memory:
$nodelist = $xd.selectnodes("/testCases/testCase")
The SelectNodes() method accepts an XPath string which is case-sensitive. With the testCase nodes now in memory I can iterate through each node with a foreach loop. Alternatively I could have iterated using a for loop with an index variable (say $i) in conjunction with the Item() method. For each node, I first fetch the test case ID attribute:
$id = $testCaseNode.getAttribute("id")
I use the GetAttribute() method of the XmlElement class. Interestingly I could have written this instead:
$id = $
This alternative illustrates an important point. In an effort to make parsing XML with PowerShell easier than with C# or VB.NET, the designers of PowerShell decided to directly expose attributes and values of XML elements in the form of properties. But since arbitrary XML data is available as properties, PowerShell does not expose standard .NET Framework properties (such as InnerXml) because there could be a name conflict. Note that PowerShell does expose standard .NET Framework methods such as GetAttribute(). Continuing in my script, next I grab the values of arg1:
$inputsNode = $testCaseNode.selectSingleNode("inputs")
$arg1 = $inputsNode.selectSingleNode("arg1").get_InnerXml()
$optional = $inputsNode.selectSingleNode("arg1").getAttribute("optional")
I use the SelectSingleNode() method to grab the single <input> node. Now instead of using the standard InnerXml property, which PowerShell does not expose, I use the underlying PowerShell get_InnerXml() method which corresponds to the non-exposed InnerXml property. OK, but just how did I know about this get_InnerXml() method? As with many PowerShell scripting tasks, before writing my script I had previously experimented by issuing interactive commands at the PowerShell prompt. For example, after interactively loading the XML file into memory (by typing the first three statements in my script), I typed commands such as:
> $nodelist = $xd.selectnodes("/testCases/testCase")
> $firstnode = $nodelist.item(0)
> $inputs = $firstnode.selectSingleNode("inputs")
> $arg1 = $inputs.selectSingleNode("arg1")
> $arg1 | get-member | more
Using the get-member cmdlet is the key to discovering exactly what properties and methods are available to an object. Anyway, the rest of the script should be reasonably self-explanatory because I use the same coding techniques. To summarize, although there are several ways to parse an XML file using PowerShell, a flexible approach is to use the XmlDocument class. After reading an XML file into memory as an XmlDocument object, you can select multiple nodes into a collection using the SelectNodes() method, grab a single node using the SelectSingleNode() method, retrieve an attribute using either the standard GetAttribute() method or the name of the attribute which PowerShell exposes as a property, and you can obtain an element value using the special get_InnerXml() PowerShell method.
This entry was posted in Software Test Automation. Bookmark the permalink.

6 Responses to Parsing XML Files with PowerShell

  1. Shay says:

    Great post, Thank you.
    I think that:
    [System.Xml.XmlDocument] $xd = new-object System.Xml.XmlDocument
    Can be just
    $xd = new-object System.Xml.XmlDocument
    Since they both result in the same type, the second is more readable.
    —–Shay Levi$cript Fanatichttp://scriptolog.blogspot.comHebrew weblog:

  2. Douglas says:

    You can also use dot notation and GetElementsByTagName.
    $xml = [xml] (get-content testCases.xml)
    ForEach($node in $xml.GetElementsByTagName("testCase")){   $id = $node.Id   ForEach($item in $node)   {      $expected= $item.expected      ForEach($argItem in $item.inputs)      {         $arg1 = $argItem.arg1         $arg1opt = $argItem.arg1.optional
             $arg2 = $argItem.arg2      }   }
       "Case ID = $id   Arg1 = $($arg1.\’#text\’) Optional = $arg1Opt   Arg2 = $arg2   Expected Value = $expected   "}
    Case ID = 001   Arg1 = 3 Optional = no   Arg2 = 4   Expected Value = 7
    Case ID = 002   Arg1 = 5 Optional = yes   Arg2 = 6   Expected Value = 11

  3. jon says:

    Thanks for the post, very helpful, I had forgotten about .item(x) to reference an element in the collection.

  4. Shaiket says:

    good post

  5. gh0st says:

    hmmm, I started off a little home project (no powershell experience…no programming experience either, so please excuse the outrageous naivety) based on the xml parse stuff on this thread – hugely helpful example – but I\’ve hit a dead-end. My aim is to parse out a particular attribute in a given XML node (works fine), read in the name of folders that match a pattern in a given folder (done) and compare the two, with the ultimate aim of deleting the folders that are absent from the XML. It\’s just for some lightweight automated housekeeping.I\’m stuck at the point of comparing the values – for sheer lack of understanding I\’ve captured the values in two different ways into files, but as far as I can tell (see output below) the variables end up containing the data that I want to compare, but when I do…inexplicable results so far as I can tell.>> Here\’s my hacky attempt to get the values into files:# Cleanup temp filesc:If (Test-Path -path "c:\\zq.csv"){ Remove-Item "c:\\zq.csv" }If (Test-Path -path "c:\\zd.csv"){ Remove-Item "c:\\zd.csv" }# Parse XML for job numbers (prepend J to match foldername)$xml = [xml] (get-content QueueStatus.xml)ForEach($node in $xml.GetElementsByTagName("QueueEntry")){ $id = $node.QueueEntryID ForEach($item in $node) { Add-Content zq.csv "J$id," }}# Log folderlist in path$dir = get-Childitem C:\\QueueJobDrops | Where-Object {$_.psIsContainer -and $ -like "J*"} | select name | Export-Clixml zd.xml>> And here\’s what I get:PS C:\\> $zq = Import-Csv zq.csv -Header namePS C:\\> $zd = Import-Clixml zd.xmlPS C:\\> $zdName —- J11111 J310 J314 J399 J888 PS C:\\> $zqname —- J317 J363 J341 J334 J310 J311 J212 J314 PS C:\\> Compare-Object -ReferenceObject $zd -DifferenceObject $zqInputObject SideIndicator ———– ————- @{name=J311} => @{name=J212} => @{name=J314} =>huh? Anyone got any ideas? Much obliged.

  6. Manu says:

    THANKS Man

Comments are closed.