XSL Pipeline Processing

Pipeline processing is a powerful XSL programming technique that leads to programs that are much easier to maintain and enhance. Using a series of simple XSL transforms chained together in series, complex transformations can be achieved. This essay demonstrates the value of a pipeline processing approach along with some implementation specifics.

Developers familiar with the power of pipeline operations central to the UNIX operating system know how simple, modular tools can be chained together to accomplish a wide variety of complex tasks.

XSL pipelines offer the same advantage for XML transformation. Where UNIX pipelines are based around standard input and output of lines of text, XSL pipelines rely on the structure of well-formed XML between stages.

The Ideal Transform Rule

Sometimes the XML you need to transform may not be suited to producing the output you’re trying to produce. Sometimes the output you’re trying to produce is quite complex in its own right. In these situations, it’s advantageous to break a transform into two stages. The first stage produces an “ideal input” for the second stage. To paraphrase Einstein, the second stage therefore becomes “as simple as possible, and no simpler.”

There are many reasons an XML input may not be ideal. Data pulled from legacy systems or databases may have an awful structure or an antiquated naming convention that makes your code difficult to understand. It’s not uncommon to have many processes share a large XML structure with each process only requiring a small subset of the data. A pre-processing XSL transform can eliminate these problems with ugly XML. Never deal with ugly XML!

Ideal Transform Rule
Work from ideal input when writing complex style sheets.

In a two-stage transform, the first stage is usually simple because it only deals with the restructuring move from input to ideal. The second stage is simple because you’ve tailored the ideal input for its operation with the first stage. Both transforms benefit from not trying to accomplish both restructuring and final output at once. Simple transforms are desirable because they’re easier to write, understand, and maintain.

Multi-stage Transforms

Multi-stage transforms can be assembled as batch files, through API calls, or by a single XSL style sheet using an intermediate result tree fragment. The following example illustrates the use of a result tree fragment stored in a variable:

1 |<xsl:stylesheet version="1.0"
2 |   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
3 |   xmlns:msxsl="urn:schemas-microsoft-com:xslt">
4 |
5 |   <!-- Stage One Result Tree Fragment -->
6 |   <xsl:variable name="sorted-names">
7 |         <SortedNames>
8 |               <xsl:apply-templates select="//Name" mode="s1">
9 |                     <xsl:sort select="." />
10|               </xsl:apply-templates>
11|         </SortedNames>
12|   </xsl:template>
13|
14|   <!-- Stage One Name Template -->
15|   <xsl:template match="Name" mode="s1">
16|         <xsl:copy-of select="." />
17|   </xsl:template>
18|
19|   <!-- Stage Two Root Template -->
20|   <xsl:template match="/">
21|         <xsl:apply-templates 
22|               select="msxsl:node-set($sorted-names)//Name" />
23|   </xsl:template>
24|
25|   <!-- Stage Two Name Template -->
26|   <xsl:template match="Name">
27|         My name is:
28|         <b><xsl:value-of select="." /></b><br />
29|   </xsl:template>
30|
31|</xsl:stylesheet>

In the example above, an extract containing only sorted Name elements is created as a first stage, then a set of HTML-formatted name labels are produced from the extract in the second stage. This sample is not sufficiently complex to demonstrate the benefits of multiple stage transforms in general, but it does demonstrate the use of the result tree fragment mechanism.

When working with result tree fragments, you need to use an extension function provided by your XSL processor. The inability to work with result tree fragments was an oversight in the XSL 1.0 specification. All the major XSL processor implementations have created extension functions to handle result tree fragments because the feature is too useful to ignore. In XSL 2.0, result tree fragments will be handled automatically without an extension function.

This sample shows the Microsoft XML toolset result tree fragment solution in particular, but the other implementations are very similar.

Debugging Tip
Use xsl:copy-of to put the contents of a variable containing a result tree fragment into your output. Wrap the call in xsl:comment tags to separate the output from the rest of your transform if need be.

To use an extension function, first you must include the extension function namespace in your style sheet. The xmlns:msxsl namespace declaration on line three in the sample above accomplishes this. The extension function msxsl:node-set(), as seen on line 22, is only available when the extension namespace has been declared. The node-set() function establishes a context within the result tree fragment during processing instead of in the input XML document.

Any number of result tree fragments may be created and processed in a single style sheet. Multiple style sheets can always be combined into a single style sheet using result tree fragments. However, this technique should be used sparingly for XSL pipeline processing because combined style sheets are often considerably more complex and therefore less maintainable.

The sample’s use of the mode attribute when building the result tree fragment is not entirely necessary, but it’s often helpful. Modes segregate templates with match patterns that would otherwise conflict during processing. When mode is changed during processing, only templates in the current mode match.

Modes are used to make multiple passes over an XML document producing different outputs. For example, a single style sheet may produce both a table of contents and the body of a report in two passes over the body of the report.

In the multi-stage sample above, mode is used to create the $sorted-names variable containing the result tree fragment beginning on line five. The output of the templates matched in the s1 mode is accumulated in the variable as a result tree fragment. The s1 mode is entered and exited within the xsl:variable tags via the xsl:apply-templates call with the mode attribute on line eight.

The xsl:apply-templates call on line 21, in default mode, moves processing context to the result tree fragment, allowing the stage two name template to match.

Pipeline Flexibility

XSL pipelines are powerful because they are easily extended to accommodate additional functionality. Consider the following simple pipeline:

description

A dataset is extracted from a database as Dataset.xml. This XML is transformed into an intermediate table XML format by Table.xsl that decorates the data with column headers, alignment and other formatting hints specific to the display of this dataset. Finally, the generic HtmlTable.xsl style sheet produces an HTML table from the intermediate table XML. The wisdom of the intermediate table XML format will be revealed shortly.

When the dataset gets large, it’s natural to want to add paging and sorting to the implementation. With a pipeline approach, this simply means inserting some additional stages into the pipeline:

description

Both the sorting and paging style sheets need parameters. Sort needs a column name and direction, and page needs a page size and page number. How you provide these parameters is up to your specific implementation, but parameterized transforms are a typical component of XSL pipelines.

The Sort.xsl and Page.xsl style sheets are written against and produce the intermediate table XML format. This makes the style sheets more modular and reusable. AHA! By sharing an intermediate format we get three reusable style sheets out of this pipeline implementation. Pipelines like this one are a valuable addition to the developer’s toolkit you bring to every project.

The pipeline stage style sheets are typically based on the identity transform. Stages may change the structure of the data, filter the data, or decorate the data by adding elements or attributes. Variations on identity transforms keep the stages simple.

Imagine how easy it would be to add another stage to this pipeline that flags rows meeting a certain criteria with a highlight or checkmark attribute. Such a style sheet could form the basis for searching or selecting rows in the result set for other operations.

Pipeline Performance

When developing pipelines, a key performance guideline is to create the smallest subset of the XML document as early as possible in the pipeline. For example, if a filter is going to select only five out of a hundred records, then that filter ought to be as early in the pipeline as possible. By reducing the size of the XML flowing through the pipeline, performance can be improved all around.

Pipelines may become slow for a variety of reasons including heavy usage, excessively large XML, or poorly written stages. In general, you will be surprised at how well a pipeline approach performs in practice. But if you do encounter performance problems with pipelines, you’ll find they are well structured for optimization.

Simple timings reveal which pipeline stages are running slow. Consider rewriting slow single stages as DOM operations. DOM operations are more work but can lead to big gains for certain kinds of transforms. Also consider combining similar stages into a single transform if the complexity doesn’t become unreasonable.

Inefficient XPath expressions or bad style sheet processing flow are another common performance problem. Taking advantage of keys and caching intermediate results in variables are helpful XSL performance improvement techniques. Future essays will be devoted to XSL performance.

Microsoft XSL Processing Pipelines

Prefer the read-only XPathDocument class in your pipeline implementations. Load and transform operations are much faster with XPathDocument.

There are many ways to implement an XSL transform pipeline with the Microsoft.NET XML services. Use XmlReader- and XmlWriter-based classes for IO, either XmlDocument or XPathDocument classes as a transform source, and the XslTransform class to perform the transform processing.

The diagram below illustrates the data flows between the .NET XML classes commonly used for pipelines:

description

Note the following features indicated by the flows:

XmlReader loads XmlDocument or XPathDocument
XslTransform reads XmlDocument or XPathDocument
XslTransform targets XmlReader or XmlWriter
XmlWriter outputs XmlDocument or XPathDocument

The following C# code fragment shows how to implement a pipeline:

1 | // load the input document and style sheets
2 |XPathDocument docIn = new XPathDocument( "list.xml" );
3 |XslTransform xslStageA = new XslTransform( );
4 |xslStageA.Load( "a.xsl" );
5 |XslTransform xslStageB = new XslTransform( );
6 |xslStageB.Load( "b.xsl" );
7 |XslTransform xslStageC = new XslTransform( );
8 |xslStageC.Load( "c.xsl" );
9 |XmlUrlResolver res = new XmlUrlResolver( );
10|
11|// three stage pipeline, null XsltArgumentList
12|XmlReader xpipe;
13|xpipe = xslStageA.Transform( docIn, null, res );
14|docIn = new XPathDocument( xpipe );
15|
16|xpipe = xslStageB.Transform( docIn, null, res );
17|docIn = new XPathDocument( xpipe );
18|
19|XmlTextWriter docOut = new
20|   XmlTextWriter( "out.xml", System.Text.Encoding.UTF8 );
21|xslStageC.Transform( docIn, null, docOut, res );

The XmlReader xpipe variable references the XmlReader created by the each call to Transform. I’ve found that letting the XslTransform class handle the creation of the XmlReader performs well, though I haven’t benchmarked this against a user-managed MemoryStream and XmlReader.

The transform result is loaded from the XmlReader into a new XPathDocument, docIn, for each stage. The last stage targets an XmlWriter to send the output of the transform directly to a text file. With ASP.NET you may choose to target the HTTP output stream associated with your page response if you’re creating HTML.

This sample pipeline implementation is unfortunately rather dumbed-down. Error handling, a parameter facility, and a set of classes to encapsulate pipeline functionality are beyond the scope of what I wanted to include here. The XslPipe project will provide a robust .NET pipeline implementation.

Using a series of easy to understand and maintain transforms leads to better and more reliable software.

An XSL pipeline processing approach has considerable advantages. With changing business requirements, pipeline processing enables stages to be modified or added as new features are requested. It’s easier to change an XSL file than to change code and recompile an application.

Development of a pipeline can proceed incrementally, adding stages and delivering functionality in an iterative process typical of modern project lifecycle methodologies. The project technical lead can stub out a pipeline with identity transforms for each stage early in the project, allowing developers to flesh out the stages during development. For project managers, pipeline stages also provide a natural partitioning of tasks among a team of developers.

In future essays, pipelines will be used for a variety of code generation tasks and XSL demonstrations. The XslPipe project will also implement a pipeline processor with an accompanying pipeline specification language.

In the meantime, two Java XSL pipeline projects worth checking out are Norman Walsh’s SXPipe and the Apache Cocoon project. Cocoon allows for very sophisticated XML pipelines, including non-XSL generator stages that produce XML from databases and web service requests. Build your own pipeline system with batch files and start playing with pipelines!

References

Microsoft.NET Framework System.Xml Reference: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfsystemxml.asp

Microsoft.NET Framework System.Xml.XPath Reference: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfSystemXmlXPath.asp

SXPipe: Simple XML Pipelines: http://norman.walsh.name/2004/06/20/sxpipe

Apache Cocoon Project: http://cocoon.apache.org/