Flat File Parsing
Most implementations of KFS are going to integrate with other enterprise systems in a variety of ways. The picking up and parsing of batch files is an extremely common method for external applications to push data to KFS. By using KFS's flat file parser, an implementing institution can configure the parse of pushed flat files in Spring configuration, greatly faciliating the parsing process for flat files.
Configuring a Flat File Parser
Imagine, for instance, a file which told KFS to create documents of certain types with accounting lines. Here is our imaginary file format:
1. DI KCOPLEY 2010/12/01EXTERNAL DIST INC AND EXP 2. SRCBL1031420-----500000001000 3. TRGBL1031490-----507000001000 4. IB KMOUTLAW 2010/12/03EXTERNAL INTERNAL BILL 5. SRCBL1031420-----180000001500 6. SRCBL4831496-----180000001000 7. TRGBL1031490-----500000002500
Let's pretend that this file is asking to create two documents: a Distribution of Income and Expense initiated by kcopley on 12/01/2010 with a description of "EXTERNAL DIST INC AND EXP", with one source accounting line and one target accounting line; and an Internal Billing, initiated by kmoutlaw on 12/03/2010, with description of "EXTERNAL INTERNAL BILL", two source accounting lines and one target accounting line.
We'll create the actual documents themselves during processing after the parse. Therefore, during the parse, we will utilize transient objects which will hold the parsed information to be subsequently passed to post-processing. We will call the top level object edu.kuali.kfs.fp.batch.DocumentHolder; the object which holds source accounting line information will be edu.kuali.kfs.fp.batch.DocumentSourceLine; and the object holding the target accounting line information will be edu.kuali.kfs.fp.batch.DocumentTargetLine.
package edu.kuali.kfs.fp.batch; class DocumentHolder { private String documentType; private String initiatorPrincipalName; private java.sql.Date initiationDate; private String documentDescription; private List<DocumentSourceLine> documentSourceLines; private List<DocumentTargetLine> documentTargetLines; // ... getters and setters ... } class DocumentSourceLine { private String chartCode; private String accountNumber; private String objectCode; private KualiDecimal amount; // ...getters and setters... } // DocumentTargetLine looks a *lot* like DocumentSourceLine
There's nothing special about these objects other than their public getters and setters which will be used to populate objects during parsing. After the parse, we will expect a List of DocumentHolder objects, each one holding the source and target accounting lines associated with that document. The FlatFileParser will always return a List of top level objects - even if it has only parsed one top level object. FlatFileParser is also going to build the object graph for us - it will figure out which child objects belong with which top level objects. We're only going one level deep in this example, but FlatFileParser should be able to handle as many levels deep of an object graph as is required by requirements, as long as the object graph comes back in a "tree" structure.
Now, in edu.kuali.kfs.fp's spring-fp.xml override, we'll add the following configuration:
1. <bean id="batchDocumentFileType" parent="FlatFileParser"> 2. <property name="flatFileSpecification"> 3. <bean parent="FixedWidthFlatFileSpecification" p:defaultBusinessObjectClass="edu.kuali.kfs.fp.batch.DocumentHolder"> 4. <property list="objectSpecifications"> 5. <list> 6. <bean parent="FlatFileObjectSpecification" p:businessObjectClass="edu.kuali.kfs.fp.batch.DocumentHolder"> 7. <property name="parseProperties"> 8. <list> 9. <bean parent="FixedWidthFlatFilePropertySpecification" p:propertyName="documentType" p:start="0" p:end="4" p:rightTrim="true" /> 10. <bean parent="FixedWidthFlatFilePropertySpecification" p:propertyName="initiatorPrincipalName" p:start="4" p:end="14" p:rightTrim="true" /> 11. <bean parent="FixedWidthFlatFilePropertySpecification" p:propertyName="initiationDate" p:start="14" p:end="28" p:dateFormat="YYYY/mm/dd" /> 12. <bean parent="FixedWidthFlatFilePropertySpecification" p:propertyName="documentDescription" p:start="28" p:rightTrim="true" /> 13. </list> 14. </property> 15. </bean> 16. <bean parent="FlatFileObjectSpecification" p:businessObjectClass="edu.kuali.kfs.fp.batch.DocumentSourceLine" p:prefix="SRC" p:parentBusinessObject="edu.kuali.kfs.fp.batch.DocumentHolder" p:parentTargetProperty="documentSourceLines"> 17. <proprety name="parseProperties"> 18. <list> 19. <bean parent="FixedWidthFlatFilePropertySpecification" p:propertyName="chartOfAccountsCode" p:start="3" p:end="5" /> 20. <bean parent="FixedWidthFlatFilePropertySpecification" p:propertyName="accountNumber" p:start="5" p:end="12" /> 21. <bean parent="FixedWidthFlatFilePropertySpecification" p:propertyName="objectCode" p:start="17" p:end="21" /> 22. <bean parent="FixedWidthFlatFilePropertySpecification" p:propertyName="amount" p:start="21" p:end="29" p:formatterClass="org.kuali.kfs.sys.businessobject.format.ExplicitKualiDecimalFormatter" /> 23. </list> 24. </property> 25. </bean> 26. <bean parent="FlatFileObjectSpecification" p:businessObjectClass="edu.kuali.kfs.fp.batch.DocumentTargetLine" p:prefix="TRG" p:parentBusinessObject="edu.kuali.kfs.fp.batch.DocumentHolder" p:parentTargetProperty="documentTargetLines"> 27. <property name="parseProperties"> 28. <list> 29. <bean parent="FixedWidthFlatFilePropertySpecification" p:propertyName="chartOfAccountsCode" p:start="3" p:end="5" /> 30. <bean parent="FixedWidthFlatFilePropertySpecification" p:propertyName="accountNumber" p:start="5" p:end="12" /> 31. <bean parent="FixedWidthFlatFilePropertySpecification" p:propertyName="objectCode" p:start="17" p:end="21" /> 32. <bean parent="FixedWidthFlatFilePropertySpecification" p:propertyName="amount" p:start="21" p:end="29" p:formatterClass="org.kuali.kfs.sys.businessobject.format.ExplicitKualiDecimalFormatter" /> 33. </list> 34. </property> 35. </bean> 36. </list> 37. </property> 38. </bean> 39. </property> 40. <property name="processor" ref="DocumentHolderFlatFileProcessor" /> 41. </bean>
This code could be improved - since the list of properties for DocumentSourceLine and DocumentTargetLine are the same, we likely could have created a single bean to hold all of their FixedWidthFlatFilePropertySpecification objects; but it's good enough to look at what we've got. This configuration by itself will build an object graph out of the file above.
There are four levels to our configuration: FlatFileParser, which deals with the parse process as a whole; FixedWidthFlatFileSpecification, which handles the parsing of the whole file into an object graph; FlatFileObjectSpecification, which handles the parsing of a single line into an object; and FixedWidthFlatFilePropertySpecification, which populates a business object with data from substrings of a given line. Let's take a look at each of these in turn.
FlatFileParser configuration
There's three major pieces of FlatFileParser configuration: the normal BatchInputFileType properties, the parse specification, and the processor injection.
org.kuali.kfs.sys.batch.FlatFileParserBase extends BatchInputFileType and therefore extends all of the properties which BatchInputFileType already has: titleMessageKey, batch file name, etc. In FlatFileParserBase, many of these properties have been made Spring-injectable so that FlatFileParserBase need not be extended.
The parseSpecification property holds a bean that explains how to turn the collection of lines that is the file into an actual object graph. There are two implementations for this kind of bean currently: FixedWidthFlatFileSpecification, where elements on a line consistently positioned from line to line, and DelimitedFlatFileSpecification, where a special character is used to split one element from the next. Line 3 of our configuration says that we're using FixedWidthFlatFileSpecification; we'll take a closer look in the next section.
The processor injection occurs on line 40. It's assumed that lurking somewhere else in spring-sys.xml, there is a bean named DocumentHolderFlatFileProcessor which implements the interface org.kuali.kfs.sys.batch.FlatFileDataHandler. This interface has two methods on it: validate and process, which correspond to BatchInputFileType's validate and process directly. When BatchInputFileType#validate is called, FlatFileParserBase will attempt to hand validation off to whatever its injected processor property is - in this case, DocumentHolderFlatFileProcessor. In our case, we would likely use the process method to create the documents specified in the file and attempt to save them.
FixedWidthFlatFileSpecification configuration
The next level is the file level. Here we need to make two choices: first, how do we know what elements in which lines should populate properties in our business objects? and secondly, what prefixes are associated wich what objects?
There are currently two implementations for how to split elements apart: FixedWidthFlatFileSpecification and DelimitedFlatFileSpecification. A FixedWidthFlatFileSpecification could be used to parse an origin entry file (or, obviously, our example above), because elements always exist in the same character position in the line throughout the file. Different lines may mean different things and may therefore have information in different positions - just like our file above reads header records differently from accounting line records, but in both cases, the data is positional. A delimited file would be something more like a CSV file where a certain character (",") has a special meaning in terms of separating elements. To make this choice, we either use FixedWidthFlatFileSpecification or DelimitedFlatFileSpecification as parent beans in our configuration, as we used FixedWidthFlatFileSpecification in line 3.
The second piece of configuration is knowing which line correlates to which object. From our investigations of different file types, there seem to be two big choices here: either every single line is the same object, or the line has some kind of "prefix" - characters which tell what kind of object the line correlates to. (The prefix may be in the middle of the String; both FlatFileSpecification beans have a property called "prefixStartingPosition" which defaults to 0 but which can pick up characters from any point in the line.)
Typically, if a prefix is associated with a given object, it is the FlatFileObjectSpecification which specifies that, as we'll soon see. However, if there are no prefixes and every line is the same, then we need to specify a defaultBusinessObjectClass to say "every line will be an object of this class."
In fact, as our example above shows on line 6, the defaultBusinessObjectClass can be used in a file format where certain other lines have prefixes. Our header lines don't have a standard prefix, so we set the defaultBusinessObjectClass. Since our source and target lines have prefixes, though, we specify that in the FlatFileObjectSpecification.
That brings up an issue, though: what about lines where the file data is meaningless to the KFS process? To deal with that, the FlatFileSpecifications have a property called insignifcantPrefixes which takes in a List of Strings. Any line with one of the insignificantPrefixes will be ignored during processing.
Note, finally, that we need not specify a defaultBusinessObjectClass if EVERY line has a prefix. In that case, we would have no need to generate a list of insignificantPrefixes either - if no defaultBusinessObjectClass is specified, then the parser will ignore lines with prefixes it cannot match. Instead, just have FlatFileObjectSpecification do all the class determinacy: as we're about to see.
FlatFileObjectSpecification configuration
Now, we're getting down to the object level and our parsing is really taking off. We're now at the point where a line gets turned into an object. We've got three points of configuration in our example above, at lines 6, 16, and 26, and from those we can see that there's a number of elements each configuration has in common.
First, FlatFileObjectSpecification associates a class with a List of parseProperties.
Second, FlatFileObjectSpecification has a prefix associated with it. Naturally, this prefix is not needed in cases when the FlatFileObjectSpecification is for the defaultBusinessObjectClass - that's the case with DocumentHolder. For all other FlatFileObjectSpecifications, such as DocumentSourceLine and DocumentTargetLine, it is required.
Note, too, that the FlatFileObjectSpecification objects for DocumentSourceLine and DocumentTargetLine specify a parentBusinessObject and a parentTargetProperty. These basically tell the parser that the current object is supposed to be the child of some other object - in the case of DocumentSourceLine and DocumentTargetLine, the parent object should be DocumentHolder. What parentTargetProperty tells the parser is what property of the parent object the given child belongs in.
If the type of the parent target property is a java.util.Collection, the child object will be added to the end of it. Otherwise, the parent target property will attempt to set the child object to the given property.
Again, there's no theoretical depth limit on how many parents and children you can have (only the constraints of phyiscal memory). An object can have as many types of children objects as needed to parse the file, and it can be the child of some other parent as well. The FlatFileParser - specifically, the ParseTracker implementation - will attempt to correctly build the object graph based on this parent/child hierarchy.
FixedWidthFlatFilePropertySpecification configuration
We've got our line, we've got the object that it needs to be parsed into. Now the final step: finding the significant substrings in the line and populating the object based on those. Let's take a look at the first of these in our sample configuration above, line 9:
<bean parent="FixedWidthFlatFilePropertySpecification" p:propertyName="documentType" p:start="0" p:end="4" p:rightTrim="true" />
Our FixedWidthFlatFilePropertySpecification obviously needs to give us the property name, so the FlatFileParser can populate the property on the object correctly. It's expected these are unique...but the FlatFileParser doesn't check to make sure you're not overwriting data.
Since this is a FixedWidthFlatFilePropertySpecification, we need to say where the string begins and where it ends. These indexes work precisely like java.lang.String.substring works, which is to say that the character at the end index will not be included in the substring (this is for technical reasons too complicated to explain here).
Also, we note that the rightTrim property of the specification has been set to true. This simply tells the FlatFileParser to trim the String before setting the property (there's a leftTrim property as well - both default to false).
Of course, simply trimming the String may not be all the data formatting we need to accomplish. Indeed, we may need to convert the substring to a different data type entirely. FixedWidthFlatFilePropertySpecifications can be given the class of a normal KNS Formatter to assist, as we see from line 22 of our sample configuration above:
<bean parent="FixedWidthFlatFilePropertySpecification" p:propertyName="amount" p:start="21" p:end="29" p:formatterClass="org.kuali.kfs.sys.businessobject.format.ExplicitKualiDecimalFormatter" />
Here, we're telling the property specification to use ExplicitKualiDecimalFormatter to turn the substring into a KualiDecimal. It should be noted that ExplicitKualiDecimalFormatter was introduced specifically to format a common case in flat files - that occurring when a String such as "100" should be converted to the KualiDecimal 1.00.
In flat file formats, there's also a plethora of choices about how to parse dates. Therefore, FixedWidthFlatFilePropertySpecification has a special property dateFormat which simply says "format this property into a Date using this format". It's used on line 11 of our sample configuration above to turn the initiatorDate into a String.
<bean parent="FixedWidthFlatFilePropertySpecification" p:propertyName="initiationDate" p:start="14" p:end="28" p:dateFormat="YYYY/mm/dd" />
There is also a property "formatToTimestamp" which makes sure the the date is formatted into a java.sql.Timestamp. Otherwise, the date will be formatted to java.sql.Date.
And with that, we've parsed our first flat file.
Delimited Files
Time for our next challenge. Let's say we had a file where every line looked like this:
BL~1031400~~5000~50.00 BL~1031400~ADV~5070~7523.17
Here, we don't have a fixed position for every element; instead, we have to split the String by the "~" character and find which element we want within an array. The FlatFileParser also supports this kind of parsing. There are only two configuration changes: using DelimitedFlatFileSpecification in the place of FixedWidthFlatFileSpecification and using DelimitedFlatFilePropertySpecifications instead of FixedWidthFlatFilePropertySpecifications.
DelimitedFlatFileSpecification has one property FixedWidthFlatFileSpecification lacked: a delimiter. It's configured as so:
<bean parent="DelimitedFlatFileSpecification" p:delimiter="~" p:defaultBusinessObjectClass="...">
DelimitedFlatFilePropertySpecification objects are just like FixedWidthFlatFilePropertySpecification objects: they can have formatters, date formats, and they always specify a propertyName. The difference is that instead of start and end properties, they have a property "lineSegmentIndex" which picks which part of the String to populate into the property, with the left-most property held in lineSegmentIndex 0. For instance, if we wanted to set the accountNumber in the file into an object, we'd do the following:
<beran parent="DelimitedFlatFilePropretySpecification" p:propertyName="accountNumber" p:lineSegmentIndex="1" />
Delimited files can use prefixes just like fixed width files.
Messages for logical files
Oftentimes, a flat file will be made up of a few major header objects with many children. It makes sense in these situations to organize errors and informational messages which are recording during validation around these header objects rather than the file as a whole. The FlatFileParser also includes support for that.
Basically, the header object needs to implement the interface FlatFileData. This interface has one method, a getter for a FlatFileTransactionInformation object.
org.kuali.kfs.sys.batch.FlatFileTransactionInformation is simply a holder for errors and info messages. With this convenient holding, e-mails can be sent out to users responsible for each "transaction" object. For instance, our example above, we could e-mail the intended initiator of the document to tell them that validation is off.
Thoughts for improvements
Just some off the cuff thoughts for improvements here...
- Configuring different ParseTrackers is tricky right now, but likely the name of a default ParseTracker bean would be pretty easy to accomplish
- And of course, the obvious alternate implementation of a ParseTracker is one that simply sticks each line into a list, without attempting to build an object graph at all (which would be faster and less complicated on files, like origin entry files, where the whole file has only one homogenuous object type).
- It would be lovely if the Delimited Files parser was smart enough to parse CSV files and handle stuff like the "commas within quotes" rule...
- Return data dictionary if p:start missing in FixedWidthFlatFilePropertySpecification?