Append rows in pentaho data integration

#APPEND ROWS IN PENTAHO DATA INTEGRATION HOW TO#
#APPEND ROWS IN PENTAHO DATA INTEGRATION SOFTWARE#

#APPEND ROWS IN PENTAHO DATA INTEGRATION SOFTWARE#

Now, most open source software starts with this realization: “There has got to be a better way!” This is no different… Pull out hair (not an option in my case).Ignore the problem, it’s clearly too hard to do.

It’s probably something along those lines. A senior ETL consultant (with evil tendencies) will try to convince the customer that the requirement is “ off scope” or shift it to the next “iteration”.

#APPEND ROWS IN PENTAHO DATA INTEGRATION HOW TO#

The ones having heard of the PDI “Split Field to rows” step might know (from our forum) how to solve the problem by reading the whole line and by splitting it into un-pivoted form with a wad of JavaScript.

Ask the file to be delivered in CSV file format so that some clever Perl-Python-JavaScript-Ruby-awk-bash script can be applied to the problem… Days will be spent by the customer to convert the file and weeks to get the script right, to debug and maintain that.

Days will be spend convincing and acquiring this data… when in fact it’s already there.

Convince the customer that this is a “ bad file” and that the un-pivoted source data is needed.

Once at this point any serious ETL consultant will be doing one of the following: Again, to be able to do that, we need to know the exact layout of the file. We also need to un-pivot or normalize the data. In order to do that, we need to actually know the layout of the spreadsheet before we do the ETL. So we actually want to only get 3 columns from this spreadsheet: The product code, the date of sales and the number of goods sold (the metric). Databases don’t support tables with varying number of columns. Surely, this can’t be too hard to read in by an ETL tool, right? You just want to load this data into a database somewhere and be done with it. In short: this spreadsheet will look different all year round. In our sample that means: no weekend days are present in the spreadsheet. Not all dates are listed however, that would too simple, only days on which products were actually sold are listed.

In our sample we have one column for each date since the beginning of this year. Because of that they contain a varying number of columns with a dimension value in the column header. Spreadsheets like these are usually automatically generated by some kind of pivoting program. Let’s assume that this spreadsheet describes the number of products sold on a given date. However, we can clearly do a lot better by extending our initiative to a few more steps: “Microsoft Excel Input” (which can also read ODS by the way), “Row Normalizer” and “Row De-normalizer”.īelow I’ll describe an actual (obfuscated) example that you will probably recognize as it is equally hideous as simple in it’s horrible complexity. Already with support for “CSV Input” and “Select Values” we could do a lot of dynamic things. Since then we received a lot of positive feedback on this functionality which encouraged me to extend it to a few more steps. Last year, right after the summer in version 4.1 of Pentaho Data Integration, we introduced the notion of dynamically inserted ETL metadata (Youtube video here).