Friday, September 20, 2013

Pentaho Data Integration ( PDI ) - Generic Design Guidelines

Design for Failure Handling

Recommended to ensure that the data source is available before a process is kicked off. One basic design principle is that the ETL job needs to be able to fail gracefully when a data availability test fails.

Kettle contains following features to do this.

  • Test a repository connection.
  • Ping a host to check whether it's available.
  • Wait for a SQL command to return success/failure based on a row count condition.
  • Check for empty folders.
  • Check for the existence of a file, table, or column.
  • Compare files or folders.
  • Set a timeout on FTP and SSH connections.
  • Create failure/success outputs on every available job step.

Multi Tier Loading Strategies

For ETL processes it is recommended to employ a pre-load staging area. This provides the ability to pull data from the production system and use it for data cleaning and harmonization activities without interfering with the production systems. By leveraging this type of strategy we will be able to see real production data sooner and follow the guiding principle of ‘Convert Early, Convert Often, and with Real Production Data'.

Table Driven

Developers frequently need to perform a large amount of cross-referencing, hard-coding of values, or other repeatable data transformations during data migration. These transformations often have a probability to change over time. Without a table driven approach this will cause code changes, bug fixes, re-testing, and re-deployments during the development effort. This work is unnecessary on many occasions and could be avoided with the use of reference tables. It is recommended to use table driven approaches such as these whenever possible. 

Some common table driven approaches include:

  • Default Values
Hard-coded values for a given column, stored in a table where the values could be changed whenever a requirement changes. For example, a hard coded value of NA for any value not populated and then want to change that value to NV we could simply change the value in a default value table rather than change numerous hard-coded values. This should be determined for all lookup steps and stream lookup steps.

  • Cross-Reference Values
Frequently in data migration projects there is a need to take values from the source system and convert them to the value of the target system. These values are usually identified up-front, but as the source system changes additional values are also needed. In a typical development situation this would require adding additional values to a series of IF statements. With a table driven situation, new data could be added to a cross-reference table and no coding, testing, or deployment would be required.

  • Code-Driven Table
In some instances a set of understood rules are known. By taking those rules and building code against them, a table-driven/code solution can be very productive. For example, if we had a rules table that was keyed by table/column/rule id, then whenever that combination was found a pre-set piece of code would be executed. If at a later date the rules change to a different set of pre-determined rules, the rule table could change for the column and no additional coding would be required.


Re-use should always be considered during Pentaho ETL development. There is often tremendous opportunity for re-use of jobs/transformations/processes/scripts/testing documents. This reduces the staff time and lowers project costs.

Being able to reuse existing parts of ETL solution is an indispensable PDI feature. Easy ways of doing this is to copy and paste or duplicate existing transformation steps, but that's not really reuse. The term reuse refers to the capability to define a step or transformation once and call the same component from different places. Within Kettle this is achieved by the Mapping step, which lets reuse existing transformations over and over as subcomponents in other transformations. Transformations themselves can be used multiple times in multiple jobs, and the same applies to jobs which can be reused as sub jobs in other jobs as well.

Common Error Handling

For any ETL solution, it is critical to find data exceptions and write appropriate rules to correct these situations throughout the data migration effort. It is highly recommended to build a common method for capturing and recording these exceptions. This common method should then be deployed for all data migration processes.


For data migration projects it is possible to build one process to pull all data for a given entity from all systems to the target system. While this may seem ideal, these types of complex processes take much longer to design and develop, are challenging to test, and are very difficult to maintain over time. Due to these drawbacks, it is recommended developing many simple processes as needed to complete the effort rather than a few complex processes.

No comments:

Post a Comment