Pentaho Data Integration - Get file names step

 The Get Filenames step allows you to retrieve information associated with filenames in the file system. The obtained file name is added to the stream as a line. Search for files using wildcard (RegExp) fields

 Stepwise illustration on how to use "Get file names" step given below.



Pentaho Data Integration - Community Edition Install for Mac

 Pentaho is an end-to-end data integration and analytics platform designed to manage data at scale for rapid business innovation, ease of use, and self-service automation and orchestration. Pentaho tightly ties data integration and business analytics in a modern platform that connects IT and business users to access, visualize, and explore all the data that impacts business outcomes. Pentaho Kettle enables IT and developers to integrate data from different sources and deliver it to business applications. 


Step wise illustration on how to install Pentaho Data Integration community edition 9.3.0.0 is given below.


Pentaho Data Integration - PDI 7.0 Installation for Windows 64 bit

Pentaho 7 is the latest Pentaho version with powerful features including enhanced big data security features and advanced data exploration functionality.

Step wise illustration on how to install Pentaho Data Integration 7 is given below.

Here are some of the highlights of the new version.

  • Inspect Data in the Pipeline.
  • Advanced Security features for Bigdata including Kerberos.
  • Integrated installation of Business Analytics (BA) and Data Integration (DI) components.
  • Spark submit job entry for scala and python.
  • Expanded Metadata Injection Support.



Pentaho Data Integration : Aggregation using Group By step

This step can be used to perform various types of aggregations such as sum, average, min, max e.t.c. Input data always need to be sorted for this step to work properly.

This step support following aggregation methods.
  1. Sum
  2. Average or Mean


Pentaho Data Integration - Data Grid Input step

This step generally used for testing, reference or demo purposes. We can create a static rows in a grid.

  • Meta tab : Enter field names and meta data info.
  • Data tab : Enter static data in a grid.

Here are the step wise illustrations on how to use Data Grid step.

Pentaho Common Errors : Error converting data while looking up value

Error Message

Stream lookup.0 - ERROR (version 5.4.0.1-130, build 1 from 2015-06-14_12-34-55 by buildguy) : Unexpected error
Stream lookup.0 - ERROR (version 5.4.0.1-130, build 1 from 2015-06-14_12-34-55 by buildguy) : org.pentaho.di.core.exception.KettleStepException:
Stream lookup.0 - Error converting data while looking up value
Stream lookup.0 -

Pentaho Data Integration - CSV File Input with parallel execution enabled

CSV file input is a commonly used input step to read delimited files. Options are similar to text file input steps. Here are the general configurable options.

  1. File name - Input file name.
  2. Delimiter - Support common delimiters like coma, tab, pipe e.t.c
  3. Enclosure - Optional enclosures like double quotes.
  4. NIO buffer size - Read buffer size.
  5. Lazy Conversion - Significant performance improvement by avoiding data type conversions. Check this option only if the logic is mere pass through.


Pentaho Data Integration - PDI 5.4 Installation for Windows 64 bit

Pentaho 5.4 is the latest Pentaho version with powerful features.
Stepwise illustration on how to install Pentaho Data Integration 5.4 is given below.

Here are some of the highlights of the new version.



Pentaho Data Integration : Google Analytics

Google Analytics service provide details about a website's traffic. This service track various statistics and can be integrated with AdWords to review online campaigns.

Pentaho Google Analytics step allows to extract Google Analytics data.
Stepwise illustration given below.

Step 1

Enable Google Analytics and generate API key.


Pentaho Common Errors : Driver class 'org.gjt.mm.mysql.Driver' could not be found

Error Message
Error connecting to database [MySQLDev] : org.pentaho.di.core.exception.KettleDatabaseException:
Error occured while trying to connect to the database

Driver class 'org.gjt.mm.mysql.Driver' could not be found, make sure the 'MySQL' driver (jar file) is installed.
org.gjt.mm.mysql.Driver



Pentaho Big Data : Pig Script Executor

Apache Pig is a high level data analysis language capable of handling very high data volume. Ease of programming, parallelization, extensibility and optimization opportunities are some of the key features of this platform.

Pig Script Executor job entry step can be used to execute "Pig Latin" script on a Hadoop cluster.
Stepwise illustration on how to configure PDI "Pig Script Executor" is given below.


Pentaho Big Data : Hadoop File Input

The Hadoop File Input step can be used to extract data from Hadoop cluster. This step can read comma separated, tab delimited , fixed width and other common types of text files.

Stepwise illustration on how to configure Pentaho Hadoop file input is given below.

Cloudera Quick Start VM used for demo purpose. Refer link for more info.
http://www.cloudera.com/content/cloudera-content/cloudera-docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html


Pentaho Data Integration : JSON input Step

JSON (JavaScript Object Notation) is a text based, light weight data inter change format.
This format enjoys a wide availability of implementations and is platform independent.

Stepwise illustration on the usage of Pentaho JSON step given below.

Step 1



Pentaho Data Integration : HTTP Client

HTTP Client provide ability to perform a call to a base URL with parameter values and return result value as a string. Sample transformation given below.

Free Yahoo finance API to download stock quotes used here for demo purpose.
Current stock prices with a 15 minute delay can be retrieved using this API.
Service return data in CSV format.

Base URL : http://finance.yahoo.com/d/quotes.csv



Pentaho Data Integration 5.0.2 - Configure DI server ( Linux )

Basic configuration steps for Pentaho Data Integration server given below. PDI installation using installation wizard on Linux OS is used for demo purposes. The server was installed on an included Apache Tomcat server.

Step 1 : Start DI Server

Script "ctlscript.sh" can be used to manage the DI server. Here are the available script arguments.
UA-46724997-1