Pentaho Data Integration - CSV File Input with parallel execution enabled

CSV file input is a commonly used input step to read delimited files. Options are similar to text file input steps. Here are the general configurable options.

  1. File name - Input file name.
  2. Delimiter - Support common delimiters like coma, tab, pipe e.t.c
  3. Enclosure - Optional enclosures like double quotes.
  4. NIO buffer size - Read buffer size.
  5. Lazy Conversion - Significant performance improvement by avoiding data type conversions. Check this option only if the logic is mere pass through.
  6. Header row present - Enable in case the file has header row.
  7. Add file name to result - File name or file names will be added to the result in memory and can be passed into the next job entry.
  8. Running in Parallel - This step can run in multiple copies and will read separate blocks of the file. Configure along with clustered or copy mode.
  9. File encoding - UTF-8, ASCII e.t.c.

Here are step wise illustrations on how to use CSV input step.

 Without Parallel Execution Option

Step 1 
Create a new transformation for demo and drag the CSV input file into the canvas.












Step 2
Double click on the CSV input step and configure options.

  • File name : Browse and select from local server.
  • Delimiter : Coma for demo purposes.
  • Lazy conversion : Disabled.
  • Running in parallel : Disabled.
  • File encoding : UTF-8 selected for demo.












Step 3
Click on "Get Fields" to populate field names based on sample data.
Configure appropriate trim type.

Here are the results of document scan. 
Result after scanning 100 lines.
---------------------------------------
Field nr. 1 :
 Field name           : id  Field type           : Integer
Field nr. 2 :
 Field name           : code  Field type           : String  Maximum length       : 10 Minimum value        : PRD_CD_ACF Maximum value        : PRD_CD_ZWQ  Nr of null values    : 1
Field nr. 3 :
 Field name           : description Field type           : String Maximum length       : 13 Minimum value        : aafwtscm-desc Maximum value        : zyirgmfn-desc Nr of null values    : 1
Field nr. 4 :
 Field name           : category Field type           : String  Maximum length       : 1 Minimum value        : A Maximum value        : Z Nr of null values    : 1
Field nr. 5 :
 Field name           : price Field type           : Number Estimated length     : 15 Estimated precision  : 5 Number format        : #.# WARNING: More then 1 number format seems to match all sampled records: Number format        : #.# Trim Type            : 0 Minimum value      : 252.44767Maximum value      : 1246.517 Example            : #.#, number [252.44767] gives 252.44767


Step 4
Add a dummy step to complete the flow.













Step 5
Here are the execution results for the demo transformation.
Step  Metrics









Preview Data














Execution Log
  General - Logging plugin type found with ID: CheckpointLogTable
  General - Starting agile-bi
  Version checker - OK
  class org.pentaho.agilebi.platform.JettyServer - WebServer.Log.CreateListener localhost:10000
  Spoon - Transformation opened.
  Spoon - Launching transformation [trf_Demo_Load_CSV_File]...
  Spoon - Started the transformation execution.
  trf_Demo_Load_CSV_File - Dispatching started for transformation [trf_Demo_Load_CSV_File]
  CSV file input.0 - Header row skipped in file 'products_data.csv'
  CSV file input.0 - Finished processing (I=10001, O=0, R=0, W=10000, U=0, E=0)
  Dummy (do nothing).0 - Finished processing (I=0, O=0, R=10000, W=10000, U=0, E=0)
  Spoon - The transformation has finished!!


 With Parallel Execution Option

Step 6
Enable "Running in parallel" option.











Step 7 
Change number of copies to start with from 1 to say 5 for demo purpose. 
Step 8
Select data movement as "Round robin".











Step 9
Execute the demo transformation. 5 parallel copies of the input step got executed, each reading a different block of the input file.









Step 10
Execution Log
 Spoon - Transformation opened.
 Spoon - Launching transformation [trf_Demo_Load_CSV_File]...
 Spoon - Started the transformation execution.
 trf_Demo_Load_CSV_File - Dispatching started for transformation [trf_Demo_Load_CSV_File]
 CSV file input.0 - We start to read from file 'products_data.csv' (430326 bytes).  Then we skip 0 bytes to the starting position and read   86065 bytes.
 CSV file input.1 - We start to read from file 'products_data.csv' (430326 bytes).  Then we skip 86065 bytes to the starting position and  read 86065 bytes.
 CSV file input.0 - Header row skipped in file 'products_data.csv'
 CSV file input.4 - We start to read from file 'products_data.csv' (430326 bytes).  Then we skip 344260 bytes to the starting position and read 86065 bytes.
 CSV file input.1 - Finished processing (I=1995, O=0, R=0, W=1995, U=0, E=0)
 CSV file input.0 - Finished processing (I=2022, O=0, R=0, W=2021, U=0, E=0)
 CSV file input.2 - We start to read from file 'products_data.csv' (430326 bytes).  Then we skip 172130 bytes to the starting position and read 86065 bytes.
 CSV file input.3 - We start to read from file 'products_data.csv' (430326 bytes).  Then we skip 258195 bytes to the starting position and read 86065 bytes.
 CSV file input.4 - Finished processing (I=1995, O=0, R=0, W=1995, U=0, E=0)
 CSV file input.2 - Finished processing (I=1993, O=0, R=0, W=1993, U=0, E=0)
 CSV file input.3 - Finished processing (I=1996, O=0, R=0, W=1996, U=0, E=0)
 Dummy (do nothing).0 - Finished processing (I=0, O=0, R=10000, W=10000, U=0, E=0)
 Spoon - The transformation has finished!!

2 comments :

anupam kumar tejaswi said...

nice article buddy :)
i am expecting similar kind of explanation of all three data movement options. i.e round robin, load balance and copy data to next step.

Joyti Nasna said...

BA Revaluation Result 2019
Hey Nice Blog!! Thanks For Sharing!!! Wonderful blog & good post. It is really very helpful to me, waiting for a more new post. Keep Blogging

Post a Comment

UA-46724997-1