Pentaho Data Integration: Pentaho Data Integration - CSV File Input with parallel execution enabled

CSV file input is a commonly used input step to read delimited files. Options are similar to text file input steps. Here are the general configurable options.

File name - Input file name.
Delimiter - Support common delimiters like coma, tab, pipe e.t.c
Enclosure - Optional enclosures like double quotes.
NIO buffer size - Read buffer size.
Lazy Conversion - Significant performance improvement by avoiding data type conversions. Check this option only if the logic is mere pass through.
Header row present - Enable in case the file has header row.
Add file name to result - File name or file names will be added to the result in memory and can be passed into the next job entry.
Running in Parallel - This step can run in multiple copies and will read separate blocks of the file. Configure along with clustered or copy mode.
File encoding - UTF-8, ASCII e.t.c.

Here are step wise illustrations on how to use CSV input step.

Without Parallel Execution Option

Step 1
Create a new transformation for demo and drag the CSV input file into the canvas.

Step 2
Double click on the CSV input step and configure options.

File name : Browse and select from local server.
Delimiter : Coma for demo purposes.
Lazy conversion : Disabled.
Running in parallel : Disabled.
File encoding : UTF-8 selected for demo.

Step 3
Click on "Get Fields" to populate field names based on sample data.
Configure appropriate trim type.

Here are the results of document scan.
Result after scanning 100 lines.
---------------------------------------
Field nr. 1 :
Field name           : id Field type           : Integer
Field nr. 2 :
Field name           : code Field type           : String Maximum length       : 10 Minimum value        : PRD_CD_ACF Maximum value        : PRD_CD_ZWQ Nr of null values    : 1
Field nr. 3 :
Field name           : description Field type           : String Maximum length       : 13 Minimum value        : aafwtscm-desc Maximum value        : zyirgmfn-desc Nr of null values    : 1
Field nr. 4 :
Field name           : category Field type           : String Maximum length       : 1 Minimum value        : A Maximum value        : Z Nr of null values    : 1
Field nr. 5 :
Field name           : price Field type           : Number Estimated length     : 15 Estimated precision : 5 Number format        : #.# WARNING: More then 1 number format seems to match all sampled records: Number format        : #.# Trim Type            : 0 Minimum value      : 252.44767Maximum value      : 1246.517 Example            : #.#, number [252.44767] gives 252.44767

Step 4
Add a dummy step to complete the flow.

Step 5
Here are the execution results for the demo transformation.
Step Metrics

Preview Data

Execution Log
General - Logging plugin type found with ID: CheckpointLogTable
General - Starting agile-bi
Version checker - OK
class org.pentaho.agilebi.platform.JettyServer - WebServer.Log.CreateListener localhost:10000
Spoon - Transformation opened.
Spoon - Launching transformation [trf_Demo_Load_CSV_File]...
Spoon - Started the transformation execution.
trf_Demo_Load_CSV_File - Dispatching started for transformation [trf_Demo_Load_CSV_File]
CSV file input.0 - Header row skipped in file 'products_data.csv'
CSV file input.0 - Finished processing (I=10001, O=0, R=0, W=10000, U=0, E=0)
Dummy (do nothing).0 - Finished processing (I=0, O=0, R=10000, W=10000, U=0, E=0)
Spoon - The transformation has finished!!

With Parallel Execution Option

Step 6
Enable "Running in parallel" option.

Step 7

Change number of copies to start with from 1 to say 5 for demo purpose.

Step 8
Select data movement as "Round robin".

Step 9
Execute the demo transformation. 5 parallel copies of the input step got executed, each reading a different block of the input file.

Step 10
Execution Log
Spoon - Transformation opened.
Spoon - Launching transformation [trf_Demo_Load_CSV_File]...
Spoon - Started the transformation execution.
trf_Demo_Load_CSV_File - Dispatching started for transformation [trf_Demo_Load_CSV_File]
CSV file input.0 - We start to read from file 'products_data.csv' (430326 bytes). Then we skip 0 bytes to the starting position and read 86065 bytes.
CSV file input.1 - We start to read from file 'products_data.csv' (430326 bytes). Then we skip 86065 bytes to the starting position and read 86065 bytes.
CSV file input.0 - Header row skipped in file 'products_data.csv'
CSV file input.4 - We start to read from file 'products_data.csv' (430326 bytes). Then we skip 344260 bytes to the starting position and read 86065 bytes.
CSV file input.1 - Finished processing (I=1995, O=0, R=0, W=1995, U=0, E=0)
CSV file input.0 - Finished processing (I=2022, O=0, R=0, W=2021, U=0, E=0)
CSV file input.2 - We start to read from file 'products_data.csv' (430326 bytes). Then we skip 172130 bytes to the starting position and read 86065 bytes.
CSV file input.3 - We start to read from file 'products_data.csv' (430326 bytes). Then we skip 258195 bytes to the starting position and read 86065 bytes.
CSV file input.4 - Finished processing (I=1995, O=0, R=0, W=1995, U=0, E=0)
CSV file input.2 - Finished processing (I=1993, O=0, R=0, W=1993, U=0, E=0)
CSV file input.3 - Finished processing (I=1996, O=0, R=0, W=1996, U=0, E=0)
Dummy (do nothing).0 - Finished processing (I=0, O=0, R=10000, W=10000, U=0, E=0)
Spoon - The transformation has finished!!

2 comments:

AnupamMay 15, 2018 at 11:47 PM
nice article buddy :)
i am expecting similar kind of explanation of all three data movement options. i.e round robin, load balance and copy data to next step.
kicMarch 12, 2021 at 5:38 AM
Thanks for sharing this information
Kiran Infertility Centre is one of the leading infertility treatment clinics in India with branches in Hyderabad, Gurgaon and Bengaluru and is perhaps the best fertility clinic in Hyderabad providing world class treatment options for Infertility such as Surrogacy, I.V.F., I.C.S.I., IUI, Egg/ Oocyte Donation, PGS/ Embryo Donation, Oocyte/Embryo Freezing.IVF Centers in hyderabad, IVF Specialist doctors in hyderabad, Infertility Treatment in hyderbad, Fertility Centers in hyderabad

Pentaho Data Integration

Saturday, September 19, 2015

Pentaho Data Integration - CSV File Input with parallel execution enabled

2 comments:

Pentaho Data Integration Made Easy: Essential Tips