Friday, October 2, 2015

Pentaho Data Integration : Aggregation using Group By step

This step can be used to perform various types of aggregations such as sum, average, min, max e.t.c. Input data always need to be sorted for this step to work properly.

This step support following aggregation methods.
  1. Sum
  2. Average or Mean
  3. Median
  4. Minimum
  5. Maximum
  6. Standard deviation
  7. Percentile
  8. Number of values
  9. Number of distinct values 
  10. Number of rows
  11. First non-null value
  12. Last non-null value
  13. First value null inclusive
  14. Last value null inclusive
  15. Cumulative sum for all rows
  16. Cumulative average for all rows
  17. Concatenate strings separated by comma
  18. Concatenate strings separated by a value.

Here are the commonly used options for this step.

  • Group fields - The fields for Group by. 
  • Aggregates fields - Specify aggregate field names and aggregate methods.  
  • Include all rows - Output will contain all rows instead of just the aggregate rows 
  • Add a line number, Restart in each group -  Option to be used along with "Include all rows" 
  • Always give back a result row - To make sure there is an output row in case of no input rows ( for count of 0 ). 

Step wise illustration on how to use "Group By" step given below.

Step 1 

Create a demo transformation.

This demo transformation will read from a CSV input file which contain product information. "Group By" step will be used to aggregate product price based on product category. The input records need to be sorted to use Group BY. A sort step is used for this.

Here are the configurations for the CSV input and Sort step.

Step 2

Drag the "Group By" step into the canvas.

Step 3

Set the group by field and aggregate type.

Category is used as the group by field here for demo purposes. Sum, maximum, minimum, first value and last value selected as aggregate types.

Step 4

Drag the "Dummy" step into the canvas to complete the flow.

Execution Log 

trf_Demo_Aggregates - Transformation is pre-loaded.
Spoon - Transformation opened.
Spoon - Launching transformation [trf_Demo_Aggregates]...
Spoon - Started the transformation execution.
trf_Demo_Aggregates - Dispatching started for transformation [trf_Demo_Aggregates]
trf_Demo_Aggregates - Allocating Steps & StepData...
trf_Demo_Aggregates -  Transformation is about to allocate step [CSV file input] of type [CsvInput]
CSV file input.0 - distribution activated
trf_Demo_Aggregates -  Transformation is about to allocate step [Sort rows] of type [SortRows]
Sort rows.0 - distribution activated
trf_Demo_Aggregates -  Transformation is about to allocate step [Group by] of type [GroupBy]
Group by.0 - distribution activated
Group by.0 - Starting allocation of buffers & new threads...
Sort rows.0 - The number of binary string to data type conversions done in this sort block is 0
Sort rows.0 - Finished processing (I=0, O=0, R=10000, W=10000, U=0, E=0)
Group by.0 - Finished processing (I=0, O=0, R=10000, W=26, U=0, E=0)
Dummy (do nothing).0 - Finished processing (I=0, O=0, R=26, W=26, U=0, E=0)
trf_Demo_Aggregates - Handling extension point for plugin with id 'DataRefineryTransFinishListener' and extension point id 'TransformationFinish'
Spoon - The transformation has finished!!

Include All rows - Disabled

Results preview where "Include all rows" option is disabled. Only aggregated rows will be there in the output, in this case one row per category since group by was done based on category.


Include All rows - Enabled

Results preview where "Include all rows" option is enabled. All input rows will be there in the output, aggregated values such as sum_price, max_price e.t.c will be repeated for the same category in the output.


Include All rows - Enabled with Line number

Results preview where "Include all rows" option is enabled and "Add line number" is enabled. All rows will be there in the output with a line number for each row.


Always give back a result row

Add a filter step to the dummy transformation to filter out all input records.
There will be no results rows if the "Always give back a result row" is disabled. Need to enable this to get output result row with Counts as 0 and other aggregate values as Null.



  1. How does Percentile aggregation work?

  2. Great post! I am actually getting ready to across this information, It's very helpful for this blog.Also great with all of the valuable information you have Keep up the good work you are doing well.

    Informatica training in chennai

  3. Great article. It helps to explain complicated things to people who are not competent and explicate why this is important.
    dataroom software

  4. Dude, there are too many ads. I can't even search. It anonying...

  5. Thank you for your post. This is excellent information. It is amazing and wonderful to visit your site
    Website Design and Development Company in Bangalore
    ECommerce Web Design Company in bangalore

  6. I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.

    python training in chennai
    python course in chennai
    python training in bangalore

  7. This is a nice article here with some useful tips for those who are not used-to comment that frequently. Thanks for this helpful information I agree with all points you have given to us. I will follow all of them.
    Best Devops training in sholinganallur
    Devops training in velachery
    Devops training in annanagar
    Devops training in tambaram

  8. It's interesting that many of the bloggers to helped clarify a few things for me as well as giving.
    Most of ideas can be nice content.The people to give them a good shake to get your point and across the command.

    Java training in Bangalore|best Java training in Bangalore

  9. Thanks for sharing such a useful blog. Really! This Blog is very informative for us which contains a lot of information about the Writing. I like this post. SEO Cheltenham