Pentaho Data Integration : Aggregation using Group By step

This step can be used to perform various types of aggregations such as sum, average, min, max e.t.c. Input data always need to be sorted for this step to work properly.

This step support following aggregation methods.
  1. Sum
  2. Average or Mean
  3. Median
  4. Minimum
  5. Maximum
  6. Standard deviation
  7. Percentile
  8. Number of values
  9. Number of distinct values 
  10. Number of rows
  11. First non-null value
  12. Last non-null value
  13. First value null inclusive
  14. Last value null inclusive
  15. Cumulative sum for all rows
  16. Cumulative average for all rows
  17. Concatenate strings separated by comma
  18. Concatenate strings separated by a value.

Here are the commonly used options for this step.

  • Group fields - The fields for Group by. 
  • Aggregates fields - Specify aggregate field names and aggregate methods.  
  • Include all rows - Output will contain all rows instead of just the aggregate rows 
  • Add a line number, Restart in each group -  Option to be used along with "Include all rows" 
  • Always give back a result row - To make sure there is an output row in case of no input rows ( for count of 0 ). 

Step wise illustration on how to use "Group By" step given below.


Step 1 

Create a demo transformation.

This demo transformation will read from a CSV input file which contain product information. "Group By" step will be used to aggregate product price based on product category. The input records need to be sorted to use Group BY. A sort step is used for this.

Here are the configurations for the CSV input and Sort step.

Step 2

Drag the "Group By" step into the canvas.















Step 3

Set the group by field and aggregate type.

Category is used as the group by field here for demo purposes. Sum, maximum, minimum, first value and last value selected as aggregate types.


















Step 4

Drag the "Dummy" step into the canvas to complete the flow.



Execution Log 


trf_Demo_Aggregates - Transformation is pre-loaded.
Spoon - Transformation opened.
Spoon - Launching transformation [trf_Demo_Aggregates]...
Spoon - Started the transformation execution.
trf_Demo_Aggregates - Dispatching started for transformation [trf_Demo_Aggregates]
trf_Demo_Aggregates - Allocating Steps & StepData...
trf_Demo_Aggregates -  Transformation is about to allocate step [CSV file input] of type [CsvInput]
CSV file input.0 - distribution activated
trf_Demo_Aggregates -  Transformation is about to allocate step [Sort rows] of type [SortRows]
Sort rows.0 - distribution activated
trf_Demo_Aggregates -  Transformation is about to allocate step [Group by] of type [GroupBy]
Group by.0 - distribution activated
Group by.0 - Starting allocation of buffers & new threads...
Sort rows.0 - The number of binary string to data type conversions done in this sort block is 0
Sort rows.0 - Finished processing (I=0, O=0, R=10000, W=10000, U=0, E=0)
Group by.0 - Finished processing (I=0, O=0, R=10000, W=26, U=0, E=0)
Dummy (do nothing).0 - Finished processing (I=0, O=0, R=26, W=26, U=0, E=0)
trf_Demo_Aggregates - Handling extension point for plugin with id 'DataRefineryTransFinishListener' and extension point id 'TransformationFinish'
Spoon - The transformation has finished!!




Include All rows - Disabled

Results preview where "Include all rows" option is disabled. Only aggregated rows will be there in the output, in this case one row per category since group by was done based on category.


 

Include All rows - Enabled

Results preview where "Include all rows" option is enabled. All input rows will be there in the output, aggregated values such as sum_price, max_price e.t.c will be repeated for the same category in the output.



 

Include All rows - Enabled with Line number

Results preview where "Include all rows" option is enabled and "Add line number" is enabled. All rows will be there in the output with a line number for each row.



 

Always give back a result row

Add a filter step to the dummy transformation to filter out all input records.
There will be no results rows if the "Always give back a result row" is disabled. Need to enable this to get output result row with Counts as 0 and other aggregate values as Null.










 

15 comments :

Unknown said...

How does Percentile aggregation work?

Unknown said...

Great post! I am actually getting ready to across this information, It's very helpful for this blog.Also great with all of the valuable information you have Keep up the good work you are doing well.

Informatica training in chennai

Unknown said...

Great article. It helps to explain complicated things to people who are not competent and explicate why this is important.
dataroom software

Unknown said...

Good post. I like your post, Thanks for sharing.
web design institute in chennai

Unknown said...

Dude, there are too many ads. I can't even search. It anonying...

srihariparu said...


Its interesting and informative. Thanks for your great information
Embedded Project Center in Chennai | Best Embedded Project Center in Velachery

Ancy merina said...

Thank you for your post. This is excellent information. It is amazing and wonderful to visit your site
Website Design and Development Company in Bangalore
ECommerce Web Design Company in bangalore

Mounika said...

I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.

python training in chennai
python course in chennai
python training in bangalore

Unknown said...

I would assume that we use more than the eyes to gauge a person's feelings. Mouth. Body language. Even voice. You could at least have given us a face in this test.
Selenium training in Chennai | Selenium training institute in Chennai | Selenium course in Chennai

Selenium training in Bangalore | Selenium training institute in Bangalore | Selenium course in Bangalore
Java training in Chennai | Java training institute in Chennai | Java course in Chennai

Java training in Bangalore | Java training institute in Bangalore | Java course in Bangalore

Anonymous said...

This is a nice article here with some useful tips for those who are not used-to comment that frequently. Thanks for this helpful information I agree with all points you have given to us. I will follow all of them.
Best Devops training in sholinganallur
Devops training in velachery
Devops training in annanagar
Devops training in tambaram

Unknown said...

Thank you for allowing me to read it, welcome to the next in a recent article. And thanks for sharing the nice article, keep posting or updating news article.
Data Science training in rajaji nagar | Data Science Training in Bangalore
Data Science with Python training in chennai
Data Science training in electronic city
Data Science training in USA
Data science training in pune

siva said...

It's interesting that many of the bloggers to helped clarify a few things for me as well as giving.
Most of ideas can be nice content.The people to give them a good shake to get your point and across the command.

Java training in Bangalore|best Java training in Bangalore



Julia Loi said...

I am really happy to say it’s an interesting post to read. I learn new information from your article, you are doing a great job, Keep it up.
mobile phone repair in Novi
iphone repair in Novi
cell phone repair in Novi
phone repair in Novi
tablet repair in Novi
ipad repair in Novi
mobile phone repair Novi
iphone repair Novi
cell phone repair Novi
phone repair Novi

Rekha said...


Thanks for the efforts in writing the wonderful article.
php interview questions and answers
salesforce interview questions
blue prism interview questions
javascript interview questions
ethical hacking interview questions
networking interview questions and answers


Dynamic Sales Solutions said...

Thanks for sharing such a useful blog. Really! This Blog is very informative for us which contains a lot of information about the Writing. I like this post. SEO Cheltenham

Post a Comment

UA-46724997-1