What is the difference between reduceByKey and aggregateByKey in Spark? - Big Data In Real World

What is the difference between reduceByKey and aggregateByKey in Spark?

How to properly remove a node from a Hadoop cluster?
April 16, 2021
How to alter the type of the column in a Hive table?
April 21, 2021

reduceByKey()

reduceByKey() has the below properties

The result of the combination (e.g. a sum) is of the same type that the values

The operation when combined from different partitions is also the same as the operation when combining values inside a partition.

Example

scala> val colors = sc.parallelize(Array(("RED", 1), ("RED", 1), ("GREEN", 1), ("RED", 1)))
scala> val colorsReduced = colors.reduceByKey(_ + _)
scala> colorsReduced.collect
res0: Array[(String, Int)] = Array((GREEN,1), (RED,3))

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

aggregateByKey()

aggregateByKey() has the below properties and it is very flexible and extensible when compared to reduceByKey()

The result of the combination can be any object that you specify and does not have to be the same type as the values that are being combined.

You have to specify a function on how the values are combined inside one partition.

You have to specify a function on how the result is combined from data from different partitions.

Both functions could be the same or different.

Example

Let’s use aggregateByKey() to perform the same operation as reduceByKey()

scala> val colors = sc.parallelize(Array(("RED", 1), ("RED", 1), ("GREEN", 1), ("RED", 1)))

Let’s examine the below aggregateByKey. The first parameter – 0 is the initial value and also indicates the type of the output.

First _+_  function indicates the function on the map side combine and second _+_ function indicates the reduce side combine. Both functions are the same in this case.

scala> val colorsAggregated = colors.aggregateByKey(0)(_+_,_+_)
scala> colorsAggregated.collect
res1: Array[(String, Int)] = Array((GREEN,1), (RED,3))

Let’s try another example. This time we want the result to be of type Set and not Int. 

*Note now we have changed the colors data slightly.

scala> val colors = sc.parallelize(Array(("RED", 1), ("RED", 1), ("GREEN", 1), ("RED", 3)))

We are initializing the output to be an empty HashSet of type Int. 

_+_ – add (not sum) elements to HashSet during map side combine

_++_ – here we are specifying a different function during reduce side combine. This function will combine the sets from multiple maps.

scala> val colorsSet = colors.aggregateByKey(new HashSet[Int])(_+_, _++_)

Note that Set maintains unique values and that is why we don’t see the duplicates 1s for RED.

scala> colorsSet.collect

res8: Array[(String, scala.collection.mutable.HashSet[Int])] = Array((GREEN,Set(1)), (RED,Set(1, 3)))
Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

  1. […] The Hadoop in Real World team compares two functions against RDDs in Spark: […]

gdpr-image
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X