What is the difference between RDD, Dataframe and Dataset in Spark? - Big Data In Real World

What is the difference between RDD, Dataframe and Dataset in Spark?

Why is Namenode stuck in safe mode and how to fix it?
August 13, 2021
How to get a list of consumers connected to a Kafka topic?
August 18, 2021
Why is Namenode stuck in safe mode and how to fix it?
August 13, 2021
How to get a list of consumers connected to a Kafka topic?
August 18, 2021

RDD, Dataframe and Dataset are all Spark APIs introduced in Spark at different points in time. The goal of these API is to help us work with large datasets in a distributed fashion in Spark with performance in mind.

RDD

  • Introduced in 2011 and is available in Spark since the beginning
  • RDD is now considered to be a low level API
  • RDD is still the core of Spark
  • Whether you use Dataframe or Dataset, all your operations eventually get transformed to RDD
  • RDD API provides many transformation functions like map(), filter() and reduce() etc. for performing computations on Data.
  • RDD distribute a collection of JVM objects
  • Can not catch syntax and analysis errors at compile time

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

RDD Example

val rdd = sc.textFile("student.txt")
val rdd_filtered = rdd.filter(_.age > 13) // transformation
rdd_filtered.saveAsTextFile("students_over13.txt") // action

In this above example, first we read a file and we get a RDD. Next we apply a filter function on RDD which again results in a RDD and we are calling it rdd_filtered. filter() is a transformation function. All transformation functions result in a RDD. Finally we call saveAsTextFile on rdd_filtered which is an action function to save the contents of RDD.

 

DataFrame

  • Introduced in 2013
  • Considered high level API
  • All operations with DataFrames go through Spark’s catalyst optimizer which converts our Dataframe code to an optimized set of code in RDD.
  • Let’s you perform “SQL like” operations on data
  • DataFrame distribute a collection of Row objects
  • Catch syntax errors at compile time and analysis errors only at run time.

DataFrame Example

val df = spark.read.csv("student.txt")

val df_filtered = df.filter("age > 13")

Dataset

  • Introduced in 2015
  • Considered high level API
  • Just like Dataframes, all operations with Datasets go through Spark’s catalyst optimizer which converts our Dataframe code to an optimized set of code in RDD.
  • Just like Dataframes, let’s you perform “SQL like” operations on data
  • Dataset lets you work with data represented as an user defined object.
  • Datasets are considered type safe
  • DataFrame distribute a collection of user defined JVM objects but internally represented by Spark Row objects
  • Dataframe is nothing both Dataset of type Row [Dataframe = Dataset<Row>]
  • Catch syntax errors at compile time and due to type safety catch analysis errors also at compile time

Dataset Example

val ds = spark.read.csv("student.txt").as[Student] 

val ds_filtered =ds.filter("age > 13")

In the above example, we have created a Dataset of type Student. So all elements in the Dataset are Students and thus offering type safety.

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

What is the difference between RDD, Dataframe and Dataset in Spark?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X