What is an efficient way to check if a Spark DataFrame is empty? - Big Data In Real World

What is an efficient way to check if a Spark DataFrame is empty?

Where does Hive store files for Hive tables?
October 29, 2021
How to move a Hive table from one database to another?
January 5, 2022
Where does Hive store files for Hive tables?
October 29, 2021
How to move a Hive table from one database to another?
January 5, 2022

A quick answer that might come to your mind is to call the count() function on the dataframe and check if the count is greater than 0. count() on a dataframe with a lot of records is super inefficient.

count() will do a global count of records in the dataframe from all partitions and then add all the intermediate counts together to get the final count. You will find this approach very slow for big dataframes.

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

Optimal way to check if dataframe is empty

Use the head function in place of count.

df.head(1).isEmpty

Above is efficient because to find whether a dataframe is empty or not, all you need to know is whether the dataframe has at least one record or not.

Note that head() on an empty dataframe will result in java.util.NoSuchElementException exception. So make sure to use head(1).

 

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

What is an efficient way to check if a Spark DataFrame is empty?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X