How to find directories in HDFS which are older than N days? - Big Data In Real World

How to find directories in HDFS which are older than N days?

How to use multi character delimiter in a Hive table?
January 26, 2017
Finding the MAX tuple with Pig
February 2, 2017
How to use multi character delimiter in a Hive table?
January 26, 2017
Finding the MAX tuple with Pig
February 2, 2017

How to find directories in HDFS which are older than N days?

Cleaning up older or obsolete files in HDFS is important. Even if you have a big enough cluster with lot of space, if you don’t have good clean up scripts to keep your cluster clean, little things add up and before you know you will run out of space in your cluster.

HDFS does not have a command out of the box to list all the directories that are N days old. But you can write a simple script to do so.

Script

Here is a small script to list directories older than 10 days.

now=$(date +%s)
hadoop fs -ls -R | grep "^d" | while read f; do
 dir_date=`echo $f | awk '{print $6}'`
 difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))
 if [ $difference -gt 10 ]; then
   echo $f;
 fi
done

hadoop fs -ls -R command list all the files and directories in HDFS. grep “^d” will get you only the directories. Then with while..do let’s loop through each directory.

hadoop fs -ls -R | grep "^d" | while read f; do

awk ‘{print $6}’  gets the date of the directory and save it in dir_date.

dir_date=`echo $f | awk '{print $6}'`

Below script calculate the difference between the date from the directory and the current date and convert the difference to the number of days.

difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))

Finally print the directory if the difference is more than 10 days.

 if [ $difference -gt 10 ]; then
   echo $f;
 fi

 

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.
How to find directories in HDFS which are older than N days?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X