Calculate Resource Allocation for Spark Applications

Batch Processing with Google Cloud DataFlow and Apache Beam

June 29, 2020

Try Hadoop In 5 Minutes

September 20, 2020

Published by Big Data In Real World at July 13, 2020

Understanding YARN Properties

It’s important to think about how the resources are requested by Spark, as it needs to fit into what YARN has available. The relevant YARN properties are:

yarn.nodemanager.resource.memory-mb controls the maximum sum of memory used by the containers on each node.

yarn.nodemanager.resource.cpu-vcores controls the maximum sum of cores used by the containers on each node.

Understanding Spark Application Properties

–executor-cores / spark.executor.cores determines number of cores or the number of concurrent tasks an executor can run.

–executor-memory / spark.executor.memory determines the heap memory per executor.

–num-executors / spark.executor.instances determines number of executors per application.

spark.executor.memoryOverhead max (384, 0.1 * spark.executor.memory)

JVMs use some off heap memory, for example for interned Strings, direct byte buffers and Python processes are spawn in this memory.

Allocating Resources

Suppose you have a cluster with 6 nodes and each node is equipped with 16 cores and 64GB of memory.

YARN Container Configurations

yarn.nodemanager.resource.memory-mb = 63 * 1024 = 64512 (megabytes)

yarn.nodemanager.resource.cpu-vcores = 15

We avoid allocating 100% of the resources to YARN containers because the node needs some resources to run the OS and Hadoop daemons. In this case, we leave 1GB and 1 core for the system processes like VM overheads, OS processes.

Resources Available for Spark Application

Total Number of Nodes = 6

Total Number of Cores = 6 * 15 = 90

Total Memory = 6 * 63 = 378 GB

So the total requested amount of memory per executor must be:

spark.executor.memory + spark.executor.memoryOverhead < yarn.nodemanager.resource.memory-mb

Interested in Spark? check out our Spark Developer In Real World course for interesting use case and real world projects in Spark

Calculating Resources for Spark Application

To achieve full write throughput in HDFS so we should keep the number of core equals to 5 or less as a best practice, this is to avoid poor HDFS I/O throughput due to high concurrent threads.

Total Number Executor = Total Number Of Cores / 5 => 90/5 = 18.

We have 3 executors per node and 63 GB memory per node then memory per node should be 63/3 = 21 GB but this is wrong as heap + overhead < container/executor so

Overhead Memory = max(384 , 0.1 * 21) ~ 2 GB (roughly)

Heap Memory = 21 – 2 ~ 19 GB

On the Yarn Cluster Manager we need to make room for Application Master so we will reserve 1 executor as Application Master.

Finally

--executor-cores / spark.executor.cores = 5

--executor-memory / spark.executor.memory = 19

--num-executors / spark.executor.instances = 17

We will have 3 executors on each node except the one having an Application Master, 19GB memory available to each executor and 5 core for each executor.

Like what you are reading? You would like our free live webinars too. Sign up and get notified when we host webinars =>

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

Calculate Resource Allocation for Spark Applications

Batch Processing with Google Cloud DataFlow and Apache Beam

Try Hadoop In 5 Minutes

Batch Processing with Google Cloud DataFlow and Apache Beam

Try Hadoop In 5 Minutes

Understanding YARN Properties

Understanding Spark Application Properties

Allocating Resources

YARN Container Configurations

Resources Available for Spark Application

Calculating Resources for Spark Application

Finally

Big Data In Real World

Related posts

How to recursively delete files, folders or bucket from S3?

Hadoop In Real World is now Big Data In Real World!

Hadoop In Real World is changing to Big Data In Real World