Spark parts

4/2/2023

But I know what you are going to say, Spark works in memory, not disk! However, Spark still needs disks to allocate shuffle partitions (discussed further, later in this article), it also needs space for persistence to disk and also spill to disk. Each worker also has a number of disks attached.

The default ratio of this is 50:50, but this can be changed in the Spark config. There is an amount of available memory which is split into two sections, storage memory and working memory. These slots are classified as available chucks of compute and they can be provided tasks to complete by the driver node. The cluster manager will track the number of “slots”, which are essentially the number of cores available on the device. When we have Spark installed on our worker node, our driver node will be able to utilise its cluster management capabilities to map what hardware is available on the worker.As with the driver, for Spark to run on a worker node, we need to ensure that the system has a compatible version of Java installed so that the code can be interpreted in the correct and meaningful way.Such as a CPU, memory, disks and a cache. Worker nodes are generally stand-alone machines which have hardware like any other.Worker Node Step by Step (created by Luke Thorp) Although, having worker nodes which perform equally, can be a benefit when investigating performance bottlenecks further down the line. These nodes come in all shapes and sizes and don’t always have to be the same to join a cluster, they can vary in size and performance. Worker nodes form the “distributed” part of the cluster. These resources come in the form of worker nodes. Hidden away within the driver node is the cluster manager, which is responsible for acquiring resources on the Spark cluster and allocating them to a Spark job. These driver components are responsible for translating user code into Spark jobs executed on the cluster. The driver contains the DAG (Directed Acyclic Graph) scheduler, task scheduler, backend scheduler and the block manager. The Spark Program runs inside of this JVM and is used to create the SparkContext, which is the access point for the user to the Spark Cluster.As Spark is written in Scala, it is important to remember that any machine within the cluster needs to have a JVM (Java Virtual Machine) running, so that Spark can work with the hardware on the host.The driver is the users link, between themselves, and the physical compute required to complete any work submitted to the cluster. The driver node is like any other machine, it has hardware such as a CPU, memory, DISKs and a cache, however, these hardware components are used to host the Spark Program and manage the wider cluster.Driver Node Step by Step (created by Luke Thorp)

0 Comments

Spark parts

Leave a Reply.

Author

Archives

Categories