Member-only story

Apache Spark RDD & Dataframe

6 min readMay 6, 2023

RDD stands for Resilient Distributed Dataset, a fundamental data structure in Apache Spark. It is an immutable distributed collection of objects that can be processed in parallel across a cluster of computers.

In Spark, RDDs are the basic building blocks for all Spark operations and transformations. RDDs are fault-tolerant and can automatically recover from node failures, making them resilient. They can be cached in memory for faster processing and partitioned across multiple nodes in a cluster for parallel processing.

RDDs can be created from data stored in Hadoop Distributed File System (HDFS), Local File System, or other data sources such as Amazon S3, Cassandra, and HBase. Once an RDD is created, it can be transformed using various operations like map, filter, reduce, join, and more. These operations are performed in parallel on all the nodes in the cluster, making Spark a powerful tool for big data processing.

Spark RDDs are a key part of Spark’s processing engine and provide a scalable and fault-tolerant way to process large datasets quickly and efficiently.

Before RDD, many frameworks were challenged in processing multi-tasks and reliable intermediate data was required HDFS.

Unlike HDFS which has all data in disk to be processed for iterations, RDD is processing the data with in-memory.

Apache Spark RDD & Dataframe

Written by Park Sehun

No responses yet