Spark overview

- October 15, 2019

I'm a software developer, working with Apache Spark for a few years and thought I could share some of the data I collected over this time.

From the Spark web site: "Apache Spark™ is a unified analytics engine for large-scale data processing".

In my own words, Apache Spark is a software that allows you to do complex manipulations on tons of data. It can work in a distributed manner to allow parallelism and reduce the total computation time.

For example, summing a collection of N numbers. If you try to run such code without the ability to run in parallel it would take order N or O(N).

sum = 0
for n in N:
  sum += n

However splitting the collection to, say, K sub-collections we could run a summation on each of them in parallel and produce K numbers and then add the K numbers in a single value.
The total run time complexity will be O(K)

# Adding each sub collection C
sub_sum = 0
for n in C:
    sub_sum += n
# Summing up all K sum_sum in a single thread
sub_sum1 + ... + sub_sumk

This is a simple example, but it demonstrates the power of parallel computation. Spark implements such mechanism and in addition, also adds many optimizations to speed up the computation.

In addition, the Spark ecosystem is very rich. Names like Hadoop, Yarn, Mesos, Hive, Parquet and many more. I will give more details about these in later posts.

Search This Blog

Roi Zaig

Spark overview

Comments

Post a Comment

Popular posts from this blog

Create Custom Datasource With Spark 3 - Part 1

Spark - How to read a JSON file

5 ways to calculate max of row in Spark