Spark overview
I'm a software developer, working with Apache Spark for a few years and thought I could share some of the data I collected over this time.
From the Spark web site: "Apache Spark™ is a unified analytics engine for large-scale data processing".
In my own words, Apache Spark is a software that allows you to do complex manipulations on tons of data. It can work in a distributed manner to allow parallelism and reduce the total computation time.
For example, summing a collection of N numbers. If you try to run such code without the ability to run in parallel it would take order N or O(N).
sum = 0 for n in N: sum += n
However splitting the collection to, say, K sub-collections we could run a summation on each of them in parallel and produce K numbers and then add the K numbers in a single value.
The total run time complexity will be O(K)
The total run time complexity will be O(K)
# Adding each sub collection C sub_sum = 0 for n in C: sub_sum += n # Summing up all K sum_sum in a single thread sub_sum1 + ... + sub_sumk
This is a simple example, but it demonstrates the power of parallel computation. Spark implements such mechanism and in addition, also adds many optimizations to speed up the computation.
Comments
Post a Comment