Search This Blog

Roi Zaig

Apache Spark blog including API examples and code

Comments

Create Custom Datasource With Spark 3 - Part 1

- April 23, 2021

This article is the first of a series on how to create a custom data source that will plug in to Spark mechanism. The idea is to write something like: Dataset < Row > df = sparkSession . read () . option ( "filename" , "path-to-data-file" ) . option ( "other-options" , "some value" ) . format ( "com.roizaig.mydatasource" ) . load (); To read a file, parse it according to some custom logic and return a dataset that represents it. Why Create Custom Data Source? Spark support out of the box in Parquet , CSV , JSON and ORC and more. In most cases these formats will be more than okay and if not, you can always read it as a text file and do some string manipulation in Spark. However, there are some cases where we would like to "teach" Spark to read our own custom format. Flexibility is a key point, with a custom datasource you can read every format y...

Calculate Standard deviation incrementally one batch at a time

- June 05, 2019

In this article, I would like to display a way to calculate standard deviation in an incremental way. Unlike existing articles, that shows how to update the calculation when adding one more element. I will display a method to calculate standard deviation (std.dev) from existing batch of data and additional batch of data. Motivation - suppose you have an ETL system that processes batches of data every day like Apache Spark and you need to display statistical data for all the data like std.dev. If you calculate it on the entire set of data the calculation will become increasingly "harder" since the data set will always grow in size and the calculation will require more and more time. The definition of standard deviation is: So once additional data is added the entire formula should be re-calculated. And the "hard" parts (where we do a full scan of the data) is in the average and the squared difference from the mean should be recalculated. How...

Spark - How to read a JSON file

- October 30, 2019

First thing to notice it the file structure. Its actually a collection of json objects *each one in a single line*. { "field_name": "value", "field_name": "value", ... } { "field_name": "value", "field_name": "value", ... } { "field_name": "value", "field_name": "value", ... } Unlike csv() method the json() method does not have inferSchema parameter. The schema is inferred from the json types. from pyspark.sql import SparkSession import pyspark.sql.types as t import pyspark.sql.functions as f spark = SparkSession . builder \ . master( 'local' ) \ . appName( 'Read JSON file' ) \ . getOrCreate() df = spark . read \ . json( 'data.json' , multiLine = True ) df . printSchema() df . show(truncate = False ) root |-- Address: string (nullable = true) |-- Age: long (nullable = true) |-- Birth Date: string (nu...

Labels

apache
apache spark
apache spark python
apache spark sql
apache zeppelin
batch
big
calculate
computation
createDataFrame

csv
custom
custom data source
data
Data Frame
data source
DataFrame
datasource
deviation
DF
documentation
incremental
java dataset
json
max
max row value
maximum
parallel
performance
pyspark
pyspark api
pyspark documentation
pyspark example
pyspark functions
pyspark sample
pyspark tutorial
pyspark_python
python pyspark
read
roizaig
sample
schema
spark
spark custom data source
spark functions
spark hadoop
spark python example
spark session
sql select row with max
standard
stddev
stddevp
toDF
tutorial
value
zeppelin
zeppelin notebook
zeppelin notebookdata
zeppelin spark

Show more Show less

Search This Blog

Roi Zaig

Recommended order of reading

Comments

Post a Comment

Popular posts from this blog

Create Custom Datasource With Spark 3 - Part 1

Calculate Standard deviation incrementally one batch at a time

Spark - How to read a JSON file