This article is the first of a series on how to create a custom data source that will plug in to Spark mechanism. The idea is to write something like:  Dataset < Row >  df =  sparkSession . read ()                  . option ( "filename" ,  "path-to-data-file" )                 . option ( "other-options" ,  "some value" )                 . format ( "com.roizaig.mydatasource" )                  . load ();     To read a file, parse it according to some custom logic and return a dataset that represents it. Why Create Custom Data Source? Spark support out of the box in Parquet , CSV , JSON  and ORC  and more. In most cases these formats will be more than okay and if not, you can always read it as a text file and do some string manipulation in Spark. However, there are some cases where we would like to "teach" Spark to read our own custom format. Flexibility  is a key point, with a custom datasource you can read every format y...
 
Comments
Post a Comment