This article is the first of a series on how to create a custom data source that will plug in to Spark mechanism. The idea is to write something like: Dataset < Row > df = sparkSession . read () . option ( "filename" , "path-to-data-file" ) . option ( "other-options" , "some value" ) . format ( "com.roizaig.mydatasource" ) . load (); To read a file, parse it according to some custom logic and return a dataset that represents it. Why Create Custom Data Source? Spark support out of the box in Parquet , CSV , JSON and ORC and more. In most cases these formats will be more than okay and if not, you can always read it as a text file and do some string manipulation in Spark. However, there are some cases where we would like to "teach" Spark to read our own custom format. Flexibility is a key point, with a custom datasource you can read every format y...
Comments
Post a Comment