Spark - How to read a JSON file



First thing to notice it the file structure. Its actually a collection of json objects *each one in a single line*. { "field_name": "value", "field_name": "value", ... } { "field_name": "value", "field_name": "value", ... } { "field_name": "value", "field_name": "value", ... } Unlike csv() method the json() method does not have inferSchema parameter. The schema is inferred from the json types.

from pyspark.sql import SparkSession
import pyspark.sql.types as t
import pyspark.sql.functions as f
spark = SparkSession.builder \
    .master('local') \
    .appName('Read JSON file') \
    .getOrCreate()
df = spark.read \
    .json('data.json', multiLine=True)
df.printSchema()
df.show(truncate=False)

root
 |-- Address: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- Birth Date: string (nullable = true)
 |-- Id: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Score: double (nullable = true)
+--------------------+---+----------+----------+-----+-----+
|             Address|Age|Birth Date|        Id| Name|Score|
+--------------------+---+----------+----------+-----+-----+
|New York, NY 1002...| 28|1991-02-14|3698547891|Arike| 78.6|
|1843-1701 S Osage...| 32|1987-06-07|6984184782|  Bob|45.32|
|R. Olavo Bilac, 1...| 65|1954-12-26|9782472174|Corry|98.47|
|20-16 B5036, Wirk...| 18|2001-10-22|2316324177|David| 3.77|
+--------------------+---+----------+----------+-----+-----+

In case you wish to read the file without types you can use set the primitivesAsString to True. It will read all the values as strings.

df = spark.read \
    .json('data.json', primitivesAsString=True)
df.printSchema()
df.show()

root
 |-- Address: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Birth Date: string (nullable = true)
 |-- Id: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Score: string (nullable = true)
+--------------------+---+----------+----------+-----+-----+
|             Address|Age|Birth Date|        Id| Name|Score|
+--------------------+---+----------+----------+-----+-----+
|New York, NY 1002...| 28|1991-02-14|3698547891|Arike| 78.6|
|1843-1701 S Osage...| 32|1987-06-07|6984184782|  Bob|45.32|
|R. Olavo Bilac, 1...| 65|1954-12-26|9782472174|Corry|98.47|
|20-16 B5036, Wirk...| 18|2001-10-22|2316324177|David| 3.77|
+--------------------+---+----------+----------+-----+-----+

Alternatively you can specify the schema yourself.

schema = t.StructType([
    t.StructField("Name", t.StringType(), True),  # Changed to string
    t.StructField("Age", t.DoubleType(), True),  # Changed to double
    t.StructField("Birth Date", t.TimestampType(), True),
    t.StructField("Id", t.LongType(), True),  # Changed to long
    t.StructField("Address", t.StringType(), True),
    t.StructField("Score", t.StringType(), True)
])
df = spark.read \
    .json('data.json', schema=schema)
df.printSchema()
df.show()

root
 |-- Name: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Birth Date: timestamp (nullable = true)
 |-- Id: long (nullable = true)
 |-- Address: string (nullable = true)
 |-- Score: string (nullable = true)
+-----+----+-------------------+----------+--------------------+-----+
| Name| Age|         Birth Date|        Id|             Address|Score|
+-----+----+-------------------+----------+--------------------+-----+
|Arike|28.0|1991-02-14 00:00:00|3698547891|New York, NY 1002...| 78.6|
|  Bob|32.0|1987-06-07 00:00:00|6984184782|1843-1701 S Osage...|45.32|
|Corry|65.0|1954-12-26 00:00:00|9782472174|R. Olavo Bilac, 1...|98.47|
|David|18.0|2001-10-22 00:00:00|2316324177|20-16 B5036, Wirk...| 3.77|
+-----+----+-------------------+----------+--------------------+-----+

You can add Java style comments inside the json object and tell Spark to ignore it by setting the allowComments parameter to True For example: { "Name": "Arike", "Age": 28, "Birth Date": "1991-02-14", "Id": 3698547891, "Address": "New York, NY 10022, USA", "Score": 78.6 /* lsdjflsdkjf */ }

df = spark.read \
    .json('data.json', allowComments=True)
df.printSchema()
df.show()

root
 |-- Address: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- Birth Date: string (nullable = true)
 |-- Id: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Score: double (nullable = true)
+--------------------+---+----------+----------+-----+-----+
|             Address|Age|Birth Date|        Id| Name|Score|
+--------------------+---+----------+----------+-----+-----+
|New York, NY 1002...| 28|1991-02-14|3698547891|Arike| 78.6|
|1843-1701 S Osage...| 32|1987-06-07|6984184782|  Bob|45.32|
|R. Olavo Bilac, 1...| 65|1954-12-26|9782472174|Corry|98.47|
|20-16 B5036, Wirk...| 18|2001-10-22|2316324177|David| 3.77|
+--------------------+---+----------+----------+-----+-----+

If you don't set the allowComments parameter it's default value is False and will result in additional column called _corrupt_record

+-------------------------------------------------------------+----+----------+----------+-----+-----+---------------------------------------------------------------------------------------------------------------------------------------------------+
|Address                                                      |Age |Birth Date|Id        |Name |Score|_corrupt_record                                                                                                                                    |
+-------------------------------------------------------------+----+----------+----------+-----+-----+---------------------------------------------------------------------------------------------------------------------------------------------------+
|null                                                         |null|null      |null      |null |null |{ "Name": "Arike", "Age": 28, "Birth Date": "1991-02-14", "Id": 3698547891, "Address": "New York, NY 10022, USA", "Score": 78.6 /* lsdjflsdkjf */ }|
|1843-1701 S Osage Dr, Ottawa, KS 66067, USA                  |32  |1987-06-07|6984184782|Bob  |45.32|null                                                                                                                                               |
|R. Olavo Bilac, 143 - Batel, Curitiba - PR, 80440-040, Brazil|65  |1954-12-26|9782472174|Corry|98.47|null                                                                                                                                               |
|20-16 B5036, Wirksworth, Matlock DE4 4FP, UK                 |18  |2001-10-22|2316324177|David|3.77 |null                                                                                                                                               |
+-------------------------------------------------------------+----+----------+----------+-----+-----+---------------------------------------------------------------------------------------------------------------------------------------------------+

The parameter allowBackslashEscapingAnyCharacter allow for new-line, tab (and more) chars to be read and parsed. You can find a list of escaped chars in this link https://en.wikipedia.org/wiki/Escape_character The below example have \n, \t, \\ and \" as an example

{ "Name": "Arike", "Age": 28, "Birth Date": "1991-02-14", "Id": 3698547891, "Address": "New York\nNY 10022,\nUSA", "Score": 78.6 }
{ "Name": "Bob", "Age": 32, "Birth Date": "1987-06-07", "Id": 6984184782, "Address": "1843-1701\tS Osage Dr,\tOttawa, KS 66067, USA", "Score": 45.32 }
{"Name": "Corry", "Age": 65, "Birth Date": "1954-12-26", "Id": 9782472174, "Address": "R. \\ \\ \\ Olavo Bilac, 143 - Batel, Curitiba - PR, 80440-040, Brazil", "Score": 98.47}
{"Name": "David", "Age": 18, "Birth Date": "2001-10-22", "Id": 2316324177, "Address": "20-16 B5036, \"Wirksworth\", Matlock DE4 4FP, UK", "Score": 3.77}

df = spark.read \
    .json('data.json', allowBackslashEscapingAnyCharacter=True)
df.show(truncate=False)

+-------------------------------------------------------------------+---+----------+----------+-----+-----+
|Address                                                            |Age|Birth Date|Id        |Name |Score|
+-------------------------------------------------------------------+---+----------+----------+-----+-----+
|New York
NY 10022,
USA                                             |28 |1991-02-14|3698547891|Arike|78.6 |
|1843-1701 S Osage Dr, Ottawa, KS 66067, USA                        |32 |1987-06-07|6984184782|Bob  |45.32|
|R. \ \ \ Olavo Bilac, 143 - Batel, Curitiba - PR, 80440-040, Brazil|65 |1954-12-26|9782472174|Corry|98.47|
|20-16 B5036, "Wirksworth", Matlock DE4 4FP, UK                     |18 |2001-10-22|2316324177|David|3.77 |
+-------------------------------------------------------------------+---+----------+----------+-----+-----+

The multiLine option is False by default, it allows the json to span over multiple lines. However there is a catch, Spark will process only the first json object, it will ignore the rest For the content: { "Name": "Bob", "Age": 32, "Birth Date": "1987-06-07", "Id": 6984184782, "Address": "1843-1701 S Osage Dr, Ottawa, KS 66067, USA", "Score": 45.32 } { "Name": "Corry", "Age": 65, "Birth Date": "1954-12-26", "Id": 9782472174, "Address": "R. Olavo Bilac, 143 - Batel, Curitiba - PR, 80440-040, Brazil", "Score": 98.47 }

df = spark.read \
    .json('data.json', multiLine=True)
df.show()

The output will be:

+----------------------+---+----------+----------+-----+-----+
|Address               |Age|Birth Date|Id        |Name |Score|
+----------------------+---+----------+----------+-----+-----+
|New York NY 10022, USA|28 |1991-02-14|3698547891|Arike|78.6 |
+----------------------+---+----------+----------+-----+-----+

You can handle custom date only format or date and time format (aka timestamp) with the dateFormat and timestampFormat correspondingly. The default date format is: yyyy-MM-dd and the default timestamp format is: yyyy-MM-dd'T'HH:mm:ss.SSSXXX. For the input: { "Name": "Arike", "Age": 28, "Birth Date": "1991/02/14", "Id": 3698547891, "Address": "New York, NY 10022, USA", "Score": 78.6 }

schema = t.StructType([
    t.StructField("Name", t.StringType(), True),
    t.StructField("Age", t.DoubleType(), True),
    t.StructField("Birth Date", t.DateType(), True),  # Date column
    t.StructField("Id", t.LongType(), True),
    t.StructField("Address", t.StringType(), True),
    t.StructField("Score", t.StringType(), True)
])
df = spark.read \
    .json('data.json', dateFormat="yyyy/MM/dd", schema=schema)
df.printSchema()
df.show()

Note that the Birth Date field format is not the default and that the standard date output is separated by hyphens.

root
 |-- Name: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Birth Date: date (nullable = true)
 |-- Id: long (nullable = true)
 |-- Address: string (nullable = true)
 |-- Score: string (nullable = true)
+-----+----+----------+----------+--------------------+-----+
| Name| Age|Birth Date|        Id|             Address|Score|
+-----+----+----------+----------+--------------------+-----+
|Arike|28.0|1991-02-14|3698547891|New York, NY 1002...| 78.6|
+--------------------+---+----------+----------+-----+-----+

And for the input:

{ "Name": "Arike", "Age": 28, "Birth Date": "1991/02/14 12_43_43", "Id": 3698547891, "Address": "New York, NY 10022, USA", "Score": 78.6 }

schema = t.StructType([
    t.StructField("Name", t.StringType(), True),
    t.StructField("Age", t.DoubleType(), True),
    t.StructField("Birth Date", t.TimestampType(), True),  # Timestamp column
    t.StructField("Id", t.LongType(), True),
    t.StructField("Address", t.StringType(), True),
    t.StructField("Score", t.StringType(), True)
])
df = spark.read \
    .json('data.json', timestampFormat="yyyy/MM/dd HH_mm_ss", schema=schema)
df.printSchema()
df.show()

root
 |-- Name: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Birth Date: timestamp (nullable = true)
 |-- Id: long (nullable = true)
 |-- Address: string (nullable = true)
 |-- Score: string (nullable = true)
+-----+----+-------------------+----------+--------------------+-----+
| Name| Age|         Birth Date|        Id|             Address|Score|
+-----+----+-------------------+----------+--------------------+-----+
|Arike|28.0|1991-02-14 12:43:45|3698547891|New York, NY 1002...| 78.6|
+-----+----+-------------------+----------+--------------------+-----+

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=json#pyspark.sql.DataFrameReader.json

Additional posts: How to create a DataFrame (different ways) or 5 ways to calculate max of row in Spark

Comments

Popular posts from this blog

5 ways to calculate max of row in Spark

Create Custom Datasource With Spark 3 - Part 1