Spark - How to read a JSON file
First thing to notice it the file structure. Its actually a collection of json objects *each one in a single line*.
{ "field_name": "value", "field_name": "value", ... }
{ "field_name": "value", "field_name": "value", ... }
{ "field_name": "value", "field_name": "value", ... }
Unlike csv() method the json() method does not have inferSchema parameter. The schema is inferred from the json types.
from pyspark.sql import SparkSession import pyspark.sql.types as t import pyspark.sql.functions as f spark = SparkSession.builder \ .master('local') \ .appName('Read JSON file') \ .getOrCreate() df = spark.read \ .json('data.json', multiLine=True) df.printSchema() df.show(truncate=False)
root |-- Address: string (nullable = true) |-- Age: long (nullable = true) |-- Birth Date: string (nullable = true) |-- Id: long (nullable = true) |-- Name: string (nullable = true) |-- Score: double (nullable = true) +--------------------+---+----------+----------+-----+-----+ | Address|Age|Birth Date| Id| Name|Score| +--------------------+---+----------+----------+-----+-----+ |New York, NY 1002...| 28|1991-02-14|3698547891|Arike| 78.6| |1843-1701 S Osage...| 32|1987-06-07|6984184782| Bob|45.32| |R. Olavo Bilac, 1...| 65|1954-12-26|9782472174|Corry|98.47| |20-16 B5036, Wirk...| 18|2001-10-22|2316324177|David| 3.77| +--------------------+---+----------+----------+-----+-----+
In case you wish to read the file without types you can use set the primitivesAsString to True. It will read all the values as strings.
df = spark.read \ .json('data.json', primitivesAsString=True) df.printSchema() df.show()
root |-- Address: string (nullable = true) |-- Age: string (nullable = true) |-- Birth Date: string (nullable = true) |-- Id: string (nullable = true) |-- Name: string (nullable = true) |-- Score: string (nullable = true) +--------------------+---+----------+----------+-----+-----+ | Address|Age|Birth Date| Id| Name|Score| +--------------------+---+----------+----------+-----+-----+ |New York, NY 1002...| 28|1991-02-14|3698547891|Arike| 78.6| |1843-1701 S Osage...| 32|1987-06-07|6984184782| Bob|45.32| |R. Olavo Bilac, 1...| 65|1954-12-26|9782472174|Corry|98.47| |20-16 B5036, Wirk...| 18|2001-10-22|2316324177|David| 3.77| +--------------------+---+----------+----------+-----+-----+
Alternatively you can specify the schema yourself.
schema = t.StructType([ t.StructField("Name", t.StringType(), True), # Changed to string t.StructField("Age", t.DoubleType(), True), # Changed to double t.StructField("Birth Date", t.TimestampType(), True), t.StructField("Id", t.LongType(), True), # Changed to long t.StructField("Address", t.StringType(), True), t.StructField("Score", t.StringType(), True) ]) df = spark.read \ .json('data.json', schema=schema) df.printSchema() df.show()
root |-- Name: string (nullable = true) |-- Age: double (nullable = true) |-- Birth Date: timestamp (nullable = true) |-- Id: long (nullable = true) |-- Address: string (nullable = true) |-- Score: string (nullable = true) +-----+----+-------------------+----------+--------------------+-----+ | Name| Age| Birth Date| Id| Address|Score| +-----+----+-------------------+----------+--------------------+-----+ |Arike|28.0|1991-02-14 00:00:00|3698547891|New York, NY 1002...| 78.6| | Bob|32.0|1987-06-07 00:00:00|6984184782|1843-1701 S Osage...|45.32| |Corry|65.0|1954-12-26 00:00:00|9782472174|R. Olavo Bilac, 1...|98.47| |David|18.0|2001-10-22 00:00:00|2316324177|20-16 B5036, Wirk...| 3.77| +-----+----+-------------------+----------+--------------------+-----+
You can add Java style comments inside the json object and tell Spark to ignore it by setting the allowComments parameter to True
For example: { "Name": "Arike", "Age": 28, "Birth Date": "1991-02-14", "Id": 3698547891, "Address": "New York, NY 10022, USA", "Score": 78.6 /* lsdjflsdkjf */ }
df = spark.read \ .json('data.json', allowComments=True) df.printSchema() df.show()
root |-- Address: string (nullable = true) |-- Age: long (nullable = true) |-- Birth Date: string (nullable = true) |-- Id: long (nullable = true) |-- Name: string (nullable = true) |-- Score: double (nullable = true) +--------------------+---+----------+----------+-----+-----+ | Address|Age|Birth Date| Id| Name|Score| +--------------------+---+----------+----------+-----+-----+ |New York, NY 1002...| 28|1991-02-14|3698547891|Arike| 78.6| |1843-1701 S Osage...| 32|1987-06-07|6984184782| Bob|45.32| |R. Olavo Bilac, 1...| 65|1954-12-26|9782472174|Corry|98.47| |20-16 B5036, Wirk...| 18|2001-10-22|2316324177|David| 3.77| +--------------------+---+----------+----------+-----+-----+
If you don't set the allowComments parameter it's default value is False and will result in additional column called _corrupt_record
+-------------------------------------------------------------+----+----------+----------+-----+-----+---------------------------------------------------------------------------------------------------------------------------------------------------+ |Address |Age |Birth Date|Id |Name |Score|_corrupt_record | +-------------------------------------------------------------+----+----------+----------+-----+-----+---------------------------------------------------------------------------------------------------------------------------------------------------+ |null |null|null |null |null |null |{ "Name": "Arike", "Age": 28, "Birth Date": "1991-02-14", "Id": 3698547891, "Address": "New York, NY 10022, USA", "Score": 78.6 /* lsdjflsdkjf */ }| |1843-1701 S Osage Dr, Ottawa, KS 66067, USA |32 |1987-06-07|6984184782|Bob |45.32|null | |R. Olavo Bilac, 143 - Batel, Curitiba - PR, 80440-040, Brazil|65 |1954-12-26|9782472174|Corry|98.47|null | |20-16 B5036, Wirksworth, Matlock DE4 4FP, UK |18 |2001-10-22|2316324177|David|3.77 |null | +-------------------------------------------------------------+----+----------+----------+-----+-----+---------------------------------------------------------------------------------------------------------------------------------------------------+
The parameter allowBackslashEscapingAnyCharacter allow for new-line, tab (and more) chars to be read and parsed.
You can find a list of escaped chars in this link https://en.wikipedia.org/wiki/Escape_character
The below example have \n, \t, \\ and \" as an example
{ "Name": "Arike", "Age": 28, "Birth Date": "1991-02-14", "Id": 3698547891, "Address": "New York\nNY 10022,\nUSA", "Score": 78.6 } { "Name": "Bob", "Age": 32, "Birth Date": "1987-06-07", "Id": 6984184782, "Address": "1843-1701\tS Osage Dr,\tOttawa, KS 66067, USA", "Score": 45.32 } {"Name": "Corry", "Age": 65, "Birth Date": "1954-12-26", "Id": 9782472174, "Address": "R. \\ \\ \\ Olavo Bilac, 143 - Batel, Curitiba - PR, 80440-040, Brazil", "Score": 98.47} {"Name": "David", "Age": 18, "Birth Date": "2001-10-22", "Id": 2316324177, "Address": "20-16 B5036, \"Wirksworth\", Matlock DE4 4FP, UK", "Score": 3.77}
df = spark.read \ .json('data.json', allowBackslashEscapingAnyCharacter=True) df.show(truncate=False)
+-------------------------------------------------------------------+---+----------+----------+-----+-----+ |Address |Age|Birth Date|Id |Name |Score| +-------------------------------------------------------------------+---+----------+----------+-----+-----+ |New York NY 10022, USA |28 |1991-02-14|3698547891|Arike|78.6 | |1843-1701 S Osage Dr, Ottawa, KS 66067, USA |32 |1987-06-07|6984184782|Bob |45.32| |R. \ \ \ Olavo Bilac, 143 - Batel, Curitiba - PR, 80440-040, Brazil|65 |1954-12-26|9782472174|Corry|98.47| |20-16 B5036, "Wirksworth", Matlock DE4 4FP, UK |18 |2001-10-22|2316324177|David|3.77 | +-------------------------------------------------------------------+---+----------+----------+-----+-----+
The multiLine option is False by default, it allows the json to span over multiple lines. However there is a catch, Spark will process only the first json object, it will ignore the rest
For the content:
{
"Name": "Bob",
"Age": 32,
"Birth Date": "1987-06-07",
"Id": 6984184782,
"Address": "1843-1701 S Osage Dr, Ottawa, KS 66067, USA",
"Score": 45.32
}
{
"Name": "Corry",
"Age": 65,
"Birth Date": "1954-12-26",
"Id": 9782472174,
"Address": "R. Olavo Bilac, 143 - Batel, Curitiba - PR, 80440-040, Brazil",
"Score": 98.47
}
df = spark.read \ .json('data.json', multiLine=True) df.show()
The output will be:
+----------------------+---+----------+----------+-----+-----+ |Address |Age|Birth Date|Id |Name |Score| +----------------------+---+----------+----------+-----+-----+ |New York NY 10022, USA|28 |1991-02-14|3698547891|Arike|78.6 | +----------------------+---+----------+----------+-----+-----+
You can handle custom date only format or date and time format (aka timestamp) with the dateFormat and timestampFormat correspondingly. The default date format is: yyyy-MM-dd and the default timestamp format is: yyyy-MM-dd'T'HH:mm:ss.SSSXXX.
For the input:
{ "Name": "Arike", "Age": 28, "Birth Date": "1991/02/14", "Id": 3698547891, "Address": "New York, NY 10022, USA", "Score": 78.6 }
schema = t.StructType([ t.StructField("Name", t.StringType(), True), t.StructField("Age", t.DoubleType(), True), t.StructField("Birth Date", t.DateType(), True), # Date column t.StructField("Id", t.LongType(), True), t.StructField("Address", t.StringType(), True), t.StructField("Score", t.StringType(), True) ]) df = spark.read \ .json('data.json', dateFormat="yyyy/MM/dd", schema=schema) df.printSchema() df.show()
Note that the Birth Date field format is not the default and that the standard date output is separated by hyphens.
root |-- Name: string (nullable = true) |-- Age: double (nullable = true) |-- Birth Date: date (nullable = true) |-- Id: long (nullable = true) |-- Address: string (nullable = true) |-- Score: string (nullable = true) +-----+----+----------+----------+--------------------+-----+ | Name| Age|Birth Date| Id| Address|Score| +-----+----+----------+----------+--------------------+-----+ |Arike|28.0|1991-02-14|3698547891|New York, NY 1002...| 78.6| +--------------------+---+----------+----------+-----+-----+
And for the input:
{ "Name": "Arike", "Age": 28, "Birth Date": "1991/02/14 12_43_43", "Id": 3698547891, "Address": "New York, NY 10022, USA", "Score": 78.6 }
schema = t.StructType([ t.StructField("Name", t.StringType(), True), t.StructField("Age", t.DoubleType(), True), t.StructField("Birth Date", t.TimestampType(), True), # Timestamp column t.StructField("Id", t.LongType(), True), t.StructField("Address", t.StringType(), True), t.StructField("Score", t.StringType(), True) ]) df = spark.read \ .json('data.json', timestampFormat="yyyy/MM/dd HH_mm_ss", schema=schema) df.printSchema() df.show()
root |-- Name: string (nullable = true) |-- Age: double (nullable = true) |-- Birth Date: timestamp (nullable = true) |-- Id: long (nullable = true) |-- Address: string (nullable = true) |-- Score: string (nullable = true) +-----+----+-------------------+----------+--------------------+-----+ | Name| Age| Birth Date| Id| Address|Score| +-----+----+-------------------+----------+--------------------+-----+ |Arike|28.0|1991-02-14 12:43:45|3698547891|New York, NY 1002...| 78.6| +-----+----+-------------------+----------+--------------------+-----+
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=json#pyspark.sql.DataFrameReader.json
Comments
Post a Comment