Spark - How to read CSV File


In this post, I would like to show how to read a CSV file located on your local file system.

First, lets consider the below CSV file. You can write it using Excel, Libra Office Calc or any simple text editor. Note, we have 6 columns - A to F and 5 rows where the first row is a header row.



Name,Age,Birth Date,Id,Address,Score
Arike,28,1991-02-14,3698547891,"New York, NY 10022, USA",78.6
Bob,32,1987-06-07,6984184782,"1843-1701 S Osage Dr, Ottawa, KS 66067, USA",45.32
Corry,65,1954-12-26,9782472174,"R. Olavo Bilac, 143 - Batel, Curitiba - PR, 80440-040, Brazil",98.47
David,18,2001-10-22,2316324177,"20-16 B5036, Wirksworth, Matlock DE4 4FP, UK",3.77

Assuming we have an initialized SparkSession object called spark we can write the following.
Here, the "read" property returns a DataFrameReader we can execute the csv method with a path to the csv.
The line df.printSchema() prints to the console the scheme of the file.
df = spark.read\
    .csv('data.csv')
df.printSchema()
df.show()

And the output will be:
root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)

+-----+---+----------+----------+--------------------+-----+
|  _c0|_c1|       _c2|       _c3|                 _c4|  _c5|
+-----+---+----------+----------+--------------------+-----+
| Name|Age|Birth Date|        Id|             Address|Score|
|Arike| 28|1991-02-14|3698547891|New York, NY 1002...| 78.6|
|  Bob| 32|1987-06-07|6984184782|1843-1701 S Osage...|45.32|
|Corry| 65|1954-12-26|9782472174|R. Olavo Bilac, 1...|98.47|
|David| 18|2001-10-22|2316324177|20-16 B5036, Wirk...| 3.77|
+-----+---+----------+----------+--------------------+-----+

Note that the column names are _c0, _c1 etc. and all the types are strings. So let fix that.

df = spark.read\
    .csv('data.csv', header=True, inferSchema=True)

df.printSchema()
df.show()
The optional input "header" is set to True so the Spark will refer to the first line as header names otherwise Spark will generate its own names like _c0, _c1 etc. The optional input "inferSchema" is set to True so Spark will sample the data and try to guess its type. Lets see the output:
root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Birth Date: timestamp (nullable = true)
 |-- Id: long (nullable = true)
 |-- Address: string (nullable = true)
 |-- Score: double (nullable = true)

+-----+---+-------------------+----------+--------------------+-----+
| Name|Age|         Birth Date|        Id|             Address|Score|
+-----+---+-------------------+----------+--------------------+-----+
|Arike| 28|1991-02-14 00:00:00|3698547891|New York, NY 1002...| 78.6|
|  Bob| 32|1987-06-07 00:00:00|6984184782|1843-1701 S Osage...|45.32|
|Corry| 65|1954-12-26 00:00:00|9782472174|R. Olavo Bilac, 1...|98.47|
|David| 18|2001-10-22 00:00:00|2316324177|20-16 B5036, Wirk...| 3.77|
+-----+---+-------------------+----------+--------------------+-----+

Alternatively, we could specify the schema ourselves. In the example below I remove the inferSchema input, and the default value of inferSchema is False. Additionally, I specify that the Id column is a string not long as before.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, DoubleType

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Birth Date", TimestampType(), True),
    StructField("Id", StringType(), True),
    StructField("Address", StringType(), True),
    StructField("Score", DoubleType(), True)
])
df = spark.read\
    .csv('data.csv', header=True, schema=schema)

df.printSchema()
df.show()


root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Birth Date: timestamp (nullable = true)
 |-- Id: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Score: double (nullable = true)

+-----+---+-------------------+----------+--------------------+-----+
| Name|Age|         Birth Date|        Id|             Address|Score|
+-----+---+-------------------+----------+--------------------+-----+
|Arike| 28|1991-02-14 00:00:00|3698547891|New York, NY 1002...| 78.6|
|  Bob| 32|1987-06-07 00:00:00|6984184782|1843-1701 S Osage...|45.32|
|Corry| 65|1954-12-26 00:00:00|9782472174|R. Olavo Bilac, 1...|98.47|
|David| 18|2001-10-22 00:00:00|2316324177|20-16 B5036, Wirk...| 3.77|
+-----+---+-------------------+----------+--------------------+-----+

Of course, there are more parameters to this API, I will leave it up to you to explore it in this link to Spark documentation.



Comments

Popular posts from this blog

5 ways to calculate max of row in Spark

Create Custom Datasource With Spark 3 - Part 1

Spark - How to read a JSON file