Spark - How to read CSV File
In this post, I would like to show how to read a CSV file located on your local file system.
First, lets consider the below CSV file. You can write it using Excel, Libra Office Calc or any simple text editor. Note, we have 6 columns - A to F and 5 rows where the first row is a header row.
Name,Age,Birth Date,Id,Address,Score Arike,28,1991-02-14,3698547891,"New York, NY 10022, USA",78.6 Bob,32,1987-06-07,6984184782,"1843-1701 S Osage Dr, Ottawa, KS 66067, USA",45.32 Corry,65,1954-12-26,9782472174,"R. Olavo Bilac, 143 - Batel, Curitiba - PR, 80440-040, Brazil",98.47 David,18,2001-10-22,2316324177,"20-16 B5036, Wirksworth, Matlock DE4 4FP, UK",3.77
Assuming we have an initialized SparkSession object called spark we can write the following.
Here, the "read" property returns a DataFrameReader we can execute the csv method with a path to the csv.
The line df.printSchema() prints to the console the scheme of the file.
df = spark.read\ .csv('data.csv') df.printSchema()
df.show()
And the output will be:
root |-- _c0: string (nullable = true) |-- _c1: string (nullable = true) |-- _c2: string (nullable = true) |-- _c3: string (nullable = true) |-- _c4: string (nullable = true) |-- _c5: string (nullable = true)
+-----+---+----------+----------+--------------------+-----+
| _c0|_c1| _c2| _c3| _c4| _c5|
+-----+---+----------+----------+--------------------+-----+
| Name|Age|Birth Date| Id| Address|Score|
|Arike| 28|1991-02-14|3698547891|New York, NY 1002...| 78.6|
| Bob| 32|1987-06-07|6984184782|1843-1701 S Osage...|45.32|
|Corry| 65|1954-12-26|9782472174|R. Olavo Bilac, 1...|98.47|
|David| 18|2001-10-22|2316324177|20-16 B5036, Wirk...| 3.77|
+-----+---+----------+----------+--------------------+-----+
Note that the column names are _c0, _c1 etc. and all the types are strings. So let fix that.
df = spark.read\ .csv('data.csv', header=True, inferSchema=True) df.printSchema()
df.show()
root |-- Name: string (nullable = true) |-- Age: integer (nullable = true) |-- Birth Date: timestamp (nullable = true) |-- Id: long (nullable = true) |-- Address: string (nullable = true) |-- Score: double (nullable = true) +-----+---+-------------------+----------+--------------------+-----+ | Name|Age| Birth Date| Id| Address|Score| +-----+---+-------------------+----------+--------------------+-----+ |Arike| 28|1991-02-14 00:00:00|3698547891|New York, NY 1002...| 78.6| | Bob| 32|1987-06-07 00:00:00|6984184782|1843-1701 S Osage...|45.32| |Corry| 65|1954-12-26 00:00:00|9782472174|R. Olavo Bilac, 1...|98.47| |David| 18|2001-10-22 00:00:00|2316324177|20-16 B5036, Wirk...| 3.77| +-----+---+-------------------+----------+--------------------+-----+
Alternatively, we could specify the schema ourselves. In the example below I remove the inferSchema input, and the default value of inferSchema is False. Additionally, I specify that the Id column is a string not long as before.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, DoubleType schema = StructType([ StructField("Name", StringType(), True), StructField("Age", IntegerType(), True), StructField("Birth Date", TimestampType(), True), StructField("Id", StringType(), True), StructField("Address", StringType(), True), StructField("Score", DoubleType(), True) ]) df = spark.read\ .csv('data.csv', header=True, schema=schema) df.printSchema() df.show()
root |-- Name: string (nullable = true) |-- Age: integer (nullable = true) |-- Birth Date: timestamp (nullable = true) |-- Id: string (nullable = true) |-- Address: string (nullable = true) |-- Score: double (nullable = true) +-----+---+-------------------+----------+--------------------+-----+ | Name|Age| Birth Date| Id| Address|Score| +-----+---+-------------------+----------+--------------------+-----+ |Arike| 28|1991-02-14 00:00:00|3698547891|New York, NY 1002...| 78.6| | Bob| 32|1987-06-07 00:00:00|6984184782|1843-1701 S Osage...|45.32| |Corry| 65|1954-12-26 00:00:00|9782472174|R. Olavo Bilac, 1...|98.47| |David| 18|2001-10-22 00:00:00|2316324177|20-16 B5036, Wirk...| 3.77| +-----+---+-------------------+----------+--------------------+-----+
Of course, there are more parameters to this API, I will leave it up to you to explore it in this link to Spark documentation.
Comments
Post a Comment