Reading CSV & JSON files in Spark – Word Count Example

Reading CSV & JSON files in Spark – Word Count Example

One of the really nice things about spark is the ability to read input files of different formats right out of the box. Though this is a nice to have feature, reading files in spark is not always consistent and seems to keep changing with different spark releases. This article will show you how to read files in csv and json to compute word counts on selected fields. This example assumes that you would be using spark 2.0+ with python 3.0 and above.

Data files

To illustrate by example let’s make some assumptions about data files. Let’s assume that we have data files containing a title field and a corresponding text field. The toy example format in json is as follows:

And, the format in csv is as follows:

Assume that we want to compute word counts based on the textfield.

Reading JSON File

Reading the json file is actually pretty straightforward, first you create an SQLContext from the spark context. This gives you the capability of querying the json file in regular SQL type syntax.

In this next step, you use the sqlContext to read the json file and select only the text field. Remember that we have two fields, title and text and in this case we are only going to process the text field. This step returns a spark data frame where each entry is a Row object. In order to access the text field in each row, you would have to use row.text.

To view what you have just read, you can use df.show()

You should see something like this:

 

Reading CSV File

Reading the csv file is similar to json, with a small twist to it, you would use sqlContext.read.load(...) and provide a format to it as below. Note that this method of reading is also applicable to different file types including json, parquet and csv and probably others as well.

Since the csv data file in this example has a header row, this can be used to infer schema and thus header='true' as seen above. In this example, we are again selecting only the text field. This method of reading a file also returns a data frame identical to the previous example on reading a json file.

Generating Word Counts

Now that we know that reading the csv file or the json file returns identical data frames, we can use a single method to compute the word counts on the text field. The idea here is to break words into tokens for each row entry in the data frame, and return a count of 1 for each token (line 4). The tokenized words would serve as the key and the corresponding count would be the value. Then when you reduce by key, you can add up all the counts to get the total counts for each word (line 8).

As you can see below, accessing the text field is pretty simple if you are dealing with data frames.

And whoala, now you know how to read files with pyspark and use it for some basic processing! For the full source code please see links below.

Source Code

 

 

Leave a Reply

Your email address will not be published.