How to read CSV & JSON files in Spark – word count example

One of the really nice things about spark is the ability to read input files of different formats right out of the box. Though this is a nice to have feature, reading files in spark is not always consistent and seems to keep changing with different spark releases. This article will show you how to read files in csv and json to compute word counts on selected fields. This example assumes that you would be using spark 2.0+ with python 3.0 and above. Full working code can be found in this repository.

Data files

To illustrate by example let’s make some assumptions about data files. Let’s assume that we have data files containing a title field and a corresponding text field. The toy example format in json is as follows:

And, the format in csv is as follows:

Assume that we want to compute word counts based on the textfield.

Reading JSON File

Reading the json file is actually pretty straightforward, first you create an SQLContext from the spark context. This gives you the capability of querying the json file in regular SQL type syntax.

In this next step, you use the sqlContext to read the json file and select only the text field. Remember that we have two fields, title and text and in this case we are only going to process the text field. This step returns a spark data frame where each entry is a Row object. In order to access the text field in each row, you would have to use row.text. Note that the select here is conceptually the same as traditional SQL where you would do: select text from .....

To view what you have just read, you can use

You should see something like this:

SQL Query to Read JSON file

Note that you can achieve the same results, by issuing an actual SQL query on the dataset. For this, you first register the dataset as a view, then you issue the query. This also returns the same DataFrame as above.

Reading CSV File

Reading the csv file is similar to json, with a small twist to it, you would use and provide a format to it as below. Note that this method of reading is also applicable to different file types including json, parquet and csv and probably others as well.

Since the csv data file in this example has a header row, this can be used to infer schema and thus header='true' as seen above. In this example, we are again selecting only the text field. This method of reading a file also returns a data frame identical to the previous example on reading a json file.

Generating Word Counts

Now that we know that reading the csv file or the json file returns identical data frames, we can use a single method to compute the word counts on the text field. The idea here is to break words into tokens for each row entry in the data frame, and return a count of 1 for each token (line 4). This function returns a list of lists where each internal list contains just the word and a count of 1 ([w, 1]). The tokenized words would serve as the key and the corresponding count would be the value. Then when you reduce by key, you can add up all counts on a per word (key) basis to get total counts for each word (see line 8). Note that add here is a python function from the operator module.

As you can see below, accessing the text field is pretty simple if you are dealing with data frames.

And whoala, now you know how to read files with pyspark and use it for some basic processing! For the full source code please see links below.

Source Code



Leave a Reply

Have a thought?

Notify of