THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a data mangler in London working for Dunnhumby

Creating a Spark dataframe containing only one column

I’ve been doing lots of Apache Spark development using Python (aka PySpark) recently, specifically Spark SQL, and one thing I’ve found very useful to be able to do for testing purposes is create a Spark SQL dataframe from literal values. The documentation at pyspark.sql.SQLContext.createDataFrame covers this pretty well however the code there describes how to create a dataframe containing more than one column like so:

l = [('Alice', 1)]
# returns [Row(_1=u'Alice', _2=1)]
sqlContext.createDataFrame(l, ['name', 'age']).collect()
# returns [Row(name=u'Alice', age=1)]

For simple testing purposes I wanted to create a dataframe that has only one column so you might think that the above code could be amended simply like so:

l = [('Alice')]
sqlContext.createDataFrame(l, ['name']).collect()

but unfortunately that throws an error:

TypeError: Can not infer schema for type: <type 'str'>

The reason is simple, ('Alice', 1) returns a tuple whereas ('Alice') returns a string.

print type(('Alice',1))
# returns <type 'tuple'>
print type(('Alice'))
#returns <type 'str'>

The latter causes an error because createDataFrame() only creates a dataframe from a RDD of tuples, not a RDD of strings.

There is a very easy fix which will be obvious to any half-decent Python developer, unfortunately that’s not me so I didn’t stumble on the answer immediately. Its possible to create a one-element tuple by including an extra comma like so:

print type(('Alice',))
# returns <type 'tuple'>

hence the earlier failing code can be adapted to this:

l = [('Alice',)]
# returns [Row(_1=u'Alice')]
sqlContext.createDataFrame(l, ['name']).collect()
# returns [Row(name=u'Alice')]

It took me far longer than it should have done to figure that out Smile


Here is another snippet that creates a dataframe from literal values without letting Spark infer the schema (behaviour which, I believe, is deprecated anyway):

from pyspark.sql.types import *
schema = StructType([StructField("foo", StringType(), True)])
l = [('bar1',),('bar2',),('bar3',)]
sqlContext.createDataFrame(l, schema).collect()
# returns: [Row(foo=u'bar1'), Row(foo=u'bar2'), Row(foo=u'bar3')]

or, if you don’t want to use the one-element tuple workaround that I outlined above and would rather just pass a list of strings:

from pyspark.sql.types import *
from pyspark.sql import Row
schema = StructType([StructField("foo", StringType(), True)])
l = ['bar1','bar2','bar3']
rdd = sc.parallelize(l).map (lambda x: Row(x))
sqlContext.createDataFrame(rdd, schema).collect()
# returns [Row(foo=u'bar1'), Row(foo=u'bar2'), Row(foo=u'bar3')]

Happy sparking!
Published Monday, December 12, 2016 2:59 PM by jamiet

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



RichB said:

Interesting... but I'm intrigued, what is the correct pronunciation of PysPark?

December 12, 2016 7:34 PM

jamiet said:

RichB. Does this help:




December 13, 2016 5:18 AM

RichB said:

Ah... how disappointing.

December 13, 2016 7:41 PM

Leave a Comment


This Blog


Privacy Statement