Expanding JSON arrays to rows with SQL on RedShift

12 December 2013  —  Torsten Becker

Amazon's RedShift is a really neat product that solves a lot of our problems at work. However, its SQL dialect has some limitations when compared to Hive or PostgresSQL. I hit a limit when I needed table-generating functions but found a work-around.

Some of the data we store in RedShift contains JSON arrays. However, when running analytical queries, there is no out-of-the box way to join on "nested data" inside of arrays so up until now this data was very hard to use.

In October, RedShift added new functions to work with JSON 1  but the support is missing something like Hive's explode() or Postgres' unnest() functions to expand an array from one column into one row for each element.

As a work-around, I came up with a simple hack: Joining the JSON array with a predefined sequence of integers and then extracting the element at each index into a new relation through that join.

If you want to follow along the queries and play with the data, I created a Gist which has all the queries to create the dummy tables and that fills them in with test data.

In this example, I am assuming a table clusters where each row represents a cluster of "things" and each cluster consists of many nodes modeled as a JSON array. Each node then has its size stored in this array – you could ask "What is the maximum node size over all clusters?".

id node_sizes
1 '[1, 2]'
2 '[5, 1, 3]'
3 '[2]'

Assuming the above data in the table clusters, you can use the following SQL query in RedShift to extract the maximum node size from all arrays:

WITH exploded_array AS (
    SELECT id, JSON_EXTRACT_ARRAY_ELEMENT_TEXT(node_sizes, seq.i) AS size
    FROM clusters, seq_0_to_100 AS seq
    WHERE seq.i < JSON_ARRAY_LENGTH(node_sizes)
SELECT max(size)
FROM exploded_array;

The example query above uses a Common Table Expression to create a intermediate relation exploded_array which looks like this:

id size
1 1
1 2
2 5
2 1
2 3
3 2

As you can see, each array was expanded into many rows, but the id is still the same for each element.

So where does the magical seq_0_to_100 in the above queries come from?

Since RedShift is currently missing any kind of sequence-generating functions, I had to emulate this as well. For that I created a view like this:

CREATE VIEW seq_0_to_100 AS (
    -- You get the idea...
    SELECT 100

One constraint of the presented technique is that the maximum length of the longest array has to be known upfront to generate the sequence of integers. This constraint did not provide any obstacles for me in practice yet since this size can just be queried in most cases:

SELECT MAX(JSON_ARRAY_LENGTH(node_sizes)) FROM clusters;

Of course, this example is not all you can do. Once the data is in the shape of exploded_array you can work with the resulting intermediate relation in any other way and join it with all the things in your data warehouse.

Thanks to Jenny and Hans for reading drafts of this post!