Expanding JSON arrays to rows with SQL on RedShift
Amazon's RedShift is a really neat product that solves a lot of our problems at work. However, its SQL dialect has some limitations when compared to Hive or PostgresSQL. I hit a limit when I needed table-generating functions but found a work-around.
Some of the data we store in RedShift contains JSON arrays. However, when running analytical queries, there is no out-of-the box way to join on "nested data" inside of arrays so up until now this data was very hard to use.
In October, RedShift added new functions to work with JSON 1 but the support is missing something like Hive's explode() or Postgres' unnest() functions to expand an array from one column into one row for each element.
As a work-around, I came up with a simple hack: Joining the JSON array with a predefined sequence of integers and then extracting the element at each index into a new relation through that join.
If you want to follow along the queries and play with the data, I created a Gist which has all the queries to create the dummy tables and that fills them in with test data.
In this example, I am assuming a table clusters
where each row represents a cluster of "things" and each cluster consists of many nodes modeled as a JSON array. Each node then has its size stored in this array – you could ask "What is the maximum node size over all clusters?".
id | node_sizes |
---|---|
1 | '[1, 2]' |
2 | '[5, 1, 3]' |
3 | '[2]' |
Assuming the above data in the table clusters
, you can use the following SQL query in RedShift to extract the maximum node size from all arrays:
WITH exploded_array AS (
SELECT id, JSON_EXTRACT_ARRAY_ELEMENT_TEXT(node_sizes, seq.i) AS size
FROM clusters, seq_0_to_100 AS seq
WHERE seq.i < JSON_ARRAY_LENGTH(node_sizes)
)
SELECT max(size)
FROM exploded_array;
The example query above uses a Common Table Expression to create a intermediate relation exploded_array
which looks like this:
id | size |
---|---|
1 | 1 |
1 | 2 |
2 | 5 |
2 | 1 |
2 | 3 |
3 | 2 |
As you can see, each array was expanded into many rows, but the id
is still the same for each element.
So where does the magical seq_0_to_100
in the above queries come from?
Since RedShift is currently missing any kind of sequence-generating functions, I had to emulate this as well. For that I created a view like this:
CREATE VIEW seq_0_to_100 AS (
SELECT 0 AS i UNION ALL
SELECT 1 UNION ALL
-- You get the idea...
SELECT 99 UNION ALL
SELECT 100
);
One constraint of the presented technique is that the maximum length of the longest array has to be known upfront to generate the sequence of integers. This constraint did not provide any obstacles for me in practice yet since this size can just be queried in most cases:
SELECT MAX(JSON_ARRAY_LENGTH(node_sizes)) FROM clusters;
Of course, this example is not all you can do. Once the data is in the shape of exploded_array
you can work with the resulting intermediate relation in any other way and join it with all the things in your data warehouse.