Using Tableau: Massive queries and slick SQL

As a data exploration tool, Tableau is second to none, but sometimes the SQL it generates to access SQL databases is just eye watering. Working smart can cure that.

In one the reports I generate, there are two important dates. The reporting month and the dataset extraction date. In fact one defines the other.

OK, quick background, we summarise our data at the monthly level and we only report on complete months. So once a month we snapshot the live data and store it in tables that the Tableau dashboards use. This ensures that all the datasets are self-consistent, and means if someone runs a report at anytime during the month they'll always get the same results.

The dataset extraction date is important because it's when the datasets were snapshotted and appears on our reports as the "as-at" date. Although I have to say, it appears in small type and near the bottom of the dashboard because we discovered that if people see a date with, for example, 03/06/2017, they immediately ask why the data from June isn't in the dashboard. However it does allow us to create the Reporting Month, which is the extraction date minus one month eg for 03/06/2017, the reporting month would be May 2017. This is the one that we display in large characters at the top of the dashboard :-)

To create and use the Reporting Month from within Tableau you can use the dataset extraction date and the following calculated field.

DATE(DATETRUNC('month', DATEADD('month', -1, [dataset_extraction_dt]) ))

The problem is the SQL that Tableau generates behind the scenes to get this date is massive. Mainly this is because we know somethings that Tableau doesn't - for instance this field only ever has a single value in it, and because of that you don't need to apply any filters to it.

SELECT CASE WHEN CAST(CASE WHEN ADD_MONTHS("tar_combined_abs_data"."dataset_extraction_dt", -1) IS NULL THEN CAST(NULL AS TIMESTAMP) ELSE CAST( CAST(ADD_MONTHS("tar_combined_abs_data"."dataset_extraction_dt", -1) AS DATE) - (EXTRACT(DAY FROM ADD_MONTHS("tar_combined_abs_data"."dataset_extraction_dt", -1)) - 1) * INTERVAL '1' DAY AS TIMESTAMP) END AS DATE) IS NULL THEN CAST(NULL AS TIMESTAMP) ELSE CAST( CAST(CAST(CASE WHEN ADD_MONTHS("tar_combined_abs_data"."dataset_extraction_dt", -1) IS NULL THEN CAST(NULL AS TIMESTAMP) ELSE CAST( CAST(ADD_MONTHS("tar_combined_abs_data"."dataset_extraction_dt", -1) AS DATE) - (EXTRACT(DAY FROM ADD_MONTHS("tar_combined_abs_data"."dataset_extraction_dt", -1)) - 1) * INTERVAL '1' DAY AS TIMESTAMP) END AS DATE) AS DATE) - (EXTRACT(DAY FROM CAST(CASE WHEN ADD_MONTHS("tar_combined_abs_data"."dataset_extraction_dt", -1) IS NULL THEN CAST(NULL AS TIMESTAMP) ELSE CAST( CAST(ADD_MONTHS("tar_combined_abs_data"."dataset_extraction_dt", -1) AS DATE) - (EXTRACT(DAY FROM ADD_MONTHS("tar_combined_abs_data"."dataset_extraction_dt", -1)) - 1) * INTERVAL '1' DAY AS TIMESTAMP) END AS DATE)) - 1) * INTERVAL '1' DAY AS TIMESTAMP) END AS "tmn:calculation_13595241565024"
FROM "LDB_SBOX_PDW"."tar_combined_abs_data" "tar_combined_abs_data"
  INNER JOIN "LDB_SBOX_PDW"."Abs_Employee_Static_Base" "abs_employee_static_base" ON ("tar_combined_abs_data"."person_id" = "abs_employee_static_base"."Person_ID")
  INNER JOIN "LDB_SBOX_PDW"."HRC_Reporting_Hierarchy" "hrc_reporting_hierarchy" ON ("tar_combined_abs_data"."unit_hierarchy_cd" = "hrc_reporting_hierarchy"."hierarchy_key")
WHERE (("tar_combined_abs_data"."active_ind" = 'Y') AND ("hrc_reporting_hierarchy"."hierarchy_id" = 1))
GROUP BY 1

On our big datesets, the ones with more than a million rows in them, this causes Teradata to blow out with a "No More Spool Space" error. This happens because there is only one value in the field which means that Teradata is doing all the processing on one AMP (aka a node).

One way round this is to ask the DBA's very nicely to increase your SPOOL SPACE, but this is not a good solution - it's like fixing a leak in an inflatable paddling pool with duct tape. It'll work for a while, but it'll only break again later especially if someone else jumps into the pool.

A better way to fix it is to create a View on the Table in SQL that only has the extraction date in it and only returns one value. At the same time we can also create a text version of the reporting month in exactly the format we want :-)

CREATE VIEW  ldb_sbox_pdw.Abs_Single_Value_Fields  AS
SELECT
MAX(dataset_extraction_dt) AS dataset_extraction_dt,
ADD_MONTHS(dataset_extraction_dt - (EXTRACT(DAY FROM dataset_extraction_dt)-1),-1) AS reporting_month_dt,
TRIM(TO_CHAR(reporting_month_dt, 'Month')) \|| ' ' \|| TRIM(TO_CHAR(reporting_month_dt, 'YYYY')) AS reporting_month_str
FROM ldb_sbox_pdw.tar_combined_abs_data
GROUP BY dataset_extraction_dt
;

And then connect we this as a new data source in Tableau. Making it a new data source means none of the existing filters will be applied to it and means Tableau doesn't try to JOIN to any other datasets. No WHERE clauses and no JOINS. In fact the SQL Tableau generates remarkably clean.

SELECT reporting_month_str 
FROM "LDB_SBOX_PDW"."tar_combined_abs_data" "tar_combined_abs_data";

Compare that to the original SQL and we can see that we're in a much better place. No possibility of that blowing the Spool Space :-).

What we've done here is shifted the onus of the processing and formatting from Tableau onto Teradata where we can apply our knowledge of the data. (See Full Spectrum Developers).

So with a little bit of thought, and some understanding of how data is processed and we've managed to fix the problem and greatly increase the efficiency of our dashboard.


“The combination of hard work and smart work is efficient work”
— Robert Half