Tableau and Databricks part 1 – getting started

Two data tools I really enjoy working with are Tableau and Databricks. Tableau lets you visually explore and communicate your data. And Databricks is a cloud-based data and AI platform. If you’ve started to learn about or work with Tableau Next, then you’ll be aware that it reboots the Tableau product stack on the power of data cloud and agentic AI (a data and AI platform). But if you’re not fortunate enough to be there yet, or you already work with Databricks, then this series of blog posts could be for you!

You can connect Tableau to Databricks (and vice-versa). And there is a free edition of Databricks for non-commercial/production use. You can use the free edition to learn about the products, and potentially as a platform for your own personal projects. In this “getting started” post I’ll cover signing up for the Databricks free edition, a quick tour, and how to connect to the data in your instance from Tableau Desktop.

Databricks free edition

Firstly, it’s important to note that Databricks free edition is not the same as the free trial, which is often a more prominent link. To access the free edition:

  • Go to www.databricks.com/try-databricks
  • Scroll down and select “get free edition”, noting the caveats re personal use, etc
  • Choose a login method
  • I used my email address, after which I was prompted to enter a confirmation code sent to that email (and prove I was a human)
  • That’s it, you’re in!

Databricks overview

Once you’re into your Databricks instance you should see a page like this:

databricks-01


There is a menu to the left which I’ll run through briefly below. And there is an Intro to Databricks video linked prominently in the main part of the page. The video is around 10 minutes long and gives a good walk through of the menu items too.

Working down the first set of menu items and key concepts:

  • Workspace = where you work on and share any code (e.g. notebooks)
  • Catalog = the meta-data about the data you have in Databricks: catalogs, schemas, tables and field definitions (well described = better for AI, arguably)
  • Jobs & Pipelines = your data ingestion pipelines, ETL pipelines, and jobs
  • Compute = the compute resource that will handle your workloads and queries. In free edition there is just a Serverless Starter Warehouse (see also SQL Warehouse below)
  • Marketplace = connectors and integrations, resources (data), sample notebooks

In the SQL section

  • SQL Editor / Queries = write and save SQL scripts to query your databases (selects, use Workspace/notebooks for updates, etc.)
  • Dashboards = not yet as mature as Tableau and other BI/dashboarding tools (but see Genie!)
  • Genie = natural language querying of well-defined data within your Catalog, available within dashboards too
  • Alerts = like a trigger, e.g. rows loaded today = 0
  • SQL Warehouse = the compute resource SQL queries are run against, and that tools like Tableau connect to

The Data Engineering section is in essence an alternative way to drill into Jobs & Pipelines mentioned above.

And the AI/ML section allows you to try out AI/ML workloads, run experiments, develop and serve up AI/ML models.

Reviewing some sample data

A good way to dig in further is to review some of the sample data that is shared into your instance – we’ll then go on to connect to that data in Tableau.

If you click into the Catalog menu item, you’ll see the Catalog explorer just to the right of the left-hand menu. If you expand Workspace and Default, you’ll see that you currently have no data of your own in your instance. If you expand the Delta Shares Received section and the Samples catalog, you’ll see a list of Schemas (databases). If you expand Accuweather you’ll see the tables available. And if you select a table, you’ll see information about that table on the right:

databricks-02


Next up let’s query some of that data. To do that I go to the SQL Editor menu item and create a new query. I can start my compute resource over on the right (arrow, or green circle as shown below once running). If I don’t I’ll be prompted to when I run the query. I select the “samples” catalog, and can then write a query against the schemas/tables in that catalog. In this case the SQL is pulling back min, average and max temperatures for London by date from the table we looked at above, and we can see the results at the bottom of the page:

databricks-03


The query was run against the SQL Warehouse (compute resource). And it’s this that Tableau will connect to. Click into the SQL Warehouses menu item, then into your “Serverless Starter Warehouse”, and then into the “Connection details” tab:

databricks-04


You will need “Server hostname” and “HTTP path” when you connect from Tableau, and you can use the “copy” icon/button to copy the details.

Connecting from Tableau Desktop

In Tableau:

  • Connect to data
  • In the “To a Server” section select More and search for Databricks
  • Select the top “Databricks” option and on your first connection you should be prompted to download and install the driver.
  • Go ahead and do that (NB: you can go directly here to get the driver: https://databricks.com/spark/odbc-driver-download)
  • Reselect Databricks now that you have the driver installed, and you should be prompted for details
  • Copy and paste in the Server Hostname and HTTP Path of your Databricks SQL Warehouse
  • For Authentication you can proceed with Databricks login (recommended)

databricks-05


At this point:

  • You should get a browser pop up
  • If you’re already logged into your Databricks workspace it will say you’re authenticated
  • Otherwise, re-login
  • NB: you could instead setup and use a personal access token (see link on the SQL Warehouse page, connection details tab, top right)

Select catalog, schema and table(s):

  • As per other server connections you can now select a catalog and a database (schema)
  • And then you can choose tables to drag into your data source
  • NB: in the Database and Table sections the lists don’t show until you search, but you can search for all – just click the Search icon!

databricks-06


More info on connecting from Tableau to Databricks is available here: https://help.tableau.com/current/pro/desktop/en-us/examples_databricks.htm. Note that there will be extra details and steps required when using a paid instance.

Building a view

Building a view in Tableau Desktop is now the same as for any other data source (if this is your first time building a view in Tableau I’ll cover more below the screenshot). In this example I’ve created a graph of the query I ran in Databricks and showed above, and I can see the variation in range of min, average and max temperatures for London much more clearly:

databricks-07


EDIT New to Tableau Desktop? Here are some quick tips:

  • You can sign up for a 14-day trial
  • And there is a handy free trial starter kit
  • In the example above I have connected to data (as previously described)
  • Then I can drag fields from the Data pane on the left onto areas (called shelves) in the middle
  • These show up as the green and blue “pills” as you can see in the screenshot above
  • And their placement tells Tableau what you want to see, and what query to generate behind the scenes
  • I have dragged City Name into the Filters section (top left) and filtered to “london” only
  • Dragged Date onto Columns (at the top) to get a chart over time
  • Measure Values onto Rows (again at the top) – to plot the measure values over time
  • And Measure Names also on Filters, and filtered to just three measures
  • Also added Measure Names onto Marks > Color (left), so I get a coloured line per measure
  • Clicked Color (in the Marks area) and picked a suitable colour per measure (line)

Wrapping up

This post covered signing up to Databricks free edition, exploring the UI and sample data, and connecting to that data in Tableau Desktop.

In future posts I plan to cover ingesting your own data, processing that data (including AI/ML use cases), and using the end results in Tableau. Options to generate extracts for Tableau Public workbooks would also be a great topic as Databricks free edition + Tableau Public could be a compelling pairing!

In the meantime, I hope you get to enjoy exploring the two tools together.

Tableau expander tables – 3

In previous blog posts I introduced the use of expander tables in Tableau when needing to handle a single row in multiple ways, along with three specific use cases.

In this post, part three of the series, I’m going to cover three more use cases. Let’s take a look…

4. To create custom sub totals or grand totals

Here I have a view showing sales by country and region. I’ve turned on grand totals and sub totals for rows so that I get the overall total and a sub-total per country:

custom-totals-1


If I only wanted the breakdown to region for the US, with Canada shown as a single combined row, then I could create a calc to only return region for US. But if I still want a sub-total for the US I would also get a sub-total for Canada which is unwanted:

custom-totals-2


In this case I can use an expander table with three rows:

custom-totals-3


Let’s take a look at how we use this in the view. NB: the final view won’t look quite like this but it’s useful to see (and test!) what’s happening behind the scenes:

custom-totals-4


What you can see highlighted yellow is that

  • “Row Group” A = the detail level (region or country)
  • “Row Group” B = the sub-total row for the US
  • “Row Group” C = the grand total row

Pointed out in red there is also a “Region 3 Group” calc which is ensuring that the sub-total row for the US is “less” than the detail row for Canada (“02″ vs “03″).

If I now remove “Row Group” from the view, and hide “Region 3 Group” I get the view I wanted:

custom-totals-5


I’ve added row banding so that the sub-total and the grand total have different backgrounds. And you’ll see that the “Region 3″ calc has similar logic to the previous calc to return region country or sub-total / grand total text. Nice!

Benefits: more control over sub-totals and row ordering, as long as you can write a calc to do it. As noted above a good trick is to include all columns in your view to test your calcs before gradually reducing down to the ones you need on the view.

Example: forum example.

5. To have a category or date in multiple groupings

Sometimes you want to include a single row into more than one grouping or category. People often try to use IF / THEN / ELSE logic for this. In this example the analyst wants to treat any Tomatoes as both a fruit and a vegetable:

IF [Category] IN ("Apples", "Tomatoes", "Oranges")
THEN "Fruit"
ELSEIF [Category] IN ("Carrot", "Cucumber", "Tomatoes")
THEN "Vegetable"
ELSE "Other"
END

Unfortunately, this doesn’t do the job. Once a row in the data has fallen into the IF logic (fruit) it won’t also be able to count in the ELSEIF (vegetable). You can’t count the row twice.

Instead, you can use an expander table to give yourself two rows for every single Tomatoes row, and handle version 1 as fruit and version 2 as vegetable. You will need to ensure that you have a calc that can be used to filter out version 2 of all the other rows.

A similar example is where you want to show sales today and this month. Any records for today should appear in both totals.

Benefits: you won’t try to achieve the impossible in a calc (please do feel free to prove me wrong of course!).

Example: Forum example.

6. Data densification

This used to be more prevalent when polygons were being created to make certain advanced chart types, but it can also apply when doing advanced charting using map layers.

If you need to plot a square for each (singular) data point in a data set, then you can relate to an expander table with 5 rows.

  • Row 1 will = bottom left
  • Row 2 will = bottom right
  • Row 3 will = top right
  • Row 4 will = top left
  • Row 5 will = bottom left again

You’ll use calculations to determine the X/Y coordinate of each. E.g. if your data point has coordinates N, M and you want that to be the bottom left of a square that is 1×1…

  • Row 1: X=N, Y=M
  • Row 2: X=N+1, Y=M
  • Row 3: X=N+1, Y=M+1
  • Row 4: X=N, Y=M+1
  • Row 5: X=N, Y=M

Putting X/Y on Rows and Columns, an identifier on Detail, picking Polygon mark type, and then placing Row Num on Path should give you a filled 1×1 square for each of your data points.

Nowadays if you can use GEOMETRY/GEOGRAPHY types or shapefiles then it’s best to do so. However, this technique can come in handy if you’re drawing advanced charts yourself using map layers.

Benefits: gives you more control over what Tableau can “draw” by allowing you to create additional marks for data points.

Examples:

Closing comments

I hope you’ve enjoyed this series of posts and examples. Expander tables have been a key part of my Tableau toolset for a while now. If you haven’t used them, then I’d highly recommend considering them in the future for use cases like those I’ve covered. And I’ll leave you with the question: how else could you make use of expander tables?

Tableau expander tables – 2

In a previous blog post I introduced the use of expander tables in Tableau when needing to handle a single row in multiple ways. You may have heard of expander tables before under the more generalised term of “scaffolding”.

In this blog post I’m going to cover some different use cases for expander tables. If you didn’t read part 1 yet, go ahead and have a skim through that…

Right let’s get into some use cases! For each case I’m going to describe the approach and benefits, as well as providing an example form the Tableau Community Forums or Tableau Public, that you can dig into in more detail.

1. To count a single row on multiple dates without using a UNION

I covered this use case in part 1 so pop back there for the detail. To recap: with an expander table you can turn a single row with an opened and closed date, into two rows. To do this you relate your existing table onto another table which has two rows, with relationship calcs of 1=1. This means every original row joins to both of the additional rows (as 1=1 is always true for the join). Then you can have a calculation that says “for copy 1 use opened date, and for copy 2 use closed date”. This allows you to count the row on both the opened date (as a +1) and the closed date (as a -1, say).

If you have more than two dates then add extra rows to the expander table (one expander row per date), and extra clauses to your calculations.

Benefits: with a UNION your data set is always duplicated meaning that you have to cater for that in other parts of your workbook that use the data (i.e. deduplicate it). With a relationship to an expander table your data will only be duplicated when you need it to be.

Example: here is the example from my VizIt Sydney presentation published to Tableau Public. The third view is the expander table version. The first view demonstrates why we need to duplicate the data, and the second view is a more traditional UNION-based solution.

2. To avoid issues from pivoting your data

Sometimes your data will look a little like this, where a measure is split out by a category or dimension:

In this case we don’t have an overall “sales” figure with a dimension for “region”. Instead we have sales for the East region in one column, and sales for the West region in another. Another example that often arises is where each year’s value is in a separate column (Sales 2023, Sales 2024, Sales 2025, etc). Unfortunately Tableau much prefers the measure to be in a single column, and there to be a row per category.

To get around this we can use the built in pivot option when building the data source:

And that will give us two new columns: Pivot Field Names (with our category), and Pivot Field Values (the measure). NB: Usually you would rename these, in this case to Region and Sales. And you would usually remove the word “Sales” prefixing the Region, I won’t do that in this example.

One problem you’ll spot above is that we have now duplicated Stock On Hand!

So whilst we can use Pivot Field Names (on Colour) and Pivot Field Values (on Columns) to create a stacked bar chart breaking Sales down by region. If we also want to show Stock On Hand, it is now double counted:

pivot-1


Instead we can use an expander table.

The table has a row per Region:

We need to relate that table to the original data set, with a relationship of 1=1. To do this we pick relationship calculation from each of the drop downs and enter in 1:

We can now drag Region onto Colour. Then we need a calculation for Sales which says “when the Region is East use Sales East, and when the Region is West use Sales West”, and we drag that onto Columns:

expander-1


We still get our stacked bar chart of sales broken down by region. But now if we drag Stock on Hand onto the view it doesn’t double count as Tableau doesn’t need to use the expander table for that part of the view.

Benefits: similarly to avoiding a UNION, avoiding a PIVOT in favour of an expander tables means that additional data isn’t always duplicated. It is only duplicated when we need it to be. And that can help avoid having to deduplicate other values in your visualisation.

Example: Forum example.

3. Instead of Measure Names and Measure Values

In the example above you may have been thinking “I could just use the special Measure Names and Measure Values pills instead of a pivot or an expander table”. And you would be right:

meas-name-values-1


With this approach the Stock On Hand isn’t double counted either. Nice!

But … yep, there was always going to be a but … the special Measure Names and Measure Values pills do have some restrictions. You can’t use them in calculations, they are harder to pass through dashboard actions, and you are restricted on sort options which can be a pain.

Using an expander table as shown in the previous section can give you your own “measure” and “value” columns, but in a way that you can use in calculations, actions and sorting. The example linked below goes into more detail.

Benefits: the special Measure Names and Measure Values pills are very handy but do have some restrictions. Those restrictions don’t apply to the normal pills you can get via use of an expander table.

Example: Forum example (scroll down further than the “best” answer, as an early post goes into the expander table option in more detail).

Further use cases

In the final blog post in the series I’ll cover the use of expander tables to:

  • Create custom sub totals or grand totals
  • Count a category or date in multiple groupings
  • Densify your data in other ways

Until then, thanks for reading!

Tableau expander tables – 1

In a recent VizIt Sydney and Tableau User Group presentation, Fantastic Tableau tricks, and how to avoid them, I talked about expander tables and their benefits when working with data in Tableau. In this blog series I’ll cover:

  • expander tables in more depth (part 1); and
  • use cases they can help with (parts 2+)

But first up…

Why might you need to use an expander table?

When working with data in Tableau it’s not unusual to find that you need to treat a single row of data in more than one way in the same view.  A common example is where a row has both an opened date and a closed date, and you need to plot items opened and closed per date. Each item, or row, needs to be counted twice. Another example is where you have multiple measures per row, you need to use more than one on the same view, and a dual axis or the built in Measure Names and Measure Values pills aren’t flexible enough.

An expander table can help you to handle a single row in more than one way.

An example

Let’s use the example of support ticket data:

Excel of support tickets with open and close date


We have a support ticket number and for each of these there is a date opened and a status. When the status is closed we have a date closed.

If we want to plot the number of tickets opened and closed on any given date, we can’t do that easily with the data like this. For each ticket we only have a single row so we can only use the date opened OR the date closed. As the data stands, we can’t use both to count the ticket when it was opened and when it was closed.

You might be used to seeing date scaffold tables to help with this situation, pivoting the data, or possibly using a union like so:

Unioning the data set to itself in Tableau


The union (first black arrow) gives you two copies of the data, differentiated by two new fields – Sheet and Table Name (second black arrow).

The problem with these approaches is they often reshape the data in more situations than you want, or in ways that cause other impacts to your charts. For example, with the support ticket data, if we just count the rows in the now union-ed data set we will end up double counting support tickets!

Instead we can use an expander table:

Tableau data source relating to an expander table with two rows on a calc of 1=1


Here I have related an “expander” table onto the support tickets (top black arrow); the expander table has two rows in it (Row Set A and B, highlighted with the diagonal arrow); and the relationship is on a relationship calc of 1=1, which is always true (bottom arrow, note: you choose “relationship calc” in the drop downs here and then enter 1 for each).

This relationship ensures that for every support ticket we can choose to have two rows. But that will only happen when we reference the Row Set from the expander table. In other cases the support tickets data will act as it always has, with no double ups. This ability to only have the extra rows when we choose to is a key advantage over a union or pivot, as it means calcs like the total number of tickets don’t double count by default.

Lets take a look at how we use the expander table when building a view:

Graph of opened and closed support tickets with calc 1


Here I build a combined date. When the Row Set is A, the date will be the date opened. And when the Row Set is B, it will be the date closed. I also create a similar calc for Ticket Numbers: +1 when a ticket was opened (Row Set A), and -1 when a ticket was closed (Row Set B where status is closed):

Graph of opened and closed support tickets with calc 2


We’re using one set of rows for opened, and the other for closed.

With this we can plot bars for both tickets opened (orange) and tickets closed (light blue). And unlike a union, we can still get the correct total ticket count (black arrow below) without having to adjust that calc … we only get the second set of rows when we needed it, thanks to the way relationships work.

Graph of opened and closed support tickets with correct total


Whilst this is a very simple concept it can be incredibly useful when your data is not quite in the shape you need in Tableau. Cases where you might otherwise turn to Measure Names and Values, a union, pivot, or complex calculations.

What if I can’t add another table / sheet?

Firstly, if you’re using a database connection you don’t necessarily need an expander table in the same database. You could try adding an extra connection in the Data Source pane, selecting a CSV or Excel and them using that extra connection:

You can also copy a simple text file with a header and X rows and from the “Data” menu “Paste Data as Connection” which will achieve the same thing.

And finally, if you are using a database, you could try adding a Custom SQL where the SQL you enter selects X number of rows from a table that you do have and converts to what you need, e.g. if you have a table “YourTable” that has a Location in it with values North, South, East and West, and you want an expander table with just two rows, you could use this Custom SQL:

SELECT CASE WHEN Location='North' THEN 1 ELSE 2 END AS RowNum
FROM YourTable 
WHERE Location IN ('North', 'South') 
GROUP BY CASE WHEN Location='North' THEN 1 ELSE 2 END

NB: This sort of Custom SQL approach does run the risk of reducing performance!

Use cases

In part 2 and part 3 of this blog series I cover six use cases for expander tables:

  1. Count a single row on multiple dates without using a UNION
  2. Avoid issues from pivoting your data
  3. Instead of Measure Names and Measure Values
  4. Create custom sub totals or grand totals
  5. Count a category or date in multiple groupings
  6. Densify your data in other ways

You can also check out the You Tube video at the start of this post (~16m50s to ~21m30s).

LOD equivalent of INDEX and RANK (part 2)

In my last blog post I looked at a LOD equivalent of RANK / INDEX table calculations in Tableau. That approach was limited to ranking a very small range of whole numbers, and left me considering other options. This post outlines another LOD approach using spatial functions!

WARNING: As before I will stress that you should rarely need a LOD-based equivalent of rank or index, and can often use table calculations when you don’t think that you can. That said, there are scenarios where a LOD equivalent can be useful: onward use of the calc or use in spatial functions being the cases I’ve seen on the Tableau community forums. And it’s also a fun challenge!

To revisit why this is a tough (fun!) challenge we have to remember that LODs really only break out of the viz level of detail to a higher or lower level of detail. So if you’re working at a month level of detail, a LOD expression can get you the figure for that month’s quarter or year, or it can work with figures at the day level. However we would usually say that there is no LOD expression (or indeed level of detail) that equates to the rows before or after any given row. And that makes finding an index or rank difficult with LODs.

The previous solution I wrote about takes any value and switches on a 1 at that value’s position in a string of 1s and 0s. So 5 = 10000 and 3 = 00100. These strings of 1s and 0s can be considered as a binary number. And that can be summed up using a LOD. For 5 and 3 the sum gives us 10100 (in binary). And from this we can count all of the 1s including and to the right of the n-th digit. So the number of 1s including and to the right of the 3rd digit = 1, and the number including and to the right of the 5th digit = 2. Hence 3 = rank/index of 1, and 5 = rank/index of 2.

As noted about this approach is constrained to quite small ranges of whole numbers. However it does give us a generalised approach to solving our problem with LODs. What we need is an alternative way to add up all the values in a range, such that we can still count how many there are “including and before” any particular value.

This is where spatial functions come in. We can wrap spatial functions into LODs, and we can do things like UNION and COLLECT to combine spatial values, and we can do things like INTERSECT to find overlaps.

Approach using LODs and spatial functions

Here’s how we can stitch this together for a LOD equivalent of RANK_DENSE (which is the same as INDEX when there are only unique values):

  • Convert the values (that we want to rank) to a latitude between 0 and 90
  • Convert these latitudes to buffered points (that have an area) along longitude 0
  • Use a LOD to COLLECT all these buffered points (circles) together
  • Also convert each value’s latitude to a similarly buffered line from that latitude back to 0
  • INTERSECT the buffered line with the LOD COLLECT of buffered points
  • This leaves us with just the circle equivalents of all the values up to and including this one!
  • We can’t count these (as far as I know!) but we can get the combined area
  • And we can divide that area by the known area of a single buffered point
  • Which gives us the equivalent of a count, or RANK_DENSE
  • And for ranges with only unique values, this will be the same as INDEX

Illustration

To illustrate here is a table of sales by sub-category using sample superstore data, along with INDEX and RANK_DENSE using table calcs, and a LOD version.

index-1


Art is ranked fourth here. If we convert the sales values to circles between latitude 0 and 90. And also draw a line from Art back to latidtude 0 (buffered to the same radius as the circles) then we can see that this line intersects with four of the circles, equating to the rank/index of 4 for art:

index-2


Calculations to replicate RANK_DENSE

Let’s take a look at the calculations to replicate RANK_DENSE. For this example I’m using data with non-unique values:

rank-1


I’m going to use INCLUDE and EXCLUDE LODS, but you can try with FIXED too. Just remember that if you use FIXED you’ll need to account for any other filters/fields on your view (either adding filters to context or including the dimensions that you need to fix by), otherwise the FIXED LOD will bust out beyond the data you want to consider – e.g. include years that you wanted to filter out.

First up some base calcs that will be needed regardless of which version of RANK that we’re trying to replicate:

We need to know the minimum and maximum values in the range as these will end up equating to latitude 0 and 90…

In these calculations Sub-Category is the thing I want to rank and Sales is the measure I want to rank by. You will replace these with the dimension and measure you want to use from your data.

BASE (a) min

{EXCLUDE [Sub-Category]: MIN(
 {INCLUDE [Sub-Category]: SUM([Total Sales])}
 )}

BASE (b) max

{EXCLUDE [Sub-Category]: MAX(
 {INCLUDE [Sub-Category]: SUM([Total Sales])}
 )}

BASE (c) lat

( SUM([Total Sales]) - MIN([BASE (a) min]) )
/
( MIN([BASE (b) max]) - MIN([BASE (a) min]) )
*
90

Next up we convert the latitudes to a point, and we define a buffer radius to get a circle. Note that you may need to tweak this buffer radius if you’re finding that your values, when converted to latitudes and buffered are overlapping too much and giving you incorrect results.

BASE (d) point

MAKEPOINT([BASE (c) lat],0)

BASE (e) buffer m

10

Now we’re in to the RANK_DENSE specifics. We want a buffered circle, and a buffered line back to lat 0, per value. One thing to note here is that some of these spatial functions count as aggregations. And that means they won’t accept aggregated values, so we have to wrap any aggregations in another LOD first – hence the extra INCLUDE in the next calc.

RANKD (g) circle

COLLECT(
 {INCLUDE [Sub-Category]: 
 BUFFER([BASE (d) point],[BASE (e) buffer m],'m')
 })

RANKD (h) line

BUFFER(
 MAKELINE( MAKEPOINT(0,0), [BASE (d) point] ),
 [BASE (e) buffer m], 
 'm')

The penultimate step is to return the overall collection of circles that intersect with a specific value’s line. To do this we use the INTERSECTION spatial function. And we use an EXCLUDE LOD on the circles and an INCLUDE LOD on the line to ensure one is the overall set and the other is specific to the value.

RANKD (i) intersect

INTERSECTION(
 {INCLUDE [Sub-Category]:[RANKD (h) line]}, 
 {EXCLUDE [Sub-Category]:[RANKD (g) circle]}
 )

That leaves the final step being to convert the circles up to and including this specific value that we’re left with back into a count. I don’t think we can count the circles with the spatial functions that Tableau has, but we can return the AREA of the collection of circles, and we can then divide that area by the known area of a single circle (“pi r squared” – where r is the buffer)

RANKD!

1 + FLOOR(ZN(
 AREA([RANKD (i) intersect],'m') 
 / 
 ( PI() * SQUARE([BASE (e) buffer m]) )
))

The use of FLOOR here is to cater for a little bit of overlap in values that are close enough together when converted to latitudes that the buffered circles slightly overlap.  And this is the RANK_DENSE!

What about RANK_MODIFIED

In the screenshot above I also included a LOD version of RANK_MODIFED. The approach for this is a little more complex.

It involves a FIXED LOD to determine how many occurences there are of any given value (in my case SUM of Sales per Sub-Category):

BASE (f) num same

{FIXED 
 {FIXED [Sub-Category]:SUM([Total Sales])}:
 COUNTD([Sub-Category])
 }

Working out the radius of a circle whose area would be that number of times larger than the circle for a single occurence of the value:

RANKM (g) radius

// where r = normal radius
// what R (radius) do we need for a circle with N times the area
// pi*r*r*N = pi*R*R
// r*r*N = R*R
// R = SQRT(r*r*N)
SQRT( SQUARE([BASE (e) buffer m]) * [BASE (f) num same])

We then make our buffered points (circles) use this radius, and our buffered line use the max of these radiuses (radii?)

RANKM (h) circle

COLLECT(
 {INCLUDE [Sub-Category]: 
 BUFFER([BASE (d) point],AVG([RANKM (g) radius]),'m')
 })

RANKM (i) line – note the FIXED MAX for radius here

BUFFER(
 MAKELINE( MAKEPOINT(0,0), [BASE (d) point] ),
 MAX({FIXED: MAX([RANKM (g) radius])}), // line buffer max of circle buffers
 'm')

The intersection and final calc are then pretty much the same as the rank dense approach. And because the values with multiple occurences have bigger circles, the final divsion by the area for a single occurence gives us the RANK_MODIFIED.

RANKM (j) intersect

INTERSECTION(
 {INCLUDE [Sub-Category]:[RANKM (i) line]}, 
 {EXCLUDE [Sub-Category]:[RANKM (h) circle]}
 )

RANKM!

1 + FLOOR(ZN(
 AREA([RANKM (j) intersect],'m') 
 / 
 ( PI() * SQUARE([BASE (e) buffer m]) )
))

RANK_MODIFIED … LOD-ified!

What about INDEX?

If your values are all unique then either of the above approaches will give you the equivalent of INDEX too – pick the first / simpler one!

However if your values are more like my example, i.e more than one occurence of some values, and you were hoping for a LOD equivalent of INDEX, then…

Sorry! That’s eluding me at the moment.

Conceptually you “just” need to subtract something off the RANK_MODIFIED calc to spread the values back out into a unique index. But, the value you need to substract is itself an index, and in the example above (ranking sub categories) we now need an index across a text field, which this solution doesn’t cater for! In some ways it doesn’t need to be an index, just a unique value for each row shown. And I’d love to hear your ideas on this final step!

Caveats and closing remarks

A couple of caveats with the above approach if you do think that you need to use it.

Firstly you should assess performance for your data. The use of spatial functions and LODs is going to introduce additional work for Tableau to do.

Secondly really do sense check if you need a LOD-based approach in the first place. Often you can still use table calcs in cases you think that you might not be able to.

And finally if you got this far I hope that you found it as fun a challenge as I did. If nothing else it’s enjoyable to test the limits of Tableau. Here’s hoping that in the future we get inbuilt functionality like LEAD and LAG in SQL!

LOD equivalent of INDEX

Is there a level of detail expression equivalent of a particular table calculation in Tableau? This question comes up on the Tableau community forums every now and then, and almost always intrigues me.

Often a level of detail (LOD) expression isn’t really necesary, but occaisionally an alternative to table calculations is necesary. And like a mountain to be climbed, or a trail to be explored, I’m fascinated by whether it is even possible to implement LOD equivalents to some common table calcs.

I’ve previously posted on why this is a challenge, and how we can implement a LOD equivalent of LOOKUP in certain situations.

But what about INDEX? Or RANK_UNIQUE? The answer is yes, but I’ve only achieved it in quite specific situations:

  • calculating an index or rank over whole numbers only
  • where the numbers are all unique
  • and cover a relatively small range (though you can get around this as I discuss at the end of the post)

I’m posting about this in case it helps others, or triggers further ideas to improve the approach.

Solution walkthrough

For this walkthrough I’m going to use the following test data:

We have a list of fruit, each of which has a value. And we want to rank the fruit by that value. With a table calc this is easy! The third column below uses INDEX() computed by table down for example:

Firstly the big constraint: with a LOD we can jump out of the level of detail we have, to a higher or lower level of detail. However there really is no level of detail that tells us how many records have come before or after a particular record, as that before/after “group” is different for each record and so isn’t actually a different level of detail. My thinking is that we can’t simply use a LOD that counts the values (COUNT or COUNTD) before a particular record.

However, we can use a LOD to SUM up values across the data. And if any given value always had a discernable position within that SUM we might be able to derive an index. At this point my mind turns to bits in binary numbers, or converting numbers to and from base 2.

In binary:

  • 1 is represented by 001
  • 2 is represented by 010
  • 3 is represented by 011
  • 4 is represented by 100
  • 5 is represented by 101, and so on

Perhaps we can switch on “bits” (the 1s above) at the position for each value we have and use that…

Unfortunately Tableau doesn’t offer great binary arithmetic but we can use strings and a brute force calc (this is where the limit on the number range comes in).

The LOD version in purple matches the table calc version in green

And here is breakdown of the results of each step of the calc…

LOD INDEX


Let’s take a look at how it works…

x1 =

[Value]-{FIXED:MIN([Value])}

The x1 calc ensures that the range of values starts at zero by subtracting off the minimum value across the data set (note that if you’re filtering your data you’ll need those filters to be context filters, or include them as a list of dimensions you’re fixing by between FIXED and the colon).

Next we convert x1 into a binary number, switching on the bit (from 0 to 1) at the right position.

x2 = 

POWER(2,[x1])

And we can sum that up over the whole data set using a FIXED LOD (noting again that you’ll need to take into acount any filters on your view, adding to context or the dimensions fixed by).

x3 = 

{FIXED:SUM([x2])}

This number (144243 in the screenshot above) isn’t that useful as a normal decimal number. We need to convert it back to binary, and we’ll need to be able to work with that binary string of 1s and 0s. Annoyingly we need to brute force this as you’ll below!

x4 =

// ref: https://help.salesforce.com/s/articleView?id=001456706&type=1
// should handle numbers up to ~1M
STR(INT(([x3]/2^20))%2)+
STR(INT(([x3]/2^19))%2)+
STR(INT(([x3]/2^18))%2)+
STR(INT(([x3]/2^17))%2)+
STR(INT(([x3]/2^16))%2)+
STR(INT(([x3]/2^15))%2)+
STR(INT(([x3]/2^14))%2)+
STR(INT(([x3]/2^13))%2)+
STR(INT(([x3]/2^12))%2)+
STR(INT(([x3]/2^11))%2)+
STR(INT(([x3]/2^10))%2)+
STR(INT(([x3]/2^9 ))%2)+
STR(INT(([x3]/2^8 ))%2)+
STR(INT(([x3]/2^7 ))%2)+
STR(INT(([x3]/2^6 ))%2)+
STR(INT(([x3]/2^5 ))%2)+
STR(INT(([x3]/2^4 ))%2)+
STR(INT(([x3]/2^3 ))%2)+
STR(INT(([x3]/2^2 ))%2)+
STR(INT(([x3]/2^1 ))%2)+
STR(INT(([x3]/2^0 ))%2)

You’ll need to adjust this to suit your range of numbers and will almost certainly hit limitations with anything bar small ranges of numbers (you can see that I needed 17 lines of the above calc just to cater for a range of 3 to 20!)

The binary string we end up with, 000100011001101110011, has ten 1s in it (corresponding to the 10 values we had). With the position of each 1 representing it’s value (adjusted so the range starts at 0).

From here we can say “take the digits to the right of my position” for any given number.

x5 = 

RIGHT([x4],[x1]+1)

And that does give us a chance to count the numbers up to and including this number … by counting the 1s. We can do this by removing the 0s and seeing how long the remaining string is…

i LOD = 

LEN(REPLACE([x5],'0',''))

Which gives us our LOD equivalent of INDEX!

Hang on, what about…

What if I need a larger range of numbers? You may be able to extend the large 2^X string function above to cater for more digits.

What if that isn’t helping, or I need to index strings not numbers? You could pre-calculate an index in the data set (using Tableau Prep say), and then use the technique above on the pre-calculated index to allow for cases where rows are filtered out in the view. Assumption here is that the pre-calculated index will span a smaller range than the original numbers (or ensure you have a number in the first place if working to index strings). This was the approach taken in the forum post I shared above, and the challenge was adusting the pre-calculated index as the view was filtered … without using a table calc.

What other alternatives are there to table calcs and LODs? Self joining your data can allow you to see how many records fall before any given record, and that can be a simpler solution. E.g. join the data to iself where A.value >= B.value … means every row in A links to any row with the same or lower value.

So why do this with LODs? You really shouldn’t have to do this very often, if at all! Editing the compute by for table calculations often solves cases where you don’t think you can use a table calc. In the case that triggered this the results needed to be used in MAKEPOINT which didn’t accept a table calc.

Is that the only reason? No, for me it’s also a fun challenge. And one I don’t feel like I’m fully done with … if you find it intriguing too, can you take the idea further? Or do you have better ways to calculate and INDEX or RANK with a LOD for each row in a data set/view?

Tableau Prep and #PreppinData 2024 week 13

An Easter themed #PreppinData for 2024 week 13. Preparing sales of products in the 12 weeks running up to Easter to allow for easy comparison of the period across years in Tableau Desktop.

A nice one step solution this week (see screenshot at the end of this post): a FIXED level of detail calc to get the first sale date per year; then date calcs to get the week, day and day order.

This week marks a quarter of a year learning Tableau Prep. I started with the Tableau getting started guide (2 hours), committed around an hour a week for 13 weekly #PreppinData challenges (a great resource and progression of learning), answered a handful of Tableau Prep questions on the Tableau community forums (5 hours), and blogged about my learnings to reinforce them (5 hours). 25 hours all up and I feel like I’ve got a good grasp of the product and it’s capabilities. It’s a great tool for analysts and those needing to do ad-hoc but often repeatable data preparation, cleaning and transformation prior to analysis. Even better, it’s included with your Tableau Creator license if you’re using Desktop or Cloud as a creator! Definitely give it a go if you use Tableau and have a need to tidy data.

PD 2024 Wk 13

Tableau Prep and #PreppinData 2024 week 12

#PreppinData 2024 week 12, graduate student loan repayment calculator. Good to try out the “value ranges from two fields” option within a “new rows” step. Like some others my interest figure is a little different from the supplied output, however the calc appears to be the same. I also shortcutted the join onto repayment info for undergraduates with a filter (down to just the undergrad row), and joiner fields allowing a simple join on 1=1.

PD 2024 Wk 12

Tableau Prep and #PreppinData 2024 week 11

Week 11 of #PreppinData, and the question: what if there were 13 months in a year? Nice concept to have consistent 28 day months, with 4 weeks per month and each month starting on a Monday and ending on a Sunday. As we found out when we expanded the two row data set though … it’s not as neat as it seems, ending up with a spare day (or two in a leap year).

Part one of my flow answers Jenny’s question “which dates would change months?”. Output gives me 190 as expected.

Part two of the flow looks at Carl’s question of “what the new month(s) would be called”. Turns out the extra month most logically slots in between June and July. This is based on which old month is most associated with a new month number (the number of days of each new month that fall in an old month). And then the “average” of June and July (based on ASCII codes) is Juno. This may (or may not) be Dominican slang for extremely drunk, which coincidentally may (or may not) be what you’d need to be to suggest changing the current calendar in the first place! My preference for the extra day or two at the end is Extrember … as it’s a little extra after December.

PD 2024 Wk 11