What is Analytics Engineering?!

“It is easier to ship recipes than cakes and biscuits.” -John Maynard Keynes

ae-1

As an Analytics Engineer, you may hear this a lot: ‘Wait, so you just make dashboards and reports, right?’ It’s common but often based off a deep misunderstanding about the analytics process. Let me attempt to clarify.

As an Analytics Engineer, you are responsible for moving the analytics and data processes into the layer above the data warehouse. Essentially, your task is to create something from the basic set of data and shape it into something both actionable and useful.  Naturally, this task lends itself to a broad and narrow set of skills. And it is probably why people are so quick to assume you just ‘make reports and dashboards.’ They see the result of the work, not the process that created that report and dashboard. They did not see you toil with a dataset too wide and at the wrong granularity. They did not see you automate the ETL of said table into an aggregated table with security, alerting and logging baked into the process. And, after all that, they did not see you worry about font, color, and styling of the dashboard. Because the effects of a terrible design are catastrophic for your message.

An Analytics Engineer must be, primarily, a technologist and thinker. With the sheer volume of options available (throughout the Analytics and Data layers), context and task shifting are an understatement. It’s common to spend most of the day at the data (SQL) layer, while spending a smaller portion of time creating and implementing a dashboard tool. And it’s that balance of time which allows for functional dashboards.

One vital piece that is worth mentioning (and possibility elaborating on) is how this field is not solely locked into one technology or stack (for example: LAMP stack, cloud vendors or analytics tools); rather, this field should be able to use all the tools and should leverage its own, more abstract, stack unrelated to a technology. So, today it might be MySQL but tomorrow it could be Redshift or the like. Therein rests the challenge: being able to adapt to the tool at hand without missing a beat.

 

ae-2

Adding Tableau Server users

If you’re leveraging local authentication for Tableau Server, you might be wondering if there were an automated / ad hoc way to remove/update users to your server and its Sites (where appropriate). Sure, you can manually do it. But who wants to do that? No one. If you have a lot of activity, these one-offs are bound to add up and eat away both time and security.

Here’s a quick PowerShell function that will do that for you. This is designed to update users and assumes you already have an automated way of provisioning your new and removing your old (please say you do). Again, by ‘update’, I mean, remove the ability for a users to just go back to the site and select ‘Forgot your password’ after they have been terminated. That would be bad, especially if they have a valid email…

Get the code here.

Let me know if there are questions!

-Mike

Analytics as a Service: Tableau Rule 2 – Data

aaas_data_2

 

Forget about perfection

Your data (information) is a set of free-flowing, dynamic instructions about how business (or whatever) is understood. It will never be perfect. In fact, you don’t want it to be; if it’s perfect (which, again, is impossible) there’s no room for improvement or self-reflection. What’s more, it will lack creative impulse: you don’t think freely if something is ‘perfect’ and done.

Adapt with data

Data is fluid

The goal should be to bring the certification/governance process *to* the data. If you must wait, collect, meet, agree and on and on, there is a critical piece missing: data should be certified from what it produces (or its many derivations). How does this work? How can you certify a csv file? Simple: Alert, Integrate and Monitor your Analytics Infrastructure.

Essentially, if nothing is created from the data, why would there be a need to certify it? Once something is created, then you relentlessly certify, in flight, what is being produced. It’s a sort of fact-checking, data-alerting mechanism that’s completely possible with the right framework. Which leads to the next point…

Collect data about patterns of usage

If you’re not analyzing usage patterns, you’re missing valuable data. With all analytics, there are reasons for why (1) a specific set of data is selected and (2) what the user is attempting to do with the data. You can easily keep this metadata in AWS S3 (with a good lifecycle policy) or store for potential later use somewhere else. The point is that if you aren’t understanding *why* then you are only seeing one side of the coin.

Keep everything and then figure out what to do with it and then how to certify it.

More data

Preserve Order (CSV Order in Tableau)

Ever created a wonderful Tableau dashboard with the added ‘Export to CSV’ functionality? We all have. Click the super-sleek Excel icon and, viola, the download begins. Send the file, walk away and think: ‘my, was that cool.’

ts-csv-twb
Nice and ordered.

But wait. You get an email complaining about column order. For some reason, the columns you’ve added, perfectly, are all messed up. In fact, some would say they’re in alphabetical order. What the?!

ts-csv-file
Oh come on.

Anyway, here’s an easy PowerShell function that will fix that and, send the email with the columns in the correct order.

 

Automatically Choose the Max Filter Date in your Tableau Filter

Here’s a quick tip for Tableau users out there who may have tried to get their individual date filter to automatically choose the max data while (here’s the catch) still allowing for the drop down with other dates in it. Sure, it’s possible to use the LOD calculations to get you a Boolean but that would be too easy. Sometimes, the people just want to max date selected in their filter automatically. Oh, and, this also works for dynamic parameters.

dynamic-filter-2
I really want the max date. Please.

Here are the steps:

  • SQL to query for max date in column (this is the field used in your filter)
  • Match the date format (eg: mm-d-yyyy or mm-dd-yyyy)
  • Update the workbook section with the filter
  • Save workbook
  • Re-publish

The little bit of code you’ll need to dynamically do this will look like:

[xml]$TsUpdateTWB= get-content "someWorkbook.twb"
$TsUpdateTWB.SelectNodes('//filter/*') | where {$_.level -like '*Status Date*'} | % {$_.member = "#$tsFilterMax#"}
$TsUpdateTWB.Save("someWorkbook.twb")

That’s it! Since PowerShell makes it so easy to deal with XML, all you need to do is find the filter name and update.

This method can work with: Excel files, databases and everything in between. As long as you can query it, you’re golden.

What I’ve done to make this totally automatic is to use the File Watcher class to trigger this to run as soon as someone either saves a file or drops the name of the extract in Slack.

Please let me know if you’d like a demo or just come the TC16 and I’ll be showing this (and more)!

Slack your Tableau Extract – Part II

Ever wish you could drop the name of your Tableau workbook and/or data source into a Slack channel and have it automatically refresh? What if you’re developing and need to get a quick refresh completed? What if you don’t have tabcmd installed on your machine? What if you want to add step at the end of your pipeline that drops the name of the content into the Slack channel?

ts-refresh-extract-0
Example of name of Tableau content

I’ve talked about this before but what had to happen was the extract needed to exist in the ‘background_jobs’ table. Well, that won’t always happen as people will be doing this for the first time. So, we needed to expand it a bit to include *all* possibilities (workbook and data sources). Also, in this much improved version, we Slack back to the user and let them know their extract is completed.

ts-refresh-extract-2
Process for *each* extract (all dynamic) 

That’s the beauty of the ‘Tableau-Slack-Logentries‘ integration. When you have a decent amount of parts, the whole becomes a fascinating thing.

Here are the steps:

  • get the data from the Logentries webhook
  • Process the data for each extract
  • getting current slack users: don’t need to do this often (unless you want to)
  • getting valid list of workbooks and data sources
  • Processing list of extracts : rolling 24 hours
  • getting valid list of workbook / data source owners
  • create Slack content object: basically it must add up to a certain number to run (for example, if the person who dropped the name in the channel isn’t the owner, it won’t succeed).
  • Log it!

 

ts-refresh-extract-3
Tableau Server example of completed extract. 
ts-refresh-extract-1
Corresponding message back from Tableau Server

If anyone is interested in the code and / or a demo, please let me know and I’ll be happy to show it.

Understanding Tableau Server Workbook Performance by knowing Geography

A few weeks back, I did a webinar with Tableau and the Server Admin User Group and I chatted about our TWB Internals database. In that talk, I mentioned a couple of tables:

  • Workbook Performance
  • Workbook Metadata

Well, I’m happy to announce a third and final piece to the data model: Geo data. This is often a missing piece to the general ‘performance’ discussion on Tableau Server. What’s more, if your organization serves users both internally and externally, then performance becomes a big first impression. For example, you want to know why some users in various locations might mention a view is slow to load even though it’s been tested at, say, less than two seconds. Adding the geography (and ISP) of the people requesting the view makes the data (and workbooks) a bit easier to diagnose. BONUS, you can also understand the most requested views, http status codes (500s, etc.) and more.

ts-geo-perf-0
TWB Internals Database

Geo data: steps to implement

  • Pull the Apache logs (I used Logentries and its open API to pre-filter before exporting).
  • Parse the logs for basic columns: ip, datetime, request, port, etc
  • Get distinct list of IPs and lookup
  • Add to database (choose your flavor)
  • Automate

Connect the dots

ts-geo-perf-2
Parsing the Apache logs

The final table in the model had a few steps to it. It’s an IP helper table that stores the information for users who request views (city, state, latitude, longitude, etc.). Rather than look up each IP, I’ll check this table before and omit it from the lookup if we already have it stored. Once the basic script is complete, we automated it and then join the previous two tables together in order to understand performance of a view as it relates to geography. So, you’re SQL to join them all together might look something like this:

SELECT
w.site
,w.user
,w.workbook
,a.datetimemst
,a.ip
,a.httpstatus
,g.city
,g.region
,g.country
,g.loc
,sum(w.elapsed_sec) as load_time
FROM tableau_workbook_perf as w
JOIN tableau_apache_perf as a on w.req = a.requestid
JOIN tableau_apache_geo as g on g.ip = a.ip
GROUP BY
w.site
,w.user
,w.workbook
,a.ip
,a.httpstatus
,g.city
,g.region
,g.country
,g.loc

Assessing Performance

Now that we have all the pieces together, it’s time to visualize it back in Tableau (ah, yes, the circle of life, or data).  We can now take the highly viewed workbooks and understand if performance is a local/geographical issue and/or a time issue. When we combine with the TWB Internals metadata, well, we have a very complete picture of a workbooks value.

ts-geo-perf-1
Tracking view performance along with ISP / Geo data

Tableau Vapor Rub: Getting the goodness of Tableau, all over.

Some background:

While I had originally joked about the ‘Vapor Rub’ title, it has really, er, stuck to me at this point. It sort of describes the wonder of getting the right data into Tableau: once it’s there and properly formatted, everything just feels better.

vapor_rub_0
The ideal state.

Part I: Getting the data from all the places

The goal, then, is to get all the data we need for analysis into some sort of table (or tables). Ideally, it would be one table but when that’s not actually possible, we’ll create our own data model. If you’re organization has data in a multitude of places, it becomes challenging for Tableau developers to begin to acquire it and shape it into useful analysis without sacrificing some performance. Adding additional data sources can, depending on your approach, slow your workbook. Further, when we’re talking about massive data, well, performance will suffer it one doesn’t approach it efficiently.

vapor_rub_1

Part II: Processing the data by distributing the load (PowerShell Workflows) across computers

Here’s an example: pivoting data, a lot of data. We needed to pivot pipe delimited fields into their own columns, add to a database and then create a Tableau Server data source. A few caveats:

  • Needs to happen in sequence (part A needs to happen before part B)
  • It could be a long running task so if a part fails, I want the code to pick up where I left off
  • I need to distribute to load across computers because this is a lot of data

Now, grabbing the data via code isn’t that difficult of a task. We iron out our approach, sketch out some table or tables we might need to create and, hopefully, the data speaks. To do all this, we’ll use PowerShell Workflows.

Workflows are built around activities. These allow us to implement logic that wouldn’t necessarily fit within the function paradigm (though, they are related to some extent). Activities exist on their own, so ‘bucketing’ each piece of your code will assist a lot.

The difficult part is making sure we add the right checkpoint and split the load accordingly; we have a ton of files that need data pivoted and doing that work on one machine isn’t ideal. What’s more, if we do all the hard work of parsing and it doesn’t work or the computer crashes, we would need to start over. Having a checkpoint (via Checkpoint-Workflow) makes life a lot easier. At a very high level, our process will look like this:

  • Find files/data (S)
  • Do the parsing/splitting, etc (S)
  • Add to database (P)
  • Create Tableau data source (P)
  • Alert users (via Slack) (P/S)

NOTE: S = Sequence; P = Parallel

 

Part III: Adding consolidated data to a database

After doing the work of parsing and pivoting our data, we now need to add it to the database. This is also pretty straightforward if you’ve done it once. And the right connection string can make all the difference. As mentioned above, this will be done in parallel.

Part IV: Consume in Tableau

Having done the work mentioned above, making a Tableau Server data source is the easy part. We can dynamically create the extract or simply leverage or typical workflow (connect to data and publish to Server).

Other Applications:

In doing this, you might be wondering, what other ways can we leverage PowerShell Workflows along with Tableau Server. Here’s a few that work very well:

  • Tableau Extract Mover/Creator
  • Backups / Restores to various environments
  • Csv to TDE creator on a large scale
  • Creating custom reports/data files for Internal & External Users
  • A Vizable file delivery platform