Use the Logs: Tableau Server Log Analytics – Part 4

Week 4: Data Engine

For a review of what we’ve covered, make sure to check out the previous three weeks of log analytics for Tableau Server:

Week 1: Backgrounder

Week 2: Apache

Week 3: VizQL

Introduction

If you’ve read through all of the Log Analytics series, then you’re in pretty good shape when it comes to understanding the ‘voice’ of your Tableau Server. For our final week, we’re going to talk about the Tableau Server workhorse: Data Engine.  Yes, it might not be the most friendly of formats but with the right Log Analytics tool and strategy, you’ll finally be able to understand: how uploads progress, how long queries take to execute, how many ‘TEMP.#Tableau…’ tables there are and more! It’s really an interesting log set simply because it gives you a good idea of how well users are managing their extract’s design. You should really never see excessive query times as the Tableau Data Engine is pretty well put together. However, as all things go, sometimes stuff just doesn’t work.

 

ts-logs-de-1
The Data Engine EKG

If you environment is both extract heavy and view heavy, following this log (and the others mentioned above) will allow you to tune your environment with ease. You can, as we do, have over 3500 users with 100s of extracts on one machine.

Where’s the Log

From the Tableau Server admin guide:

The Tableau Server log directory is C:\ProgramData\Tableau\Tableau Server\data\tabsvc\logs if you installed Tableau Server on drive C, unless otherwise noted in the table below.

Data Engine, then, is in the folder dataengine.

What does it do for me?

Again, from the Tableau Server admin guide:

‘There will be a tdeserver log file for each day with information about data extracts and queries, and responses to VizQL server requests.’

But there’s so much more. Some things (but not all) we watch for with the Data Engine:

  • Upload size over time
  • Query Execution time > 10 seconds
  • ‘tdeserver: connection dead’ errors

For example, if you’re using a log analytics tool, you might be able to run a query like this:

ts-logs-de-4
graph of data engine activity

 

where(/UploadSetSize:  upload_guid=\d size=(?P<UploadSize>\d*)/) calculate(sum:UploadSize) timeslice(100)

ts-logs-de-5
text details of data engine activity

With the above query, we can get information the size of extracts being loaded. You can even trend this over to correlate with latency issues for users who may be experiencing ‘slowness’. And you do this by picking out the session IDs from the log and tying that back to the vizqlserver/logs directory of files (see how it all connects 🙂 ).

ts-logs-de-2
linking to another log via sessionid

Keep in mind, this is also how you’d want to leverage both alerting and tagging with your log analytics tool; if query time is consistently high and you’re seeing a lot of longer load times on dashboards, it may be time to investigate how users are designing their dashboards with their extracts.

ts-logs-de-0
query timeline

where(/time:(?P<Time>\d*)/) calculate(SUM:Time) timeslice(100)

As the above query will show, a consistent increase in execution time (or total time) may be something to investigate. Be aware, you can see the query being execution by searching for the ‘StatementPrepare’ portion of the log. Some of them can get pretty crazy.

ts-logs-de-3
Daily sessionid info, grouped by sess_guid

What should I be Alerted on?

Only those alerts which would require you to take action. Namely, these two:

  • ‘tdeserver: connection dead’ errors
  • Query Execution time > 60 seconds

So that’s it! Four weeks of Tableau Server Log Analytics and, believe it or not, there could be so much more! The four we’ve talking about cover a good portion of what you’d need to understand the basics of your environment.

And now that I’ve told you about consuming Tableau logs, why don’t you create your own?! Yes, in a series I’m currently working on, I’ll show you how to generate more detailed, user and security focused, logs related to Tableau Desktop (and Server).

 

Use the Logs: Tableau Server Log Analytics – Part 3

Week 3: VizQL

Intro

While the Apache/httpd log may be the most interesting of the Tableau Server log files, the VizQL logs are equally impressive simply because they are JSON formatted. The depth of information one can gather from these logs provides a very real-time look into the happenings of one’s Server environment.  You can find anything from the dimensions of a dashboard as it renders on a client machine to the amount of rows returned by a Tableau Server data source for each user. And, the most valuable part, well at least one of them, is the req part of the payload which links to the Apache logs.

ts-logs-vizql-0
Understanding total query time by user on Tableau Server
  • ts:
  • pid:
  • tid:
  • sev:
  • req:
  • sess:
  • site:
  • user:
  • k:
  • v:

The above list shows the payload for the log and, from that, you can confidently understand your Tableau Server. What’s more, in combination with a Log Analytics tool, you can easily query the log to get a more detailed sense of, for example, how long a query takes to execute for each user.

where(k=”end-query”) groupby(user) calculate(percentile(90):v.elapsed) sort(desc)

ts-logs-vizql-1

Where’s the Log

From the Tableau Server admin guide:

The Tableau Server log directory is C:\ProgramData\Tableau\Tableau Server\data\tabsvc\logs if you installed Tableau Server on drive C, unless otherwise noted in the table below.

VizQL, then, is in the folder: vizqlserver\logs. Note the difference here as we’re directly in the vizqlserver directory.

What does it do for me?

A lot.

ts-logs-vizql-4
Tableau Server Resource Manager data. Very valuable.

But there’s so much more. Some things (but not all) we watch for:

  • Resource Issues (CPU/Memory) by process (very important)
  • Workbook load time > 10 seconds
  • Average query time
  • Average query time by site/user/time of day
  • Query execution > 10/20/60 seconds (these alerts go directly to Slack)
  • Rows returned by user/site/time of day
  • Sev=FATAL

 For example, if you’re using a log analytics tool, you might be able to run a query like this:

where(k=”read-metadata” AND v.attributes.dbname=”<some data source>”) calculate(sum:v.elapsed)

ts-logs-vizql-6
Analyzing meta-data read time for a data source

With the above query, we can get information on how long it takes to read a data sources meta-data across time.

Keep in mind, this is also how you’d want to leverage both alerting and tagging with your log analytics tool; if query time is consistently high and a dashboard consistently takes a long time to load, then this is a learning opportunity for your users.

I’ve talked about many ways you can start to aggregate these metrics and distribute a daily or weekly summary to your users. If a workbook loads slowly, it’s time to re-train and re-focus the Tableau design and analytics. Most of the time, users aren’t even aware there is a performance issue!

ts-logs-vizql-2
Performance by Sheet on Tableau Server (probably the biggest indicator of poor performance)

 

What should I be Alerted on?

Only those alerts which would require you to take immediate action. Namely, these two:

  • SEV=Fatal
  • “end-update-sheet” > 60 seconds

Since this log is JSON formatted, it’s pretty much up to you as to what you need to look for (or be alerted on). I can say, however, there is plenty of valuable information in this log that provides the answer to the most common Tableau Server question: Why does my dashboard take so long to load?

 

phx-tug-1
Summarized version of a parsed log file. Giving hourly/daily performance data

 

Use the Logs: Tableau Server Log Analytics – Part 2

Week 2: Apache

Intro

I believe the Apache/httpd log is the most interesting of the Tableau Server log files because of the depth of information it provides. What’s more, with the Unique ID at the end of each entry, you can tie this to other, more verbose, logs (for that link, you’ll need to wait until next week when I talk about VizQL).

ts-logs-httpd-3
Alerts built off of the Apache log

Thankfully, the Apache log uses the Common Log Format and, with that, you have a standard way to parse this text file. Tableau, in a testament to its vision, has added a few bits to the end of this standard. The three most important are:

  • Content-Length
  • %D (this is the best part and it is, according to the docs, ‘The time taken to serve the request, in microseconds.’)
  • Unique_ID

If you want to see the configuration, look in the httpd.conf file and look for the bits that talk about ‘mod_log_config’.

ts-logs-httpd-4
Additions to the Apache CLF

Having the time to serve request gives us the ability to understand Tableau Server performance and, in one query, get the average time it takes to serve a request over the last {n} days:

where(/HTTP\/1\.1″ “-” \d{3} \d([0-9]|1[0-9]) “\d([0-9]|1[0-9])” (?P<TTS>\d*)/) calculate(average:TTS)

ts-logs-httpd-6
Daily performance summary

Where’s the Log

From the Tableau Server admin guide:

The Tableau Server log directory is C:\ProgramData\Tableau\Tableau Server\data\tabsvc\logs if you installed Tableau Server on drive C, unless otherwise noted in the table below.

Apache, then, is in the folder: httpd. You’ll want to follow the *.log files and (for this one), the error.log.

What does it do for me?

Again, according to the docs, ‘Apache logs. Look here for authentication entries. Each request in the Apache log will have a request ID associated with it. This request ID is used throughout the server logs and you can use it to associate log entries with a request.’

But there’s so much more. Some things (but not all) we watch for:

  • http status (200,404,500, etc)
  • IP
  • DateTime
  • Request
  • Request Type (GET/POST, etc)
  • Request ID (mentioned above)
  • Time to Serve request
  • Csv downloads
  • Login attempts
  • 404/500 (the bad ones)

 For example, if you’re using a log analytics tool, you might be able to run a query like this:

where(/HTTP\/1\.1″ “-” (?P<status>\d{3})/ AND status!=200) groupby(status) calculate(count) sort(desc)

ts-logs-httpd-1
http statuses over {n} days

With the above query (which we run on Logentries), we can get information on http status codes and, hopefully, all of them are good.

Keep in mind, this is also how you’d want to leverage both alerting and tagging with your log analytics tool; you might do this because if there are {n} requests over a threshold, say 404s, you want to know about it. Conversely, you’d also want to be made aware of inactivity (or anomalies) from the logs. For example, if it was normally pretty chatty, you want to know ASAP if it has gone silent.

If you’re really feeling up to it, you can leverage regular expression field extraction and make some custom groupings out of the text file.

In this example, let’s say you wanted to do a quick check on the 90th percentile of the time to serve (NOTE: this is in microseconds) amounts just by using a REST API to query your logs. Here’s what you send with the API call:

where(/HTTP\/1\.1″ “-” \d{3} \d([0-9]|1[0-9]) “\d([0-9]|1[0-9])” (?P<TTS>\d*)/) calculate(percentile(90):TTS) timeslice(60)

You can send these as alerts as they happen to users via email and/or Slack. Or you can make this as part of a performance summary (via a Tableau Dashboard) about your Tableau infrastructure. I mentioned a more robust solution a while back. You can read about it here. We automatically parse the Apache log for the 200s and then use an IP lookup tool to geo-locate possible performance issues with our Tableau infrastructure.

ts-geo-perf-0
Custom data model we have for Server performance

What should I be Alerted on?

Don’t forget: Too many wrong alerts, you miss the right alerts.

For the Apache logs, we alert on these bits:

  • Time to serve > 5 seconds
  • Excessive ‘Auto-Refresh’ events
  • Excessive CSV downloads
  • 404s > a rolling average
ts-logs-httpd-0
Same alerts/tags for an Apache log

As mentioned above, I believe the Apache log is the most useful log to gauge a lot of critical Tableau events, not the least of which is access and performance of views/workbooks/data sources.

 

Use the Logs: Tableau Server Log Analytics – Part 1

Week 1: Backgrounder

Intro

While perhaps not the flashiest or most riveting subject, log files hold a lot of information about the performance and health of any system and Tableau is no different. While some logs are just noise, most (if not all) issues can be diagnosed just by investigating the log file. More importantly, the analysis is augmented when one combines logs with many seemingly unrelated other elements of the system (memory, cpu, disk, perfmon counters, etc).

If you’re bored, read through this article: http://ajax.sys-con.com/node/3141925. In it, the author mentions another ‘famous/popular’ article by Jay Krep — both highlight the value in log data and log analytics. It’s still fascinating to me that, despite its enormous benefit, log data is so out of reach and, gasp, out of mind for most analysts and engineers.

For the Tableau Logs (yes, it’s a big “L” for a reason), I’m going to be doing a month long series on what are the most important log files to follow: Backgrounder, Apache, VizQL and Data Engine.  Alas, not all of them are pretty; most are text (with the exception of the vizql logs) so we’ll need to get creative with our log analytics tools and brush up on some of our regular expression knowledge. Don’t be afraid, however, the log tool we leverage makes searching though these files as easy as writing SQL so it, literally, is as simple as writing: where(/Superstore/i).

ts-logs-bk-4
30 days of backgrounder log activity

Where’s the Log

From the Tableau Server admin guide:

The Tableau Server log directory is C:\ProgramData\Tableau\Tableau Server\data\tabsvc\logs if you installed Tableau Server on drive C, unless otherwise noted in the table below.

Backgrounder, then, is in the folder: backgrounder. You’ll want to follow the *.log files.

What does it do for me?

Again, according to the docs, ‘Logs related to subscriptions and scheduled activities like extract refreshes, “Run Now” tasks, and tabcmd tasks.’

But there’s so much more. Some things (but not all) we watch for:

  • Subscriptions
  • Extract failures
  • File delete failures
  • Backgrounder process start/stop
  • Extract size as it’s stored in the repository

For example, if you’re using a log analytics tool, you might be able to run a query like this:

where(/[WARN|INFO]  com.tableausoftware\.(?P<type>(\w+\S+))/ AND type=model.workgroup.service.subscriptions.SubscriptionRunner) calculate(count)

ts-logs-bk-2
Grouping by job type.

With the above query (which we run on Logentries), we can get information on all subscription activities and any potential errors related to it.

If you’re really feeling up to it, you can leverage regular expression field extraction and make some custom grouping out of the text file.

In this example, let’s say you wanted to do a quick check on extract refresh amounts just by using a REST API to query your logs. Here’s what you send with the API call:

where(/Running job of type :(?P<TsJob>\w+)/) groupby(TsJob) sort(desc)

Yes, I know there are other ways to do that but (and here’s where it all becomes pretty powerful), you can send these as alerts as they happen to users via email and/or Slack. Or you can make this as part of a performance summary (via a Tableau Dashboard) about your Tableau infrastructure. There really is no limit when dealing with the log data; rather, it’s just making sure you don’t go insane reading through the noise.

What should I be Alerted on?

As with all “alerting/text messaging/email (basically anything that forces you to switch context)”, there is a rule: too many wrong alerts, you miss the right alerts.  We integrate with Slack and leverage Tableau for summary views of the data.

ts-logs-bk-5
If not alerts, then tags for other necessary analytics

For the backgrounder logs, we alert on 2 log bits:

  • Backgrounder process starting/stopping
  • Subscriptions to certain workbooks (getting difference between total time and run time)
ts-logs-bk-3
Some examples of bits you can find in the log

While most alerts in the backgrounder logs aren’t mission critical, it’s still nice to know how often extracts refresh, fail, total time for subscriptions, and the like. And, as we start putting together the large Tableau Log Infrastructure, you’ll see how these logs start to connect to give a bigger picture about performance and health.

Stay tuned for next week where we’ll discuss everything about the Apache logs (httpd).

In the meantime, check out Paul Banoub‘s look at all things related to the backgrounder.

 

Slack your Tableau Extract

Some background

As most of you might already know, I’m a pretty big fan of making things simple and more automated in order to get data to the people. When someone has to muck around in an area that might take him/her away from their analytics ‘flow’, I see opportunity.

Enter the ‘refresh-my-tableau-datasource-via-Slack’ bits.

Why would we do this? Well, perhaps tabcmd isn’t on the Server or client. Or perhaps it’s easier to make an API call and post a message at the end of a pipeline rather than write to a file and do this. But wait, even better, it might be easier for an analyst and the like to just simply type a ‘Refresh: <some data source or twb>’ message into a Slack channel.

ts-slack-extract-
Sample  text coming across the wire

Before you think, ‘wow, this could escalate quickly’, rest assured there are checks and balances. Here’s a few that we’ve implemented:

  • Only the owner of the twb/data source can refresh
  • Can’t refresh more than 1x per max refresh time (you should already be keeping these stats)
  • We log all of these in addition to our regular Tableau logging

Anyway, this is pretty easy to set up because of the early work with the ‘Reset my Tableau password via Slack’ bits.

 

 

The Tools

  • PowerShell
  • Logentries
  • Slack
  • Tableau Server, of course

 

Implementation

Create a Slack channel where all of this chatter will go and then with Logentries, set this up. After you’ve done that, run the code to make all the magic happen.

ts-slack-extract-1
Post in a Slack channel w/ name of TWB or Data Source

NOTE: For the one of the checks and balances mentioned above, you will want to limit the overuse of the extract refresh. For example, if someone gets excited and drops the TWB/Data source name in there 15 times in an hour, this code will prevent that from happening:

SELECT
date_trunc('hour',created_at) - INTERVAL '7 hour' as &amp;amp;quot;StartHour&amp;amp;quot;
,COUNT(*) as &amp;amp;quot;Total&amp;amp;quot;
,AVG(EXTRACT(EPOCH FROM (completed_at-created_at))) as &amp;amp;quot;TotalTime_Avg_Sec&amp;amp;quot;
,AVG(EXTRACT(EPOCH FROM (completed_at-started_at))) as &amp;amp;quot;RunTime_Avg_Sec&amp;amp;quot;
FROM background_jobs
WHERE job_name LIKE 'Refresh Extracts' and title = '&amp;amp;lt;your extract variable&amp;amp;gt;'
GROUP BY
date_trunc('hour',created_at) - INTERVAL '7 hour'

After you’ve successfully triggered the extract, you can certainly send a message back to the user and/or notify them in the event of a potential extract refresh failure. Did I mention we have code for that?

For those interested in the nuts and bolts of the implementation, please don’t hesitate to let me know.