Postprocess searches – pitfalls galore

As you develop a custom view you start with one chart or one table. After a while you’ve added and added, and you’re dispatching several searches. Often you’ll see that a lot of searches are pretty similar to each other. You’re getting the same events off disk more than once and you’re making Splunk do extra work. If you get the nagging feeling that there’s a better way, you’re right; it’s called “postProcess” and it’s a part of the core Splunk API.

Post process searches allow you to avoid this inefficiency. They allow you to dispatch only one “base search” get the events off disk only once, but then at request-time, carve up that base set of results in 2 or more different ways, to render different ‘slices’.

The first thing everyone does is very intuitive – they make a “base search” that’s a simple search that returns raw events, and they make postProcess searches that contain transforming commands like stats or timechart. Makes perfect sense, and it’s a TERRIBLE IDEA. DO NOT DO THIS. Read on.

Skipping to the end – “what could go wrong?”

  1. PITFALL #1: base search is a “pure events” search that matches more than 10,000 events.

    Splunk behaves a little different when the ‘search results’ are actually events. In particular, it does not preserve complete information about the events once you pass 10,000 rows. The problem is that you will not get any warning about this and the rows will be silently discarded from your base search in the postProcess dashboard and therefore your postProcessed results will be wrong. Conversely if the base search contains transforming commands like stats, splunk will preserve all the rows in the base search results, to 10,000 rows and beyond.

    you have fallen into this pit when the postprocessed results displayed seem wrong or truncated, or WORSE they don’t seem wrong and you don’t find out they are wrong until much later.

  2. PITFALL #2: base search is a “pure events” search and postprocess uses a field not explicitly named in base search.

    If a field is not mentioned explicitly in your base-search somewhere, splunkd will think it doesn’t need to extract and preserve those values when it runs the job. Then come postprocess-time that field will be absent and you’ll be extremely confused. If you always group your desired fields and rows with the stats command, everything is much more explicit and you sidestep this confusion.

    you have fallen into this pit when you’ve spent hours staring at your config wondering why your postProcess search acts like some field isn’t there.

  3. PITFALL #3: avoid using postProcess searches in cases where the number of rows returned by the ‘base search’ is extremely high. You’re setting yourself up for very bad performance in your dashboard.

    you have fallen into this pit when your slick postprocess-heavy dashboard actually has terrible performance.

    Note that a corollary of this pitfall is that you should avoid using a “pure events” search as your base search because such searches will have a large number of rows. Throw a “stats count sum(foo) by bar baz” on there and summarize the rows down to the ones you’ll actually use.

  4. There are other strong reasons to not use a “pure events” search as the base search, when you’re using postProcess, but they’re extremely technical and have to do with map-reduce and distributed search and all kinds of tweaky things that would take too long to explain. Just don’t do it OK?

If you’ve read this far perhaps you’re hunting for specific examples. Here are two!

Example 1) How to use postprocess when _time is not involved

Below we’re using access data from SplunkWeb to show a table of the bytes transferred by filename, and also the number of requests by HTTP status. In the normal approach we’d have to use two different searches:

  1. index=_internal source=”*web_access.log” | stats sum(bytes) as totalBytes by file | sort – totalBytes
  2. index=_internal source=”*web_access.log” | stats count by status | sort – count

Notice that both searches have to get the same events off disk. This makes it a good candidate for post process.

THE WRONG BUT INTUITIVE WAY

Base search: index=_internal source=”*web_access.log”

  1. PostProcess 1:| stats sum(bytes) as totalBytes by file | sort – totalBytes
  2. PostProcess 2:| stats count by status | sort – count

This is wrong for several reasons, and it wont work anyway. See the “Pitfalls” section above to find out why.

THE RIGHT WAY

Base search: index=_internal source=”*web_access.log” | stats count sum(bytes) as totalBytes by file, status

  1. PostProcess 1:| stats sum(totalBytes) as totalBytes by file | sort – totalBytes
  2. PostProcess 2:| stats sum(count) as count by status | sort – count

Example 2) how to use postProcess when time is involved

When time is involved, you have to use Splunk’s “bin” command to bucket all the values into some reasonable number of time buckets.

Here’s a similar example to the above, except instead of the ‘request count by status’ on the right, we want the right side to show a ‘count over time by status’:

THE WRONG, BUT INTUITIVE WAY

Base search: index=_internal source=”*web_access.log”

  1. PostProcess 1:| stats sum(bytes) as totalBytes by file | sort – totalBytes
  2. PostProcess 2:| timechart span=15min count by status

This is wrong for several reasons, and it wont work anyway. See the “Pitfalls” section above to find out why.

THE RIGHT WAY

Base search: index=_internal source=”*web_access.log” | bin _time span=15min | stats count sum(bytes) as totalBytes by file, _time, status

  1. PostProcess 1:| stats sum(totalBytes) as totalBytes by file | sort – totalBytes
  2. PostProcess 2:| timechart span=15min sum(count) by status