
John McGovern is a Principal Systems Engineer at ExtraHop, and a wizard at building metrics.
Part 4: So what do I do with all this data anyway?
Or How I stopped worrying and learned to love Open Data Stream
All right, we think we have a ton of awesome information, but how do we display this stuff so that we can figure out which users are having problems, or which actions have consistent issues, so we can work on fixing those issues?
So remember that option from the Session table setup, notify: true? Well that means when these username keys expire, a SESSION_EXPIRE
event will fire and we can start storing this data in a number of places. First, let's store it in the native ExtraHop Datastore. Our datastore is built for key:value
pairs and built for high velocity data (think 40Gbps high velocity). So how do we take a JavaScript Object with tons of data and boil it down to a key:value
pair? Well, remember why we are doing this--we want to figure out which users are having problems, and what is causing them the most problems on a transactional level. We can do that pretty easily by storing the slowest processes by user per tier. So, basically, what we need to do is sort the Object arrays for Web and App and get the highest processing time entries, and we can do this every time one of those Session table entries expires.
In the following SESSION_EXPIRE
event we will loop through all expired user sessions and begin to compute totals and commit data/export data where configured. First we grab our Object containing the full user experience, then we sort through the Web and App tiers for the slowest individual responses. After that is complete we will commit a metric per user with the slowest Web and App tier requests. Lastly we will export the full JSON Object via HTTP/MongoDB or Syslog.
else if (event == "SESSION_EXPIRE") {
var keys = Session.expiredKeys;
for (var i = 0; i < keys.length; i++) {
var tmpObject = JSON.parse(keys[i].value);
var web_tprocess = tmpObject.web[0].tprocess || 0;
var web_path = tmpObject.web[0].path;
var app_tprocess = tmpObject.app[0].tprocess || 0;
var app_soapAction = tmpObject.app[0].soapAction;
for (var j = 1; j < tmpObject.web.length; j++) {
if (tmpObject.web[j].tprocess > web_tprocess) {
web_tprocess = tmpObject.web[j].tprocess;
web_path = tmpObject.web[j].path;
}
}
for (var k = 1; k < tmpObject.app.length; k++) {
if (tmpObject.app[k].tprocess > app_tprocess) {
app_tprocess = tmpObject.app[k].tprocess;
app_soapAction = tmpObject.app[k].soapAction;
}
}
var displayKey = "User: " + keys[i].name + "\n" +
"WEB: " + web_path + ":" + web_tprocess + "\n" +
"APP: " + app_soapAction + ":" + app_tprocess;
var displayValue;
if (web_tprocess > app_tprocess) { displayValue = web_tprocess; }
else { displayValue = app_tprocess; }
Those of you familiar with programming and scripting might recognize what is happening above, but let me break it down. Skipping the comment at the top and the SESSION_EXPIRE
check gets us to the area where we do some work. When SESSION_EXPIRE
occurs we get all the keys that expired with the notify flag set to true. So that's right, we get a bunch of them all at the same time. So we will loop through each key, parse the Object, and start looking for our slowest processing time per tier. In lines 17-21 we find the slowest Web tier path, and similarly in 22-27 we do the same for the App tier's SOAP Action. Then all that work becomes what you will see in the ExtraHop GUI as a key:
var displayKey = "User: " + keys[i].name + "\n" + "WEB: " + web_path + ":" + web_tprocess + "\n" +"APP: " + app_soapAction + ":" + app_tprocess;
Here we build a key that has the user, the web tier path that was slowest with the processing time, and the app tier SOAP Action that was slowest with the corresponding processing time. Lastly, we set the value to be the higher of the two so we have a way to track which user had the most pain across their peers.
So that mainly handles the ExtraHop Streaming Datastore version, but as you can probably tell, there would be some value to being able to see all of the user's behavior, not just the slowest single transaction per tier. That's a case that we handle by sending transaction records to our ExtraHop Explore appliance, which we recently announced as part of version 5.0 of our platform. But what if you already have a third-party datastore that you want to utilize to index and store transaction data?
Enter Open Data Stream
Here at ExtraHop, we have a very strong belief that the data we produce and give you access to is your data and, as such, you should be able to take that data and do whatever you want with it. One way we uphold that promise is via Open Data Stream, which allows you to send data out of the ExtraHop, via AI Triggers in real time, to targets of your choice. So, for this example, I put together Open Data Stream configurations for three of the most popular mechanisms we offer today: ElasticSearch, MongoDB, and Syslog.
The following lines will allow full export of the complete JSON object created per user. Examples were built in for Elastic, MongoDB, and Syslog as we had a few options for output. The gain in sending this out is to be able to catalog every path/soapAction a user went to with timing information.
//HTTP (Elastic, AppD, etc)
var headers = { "Content-Type": "application/json" };
Remote.HTTP("my_destination").post( {
path: "/",
headers: headers,
payload: JSON.stringify(tmpObject)
});
//MongoDB (Try an insert first, if that false attempt an update.)
var passInsert = Remote.MongoDB.insert('sessions.crm.user.' + keys[i].name,
JSON.stringify(tmpObject));
if (!passInsert) {
var passUpdate = Remote.MongoDB.update('sessions.crm.user.' + keys[i].name,
JSON.stringify(tmpObject));
} else if (!passUpdate) {
debug("Could not add/update an entry in table sessions.crm: " +
JSON.stringify(tmpObject));
}
//Syslog (Any machine data tool for the most part.)
Remote.Syslog("my_destination").info("eh_event=usertrace" +
" user=" + keys[i].name +
" json=" + JSON.stringify(tmpObject));
Just like that the ExtraHop will send the complete Object with every transaction for both tiers by username and processing time. At that point, the data can be commingled with whatever other data you have, and certain transaction problems may make themselves visible. If only there were a way to bridge the gap and have the ExtraHop tell you about transactions automatically when they exceed your nominal transaction time….
Part 4 Recap
In this part, we went through and reaped the benefits of all of previous work. Now we know, in the ExtraHop UI, every user having slow performance and what they were doing when that occurred. Once that level of focus can be obtained, solving a problem becomes a lot easier as you know exactly where to drill into. Plus, by using our Explore appliance or Open Data Stream, the full transaction log can be retained for other purposes like security, auditing, or for negotiating more reasonable SLAs because now you have the data to back you up.
Part 5: Bridging the Gap
In our last installment, we sent our data out of the ExtraHop appliance to other platforms using the Session table and the Flow table with Open Data Stream. Another option for more business or SLA-based metrics though, is to use the metrics we have created in the ExtraHop Streaming Datastore. To further the example we have been working on, now that we have information by username on which parts of the application are struggling, what if we used that data to validate the SLAs we have agreed to as an organization for our customers?
Why are SLAs important to monitor? Well, mainly for your internal clients, they represent a promise and while failing to meet them may not incur a direct monetary penalty, it will incur more of a karmic penalty with those teams. If this is an external SLA, then money is on the line and we need to meet those if we are paying for failure.
How can the ExtraHop help with SLA monitoring of this CRM application? Great question. We already have the data we need for performance; all we need to do is check that against the SLA agreed upon.
Bridge triggers are a recently added piece of functionality that allows an ExtraHop user to access metrics just as they are being committed to the datastore. This means we can validate our SLA against the over 3,300 out-of-the-box metrics, or any of the custom metrics we have defined in our Application Inspection Triggers. In our example, we have a few metrics that cover user experience and they all have either a HTTP path or a SOAP action. So we can leverage a bridge trigger to get this done.
Functions
Now a quick segue before we get to the bridge trigger in question. When you operate on the Bridge you get ALL metrics committed during the cycle you choose. You could look at each metric one-by-one to see it is a match, but a more efficient way would be to build a few reusable JavaScript functions. Below we have an example that will process the metrics received during a Metric Cycle event. This function grabs the metrics and first checks to see if the object the metric is referencing on line 9 matches the object we created with our session table trigger. Then we search for any custom metrics committed to that object. This narrows our scope in the sea of metrics being committed.
On line 19, we then look for the metric name we were committing back in our session table trigger. If we find both the proper object and metric we want on that object, we actually use another function, processStat()
, to process the content of that metric to see if it violates our SLA.
function processMetric() {
var id = MetricRecord.id,
deviceId = MetricRecord.object.id,
fields = MetricRecord.fields;
var f,stat;
if (id='extrahop.application.custom_detail'&&deviceId=='CRM-AppTier'){
var customDSETs = fields['custom_dset'];
if (customDSETs == null){
return;
}
var customApp_Metrics = customDSETs.lookup('crm-tprocess_soapAction');
if (customApp_Metrics == null){
return;
}
var customEntries = customApp_Metrics.value.entries;
for (entry in customEntries){
processStat(id,
f,
customEntries[entry].value,
customEntries[entry].key);
}
for (var i = 0; i < uriArray.length; i++){
var uriCount = uriCountArray[i] || 0;
app.metricAddDetailCount('CRMApp-URI-Count',uriArray[i],uriCount);
}
uriArray = [];
uriCountArray = [];
}
}
Here is the processStat()
, so we can take a look at it as well:
This function takes an object of any type and if it's a Dataset it will review the URI and tprocess
time to determine if the timing is over 2 seconds or 2000ms. If that comes back true it will search the URIArray
looking for that URI, if found it will increase the count, if not it will add the URI and start the count.
function processStat(id, f, stat, key) {
var keyStr = getKeyStr(key);
if (stat instanceof Topnset) {
processTopnset(id, f, stat);
} else if (stat instanceof Dataset) {
var findUri=false;
if (stat.percentile(100) > 2000){
for (var k=0; k < uriArray.length; k++){
if (uriArray[k]==keyStr){
findUri=true;
var countValue = uriCountArray[k] + 1;
uriCountArray[k]=countValue;
}
}
if (!findUri){
uriArray[uriArray.length]=keyStr;
uriCountArray[uriCountArray.length]=1;
}
}
}
}
Even this function references another function. Part of the beauty of having a full JavaScript interpreter in the ExtraHop platform is that you can actually start to build out bits of script to reuse over and over again. In this case, processStat() grabs the custom metric found by the processMetric()
and looks to see if the individual processing times are over 2 seconds which is our SLA in this example. If they are, the function then builds a JavaScript array and puts all HTTP paths and SOAP actions into the array.
After all this is said and done, we commit a simple count metric type for every HTTP path or SOAP action every time we violate the SLA. This could be used for alerting, or to focus efforts on continuous improvements so that these actions perform better in the future.
Part 5 Recap
This was a bit more of an advanced case but can be simply implemented to provide a lot more value out of the work done earlier. Another advantage of using bridge triggers is that you can create custom trouble groups, which can serve as new heuristics that ops teams can leverage to make better decisions more quickly.