SLA Engine and commitments

General Add comments
by:

There are many ways and reasons to define a SLA inside ServiceNow, but most of you will be familiar with the concept of standard SLAs: You define a condition for the start, pause and stop of the SLA measurement and it will start counting the time on and off the clock. It becomes interesting when you want to tie these measures of SLAs up to service performance and track potential commitments that you have agreed on. Especially when penalties are involved it becomes critical that you define your SLAs well, and properly keep track of them. My colleague Benjamin van de Water wrote a nice piece a couple of weeks ago giving some tips on proper agreements and making sure your IT organization is aligned between all potential levels of them: Things to Consider when defining SLAs, OLAs and UCs.

Assuming all of these agreements are well defined, how can we track them against specific commitments? Assuming I have an offering for my clients, how do I make sure that my service performance is properly measured and tracked? For this ServiceNow has a decent plugin available, called ‘Service Portfolio Management – SLA Commitments’. This plugin enables the ability to track the performance against each individual service commitment, per service offering. It also allows you to specify Service Offering SLAs that will only measure performance against specified service offering even though the conditions might state otherwise. You can read more on the wiki about this plugin: Plugin Info.

Unfortunately, using this plugin is a seemingly strange affair as I found out recently. When activating the plugin, the service offering field becomes available on the task table. This field can be used to segregate the CIs that are in use for specific service offerings (i.e.: I can show only CIs that are supporting my Email service offering for client X). But no matter what service offering I used, the properly setup data would not trigger any SLA. To figure out what happens, I started digging a bit.

Service Offering SLAs are quite a recent addition to the gamut of measurements in ServiceNow. Being recent, this means there are older versions as well that have to be supported by both old and new engines. And surely, if you look into the code, we have two variants of the engine: 2011 and 2010. Of course this doesn’t make the code easier to read, but let’s march on.
I noticed the property was set to use the 2011 version of the engine right now. This means we will be investigating the ‘TaskSLAController’ script and specifically one piece of code that impacts the engine most. I had to read it several times before I got it, but here’s what happens:

// Check active SLA Definitions
var slaGR = new GlideRecord('contract_sla');
slaGR.addActiveQuery();
slaGR.addQuery('collection', this.taskGR.getRecordClassName());
// avoid service_offering_sla definitions, if they might exist
if (slaGR.isValidField('sys_class_name'))
slaGR.addQuery('sys_class_name', 'contract_sla');

SelfCleaningMutex.enterCriticalSection(this.MUTEX_NEW + this.taskGR.sys_id, this, this._processNewSLAs_criticalSection, slaGR);
// TODO: optionally attach work-notes
if (this.timers)
sw.log('TaskSLAController: Finished _processNewSLAs part 1');

// and active Service Offering SLA definitions
// (TODO: merge this contract_sla query with the previous one, to process all of them in one go)
var socGR = new GlideRecord('service_offering_commitment');
if (!socGR.isValid())
return;

var commitmentFieldTest = new GlideRecord('service_commitment');
if (!commitmentFieldTest.isValidField("sla"))
return;

if (this.timers)
sw = new GlideStopWatch();
// (using contract_sla GlideRecord to easily avoid
// those that are currently active and assigned to the task)
slaGR.initialize();
slaGR.addActiveQuery();
slaGR.addQuery('collection', this.taskGR.getRecordClassName());
// service_commitment.type='SLA'
slaGR.addQuery('JOINcontract_sla.sys_id=service_commitment.sla!type=SLA');
// service_offering_commitment.service_offering=cmdb_ci
slaGR.addQuery('JOINservice_commitment.sys_id=service_offering_commitment.service_commitment!service_offering=' + this.taskGR.cmdb_ci);

SelfCleaningMutex.enterCriticalSection(this.MUTEX_NEW + this.taskGR.sys_id, this,
this._processNewSLAs_criticalSection, slaGR);


This bit of code runs whenever the asynchronous processor runs to determine whether SLAs need updating. This includes attaching new SLAs, which is the bit I’m interested in (my SLAs weren’t getting attached). It collects both types of SLAs: Contract SLAs (normal and standard SLAs) and Service Offering SLAs (new type of SLA, tied to offerings).

It then processes them against their respective tied up offerings. Now focus on the following bit of code:

slaGR.initialize();
slaGR.addActiveQuery();
slaGR.addQuery('collection', this.taskGR.getRecordClassName());
// service_commitment.type='SLA'
slaGR.addQuery('JOINcontract_sla.sys_id=service_commitment.sla!type=SLA');
// service_offering_commitment.service_offering=cmdb_ci
slaGR.addQuery('JOINservice_commitment.sys_id=service_offering_commitment.service_commitment!service_offering=' + this.taskGR.cmdb_ci);

Where the offering is stored in cmdb_ci??? Why would it try a join on there as our service offering is stored in the service_offering field right? Ok, apparently the query that is running on the Contract SLA table also runs on the Service Offering SLA table and determines based on the join between service offering and service commitment whatever SLA to attach. It finds both Contract SLAs as well as Service Offering SLAs as one extends from the other.

If the service offering from the task is found in one of the commitments, and it has SLA type records tied to that definition, the SLAs should start running. But it also clearly specifies the field the Service Offering needs to be mentioned in, is the ‘cmdb_ci’ field, which is most commonly used for storing the CI that is affecting or causing the actual service disruption, or at the very least the most important CI for the type of ticket that we are logging.

There is also a bit of code in the ‘SLAConditionBase’ script include, referring to the cmdb_ci field:

_cancelServiceOffering: function() {
var soc = new GlideRecord("service_offering_commitment");
if (!soc.isValid())
return false;
if (this.sla.sys_class_name != "service_offering_sla")
return false;

soc.addQuery("service_commitment.type", "SLA");
soc.addQuery("service_commitment.sla", this.sla);
soc.addQuery("service_offering", this.task.cmdb_ci);
soc.query();
if (soc.hasNext())
return false;

this.lu.logInfo("canceling Service Offering SLA: " + this.sla.name + " - task is now against different CI");
return true;
},

That covers the ability of the SLA engine to keep track of whatever SLA should be running. That also explains why I wasn’t seeing any SLAs appearing where I expected them to appear. I took a little detour now to determine how the commitments are actually being calculated. Surely there must be something there pointing towards the service offering field?

Let’s have a look at how these results are being calculated in the ‘SLAResultCalculator’ script include, which is used for creating result records based of SLAs.

Unfortunately, in the calculate function we can see the engine look at the cmdb_ci field from task some more:

calculate: function(start, end) {
var commits = this._getCommits();
while (commits.next()) {
var cmdb_ci = commits.service_offering + '';

var ga = new GlideAggregate('task_sla');
ga.addInactiveQuery();
ga.addQuery('end_time', '>=', start);
ga.addQuery('end_time', '<', end); ga.addQuery("sla", commits.service_commitment.sla); ga.addQuery('task.cmdb_ci', cmdb_ci); ga.addQuery("stage", "IN", ["breached", "achieved", "completed"]); ga.groupBy('stage'); ga.groupBy('has_breached'); ga.addAggregate('COUNT'); ga.addAggregate('SUM', "duration"); ga.addAggregate('SUM', "business_duration"); ga.query(); var totalTasks = 0; var total_dur = 0; var total_bdur = 0; var breached = 0; var achieved = 0; while (ga.next()) { var dur = this._getDuration(ga.getAggregate('SUM', "duration")); var bdur = this._getDuration(ga.getAggregate('SUM', "business_duration")); total_dur = total_dur + dur.getNumericValue(); total_bdur = total_bdur + bdur.getNumericValue(); if (ga.stage == "breached" || (ga.stage == "completed" && ga.has_breached)) breached += parseInt(ga.getAggregate("COUNT")); if (ga.stage == "achieved" || (ga.stage == "completed" && !ga.has_breached)) achieved += parseInt(ga.getAggregate("COUNT")); } totalTasks += breached + achieved; var meanDuration = 0; var meanBusinessDuration = 0; var achievedPct = 100; if (totalTasks > 0) {
meanDuration = total_dur / totalTasks;
meanBusinessDuration = total_bdur / totalTasks;
achievedPct = (achieved / totalTasks) * 100;
}
var met = achievedPct >= commits.service_commitment.sla_percent;
var totalDuration = this._getDuration(total_dur);
var totalBusinessDuration = this._getDuration(total_bdur);
var sr = new SLAResultRecord(cmdb_ci, start, end);
sr.post(commits.service_commitment, achieved, breached, this._getDuration(meanDuration), this._getDuration(meanBusinessDuration), totalDuration, totalBusinessDuration, met);
}
},

So basically the entire engine is tuned on the fact that there is a service offering expected in the cmdb_ci field. I tried making sense of this, but nothing came to mind that might actually explain what the intent was here. Simply put, I don’t know how to work with the data here to make sure the commitments are properly tracked without making the customizations to the engine, or losing traceability of a primary CI in your processes.

The only sensible reason I could come up with, is the fact that we are talking about a script that should work out of the box (TaskSLAController) and a field that comes in with a plugin (Service offering on task). I hope ServiceNow will be able to mitigate this quite quickly, as to my opinion this setup is not correct.

Please note that making any types of changes would pretty much touch the entire engine. So be careful and act at your own risk. I would not recommend changing any of it for now. Unfortunately, that also means that using the commitments for the service offering field will need to wait a little more for being implemented, as far as I can see!

If you have any question or if you think that I should not be so picky: please let me know (why)!

.img[at].img

Leave a Reply