Building a good customer experience is essential to the retention rate of your product. But you can build as many satisfying, pixel-perfect UI snacks as you want, if your product is unreliable, it will scare your customers off. Even more so when you’re selling a product that provides value by being always “on”, like a server.
Monitoring the uptime is crucial to your product strategy and your retention rate. Customer experience satisfaction can be measured by several metrics, and they can be specific to your product, or industry, or be widely adopted by all SaaS. Service availability is part of the latter.
This article is addressed to PMs or CTOs who are looking to add a Service Level Agreement to their product. We’ll look into the definition of levels of service, what should an SLA include, and what processes should be covered in case of downtime.
Levels of service are commitments made by a service provider to deliver one or more services within a stated tolerance.
Levels of service provide the expectations of quality and service customers can count on when signing up for an offering, they ensure the continuity of service.
A Service Level Agreement (SLA) is a binding document between the service provider and the customer which states the levels of service and the modalities when requirements are not met. Essentially, levels of service are the answer to “How frequently this product will be down?” and the SLA to “What happens when the product is down?”
Typically, levels of service are expressed using the service’s average monthly uptime. Uptime is the percentage of total possible minutes a service was available to its users during a specific period.
In their SLA, they commit to a 99.99% uptime, and their uptime monitoring includes the ability to log in, post a message, share a file or preview a link, not the ability to react to a message with an emoji as they’re not part of their product’s core features. They also differentiated their SLAs based on the user’s subscription plan.
To decide which services should be included in your SLA, ask yourself:
When you have a list of features without which your product wouldn’t work, describe what services make them work, that may be an API or a server. Those services are the ones you should be setting an uptime for, your “Service Level Indicators”.
Then, define what would consist of “being down”:
You now have the services that should be up at all times and know when they’ll be considered down.
Ultimately, you’ll have to commit to an availability rate, maturity and robustness of your product’s infrastructure have to be factored in, to make an informed decision. This availability commitment is indeed crucial to the SLAs, as they’ll be closely monitored by your users, and they’ll be the decisive power for any credits you may have to give back to users if you fall short in your commitment.
Enter your SRE or DevOps department, which should be able to set up an alert when the services are down and another one when the uptime is below the threshold for the current period. When that happens, you may want to share that alert with any concerned party and if the downtime is confirmed, update your status page accordingly.
In the SLA, you’ll have to document the reimbursement policy you want to apply when your uptime commitment isn’t met. You can either proactively apply service credit to all affected users when that happens or manually issue a service credit when it has been redeemed by an affected user.
Additionally, you’ll have to decide how much service credit you want to issue per downtime. The tech industry consensus seems to be to issue a certain amount of the monthly charges, depending on the user’s plan and the monthly uptime percentage.
Slack automatically adds service credit to affected accounts if they fall short of their uptime commitment. That service credit is equal to 10 times the amount that the customer paid during the period Slack was down.
Google, on the other hand, asks users to contact their Google technical support representative within 30 days to receive Service Credits. Atlassian users have to do it within 15 days.
|Service Provider||Service Commitment||Service credit process||Service credit|
|Amazon API Gateway||99.95%||Automatically applied by service provider||99.0%< X < 99.95% : 10% |
95.0% < X < 99.0% : 25%
X <95.0% : 100%
|Slack||99.99%||Automatically applied by service provider||Credit = 10 x amount paid during downtime|
|Google Workspace||99.99%||Customer must request Service Credit||Credit = days of service added to the end of service term |
99.0%< X < 99.90% : 3
95.0% < X < 99.0% : 7
X <95.0% : 15
|Atlassian Premium||99.90%||Customer must request Service Credit||99.0%< X < 99.90% : 10% |
95.0% < X < 99.0% : 25%
X <95.0% : 50%
Now that you have the service you want to put availability commitments to, their levels of service, and know what’s the process your users should follow when the service has been done, you’re ready to publish them on your website or app legal pages and notify your users.
Next, we’ll review how to communicate to your users in case of force majeure and scheduled maintenance.
To recap, see this perfect illustration of what are Service Level Indicators, Objectives and Service Level Agreement: