Load testing is an important technique when building a reliable service. This type of testing involves subjecting the service to artificially generated load to gain confidence that it will perform as required and to identify any enhancements required. A happy side effect is that it encourages good monitoring and improves team understanding of the operational aspects of the system.
When building a service, the specific performance and capacity needs are called non-functional requirements (NFRs), and typically relate to how the service must perform in terms of things like concurrent users, number and size of requests, or time taken to respond. Load tests explore whether the service can meet these requirements.
Load tests must be representative to be useful. For example, if the service will receive 100 user requests per second then tests should mimic that customer behaviour as closely as possible.
When to test
Load testing should be done regularly throughout delivery, not only in preparation for the first go-live. If load testing is left too late, it can be prohibitively expensive to remedy any performance issues found, especially if they are caused by inherent features of the technical design.
For tests to be representative, they must be performed on a system which is as "live-like" as possible. The primary consideration is that it should be scaled to the same level as the live system. As well as scaling the speed and capacity of things like compute and storage, the volumes and type of data present in the system must be live-like to ensure data access will perform as required.
List the components of your system and third party dependencies and ensure each is scaled to live system capacity. In cloud environments it is relatively easy and cheap to temporarily create a live-scale load testing environment built using infrastructure as code. While it is important to ensure that tests are done on an environment that is as true to production as possible, it should not actually interact with production components to avoid the risk of impacting the live system. Particular attention must be paid to third party dependencies to ensure there is no potential for interaction between production and non-production environments. Also consider the capacity of the load testing tools and monitoring system to ensure these can keep pace with the capacity of the full scale system.
Be wary of trying to extrapolate results from tests on non-live-scale systems: scaling is often highly non-linear and behaviour of a system can be very sensitive to unrepresentative performance from individual components. If no test instance of a particular component or dependency is available, or the integration is deemed as out of scope for the test it is possible to use a stub instead. But use this approach with caution as it significantly reduces the reliability of test results.
Think about scope
It is natural to prioritise testing of the functionality which is most critical to the user and business. However, it is important not to focus too narrowly on testing individual flows as load in one area of the system can often impact performance in others. Ideally load tests will involve applying load in a live-like way, which typically means across all or most areas of functionality. It can be useful to test multiple scenarios with different load profiles such as times when many users log in at the start of the day or what happens after a push notification or marketing tweet is sent.
Map functionality to services and integrations
Once you have a list of service functionality, you can map each to the technical integration points which it relies on. For example, logging in may actually require two internal services and go out to a third party. Once you have an idea of what points you need to test, it can give you areas to investigate in terms of ensuring your test environment is isolated and representative, and help you build a comprehensive test and monitoring plan. However, avoid over reliance on testing individual components: tests which mimic real user behaviour give far more reliable results.
Identify and fix bottlenecks
Load testing requires good monitoring so that the behaviour of the system and each component can be understood and bottlenecks identified. If testing shows that system performance or capacity are inadequate, then the bottleneck must be optimised. During optimisation, avoid changing multiple parameters at once as doing so can make it difficult to determine cause and effect. Carefully record changes made during this process along with the results. Work methodically and be forensic in your analysis — it's very easy to lose track of exactly what changes have and haven't made a measurable change to system behaviour. Take time to put together a clear write-up of each load test so everyone who needs to can understand what was done and what the findings were.
At a minimum, test to some agreed level above the expected maximum load to compensate for uncertainty in those estimates. Typical choices are 1.5 or 2 * expected load. However, where possible it is good to also test to breaking point to better understand how much headroom the system has and what will break first. This is known as stress testing.
It is good to be aware of any known inefficiencies before testing, such as making unnecessary calls out to a third party or a missing database index. But testing and optimisation are best done iteratively with tests steering optimisation efforts to avoid wasting time on premature optimisation.
Finding performance bottlenecks is hard. Monitoring, application logs and distributed tracing can all help isolate exactly what is causing the bottleneck. Even once you have identified the slow component, it can be difficult to determine the underlying cause.
Some general points to consider are:
- Is this a point in time measurement or does the system react to high load, e.g. by automatically scaling?
- How can you reduce bottlenecks in databases, third parties and other dependencies?
- How do you reduce your own footprint as well as getting others to improve their performance? i.e. reduce calls and make them faster.
- Are you being limited by a hardware constraint such as CPU, memory, disk or network?
- Are you being limited by any application thread pools?
- Are HTTP servers configured correctly?
- Are you hitting a limit such as number of open sockets or concurrent connections?
Some considerations when deciding what and how to test:
- How well does the test mimic real user behaviour? For example, logging one customer in a million times might behave differently to logging a million customers in.
- Is the entry point the same? If you are faking API traffic, does it go through the same firewalls, proxies, etc? If you are injecting messages or files, do they go through the same controls as normal?
- Are individual pieces of functionality being tested separately or together? Both have value and a balance of each approach is often worthwhile.
- How easy is it to alter system and test parameters and retest? Can you quickly iterate test-modify cycles?
- How will you record results?
- What happens when functionality changes? How easy is it to update the test?
- What metrics are you interested in? (See section below)
- Can you automate the production of easily interpretable graphs to clearly illustrate results?
- Can tests run as part of the CI pipeline?
- Are all users of the test environment aware you're running the test?
- Are all integration points (other teams / third parties) aware you're running the test
Monitoring during testing should centre around the metrics which matter most to your users. This will often include some form of response time (also known as latency) which must be maintained at predicted levels of load (often called traffic) without error. This has lead to what the SRE book calls the 'four golden signals', which are an excellent starting point.
The time it takes to service a request. e.g. ms response time for an HTTP API. When a system gets busy, latency can vary widely between individual requests. Since perception of system performance is highly sensitive to the speed of the slowest responses, it is often desirable to use a measure which recognises this such as the 95th centile rather than using a simple mean.
A measure of how much demand is being placed on your system or an individual component or third party system. e.g. Requests per second for an HTTP API.
The rate of requests that fail, either explicitly (e.g. HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, "If you committed to one-second response times, any request over one second is an error"). e.g. 5xx errors per second for an HTTP API.
How "full" your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g. in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential.
As a bottleneck of the system nears its saturation point, latency typically starts to increase rapidly in a 'hockey stick' profile. This point where the latency starts to increase more rapidly is a good indication of the maximum capacity of the system.
Choose the right tool for the job. Think about your team's skill set when it comes to implementing the tests. There are several popular tools for load testing, and language choice is important to consider along with the functionality provided by the test framework. For example, Gatling tests are written in Scala, JMeter in Java, and Locust in Python.