Black Friday has for many years created traffic jams both physically and digitally. Black Friday has become Black Week and Cyber Monday. This has spread the traffic over a longer period of time, which is also common for other large events. For example, advance voting has been introduced in parliamentary elections to avoid long queues on election day. But is spreading the traffic a sufficient measure to handle the annual shopping bonanza?

I have previously worked with systems that will receive large loads in a short time, where the consequence of being 1 minute late is news reports and a full-blown crisis. Many people probably face similar challenges in connection with Black Friday. Here are my tips to avoid news reports or other unpleasantness.

Improve observability

Observability is about being able to understand the internal state of a system. This enables learning on a completely different level than you get by just observing the system from the outside. All the major cloud platforms have good support for observability. However, many platform providers charge well for this service, so you may want to consider free options. Then a setup based on OpenTelemetry is a good choice. 

Graphical representations of things such as error rates, response times and resource usage mean that you are much better equipped to understand how the system is doing. Being able to point to a graph and discuss it has enormous power, and the path from symptom to root cause suddenly becomes much shorter.

Do load testing

Load testing is a type of performance test where the focus is on putting load on the system and observing how it responds. There are lots of specialized tools to do this. I usually use k6. The reason for this is that the tests can be coded in javascript, that I have good solutions for drawing graphs and that I have the opportunity to run the same tests in a cloud solution. Then the tests are run outside of your own infrastructure.

It is a good idea to create tests that follow the user journey as realistically as possible. Then you meet the bottlenecks in the same order as the user, so that you solve the most important things first. For example, there is little to be gained on lightning-fast endpoints when the front page is not loading.

It is a good idea to have an idea of how much load you can expect. Starting from figures from last year's Black Friday, and tripling this to set the maximum load, is a good starting point. It is also appropriate to run breakpoint tests. These are tests where the load is turned up over time, to find the point where the system is unable to handle the traffic – the system goes into saturation. In this state, a lot will stop working. This is a very good opportunity for learning.

There is a lot of information in the way the system fails. Does the system recover on its own, or does it need help getting started? Under high load, some parts of the system will typically be overworked and other parts will have little to do. The parts that use few resources are behind bottlenecks. To move the bottlenecks backward, resources can be scaled out or up. After repeating tests and scaling several times, it comes to a point where scaling out or scaling up no longer affects the result.

If there is still a need for increased capacity, other solutions need to be looked at to move forward. Maybe a review of the architecture is needed or maybe an external endpoint is causing problems? Remember that the bottleneck may be ahead of you on the trail. For example, traffic routing into the Kubernetes cluster may be scaled too low, so that the traffic hitting the applications is already throttled.

CDN – Security and HTTP Caching

Content Delivery Network (CDN) is a distributed network of nodes located in data centers around the world. These offer caching close to the user, and can therefore reduce the time it takes to load the web page or get a response to an HTTP request. There is a great potential for improvement here if you learn to take advantage of it. It is never a waste to spend time learning HTTP properly. You should do a review of what you are caching, and make sure that the CDN is set up to respect cache headers. If there are any endpoints that are particularly slow or that put a lot of load on the system, these are good candidates for caching.

All the major CDN providers offer a Web Application Firewall. This spares your system from a lot of unwanted traffic. In addition, one should set up rules to limit traffic («rate limiting»).

Black Friday has for many years created traffic jams both physically and digitally. Black Friday has become Black Week and Cyber Monday. This has spread the traffic over a longer period of time, which is also common for other large events. For example, advance voting has been introduced in parliamentary elections to avoid long queues on election day. But is spreading the traffic a sufficient measure to handle the annual shopping bonanza?

I have previously worked with systems that will receive large loads in a short time, where the consequence of being 1 minute late is news reports and a full-blown crisis. Many people probably face similar challenges in connection with Black Friday. Here are my tips to avoid news reports or other unpleasantness.

Improve observability

Observability is about being able to understand the internal state of a system. This enables learning on a completely different level than you get by just observing the system from the outside. All the major cloud platforms have good support for observability. However, many platform providers charge well for this service, so you may want to consider free options. Then a setup based on OpenTelemetry is a good choice. 

Graphical representations of things such as error rates, response times and resource usage mean that you are much better equipped to understand how the system is doing. Being able to point to a graph and discuss it has enormous power, and the path from symptom to root cause suddenly becomes much shorter.

Do load testing

Load testing is a type of performance test where the focus is on putting load on the system and observing how it responds. There are lots of specialized tools to do this. I usually use k6. The reason for this is that the tests can be coded in javascript, that I have good solutions for drawing graphs and that I have the opportunity to run the same tests in a cloud solution. Then the tests are run outside of your own infrastructure.

It is a good idea to create tests that follow the user journey as realistically as possible. Then you meet the bottlenecks in the same order as the user, so that you solve the most important things first. For example, there is little to be gained on lightning-fast endpoints when the front page is not loading.

It is a good idea to have an idea of how much load you can expect. Starting from figures from last year's Black Friday, and tripling this to set the maximum load, is a good starting point. It is also appropriate to run breakpoint tests. These are tests where the load is turned up over time, to find the point where the system is unable to handle the traffic – the system goes into saturation. In this state, a lot will stop working. This is a very good opportunity for learning.

There is a lot of information in the way the system fails. Does the system recover on its own, or does it need help getting started? Under high load, some parts of the system will typically be overworked and other parts will have little to do. The parts that use few resources are behind bottlenecks. To move the bottlenecks backward, resources can be scaled out or up. After repeating tests and scaling several times, it comes to a point where scaling out or scaling up no longer affects the result.

If there is still a need for increased capacity, other solutions need to be looked at to move forward. Maybe a review of the architecture is needed or maybe an external endpoint is causing problems? Remember that the bottleneck may be ahead of you on the trail. For example, traffic routing into the Kubernetes cluster may be scaled too low, so that the traffic hitting the applications is already throttled.

CDN – Security and HTTP Caching

Content Delivery Network (CDN) is a distributed network of nodes located in data centers around the world. These offer caching close to the user, and can therefore reduce the time it takes to load the web page or get a response to an HTTP request. There is a great potential for improvement here if you learn to take advantage of it. It is never a waste to spend time learning HTTP properly. You should do a review of what you are caching, and make sure that the CDN is set up to respect cache headers. If there are any endpoints that are particularly slow or that put a lot of load on the system, these are good candidates for caching.

All the major CDN providers offer a Web Application Firewall. This spares your system from a lot of unwanted traffic. In addition, one should set up rules to limit traffic («rate limiting»).