Scaling Logs for Customer usage

Audit Logs

In a world of collecting activity logs and audit logs, starting off with a lean solution will usually work for the first 5 customers, but when your application grows, you’ll find yourself facing scaling issues which can affect your entire application cluster and cause DOS on specific cases.

Implementing an infrastructure for logs collection can be even more challenging as you are required to expose an endpoint for collecting logs from other applications where your service becomes the critical infrastructure and needs to maintain High Availability

The lean method

Building a simple service REST api for collecting audit logs should be rather simple, right?

Exposing an endpoint, validating the input params and saving the data to the database.

While this method is extremely easy to implement, it leaves the solution vulnerable:

  1. What happens when the DB is under migration? Write is blocked
  2. What happens if the service fails? Logs are lost
  3. And how do we meet scale? Bursts of data?

Controlling the scale via queue

One of the most popular methods to control scale and to reduce the DB migration issues is by adding a queue in the middle. This allows us to be able to scale pods (in case of k8s) or lambdas (in case of serverless) when load increase

Now we are talking… 😁

Even if our DB is going through some kind of migration process the data is stored on the queue and waiting for it, meaning we are not losing data. Furthermore, even if the consumer is under deployment / disconnected, data is still stored on the queue safely waiting for it to return.

Now bare with me. We are almost there and this solution can take care of scaling up and down. However, there is still the issue of our API. It continues to be the single point of failure of the solution, meaning that if it fails, the errors will be returned to our customers leaving them with a solution that can, and will, break.

Eliminating the API gateway bottleneck

As mentioned we are left with an issue of API gateway bottleneck. Assuming we have really scaled our service and are now collecting Thousands upon tens of Thousands of logs every second, our API gateway is left vulnerable and we are constantly increasing its scale…

Wait… Let’s try to remember what the day to day responsibilities are of any API gateway:

  1. Check authentication
  2. Check authorization
  3. Proxy the request

This means that we have to “pay” the price for each log request (which normally will come from the same customer), leaving it extremely busy…

Another popular method is to leave the API gateway to process REST calls, while moving your services to use external digestion services (such as Amazon Kinesis, Azure Event Hubs or GCP Pub/Sub)

In these cases, a dedicated service will provide each of the authenticated applications with private access tokens to the digestion service, while the consumers get the data from the digestion service itself thereby reducing the load and friction from the API gateway. This type of solution will look something like this:

This gives us the following pros:

  1. Our API gateway is not a bottleneck any more
  2. We rely on the HA our public cloud provider (AWS / Azure / GCP) to shift the load
  3. The consumer remains unchanged

But adds a bit of overhead:

  1. We need to sign the request on the SDK (application) side in order to verify the incoming source
  2. We need to deal with another authenticated service that distributes keys and maintains them

Now we are able to digest thousands and tens of thousands of logs per second, while leaving our API gateway focused on low rate REST calls.