In a world of collecting activity logs and audit logs, starting off with a lean solution will usually work for the first 5 customers, but when your application grows, you’ll find yourself facing scaling issues which can affect your entire application cluster and cause DOS on specific cases.
Implementing an infrastructure for logs collection can be even more challenging as you are required to expose an endpoint for collecting logs from other applications where your service becomes the critical infrastructure and needs to maintain High Availability.
The lean method
Building a simple service REST api for collecting audit logs should be rather simple, right?
Exposing an endpoint, validating the input params and saving the data to the database.
While this method is extremely easy to implement, it leaves the solution vulnerable:
- What happens when the DB is under migration? Writing is blocked.
- What happens if the service fails? Logs are lost.
- And how do we meet scale? Bursts of data?
Controlling the scale via queue
One of the most popular methods to control scale and to reduce the DB migration issues is by adding a queue in the middle. This allows us to be able to scale pods (in case of k8s) or lambdas (in case of serverless) when the load increases.
Now we are talking… ????
Even if our DB is going through some kind of migration process the data is stored on the queue and waiting for it, meaning we are not losing data. Furthermore, even if the consumer is under deployment / disconnected, data is still stored on the queue safely waiting for it to return.
Now bear with me. We are almost there and this solution can take care of scaling up and down. However, there is still the issue of our API. It continues to be the single point of failure of the solution, meaning that if it fails, the errors will be returned to our customers leaving them with a solution that can, and will, break.
Eliminating the API gateway bottleneck
As mentioned we are left with an issue of API gateway bottleneck. Assuming we have really scaled our service and are now collecting thousands upon tens of thousands of logs every second, our API gateway is left vulnerable and we are constantly increasing its scale.
Wait, let’s try to remember what the day to day responsibilities are of any API gateway:
- Check authentication
- Check authorization
- Proxy the request
This means that we have to pay the price for each log request (which normally will come from the same customer), leaving it extremely busy.
In these cases, a dedicated service will provide each of the authenticated applications with private access tokens to the digestion service, while the consumers get the data from the digestion service itself thereby reducing the load and friction from the API gateway. This type of solution will look something like this:
This gives us the following pros:
- Our API gateway is not a bottleneck any more.
- We rely on the HA our public cloud provider (AWS / Azure / GCP) to shift the load.
- The consumer remains unchanged.
But adds a bit of overhead:
- We need to sign the request on the SDK (application) side in order to verify the incoming source.
- We need to deal with another authenticated service that distributes keys and maintains them.
- Maintaining topics policies per broker is not an easy task an easy task well..
Overall, choosing the right solution depends on the current scale you have and the scale of your service while also taking into account the next couple of years (you don’t want to find yourself rewriting everything 6 months from now).
If you are processing petabytes of data, go with option 3. If you are expected to handle only hundreds of thousands a day, you can go with option 2.
Keep in mind, both options leave you with a resilient, bullet proof solution and layout a solid foundation for scaling your customer’s logs.