Tracing in Complex Microservices Architecture

June 09, 2020 |

Aviad Mizrachi

| 5 min read |

Tracing in complex microservices architecture

The concept of microservices architecture (MSA) exists for at least 15 years. With the growing trend of microservices it seems you need to have a real “out of this world use case” to develop a monolith application (Quick glance at Google Trends tells that story). But the story of microservices tracing is a bit more complex than that…

Pros vs Cons

The advantages of building a distributed microservices environment are clear:

Easier to maintain: microservices are generally built in order to solve a specific problem and its code base is relatively small. That makes it easier to maintain and easier to onboard.
Faster CI/CD: Huge applications takes time to test, build, migrate and deploy. When the services are small the CI/CD process is much leaner and quicker
Easier to test: As mentioned before, each micro-service is meant to solve a specific problem. Which makes the testing of that micro-service much easier.
Easier to scale: With MSA approach it is much easier to control the scaling granularity of the service. If one of the flows needs more computing and memory power, there’s no need to scale the entire application, just the relevant micro-service.

That being said, the MSA approach presents it’s own challenges:

Deployment complexity: For each of the micro-services, we need to create and maintain a deployment configuration. We want to have a single DB for each of the services, maintain migrations and collect metrics.
Monitoring and metrics: Monitoring a micro-services based environment is more complex than monitoring a single process and DB. You normally have to set up a Prometheus server or statsd daemon.
Testing full flows: Even though testing a single service becomes much easier with the MSA approach (as mentioned before), developers required to test full flows now needs to setup on their machines a range of dockers, databases, queues etc.
In today’s advanced cloud containerized environments, the current trends is setting Going to the next level means that the devops engineers will setup an environment on click on pull requests. (More on that on a soon to come blog)

Building the flows

So now you have started building a micro-service based application, separated the relevant flows by domain/scale to the relevant micro-services.

Moving from this monolith:

To this:

In most real-life scenarios, our microservices will need to communicate with each other via REST/Queues in order to complete business flows.

Most microservices based environments will log to the popular elastic search. But how do you control the flow over GBs of logs sent to our elastic search cluster? How do you apply microservices tracing?

Let’s take a simple authentication flow for example:

In this sample we have 4 different microservices taking part in the flow…
All of them are async… all of them are logging… Now let’s think of this on high scale!

We are opening the popular Kibana which shows a mix of logs from different services with different flows… All mixed…

And if the customer calls claiming they cannot login, How do we find the actual problem in the tons of developer logs that we have?

Organizing the clatter is easy

Correlating logs on a multi-tenant, micro-service environment is quite simple when you think about it.

Generate the context on your API gateway
Pass it to the micro-services via queues / REST API
Implement a main logger which will write all that information to your logstash

Now we have the context passing between the micro-services and we can correlate the logs.

This is how it looks on our Kibana after we implement this method:

And we can now filter based on our trace id and see the entire flow between the micro-services:

Yay! Now we are seeing the complete flow for this request (in this sample — our Audits flow – going from the API gateway to the metadata service, back to the audits service and to the client via the API gateway). Our microservices tracing is applied in place.

And what happens next time we need to identify what is wrong with a customer login OR any other issue we ran into? Should we go into the Kibana?

Any further questions you have on how we are monitoring tenants and segments on our ELK?, feel free to contact me aviad@frontegg.com

The Complete Guide to SaaS Multi-Tenant Architecture

Read the guide