Webhooks are now an industry standard. Whether you like it or not, there are barely any enterprise facing products that don’t have webhooks notifications as an integral part of them.
But why do we need webhooks integration in our products?
And how do we implement them?
In this post we will try to cover the main aspects you have to keep in mind when implementing webhooks, before adding them to your product.
The idea of webhooks is to expose events from any of the software products you engage with, in order to allow the customers of this product to act upon these events, carry out automations and more.
A few real life examples can be found about getting notifications such as this sample from Github:
Other examples include getting webhooks when creating or modifying tickets in JIRA:
So you get the idea. The way it works is pretty simple.
When an event is triggered on a product, the product handles it and then lets all the Webhook subscribers know that this event has occurred by sending them an HTTP call (aka web-hook).
Something like this:
Sounds easy, right?
Well it is!
But there are some items to pay attention to when it comes down to the actual implementation:
Securing webhook calls is crucial for ensuring proper implementation.
We need to make sure that the receiving side of the webhook can validate the original request and protect it by using one, or all, of the following:
One of the common methods to allow or deny requests and to “authenticate” the sender is by using IP whitelisting. While this method can work in some cases, in today’s dynamic cloud environments (not to mention serverless architecture) it’s sometimes hard to maintain and can easily undo the implementation.
Pre shared key
This method enables the webhook sender to add a header to each of the requests. The value of the header should contain a pre-shared key that was shared and configured between the two parties (the webhook sender and receiver).
The receivers’ responsibility is to validate the header and to make sure it actually contains the value that was configured by the parties.
Replay attack prevention
The use of pre-shared keys exposes the webhook receiver to MITM attacks (specifically to replay attack.
This is what is looks like (in this case of password sniffing):
The attacker can hijack the request and reuse it over and over again (for example to wire funds as part of the automation).
In order to protect the receiving end from replay attacks, the webhook sender is expected to send the time when the request was sent and the “ValidUntil” header, which is usually limited to 10-20 seconds after the request origination timestamp.
The well-known JWT mechanisms are an all-in-one solution for the security loopholes and complexities inherent in the above mentioned approaches. JWT enables us to combine both the shared secret as a header and the “ValidUntil” header, without needing to re-invent the wheel.
Using this approach we will assign a value using symmetric H256 JWT, and set it to expire after 20 seconds.
That is the secret to the validation and replay attack mechanism.
Cool, isn’t it? 😎
Scale & Fault tolerance
As the product and customer base grows, we need to handle more and more webhook calls. This means that we need to handle the scaling of millions of webhook calls per minute.
We need to keep in mind that the webhook functionality is mandatory for customers to pick up automation flows. and in most cases timing and reliability is crucial.
You DON’T want to find yourself with these types of notifications on your statuspage:
So scale is important but not enough. We need to make sure that in the case of network disconnections or drops, we are able to retry sendings — first automatically and then manually, in order to make sure that the customer’s automation is not damaged.
So full scale and fault tolerance?
One of the popular methods we tend to use with these kinds of challenges is a streaming platform like Apache Kafka.
This allows greater flexibility, and follows the SoC principles. More importantly, it allows us to scale and partition in order to meet greater demand and load on the webhooks handling as well as the handling of the failed items on separate partitions and log indexes.
The flow using Apache Kafka will look like this:
Why should we choose Apache Kafka for this use case? The added values are:
- The horizontal scale capabilities – allows us to handle scale around our cluster and add more handlers as our scale and demand grows.
- Partitioning – the partitioning capability of Kafka allows us to “isolate” problematic recipients while not blocking the rest of the recipients.
- Replay messaging – That great feature allows us to replay problematic and failed messages while moving the offset of a specific topic. This is great as it allows us to “go back in time” in the case of errors.
Webhooks are the de-facto new standard for communicating between systems and platforms.
This post reviewed the main items we need to pay attention to when implementing webhooks on our product. Every implementation of webhooks should take into consideration the main items we discussed in this post in order to avoid security and scaling pitfalls.
If you want to discuss this further with us, we are here! Your partners in SaaS.