To Protect To Serve
As a consultant, an architect, and a coach, I have worked with dozens of teams in the last 10 years to develop architectures that adopted API-First as a paradigm; teams that adopted the microservice architecture style to develop cloud native software products. In practically all cases the teams succeeded in developing APIs that proved to be robust and resilient, provided their consumers played by the rules. In almost all cases, the APIs turned out to be incapable of handling misbehaving or malicious consumers. At least for the first two or three versions of their implementation. The reason for this poor performance lies, in my experience, in their documented behavior or rather in the lack of that documentation.
This article is based on my observations over the past decade and especially the second half of it. I made these observations while being involved in solving problems that product teams were facing in several organizations spanning different markets and of various sizes. Your particular situation might be completely opposite of the picture I am painting and, in that case, continue reading, feel good about yourself and your product and treat it as a reminder to always do the right thing and stay diligent.
Passionate for APIs
When you are familiar with my blogs and articles, you know that I am very passionate about adopting API-first concepts. There are quite a few blogs by me on this topic. I am particularly passionate about service-oriented architectures, that follow the best practices around service-interfaces. An important aspect on interfaces is what we call ‘Interface Protection’.
Over the past decade I have worked with various teams that very actively embraced APIs, both providing and consuming them.
In the last year or so, I have been involved and shared my perspective on various APIs in the role of architecture consultant. A general observation I made was that the functionality that was being exposed through a service-interface, usually was defined from a technology perspective instead of a business perspective. While working with the teams developing the API it became clear to me that this is the result of the engineering teams only considering the data that is exchanged between the service consumer and producer. Hardly ever was the information flow considered. This lack of business focus is common, and the cause of many challenges faced by teams developing APIs and microservices.
There are always very understandable reasons why the technology perspective is adopted, but hardly ever do I think that these reasons are good or justifiable. I will cover this extensively in another post. In this article I will focus on an important challenge that comes with constructing services that are exposed by products to be consumed by other products; interface protection.
You can share how you like this article by clicking the little applauding hands up to 50 times.
The Misbehaving API-Consumer
As coach and architect to teams working on APIs I have often been involved in situations where the assumptions around service and API consumers turned out to be quite wrong. In several cases I was asked to support teams that found themselves in a tough spot. The API implementation was rather naïve, assuming that its consumers were abiding all the rules, behaved as ideal citizens and that they were following the happy and sunny path. In other words, the implementation did not consider consumers accidently misbehaving or even being outright malicious.
Imagine a consumer that provides incorrect data when consuming an API, is not authorized, or even authenticated while consuming the API or is repeatedly calling the API even when previous calls are consistently failing.
These are just a few cases that I have come across with different teams over the last 10 years or so.
The consumer is not always to blame though. It is not always the real culprit as I have seen time and again that there is hardly any documentation around the interface that is exposed. With teams that only recently started to develop APIs or teams that only develop APIs for internal consumption there often is only documentation in the form of OpenAPI files. These files typically cover only a happy scenario highlighting how the APIs will behave when the consumer ‘plays by the rules’. I agree that the intended behavior is in most cases obvious. But stating the obvious is sometimes required.
It becomes problematic when the scenario is not that happy, as in the examples I had you visualize before. Documentation around how the service should behave in case of a not so happy scenario is in too many cases nowhere to be found. Mind that these situations are hardly ever documented as required behavior either. And with a lack of requirements, there is no reason to expect explicit tests for these scenarios to be developed.
In my experience, the interface documentation typically only mentions the HTTP error codes returned by an API. Hardly ever will the documentation mention the behavior of the API in case the consumer is misbehaving. When I ask the engineers for this information, they revert to the code that implements the service instead of the documentation that describes the interface. In my humble opinion, a bad practice since the implementation should only cover described behavior. Moreover, behavior of the API producer will be related to the behavior of the API consumer. Behavior like throttling is hard to deduce from the source code. A consumer that can call the API once every 5 seconds will experience a different behavior from the API producer when the same API call is performed within 4 seconds since the previous call.
My perception is that service implementations and APIs lack explicit interface protection. The protection is not by design nor is the wanted, needed, or requested behavior documented as such. Too much is done implicitly, making it impossible to provide real protection from misbehaving and malicious API consumers.
In cases where I have been asked to join teams to find a resolution for a production problem, the service consumers are not free from blame in at least partly causing the problem. The typical service consumer did not handle error responses in a way that is in line with best practices. I am fully aware that my perception is skewed, since I get involved when things are going wrong or have gone wrong. Therefore, what I typically observe is possibly not fully representative for what is common within service consumer code.
Service consumers are almost never implemented such that they will gracefully handle the not so sunny scenarios, even when these are properly documented. How the consumer should behave when something else than an error code 200 is returned is not defined in any of the situations I have been involved in. Furthermore, behavior related to unwanted scenarios is not documented let alone implemented.
However, eternal retries of an API call when not authorized is not uncommon behavior. It is also not uncommon to consistently provide ill-formed input to an API and not respond in a meaningful way by the API consumer.
Producing HTTP Status Codes
Many of the software engineers I worked with that started with consuming APIs over HTTP were not familiar with the http status codes and what they mean. Not even code 200, which means OK. I almost never see consumers verify that the return code on an HTTP call is 200, it assumed that when it is not a handled code it must be 200. And that does imply that other codes not handled are in fact considered to be 200, i.e. HTTP-OK.
There are 5 categories of HTTP error codes These 5 categories allow for 500 different situations to be documented and to be identified. Many common use cases have a standard error code already defined.
1xx Informational response
4xx Client errors
5xx Server errors
Service providers are supposed to document as part of the interface documentation which situation is causing what error code to be returned, or rather which behavior is resulting in what error code. Multiple calls by the same consumer within a short timeframe resulting in an error 401 (Unauthorized) may cause an automated rule in an API gateway or even the firewall to kick in and ban that consumer for 30 minutes for example. This behavior is implementation specific and will have to be documented because it impacts the API consumer.
It is recommended to explicitly test this behavior and make it part of regression tests when testing the service, as it protects the interface/service from misbehaving and malicious consumers.
Consuming HTTP Status Codes
Similarly, the service consumer will have to be able to handle this behavior and prevent unwanted behavior from the service provider to kick in because the consumer is misbehaving. In the example above, the service consumer should make sure that it is not banned from accessing the service due to repeatedly calling the service while not authorized. That behavior might cause a total ban of the client on which the consumer runs from the backend. For example, when the firewall is closed for the client IP address because it is considered a malicious actor.
As we are maturing our products to benefit from the cloud, we also must be aware that with this newly gained great power, there is also new great responsibility coming with it. More and more companies are working towards API-first concepts and are venturing into the realms of microservices. Both concepts are the most complex models in software engineering. Apart from the technical challenges that come with them, both are, to be successfully applied, driven through a business architecture. Building robust services and API implementations, it is key to understand and appreciate the importance of testing the different scenarios that are applicable. Not just the happy scenario, but especially the implementation’s behavior in not so happy scenarios must be tested explicitly because that is where the interface protection is most relevant.
Behavior and Test-Driven Development
I recommend the use of the domain specific language (DSL) Gherkin, to document scenarios that describe your API’s or service’s behavior and implement that behavior. As an added benefit, you will be able to protect your interface iteratively and incrementally on a per needed basis.
Interface protection also includes the principle that the interface itself does not change from a consumer perspective, that is, an update of the interface itself may not break existing consumers of that interface.
This means that an interface can only expand its required input data, yet still accept input data that was previously accepted as well. An example on how to do this is by allowing only additional input parameters and define reasonable defaults in case the service is called without the new parameters. By ensuring that (automated) tests also include calls to the interface according to a previous interface definition, which need to continue to succeed after an interface is updated. These can be thought of as ‘Behavioral Regression Tests’.
Describing behavior through scenario’s
Scenario’s take the form:
Given some initial situation
When something happens
Then something is the result.
The described ‘401 error’ scenario would then be something like:
Given the service consumer is not authorized
And the service is just called by that consumer
When the service called within a second by the same consumer
Then an error-response with error code 401 is returned
And the firewall is reconfigured to block the consumer for 30 minutes
Applying behavior driven development concepts allows for the definition of a full set of behaviors of a service. This set can be created by specifying requirements through scenarios and will ensure the interface is protected. A common practice is to define behavior for each individual error-code that can be returned. Then specify one or more scenarios that define under which circumstances what response is returned to the consumer. This way every such scenario can have its consumer-based counterpart as being relevant.
A common critique on BDD (as with TDD for that matter) is that it is too cumbersome, that there is too much overhead. What is happening is in fact that behavior is tested in any case, but when not through the application of TDD and BDD concept it is done manually. Which proves even more cumbersome, more involving, less consistent, and less adequate. The result is that there is a significant drop in quality of services and APIs. Consequently, interfaces are not protected. In my experience, taking the extra effort up front to at least define the required behavior through scenarios as part of the interface definition already improves the quality and therefore the protection of the interface significantly. Adding explicit testing of the scenarios will take software products several levels higher on the quality scale. Note that through TDD and BDD the code quality metrics Code Coverage is typically improving significantly.
Need proof? See the fourth episode of my ongoing series on how to do software engineering the right way, where I apply TDD and BDD concepts and as a result reach almost 100% code coverage as a side effect.
APIs are extremely powerful constructs in any software architecture. They are the key to business functionality and data and therefore must be protected from consumers that by accident or on-purpose are not playing by the rules. Proper documentation of the API’s behavior is key in the protection of these business interfaces. Documentation in the form of use-case scenarios not only define clearly and unambiguously how to use an API, but also serve as automated tests. Applying BDD and TDD in our development of APIs, we can truly lead in an API economy through high quality software products.
On the go and done reading? Check out this continuing piece of pure fiction.
The text very explicitly communicates my own personal views, experiences and practices. Any similarities with the views, experiences and practices of any of my previous or current clients, customers or employers are strictly coincidental. This post is therefore my own, and I am the sole author of it and am the sole copyright holder of it.
Special thanks to my lovely wife Olcay, as well as my friend Sytse who took the time and made the effort to review my article. I am confident that the article’s quality was significantly improved by their feedback.