# Are my services talking to each other?

I am faced with an interesting thought experiment, which asks:

If I can see two of my friends, and I know they should be communicating to each other, what is the simplest way of making sure they are doing so?

Your first instinct is to look at them and listen. What if the communication method is subtler than that? What if you are, metaphorically speaking, deaf, and cannot eavesdrop on their conversation?

A problem like arises when you have a non-trivial amount of distributed components talking to each other, forming a complex network. Let’s start from the basics and consider a simple one:

arrows indicate flows of information, i.e. x → y means x sends information to y

You could assume A is an event log, for example, of financial transactions; B is a message queue and C is a fast queryable cache for the transactions. We want to be able to query the cache quickly for log events and rely on the message queue of transporting them from A to C, while preferably not having a hard software dependency from A to C.

The illusion is that while there are neither code nor protocol dependencies between A and C, a semantic dependency exists: the one in our heads! A is content on dumping information towards B, but what we’re really interested in is messages getting through all the way to C. So in reality, if we superimpose our perceived dependencies on top of information flows, we end up with this:

## Tolerating faults

What if the chain breaks? What happens when A can’t push messages onward to B, and we get a blackout? Who gets notified? C doesn’t know what’s happening in A, it’s just not getting information! In line of the original question, if I can see both A and C are doing fine, but they’re not talking to each other, where is or who is the broken phone?

With such a simple case as above, pointing this out is easy, so let’s make our network a bit more complicated.

A - an event log; B - a message queue; C - a cache; E - app back-end; P - a user-facing application; I - a business intelligence system; S - a storage system

Let’s assume each one of these components is an independent service, each load balanced and with redundancies that aren’t visible beyond the node itself1, and that communication is done over a computer network using some protocol.

The depicted network consists of a set of applications that all in one way or the other build on top of an event log, A. In one branch, there’s a fast queryable cache for the transaction log, the app back-end is an interface for the cache (like a REST API), and the storage acts as a long-term backup system. The second branch consists of a business intelligence system that analyzes the event log data and does something with it.

Indirectly, there are dependency arrows emanating from the root of the network tree (A) to its leaves S, P and I. From an observer’s perspective, these are the relationships that matter. These are the implicit dependencies. Furthermore, we can see those dependencies, but we build the code in such a way that it does not! The event log simply dumps data to a message queue, and that’s it. What is worse, is that the implicit dependencies each propagate up the chain. Not only does the leaf node depend on the root node, it also depends on the intermediate nodes.

Implicit dependencies

The inherent hazard in all this, of course, is that there’s a communication error. Even though we (hopefully) built the system following the robustness principle, data isn’t flowing from the root node to the leaf nodes and we have to quickly identify where the disconnect happened.

## Seeing is not enough

Our first instinct is to peer at the logs. So we go through each edge in the network and see if there’s a fault. This means for n nodes looking at least at n-1 edges for each fault! Moreover, the problem isn’t fixed by using something that gives me visibility of the nodes, like ZooKeeper or other service discovery tools. This is because I am interested in the flow of information from one node to another. The thought experiment already assumes that the nodes are there, only the communication between them is broken.

In the Internet world, with the Transmission Control Protocol , communication is made reliable using error-checking and acknowledgments. That means, if A were a network element and wanted to send things over to C, in case of a successful delivery C will acknowledge this back to A.

For various reasons, it may be that in a distributed service network this approach is not feasible. This is the cost of abstractions: when you enforce loose coupling, you have to deal with the consequences of looseness. We could build the transaction log aware of the user-facing Application but that may be overkill.

For the particular problem of acknowledging from a message queue root to a consumer leaf, there are various solutions. You either implement this on your own, which while laborious, essentially follows the principle of error-checking. The caveat is this grows in complexity with every new node. Another option is to use a message queue (one of these things is not like the others) that supports this natively.

## The rescue signal

We could build a centralized logging system to which each node logs its events. This centralized system contains all events from all nodes. To make the data meaningful, you need to construct a way to determine the flow of information, that is, grouping events together semantically. Worse, the system will require manual or semi-automated inspection to determine when any event is missing its acknowledgment, that is, A logged an event of sending Foo to message queue but the user application back-end E never processed it.

A system like this could work using a FRP approach: since FRP signals map exactly to discrete events, one could build a rule engine. By integrating time flow and compositional events, a centralized system could use its rule engine to listen to signals. A signal can be any event, e.g., a financial transaction that was logged into the event log. You can combine this signal with another event in a system that consumes transactions and does something with them, like the business intelligence system. The sum of these two signals imply that “a financial transaction was consumed by the business intelligence system”. This is also a signal!

Building a FRP-based rule engine isn’t easy, you’d need to construct a rule engine that can map diverse data events into high-level signals and then create additional logic for summing the signals.

The sum of two signals is another signal. (Oh hey, this makes it a semigroup!)

Once such a system is built, it can be queried to determine the state of the network quite efficiently (and perhaps elegantly), but it does not introduce any fault tolerance and will only tell you where data is moving, but not where it isn’t.

I guess that most of this stuff underlines the difficulties of unraveling a monolith into a microservice. Keeping track of network traffic is really hard, even at the hardware level (!), so when we push this abstraction to the software level, it is not a surprise that this can cause problems.

Playing with some toy solutions I thought of something I call a shadow network. Let’s say our principal information source is an event monitor X and we have a leaf node in the information dependency tree that is interested in data originating from X.

Each leaf node sends its data to the shadow node. The shadow node understands the data and can tell where it originated from, thereby seeing the implicit dependencies. The shadow node is effectively a mirror of the root node(s).

In the shadow network, X does not receive any new dependencies nor do the intermediaries, but the leaf nodes each push their actions to the shadow node. The shadow node contains a rule engine that can parse leaf events. A rule is something that identifies a source. It could be anything, from a simple parser (“this looks like Apache logs” → “it came from Apache!”) to something more sophisticated. This introduces a dependency only to leaf nodes, but the problem is that the shadow node has to be kept up to date on how to correctly map events to sources. When you change the format of the data traveling across the network, you have to update the rule engine.

Unfortunately, this doesn’t really help us: you can query the shadow node to get the implied dependencies, but that’s it. So while it requires less effort to develop, disregarding cases where creating rules causes difficulties, it suffers from the same flaw than the centralized FRP engine: it can only tell when data is flowing but not when it isn’t.

This makes both solutions rather untenable for monitoring a microservice architecture, but they can be used in cases where the service network grows large and you are working with opaque layers, that is, you don’t know what’s between the leaves and the root, and you want to construct the implicit dependency graph.

Bolting temporal awareness in the shadow network works if the data is supposed to be regular. If the consuming leaf expects a tick from the origin(s) every n seconds, the shadow rule engine can be built to be aware of this. If ticks aren’t happening when they are supposed to, you can create a fault on the implicit dependency. Alas, only regularly occurring data works here, so we’re out of luck for irregular events.

Either way, the original problem is an interesting one. I suppose the only reliable way of doing things is to do what the Internet Protocol does: acknowledgment and error checking. While certainly a lot of work, it will be reliable. We all love reinventing wheels, don’t we?

My opinion? Don’t fix what isn’t broken! While we all benefit from loose coupling, and while microservices definitely are most of the time an improvement over monoliths, both bring hurdles and challenges of their own. The bottom line is that networking is not easy, and if one forgets this, problems will occur.

1. So for all intents and purposes the nodes represent services as a whole instead of individual physical units, whatever they may be.

# The expression problem as a litmus test

The expression problem is a famous problem in programming languages.

“The Expression Problem is a new name for an old problem. The goal is to define a datatype by cases, where one can add new cases to the datatype and new functions over the datatype, without recompiling existing code, and while retaining static type safety (e.g., no casts).”

Using interfaces (like in Java) as the datatype example, the problem simply asks whether it is possible to derive the interface and add new methods to the interface, without having to recompile existing code or to resort to using casts.

Obviously, in a OOP language it’s easy to derive interfaces, but the problem uncovers the rigidity of the type system: you can’t modify (i.e. extend) the interface, because you have to modify all the classes classes that implement the interface.

Conversely, in functional programming languages, adding new methods operating on the interface is easy. Consider the canonical OCaml example:

type shape = Circle of float | Rectangle of float * float

let area shp = match shp with
| Rectangle (width, height) -> width *. height

let vertices shp = match shp with
Circle _ -> infinity
| Rectangle (_, _) -> 4


So in FP, you could create a function called volume that computes the volume for the existing types, and you needn’t touch the above code. However, as soon as you do that, you realize you’ve made a silly mistake: our shapes are flat so their volume is zero. Quickly, you realize you need a three-dimensional Cube shape.

let volume shp = match shp with
Circle _ -> 0.
| Rectangle _ -> 0.
| Cube s   -> a *. a *. a        (* Cube isn't defined *)


Oops!

Here’s the onion: to get the Cube working, you’ll have to modify the existing code in two places: the definition of shape and both methods area and vertices.

In OOP, this isn’t the case, since you can just derive the new IShape interface and be done with it, but the problem arises when you’re adding the volume function, because you need to modify IShape, and thus every class that derives it.

In FP, adding new functions over the datatypes is easy, but adding new cases to the datatype is tricky, because you have to modify existing functions. In OOP, adding new functions over the datatype is hard, because you need to modify each implementation; adding new cases to the datatype is easy, since all you need to do is derive the interface. FP has invented a multitude of ways to deal with this problem, ranging from type classes, traits to protocols; OOP usually solves with either patterns or open classes. Ruby’s refinements can be used for this purpose as well.

## Polymorphic variants

That’s a quick introduction to the problem. I think the expression problem is a perfect litmus test of sorts for programming languages, that is, the measure of the expressive power of the language is the quality of the solutions the language presents to the expression problem.

The expression problem is theoretically solvable in any language, but to varying degrees of elegance. In Java one must resort to using the visitor pattern, and in my mind this is the most inelegant way of going about it. I would rate the solutions on a spectrum: with the most basic solution being the visitor pattern, at the other end we have something like polymorphic variants and type classes. Multimethods and protocols are somewhere in between.

When you compare polymorphic variants of OCaml with Haskell’s type classes, there’s a marked difference in brevity. Polymorphic variants are succincter than type classes but cannot provide the same level of type safety.

type shape = [ Circle of float | Rectangle of float * float ]

let area shp = match shp with
Circle radius -> radius *. radius *. 3.14159
| Rectangle (w, h) -> w *. h

let vertices shp = match shp with
Circle radius -> infinity
| Rectangle (w, h) -> 4.


Not too different from the above declaration, the type is surrounded with brackets and the types are preceded with backticks. Recreating the volume function is easy.

let volume shp = match shp with
Circle _ -> 0.
| Rectangle _ -> 0.
| Cube a -> a *. a *. a


So now I’ve extended the shape type with another type Cube, and I haven’t touched vertices and area functions. The volume function can be done even more succinctly:

let short_volume shp = match shp with
(* no volume in two dimensions! *)
#shape -> 0.
| Cube a -> a *. a *. a


It is also possible to constrain the polymorphic variants:

let flatten shp = match shp with
#shape as x -> x
| Cube a -> Rectangle a


The type of this function is [ < Circle of float | Cube of float | Rectangle of float * float ] -> [> shape]. The [< A | B] means a closed type: it can be only A or B, but nothing else, and [> Foo] means “Foo or something else”. So the flatten function accepts Circle, Rectangle or Cube and returns a shape (or possibly something else). Trying to run flatten (Sphere 4) produces a type error:

# flatten (Sphere 3);;
Characters 8-19:
flatten (Sphere 3);;
^^^^^^^^^^^
Error: This expression has type [> Sphere of int ]
but an expression was expected of type
[< Circle of float
| Cube of float * float
| Rectangle of float * float ]
The second variant type does not allow tag(s) Sphere


However, the following code compiles:

type polytope = [ shape | Cube | Octahedron ]

let frobnicate pt =
let flattened = flatten pt in
match flattened with


The compiles, although we didn’t tell the compiler that flatten does not return Octagon. There are two ways to fix this: either explicitly annotate pt to be of type polytope, which produces this error:

Error: This expression has type polytope
but an expression was expected of type
[< Circle of float | Cube of float | Rectangle of float * float ]
The second variant type does not allow tag(s) Octahedron


It is possible to further constrain the type with type annotations. We can make sure that the flatten function returns only flat shapes:

let safe_flatten shp : [< shape] = match shp with
#shape as x -> x
| Cube a -> Rectangle a
| Sphere r -> Circle r


This produces the error:

Error: This pattern matches values of type [? Octagon ]
but a pattern was expected which matches values of type shape
The second variant type does not allow tag(s) Octagon


## Not a silver bullet

Unfortunately, polymorphic variants are problematic. The problem with polymorphic variants is you quickly reach an absurd level of complexity and are forced to use annotations or subtyping to ensure maximal type safety. So although polymorphic variants are nice, and they do let us solve the expression problem, they’re an unsteady compromise between type safety and brevity. You can certainly make elegant abstractions with them but they get unwieldy quickly. They aren’t as efficient compared to regular variants either.

So what are the options? In OCaml 4.02, you can use extensible variant types:

type boring_shape = ..
type boring_shape += Circle of float | Square of float

let boring_area shp = match shp with
Circle r -> r *. r *. 3.14159
| Square a -> a *. a
| _ -> 0.

type boring_shape += Rectangle of float * float
let radical_area shp = match shp with
Circle _ as c -> boring_area c
| Square _ as s -> boring_area s
| Rectangle (w, h) -> w *. h
| _ -> 0.


An extensible variant is defined using .., and extension is done with the += operator. The caveat is that you must handle the default _ case in pattern matching. Extensible variants are another neat trick for solving the expression problem.

## A measure of expressive power

The expression problem is a great litmus test that measures the expressive power of a programming language. The actual measurement of the test can be either the brevity of the code or its type safety. The solutions range from the clumsy Visitor Pattern in Java to polymorphic and extensible variants in OCaml and to type classes in Haskell. Clojure and Elixir have protocols that are both quite nice but not so type-safe since both are dynamically typed languages. What is more, since the expression problem is also about type safety, then strictly speaking the problem isn’t valid in a dynamic language. Any Lisper knows that Lisps are super expressive anyway.

# Before we begin

Now that the design of the site is finished, I can finally focus on the essentials.

I’ve decided that this year I will be writing a bit more, here, and elsewhere. To that end, when it comes to this site, I’ve had to perform a simple but challenging task: lowering my standards.

Last year, I did not publish anything because I had absurd standards for content. In my mind, every blog post had to be a thoroughly researched and carefully argued piece, capable of standing the test of time.

This was a monumental mistake.

Researching something thoroughly requires an extraordinary amount of time, and writing opinionated articles that can stand the test of time requires an extraordinary amount of foresight — neither of which I yet have. I have already managed to delete one post which contained opinions I no longer agreed with. I thought its content was rubbish and I was a moron for publishing it, so my only recourse was to delete it, instead of learning from it.

My desire for more content stems from the process of gradual improvement: the more you write the better you get. I cannot do this unless I start from the very basics. I could, of course, simply write guides on how to do XYZ with \$THING, but I want to tell stories, not write recipes. This doesn’t mean there won’t be any guides, however!

Another reason for wanting to write more stems from the simple funny fact that somebody reads this. Besides Google Analytics telling me so, a few weeks ago I even received a pull request for typo fixes.

So I know there’s at least one guy who actually reads every word. Disclosure: according to GA, most of the visits are “accidental redirects” (huh?), and that actual, longer visits aren’t common, but there were enough for me to extrapolate that there were, at the very least, two readers.

So, that was the background. And now comes the disclaimer, of sorts.

This site is a blog. As this site is a blog, the writings are, first and foremost, opinion pieces, not research articles, and opinions change. You may find me advocating for stricter type systems one day, and for looser the next. I will only guarantee that, at the time of writing, I will argue my points to the best of my abilities.

Longer, more in-depth stories, if ever completed, will be available under the articles category. This is to mark a distinction. To qualify as an article, the writing will a) contain properly researched writing and references b) be reviewed by somebody else. I’m currently working on something that may one day be considered something like that.

You’re welcome to snoop into the drafts folder of the GitHub repository of the site, but be warned, that stuff is obviously incomplete.

To be continued.