
Observatory monitoring is based on two discrete steps: tracing and correlation. Tracing is done by nodes. Correlation is internal to Observatory, it’s how it couples individual events together, using the tracing tokens.


Initially, we configure a stream. A stream is a directed acyclic graph (DAG) that models the flow of information in a distributed system. So if our system consists of two services, web-service and event-processor, it will initially look like this:

digraph { rankdir=TB; "web-server"; "event-processor"; }

We model this in the configuration as a stream in which the elements are defined as ["web-server", "event-processor"]. This is configured using the following syntax:

name = "my-example-stream"
nodes = ["web-server", "event-processor"]

Streams are composed of nodes and edges. A node is identified by a unique UTF-8 string. An edge is a pair between two distinct nodes. Defining an edge means configuring the rate of monitored information flow.

How all of this works can be illustrated with a sequence diagram:

participant "Client" as C
participant "Web Service" as WS
participant "Event Processor" as EP
participant "Observatory" as O #00FF88

activate C
C -> WS: ""GET /foo""
activate WS
WS --> O: ""POST (web-server, **abcd1234**, T1)""
note left: async!
activate O

WS -> EP: POST /events ...\nX-Tracing-ID: abcd1234
note left: tracing id\nin headers
activate EP
EP --> O: ""POST (event-processor, **abcd1234**, T2)""
note left: async + upstream id

deactivate O

WS <- EP: ""HTTP 201 Created""
deactivate EP
C <- WS: ""HTTP 200 OK""
deactivate WS
deactivate C


Asynchronous sending

Remember asynchrony

In order to have minimum overhead, it’s important to not block when informing Observatory. In Pythonish pseudo-code, we could do it like this, using grequests:

time =
token_uuid = uuid.uuid()
event = { 'node': 'web_server', 'timestamp': time.isoformat(), 'tracing_id': str(token_uuid) }

# non-blocking!'http://observatory:1234/events', json.dumps(event))

# ... blocking code ...

If you do this synchronously, you will delay the blocking code segment, but if you use asynchronous HTTP requests, the code will resume.


So, node web-server receives a HTTP request. A unique id abcd1234 is generated. Webserver sends the information packet (web-server, abcd1234, T1) to Observatory, where T1 is the current timestamp in ISO8601 format. App then sends that request downstream to a journal system, passing the tracing ID in a HTTP header, X-Tracing-ID: abcd1234 in the HTTP request header. The event-processor system reads this header and correlates this packet by sending (event-processor, abcd1234, T2) to observatory. Now, observatory sees that these elements are part of a stream—because they share the tracing token—so it starts observing it, and because T2 > T1, it will understand that information is flowing from web-server to event-processor. Now our DAG looks like this:

digraph { rankdir=LR; "web-server" -> "event-processor"[label="OK(pass=1/1 100%)", color="#00AA00"]; }

Note that the pass label says 100% because we haven’t described a check span: how many successive events should we remember before determining edge health.

Observation windows

Defining observation windows is essentially done by summing the different thresholds. If you define a check which has a time window of 150ms, a 85 for OK, 10 for FAIL, 5 for WARN, the following list is processed in order:

  • 85 or more events are completed within 150ms, the edge is considered OK
  • 10 or more events do not complete within 150ms, the edge is considered FAIL
  • 5 or more events do not complete within 150s, the edge is considered WARN

The checks are done in reverse order, i.e. WARN thresholds are checked first, then FAIL, last OK.

graph { rankdir=LR; edge[style=invis]; subgraph cluster1 { label=<<b>observation window</b><br/>time &rarr;>; node [style=filled,color="#5D8AA8", fillcolor="#00AA00"]; e2; e4; e5; e6; node [style=filled,color="#5D8AA8", fillcolor="#AA0000"]; e2 -- e3 -- e4 -- e5 -- e6 } e7[label=<<i>future<br/>event</i>>] e1 -- e2 e6 -- e7 }

In an observation window where the sum of thresholds is five, at the arrival of the sixth event (e6), event e1 is truncated out of the window.

All of these thresholds sum to 100. This means that Observatory maintains an observation window of 100 events in memory. Once the 101st event comes, the list is truncated from the start. The window size is simply the sum of all the thresholds.

When enough observations are available, the stream is considered observed. This is when the edge status is reliable.

Reliability considerations

The observation window is based on strict truncation. The larger the window, the more reliable the observation, but it will be slow to react to changes.