Monitoring¶
Observatory monitoring is based on two discrete steps: tracing and correlation. Tracing is done by nodes. Correlation is internal to Observatory, it’s how it couples individual events together, using the tracing tokens.
Tracing¶
Initially, we configure a stream. A stream is a directed acyclic graph (DAG) that models the flow
of information in a distributed system. So if our system consists of two services, web-service
and
event-processor
, it will initially look like this:
We model this in the configuration as a stream in which the elements are defined as ["web-server",
"event-processor"]
. This is configured using the following syntax:
[[stream]]
name = "my-example-stream"
nodes = ["web-server", "event-processor"]
Streams are composed of nodes and edges. A node is identified by a unique UTF-8 string. An edge is a pair between two distinct nodes. Defining an edge means configuring the rate of monitored information flow.
How all of this works can be illustrated with a sequence diagram:
Warning
Asynchronous sending
Remember asynchrony¶
In order to have minimum overhead, it’s important to not block when informing Observatory. In Pythonish pseudo-code, we could do it like this, using grequests:
time = datetime.now()
token_uuid = uuid.uuid()
event = { 'node': 'web_server', 'timestamp': time.isoformat(), 'tracing_id': str(token_uuid) }
# non-blocking!
grequests.post('http://observatory:1234/events', json.dumps(event))
# ... blocking code ...
If you do this synchronously, you will delay the blocking code segment, but if you use asynchronous HTTP requests, the code will resume.
Correlation¶
So, node web-server
receives a HTTP request. A unique id abcd1234
is generated. Webserver
sends the information packet (web-server, abcd1234, T1)
to Observatory, where T1
is the current
timestamp in ISO8601 format. App then sends that request downstream to a journal system, passing the
tracing ID in a HTTP header, X-Tracing-ID: abcd1234
in the HTTP request header. The event-processor
system reads this header and correlates this packet by sending (event-processor, abcd1234, T2)
to
observatory. Now, observatory sees that these elements are part of a stream—because they share the
tracing token—so it starts observing it, and because T2 > T1, it will understand that information
is flowing from web-server
to event-processor
. Now our DAG looks like this:
Note that the pass label says 100% because we haven’t described a check span: how many successive events should we remember before determining edge health.
Observation windows¶
Defining observation windows is essentially done by summing the different thresholds. If you define a
check which has a time window of 150ms, a 85 for OK
, 10 for FAIL
, 5 for WARN
, the
following list is processed in order:
- 85 or more events are completed within 150ms, the edge is considered
OK
- 10 or more events do not complete within 150ms, the edge is considered
FAIL
- 5 or more events do not complete within 150s, the edge is considered
WARN
The checks are done in reverse order, i.e. WARN
thresholds are checked first, then FAIL
,
last OK
.
All of these thresholds sum to 100. This means that Observatory maintains an observation window of 100 events in memory. Once the 101st event comes, the list is truncated from the start. The window size is simply the sum of all the thresholds.
When enough observations are available, the stream is considered observed. This is when the edge status is reliable.
Reliability considerations¶
The observation window is based on strict truncation. The larger the window, the more reliable the observation, but it will be slow to react to changes.