If you work at a company that owns a lot of cloud services, it’s inevitable that you’ll have some high level meetings with other teams regarding the architecture of their systems. Usually people start drawing boxes while saying, “we own this service called Foo which talks to this service called Bar, which then talks to Pineapple and….”.
Five minutes later, everyone is lost. The diagram and the descriptions made while drawing it transfer nothing except the knowledge that they own webservices, which have names (probably unrelated to what they do), and at some undefined point in time communicate something with eachother. Everyone spends the rest of the meeting checking their email in between squinting at the whiteboard and wondering when they can go back to writing code.
To try and make describing a distributed system easier, I ask four questions.
1. What are the entities?
Your system does stuff with entities (SKUs, Reports, Documents, Processing Requests, Marketplaces, Orders, Resources…). Telling me all about your service that processes SKU Merger Requests is going to be Chinese to me unless you start with telling me what the heck a SKU Merger Request is. Systems deal with entities that are usually abstractions around business concepts and client functionality. Start with a non-technical explanation of what the system does, why it exists in the first place, and most importantly what are the entities it deals with.
2. Where is the state/persistence?
There are stateless services and services with state. If the service has state, where is it persisted? Oracle databases? DynamoDb? Elastic search clusters? Encoded on top of the Quantom superposition of hydrogen atoms? I don’t care much about the details of how things are encoded in the persistence layer (SQL, JSON, text files…). I just care about where on the diagram of distributed stuff the state of the system resides.
3. What triggers synchronous chains of events?
Drawing arrows between service boxes doesn’t represent when those arrows are exercised. In distributed systems, there are lots of triggers which start some chain of synchronous events. These triggers should be described. For example, a user on a webpage submits an order. This is a trigger to some chain of events. Perhaps that chain of events will end with a message being put on a queue. An asynchronous agent pops the message off that queue, which begins another chain of synchronous events, that perhaps ends with a database call. The database happens to have a post-commit trigger which starts another chain of events. You get the picture; describe the triggers and make sure you draw distinctions between different chains of events.
4. The functional bits: APIs, inputs, and outputs.
Finally, start breaking down the details of the interactions between system components. All software takes input and produces output. This can be hard to draw on a whiteboard, but can be described while drawing at least.
The symbols to draw on the whiteboard? It doesn’t matter. People are adaptable. Jeff can draw his chains of events using different colored markers, Bob can draw his chains of events using squiggly lines, and Jessica can draw hers with numeric labels. The whiteboard is just temporary storage to sketch out systems that would otherwise escape short term memory before they can be fully understood. It’s a scaffolding that must be filled in with descriptions and discussions during the drawing process.
Just remember to hit all of the high level talking points. What entities does your system deal with? Where are they stored? What triggers their mutation? What are the inputs and outputs to your APIs? This is usually enough to sketch out the high level workings of complex systems filled with workflows, asynchronous queues, distributed state, and all sorts of common cloud computing patterns. Really the only thing missing is conditionals, and my answer that? Don’t draw a bunch of crazy conditional paths on the same diagram. Make one diagram follow a single path through code and draw different diagrams for the alternative flows when authorization is denied, processing is cancelled, or whatever other branching and exception cases you’re trying to communicate.