There is a logical service, which consists of several logical roles (e.g., for our example application, the service was the overall application and the roles included a Web and two Worker roles). There may be multiple instances of any single Logical Role, each of which is a Logical Role Instance (LRI). In the above picture, there are three instances of the Logical Role R_2. Each LRI is mapped to a single Logical Node, which is a hosting environment (e.g., VM). There may be multiple Logical Nodes (VMs) running on any given Physical Node or computer.
The purple ovals correspond to information provided by the application developer. That is, in the configuration (service model) for this app, the developer provides the Service Description, a Role Description for each role (which provides attributes of that role, such as, what hosting environment is required, how many resources are needed, and so on), and a Role Instance Description. The Fabric Controller maps each Role Instance Descriptor to its corresponding Logical Role Instance. Similarly, each Role Descriptor is mapped to the appropriate Logical Role. (Note that not all of these connections are shown in the above picture.)
In this way, deploying a service consists of mapping the graph describing an application's topology (as provided in that app's service model) TO the graph describing the inventory managed by the FC.
Driving the logical nodes
When an LRI is bound to a logical node (VM), that logical node's goal state is set to be whatever the LRI's goal state is. For example, Logical Node LN_22 in the above has a goal state which corresponds to LRI_22's goal state. The logical node also has a current state, which is obtained from the physical node on which this VM is running. The current state is kept up-to-date by communicating with the underlying physical hardware.
They don't say how it's constructed but evidently — in addition to knowing the current state and the goal state for each VM (OK) — the state machine for the corresponding LRI is also known. A state machine identifies all the possible states an instance can be in as well as what events cause transitions between what states. There is a stream of events — obtained via polling or interrupts — which provide updates on the physical hardware. These events are used — along with the state machine — to determine the current state. Periodically, the FC figures out what the appropriate next steps are given: the current state, the state machine, and the goal state. Then those actions are performed.
Identifying when there is a problem and handling it (automatically)
The FC maintains a cache of what it believes is the current state of each node. The Fabric Agent (which lives on each node) communicates with the FC to help the FC keep this cache updated. The FC detects when a Role dies. Multiple sources participate in monitoring Role health (and notifying the FC when a Role is not healthy): the Load Balancer issues probes to machines (i.e., pings them), a Role can also notify the FC that it (the Role) is unhealthy. When the FC learns (through whatever channel) that some Role is not healthy, the FC updates the current state for that node. Then the appropriate next steps to take — to get that node closer to its goal state — are determined and undertaken.
When a node goes offline, an attempt is made to recover that node. If the reason was a hardware failure (or, generally, something which cannot be remediated automatically) the role is migrated to another node.
(Note that the above exposition would have benefited greatly from a few concrete examples — as far as what a typical goal state is, what a typical current state is, how the state machine (which identifies what's needed to transition between states) for a logical role instance is created, and so on.)