Wednesday, October 14, 2009

More detailed notes on Azure architecture

Below we have some notes from: "Introducing the Windows Azure Platform" (by David Chappell) as well as from the video presentation (from PDC 2008) by Erick Smith and Chuck Lenzmeier, "Under the Hood: Inside the Windows Azure Hosting Environment " The PPNT slides associated with that video lecture can be found here:

  • General comments
  • A motivating app
  • Azure OS Compute  Roles, Agents, What an app consists of
  • Azure OS Storage  Blobs, Tables, and Queues. How one can store, access, and update data using the Azure OS.
  • The Fabric Controller: manages application provisioning (resource allocation), deployment, monitoring, and maintenance (including identifying when an app instance has gone down and needs to be restarted, when other hardware needed for the app has failed, when an app needs to scaled up or down in response to demand).
  • The service model: a contract between the application creator and the Azure platform. A way for the app creator to define how he wants his app to be managed. The FC enforces the Azure-side of this contract  ensuring the app stays "healthy," according to the definition of health provided by the app's creator.
  • The service lifecycle  and its constituent parts.
  • Miscellaneous Other   including security
General comments
At a high level of abstraction, think of the Azure Platform as a gigantic computer  though in actuality it consists of many physical computers, switches, routers, load balancers, and so on. A single logical OS runs over all of these different pieces of hardware and manages them. So in the same way that a user adds a physical device (such as a NIC) to his PC  by installing the driver for that device  the Azure OS adds switches, computers, and so on via installing the driver for the device. The driver in this case may not necessarily be the software that ships with the hardware (as would be the case when a user installs some hardware on his PC) but instead some Azure-specific code which lets the Azure OS monitor and fully manage the given machine, whatever it may be.

As noted, a user can run an app on Azure OS or store data using Azure OS or both. If he runs an app on Azure OS, he can perform horizontal scaling  that is, run more than one instance of the same app on different machines. In effect the app is replicated across different resources so that the overall app bandwidth (e.g., ability to perform a distributed computation or serve clients) is increased.

An app itself can consist of more than one "role," where there are front-end roles (those which receive HTTP/S connections) and back-end roles (those which do not accept incoming network connections; they are more processing or computation focused). So there can be replication of a particular role (e.g., create a new instance of the front-end "Web" role for every 1500 new client connections I receive) or of the entire app itself (e.g., create another instance of this entire web application, where that instance will include some number of instances of front-end and back-end roles). (CONFIRM THIS: it is certainly the case that one can replicate instances of particular roles; unclear whether replication can also be done at the level of an entire app, i.e., at app-granularity.)

A motivating app  for concreteness
I think our exploration of Azure would benefit greatly from considering some particular example app. So let's imagine the following app (that I just made up right this second): a credit-checking app used by landlords to determine whether or not to rent to some would-be tenant. The front-end of the app accepts a connection from a landlord who does two main things (via filling out a Web form): (1) provides all of the information about the would-be tenant (name, age, birthdate, SSN, previous addresses, previous landlords, references, ...); (2) provides order-fulfillment information (identifying where the finished report should be sent, how the landlord is going to pay for use of the service, and so on).

Let's assume we can use a single front-end role to accept this information. Depending upon the design of the app, the landlord may need to submit more than one form (e.g., one form for info about the would-be tenant and a second form for order fulfillment info). For simplicity, let's assume there's a single form and a single front-end role. Naturally, the number of instances of that front-end role should depend upon the level of demand for the service at any given instant of time. (Note that an app can increase the number of front-end instances, back-end instances, or both. Similarly, an app can "shed load" by killing instances of front- or back-end roles.)

Then there are at least two back-end roles: one for payment processing and a second for generating and supplying the report. The back-end roles can make outbound network connections, as necessary, to process the credit card payment or to retrieve info on the would-be tenant (even to invoke other Web services to perform some part of the due diligence involved in report generation), and so on.

Clearly there must be some information sharing across these roles; the info gathered by the front-end role needs to be made available to the back-end roles. Further, if there are multiple instances of each back-end role, those instances need to cooperate so that they don't duplicate one another's work. There are a number of different ways to structure this workflow; an obvious approach is for the front-end role to populate a record or document with the info it has gathered then to add that record/document to a queue  where the back-end roles take from this queue. So there's the familiar producer/consumer relationship there. Clearly, we want some kind of notification to be sent to idle Workers whenever a new job is available.

I'm not sure whether it's even possible for the Web role to create a network connection directly to the Worker role. We know that the outside world cannot connect to the Worker role; not sure whether that prohibition also extends to internal network connections. Depending upon the support for real-time notification of Workers that work is available, it might be appealing to be able to directly poke a worker. Some tradeoffs in polling versus interrupt-based notification but no obvious new twists on those tradeoffs presented here.

Azure OS Compute
True of both a Front-End Role and a Back-End Role:
  • Runs in a VM.
  • Can interact with Azure Storage:
    • Read/write messages from/to a queue.
    • Read/write data from/to a table.
    • ...
  • Interacts with an Agent, which runs in the same VM.
  • Can write to / read from the local filesystem (within the VM); such changes will not persist across reboot though.
A Front-End Role (referred to in MSFT literature as a "Web Role")
  • Receives HTTP/S connections.
  • Runs on IIS7 (in a VM).
  • Is stateless with respect to clients. What does that mean? Each separate "request" from a client (e.g., HTTP GET or POST message) could be routed to a different front-end role. Hence, all important client state must either be returned to the client (so that he can present that info with each request he makes) or put in the database (where all front-end roles can access the info). Clients can retain and re-present state via maintaining it in a URL parameter or cookie.
A Back-End Role (referred to in MSFT literature as a "Worker Role")
  • Does NOT receive any incoming network connections.
  • Can make outbound network connections.
A Windows Azure Agent
  • Runs in the same VM as the front- or back-end role.
  • Exposes an API which the Roles can invoke in order to:
    • Write to the log
    • Send an alert to the application owner
    • ...
  • Is the way that an app can interact with the Fabric Controller.
An app consists of:
  • N instances of Front-end Roles (can be zero)
  • M instances of Back-end Roles (can be zero)
  • A "hardware" load balancer (only necessary if have >1 Web or front-end role)
  • A config file
    • How many Web role (front-end) instances
    • How many Worker role (back-end) instances
  • A model: which defines what it means for this app to be healthy, the ways in which the app can be unhealthy (i.e., constraints which, when violated, indicate unhealthiness), remedial actions to take when an app is unhealthy (e.g., healthy means 1500 or fewer client connections per Web role; if all Web roles are currently managing 1500 client connections and the app receives another client connection, create a new instance of the Web role and send the new client to that instance). 
  • A unified log which contains messages received both from Web and Worker instances.
Azure Storage
"All info held in Azure Storage is replicated three times." Provides strong consistency.

Every blob, table, and queue is named using URIs and accessed via standard HTTP operations. So these objects can be accessed locally (by Worker and Web Roles running in the same data center, for example) or they can be accessed across the WWW. Can use REST services to identify and access each Azure Storage data object. There are libraries which encapsulate the REST protocol operations (e.g., ADO.NET Data Services or Language Integrated Query (LINQ)) or one can make "raw HTTP calls" to achieve the same effect. If query a table stored in Azure Storage and that query has lots of results, can obtain a continuation object which lets you process some results then get the next set of results, and so on. So a continuation object is like a list iterator.
  1. Blobs: a blob consists of binary data. Can be upto 50GB each. Can have associated metadata.
    • A storage account can have multiple containers; a container cannot contain another container (i.e., containers can't be nested). A blob name can contain slashes though  so can create the illusion of hierarchical blob storage.
    • Each container (akin to a file folder) holds blobs.
    • Can subdivide a blob into blocks then if there is a network failure while that blob is being downloaded, don't have to re-download the entire blob  just the blocks that weren't successfully delivered the first time.
    • Each container can be marked public or private.
      • Public blobs: can be read by anyone; writing requires a signed token.
      • Private blobs: both read and write operations require a signed token.
    • A blob can be reached by:
  2. Table: actually a set of entities, each of which has a set of properties (and values for those properties).
    • Access table via conventions defined in ADO.NET Data Services. Cannot query a table in Azure Storage using SQL. And a table cannot be required to conform to a particular schema.
    • The reason for this non-relational representation is that it makes it easier to "scale out storage," i.e., to spread the data in the table across machines. Harder to do this if the data is stored in a relational table (what's hard is probably performing aggregate functions over the entire table, which is not allowed for tables Azure OS Storage).
    • An Azure table can contain billions of entities holding terabytes of data.
    • A table consists of some set of entities
      • Each entity consists of some set of properties.
        • Each property consists of a Name, Type, and Value.
    • So in a way, we can think of a table as a nested structure: every property conceptually consists of a table which holds three tuples, in particular those which give the property's name (<Name, NameVal>), value (<Value, ValueVal>.), and the type of that value (<Type, TypeVal>). Supported types include Binary, Bool, DateTime, Double, GUID, Int, Int64, String. A property's type can change over that property's lifetime, depending upon what value is currently stored for the property.

      For example, some property p_12 might be represented by tuples: <Name, "CreationTime">, <Type, "DateTime">, and <Value, "12-01-2009 05:43:22 GMT">. So this property (p_12) says that the entity that it describes was created on December 1st, 2009, early in the morning (which is interesting since that date has not yet arrived but no mind!).

      The next layer of nesting is that an entity contains a table consisting of all of that entity's properties. The final layer of nesting is that an actual Azure Storage Table consists of some set of entities and each entity's associated properties table.
    • Each entity can be upto 1MB in size. Accessing any part of an entity results in accessing the entire entity.

  3. Queue: the primary purpose of a queue is to provide a way for Web roles and Worker roles to communicate with one another. A queue contains messages, the format of which can be application-defined. Each message can be upto 8KB. A Worker role reads a message from the queue  which does not result in the message being deleted. Rather, after a message is read, that message is not shown to anyone else for 30 seconds. Within that 30 seconds, the Worker who read the message could have handled the task and deleted the message. Or that Worker might have died before completing the task, in which case, the message would be shown to other Workers after 30 seconds have elapsed. This 30-second period is referred to as the visibility timeout.

    Note that a message might be delivered multiple times. Also, there is no guarantee on the order in which messages are delivered. Presumably this means that a particular message on the queue might get "starved," i.e., never delivered to any Worker (in contrast to traditional FIFO queue semantics).
Azure Storage  Access control: it sounds like they use a token-based system in order to control who can access what data. In particular, when a user creates a storage account, she's given a secret key. Every request made to the storage account (to modify account settings, to access some blob stored as part of this account, to edit a table, and so on) carries a signature which is signed by this secret key (like a token). Having such a token means that you can access all the data stored in that account (as well as the account settings themselves?); i.e., the granularity at which access control decisions are made is account-level. Presumably, a user Alice who owns an account can provide a signed token to other users whom Alice wants to allow to read/modify stored data (CONFIRM.) Regrettably, it appears that Alice has to let these other users have all types of access (create, modify, delete) to everything (all data as well as account settings?). That said, part of the Windows .NET Cloud Services is access control so perhaps this could be layered on top of storage in order to achieve a more fine-grained access control system? (CHECK.)

Update: there is evidently now a way to provide container-level or blob-level access (rather than only being able to give account-wide access). See:
Windows Azure storage operates with “shared key authentication.”  In other words, there’s a password for your entire storage account, and anyone who has that password essentially owns the account.  It’s an all-or-nothing permission model.  That means you can never give out your shared key.  The only exception to the all-or-nothing rule is that blob containers can be marked public, in which case anyone may read what’s inside the container (but not write to it).
Signed access signatures give us a new mechanism for giving permission while retaining security.  With signed access signatures, my application can produce a URL with built-in permissions, including a time window in which the signature is valid.  It also allows for the creation of container-level access policies which can be referenced in signatures and then modified or revoked later.  This new functionality enables a number of interesting scenarios...
Azure Storage  Concurrency: The basic model appears to be that a user or app would download a blob or entity then modify that blob/entity locally. After that, the user/app would attempt to commit those local changes  by writing the updated version back to storage. So, what happens if more than one app attempts to modify the same storage object (e.g., blob or table) at the same time? The first writer wins. Subsequent writes (of the same version) will fail. This is a form of optimistic concurrency and is achieved via maintaining version numbers. Note that an app can force his changes to stick by unconditionally updating an entity or blob.
    The Fabric Controller (FC)
    One per datacenter. Is replicated across several machines (maybe 57). Owns all of the resources in the fabric (including computers, switches, load balancers). Communicates with each machine's Fabric Agent. Every physical machine has a Fabric Agent running on it which keeps track of how many VMs are running on this machine, how each Azure app in each VM is doing, and so on. The FA reports this info back to the FC which uses it (along with the app's configuration file or model) to manage the app so that the app remains healthy (as defined in the app's configuration file). 

    The FC monitors all running apps, manages its own infrastructure (e.g., keeps VMs patched, keeps OSs patched, ...), and performs resource allocation using its global view of the entire fabric along with the configuration file for the app that needs to be deployed. All of this is done automaticallyThe FC presently (i.e., in the CTP) allocates VMs to processor cores in a one-to-one manner (that is, no core will have more than one VM running on it).

    The FC maintains a graph which describes its hardware inventory. A node in the graph is a computer, load balancer, switch, router, or network management device. An edge is a networking cable (e.g., connecting a computer to a switch), power cable, or serial cable. When an app needs to be deployed, the FC may allocate an entire physical machine or, more commonly, may subpartition that machine into VMs and allocate one or more VMs to the given role. The size of the allocation depends upon the configuration settings for that role.

    The service model
    Is "declarative," which means rule-based. So the model or specification consists of a set of rules. Those rules are used by the Fabric Controller in initially allocating/provisioning resources for your service and to manage the service over its lifetime (i.e., maintain its health). The model contains info that you might otherwise communicate to the IT department that was going to deploy and run/manage your service in-house. This technique/setup is referred to as model-driven service management.

    What it looks like:
    • Contains the topology of your service (which comprises a graph):
      • Which roles? How many instances of each?
      • How are the roles linked (i.e., which roles communicate with which other roles)?
      • What interfaces exposed to the Internet?
      • What interfaces exposed to other services?
    • For each role, defines attributes of that role, e.g.,
      • What hosting environment does that role require? (E.g., IIS7, ASP.NET, CLR, ...)
      • How many resources does that role need? (e.g., CPU, disk, network latency, network bandwidth, ...)
    • Configuration settings: all are accessible at run-time. Can register to be notified when the value of a particular setting changes.
      • Application configuration settings: defined by the developer. Can think of these as being akin to command-line arguments that one would pass to a program. The app will behave differently depending upon the values of these arguments. Can use these to create different versions of your app where the versions differ on the basis of their values for these configuration settings.
      • System-defined configuration settings: predefined by the system. 
        • How many fault domains?
        • How many update domains?
        • How many instances of each role?
    • In the CTP, don't get to use entire service-model language. Instead, will be provided with various templates; can choose one and customize it for your particular app. Eventually, more flexibility (a higher level of control over service model configuration) will be exposed.
    The service lifecycle
    1. [Developer] Develop code. Build/construct service model.
    2. [Developer/Deployer] Specify desired config.
    3. [FC] Deploy app; FC maps app to hardware in the way specified in the config/model.
      1. Resource allocation: choose hardware to allocate to the app based on the app's service model. Is basically a ginormous constraint satisfaction problem. Performed as a transaction: all resources allocated for an app or none are. Example constraints include: 
        • Question: Are there both system constraints and application-defined constraints? Then both sets are combined? For example, some of the sample constraints seem like something that the system would specify, such as "Node must have enough resources to be considered" or "Nod must have a compatible hosting environment" whereas others seem more app-specific, as in "Each node should have at most a single instance of any given role assigned to it."
      2. Provisioning: Each item of allocated hardware is assigned a new "goal state." The FC will drive each piece of hardware to its goal state.
      3. Upgrades:
    4. [FC] Monitor app state and maintain app health, according to the service model and the SLA.
      • Handle software faults, hardware failures, need to scale up/down.
      • Logging infrastructure gathers the data needed to diagnose an app as unhealthy (where "health" is application-defined and provided in the app's service model).
      Miscellaneous Other
      • Development environment allows desktop simulation of the entire Azure cloud experience. Can run your app locally and debug that way.
      • Can specify which data centers your app runs in and stores its data in.
      • Two-stage deployment to the cloud (which can also be used to achieve zero-downtime upgrades):
        • Upload application, access via, where points to a load balancer (i.e., the DNS entry contains the load balancer's virtual IP (VIP)). Then connections to the app will be routed through this load balancer.
          • Presumably anyone can connect to the app (i.e., it's world reachable) but not everyone knows the particular globally unique ID associated with this app?
        • To make the app run live, DNS entry created which ties the application's domain name (e.g., to the load balancer's VIP.
      • Can layer any desired HTTP-based authentication mechanism on this. Or can use Windows Cloud Services (the Live Service in particular) which provides authentication.
      • Machines are grouped into fault domains, where a fault domain is a collection of systems with some common dependency, which means that this entire set of machines could go down at the same time. For example, all machines that sit behind a single switch are in the same fault domain as are all machines that rely on the same power supply. 
        • In configuring the settings for your Azure app and storage, you might specify that two instances of the same thing (e.g., two identical Web roles or two instances of Azure Storage) should not be located within the same fault domain.
        • And actually, for the case of Azure Storage, the Storage app takes care of making sure that it has more than one version of itself and is spread across different fault domains (rather than the Fabric Controller doing this).
        • Easy to determine  given the FC's graph of all inventory (i.e., hardware)  what the fault domains are. Can determine statistically what the likelihood of failure is for each such domain. 
      • Update domains: if you're doing a rolling upgrade of your service, a certain percentage of your service will be taken offline at any time (and upgraded). That percentage is the size of your update domain.
      • Security/Isolation
        • Use VMs to isolate one role from others
        • For PDC, app can only be managed code, which can be easily sandboxed
        • Run app code at a reduced privilege level
        • Network-based isolation
          • IP filtering
          • Firewalls: each machine configured to only allow traffic that machine expects
        • Automatically apply Windows security patches (heh).
      • MIX 2009 introduced the ability to run Web or worker roles with more privileges than previously was allowed. In particular, there are Code Access Security (CAS) levels, such as Partial Trust and Full Trust. In order to be able to run non-.NET (a.k.a., non-managed or native) code, the Web or Worker role needs to be able to spawn a process or to issue the Platform Invoke command. Apparently, one can only do those things with Full Trust. Hence, this is how MSFT will support non-managed code: you will configure your Web or Worker role to execute your non-managed code. So the Web or Worker role bootstraps your code.

        Thus, in order for this to work, the code you execute has to be copy-deploy, which means that you don't have to install the code (since even Full Trust
        does not include administrative or install privileges)  instead, you can merely copy the code to the VM and run it.

      • The MSFT VMM is based on their Hyper-V Hypervisor, which was released in June 2008.
          • Read up on REST.
          • Get more details on fabric controller and data-center architecture etc.
            • Can see more about the FC  as far as redundancy goes  here (starting at about 30:00).
          • Get more details on the "service model," which is how the app developer communicates with the Azure OS Compute platform about how to manage his app.
          • More details on how the FC drives nodes into the goal state (i.e., monitors and maintains app health).
          • How do these virtual IPs and dedicated IPs (where the latter make up a pool that "back" the virtual IPs)  how does all of this work? Presumably there is some address translation at some point (mapping a connection to a virtual IP to instead be to a real IP)  like NAT  but at which point? At the load balancer?
            • Maybe just means an internal IP address, such as

          No comments:

          Post a Comment