Friday, July 10, 2009

Cloud Service providers

So this is all about determining what kind of app I can run on the Amazon cloud vs. the MSFT cloud vs. the Google cloud. Are these guys all considered "cloud service providers"? Who else should I care about? (surely, Yahoo! has something?) What do their offerings look like? How do they differ from one another? What kind of app do they let me build (i.e., what do I upload to the cloud)? What type of functionality does their API expose (what can my app do that it couldn't do before)? Another question: how nice is the interface for managing one's stored data or web service at each of these cloud service providers? Answer: the GUI web interface for managing cloud resources seems to be a selling point for all.

The biggies of which I am aware:
* MSFT Azure
* Google AppEngine

Also look into:
* Coghead (on second thought...)

Overall Observations
  • All of the databases provided by Amazon, MSFT, and Google are schema-less. That is, they collectively represent a big shift away from SQL and relational databases and toward something more unstructured, like a hashtable or RDF triple store. Even Microsoft's SQL Service (which is a service that runs in the cloud and can be accessed by applications) doesn't actually run a SQL server!

  • Each vendor provides a way to upload blobs or large binary objects of data (Amazon S3, MSFT Azure Storage, and MSFT SkyDrive for Windows Live users).
Amazon Web Services (AWS)
We'll talk in the below about:
* Amazon SimpleDB
* Amazon Simple Storage Service (S3)
* Amazon Elastic Compute Capacity (EC2)


Amazon SimpleDB
An alternative to building and maintaining a DB (e.g., SQL) in-house. Used to store, process, and query data sets. Can be used in conjunction with EC2 and S3. Automatically indexes uploaded data; provides query interface for accessing that data.

The actual format (in which data is stored) sounds a lot like a triple (RDF) store. Create a "domain" which contains <attr, value> pairs. Can only query within a single domain (not across domains). There can be multiple entries for a single attribute (attr) mapping it to various different values. No pre-defined or set schema. Amazon indexes the data (for fast query access). Can query via the SELECT or the QUERY API, which are discussed in more detail here. Designed for fast access to a (relatively) small amount of data. Pricing is based on amount of storage, CPU time consumed by queries, and amount of data transferred to/from storage; note that transferring data from storage to EC2 is free of charge (i.e., to perform a calculation over that data). Not storing raw data the way that S3 does but instead storing hash table entries (i.e., attr, value pairs) which they index variously.

Amazon Simple Storage Service (S3)
Simple interface (web service) to store and retrieve any data. How is S3 used? People store at S3: static web data, backups of files, files to share, and data to be processed using an EC2 application. That data is stored on the same infrastructure that Amazon's marketplace runs on; thus, quick access, high reliability, sophisticated, fault tolerant.

Can write, read, or delete an object; each object is anywhere from 1 byte to 5 GB in size. When upload an object, provide a key for it, which is subsequently used to retrieve the object. Each object is stored in a bucket in the same location that the bucket is stored (either US or Europe). Can grant access to the object to the public, to specific users, or to no one at all. Pricing is based on amount of storage occupied, amount of data transferred in or out, and the number of requests on data objects. The operations one can perform on data include: PUT, COPY, POST, LIST; it seems as though a data object is treated a bit like an opaque blob (in contrast to SimpleDB which enables performing SQL-like queries over data objects, which in that case are attribute-value pairs).

Amazon Elastic Compute Capacity (EC2)
The first two Amazon services that we looked at were ways to upload, store, and then access that stored data in the cloud. EC2 provides a way to effectively upload an application to the cloud; i.e., to convert an application that you run on a server that you maintain in an equipment room under your control to an application that runs on Amazon's infrastructure. A small outfit might create a VM (Amazon Machine Image (AMI)) and install Apache web server on that VM, along with all of its site-specific web data. Then they would remap the DNS hostname to point to this VM, which is part of EC2. This means that the small outfit probably doesn't have to worry about denial-of-service (DoS) attacks anymore; plus, they don't have to maintain backup/redundant network connections etc. Many of the IT department is pretty much outsourced to Amazon. The small outfit just produces the web site data and uploads it to its VM, as appropriate.

The way I think of EC2 is as a replacement for a server that is used to host a customer-facing app; the most obvious and common type of these is a web server. So instead of running Apache locally, I'll simply move it to the Amazon cloud. I'm still running the same app on top of the same hardware configuration; the location of the app (and hence who is responsible for ensuring connectivity to the app etc.) has changed. You create an instance on EC2 using the same parameters that you would use in selecting a server: HW config (memory, CPU speed, amount of storage), which libraries should be loaded on the server, and which applications should be loaded on the server. These are not web-apps or applications built to some Amazon API but rather the same Linux-based or Windows-based applications; only difference is that the apps are not running on an Amazon VM.

By contrast, other cloud service providers actually provide an API. Then you write an application using that API. Your application is most likely a web-service. For example, below (with Google App Engine), instead of uploading a virtual machine (and specifying hardware configuration etc.), one merely uploads an application, written using a Google-provided API. So it's much more lightweight but you might need to rewrite your application so that it can be hosted on Google whereas with Amazon EC2 your app runs in an identical environment. So no changes are needed to the application itself. (Note that Amazon does have an API for managing your EC2 instances, such as instantiating and running them as well as configuring access control among them.)

Microsoft Azure
We'll talk in the below about:

Their whitepaper (pdf) — I don't recommend this; it's sort of typically sucky whitepaper writing (where they say that certain things are "harder than most people think" without providing any insight at all as to how or why); that said, there is some information in there. The basic offerings:
  1. Azure Execution: Run a Windows application on Microsoft's cloud computing operating system (OS). Not for running arbitrary applications...

  2. Azure Storage: Store data using Microsoft's cloud computing OS.

  3. .NET cloud service: Provides "distributed infrastructure services," such as a directory service for providing a mapping between a cloud service URI and an actual physical location.

  4. SQL cloud service: Presumably provides data storage/access.

  5. Live cloud service: Synchronizes data across desktops and devices.
An app can do any single one of the above or any combination of the above; e.g., might have an app which runs at the customer site (so-called "on-premises applications") but stores data on the cloud using Windows Azure Storage or stores data using MSFT's SQL cloud service (or both). An app can also run on the cloud, store data locally (on-premises), and access cloud services. Where specifically might an app be running? An app might run on the Azure OS in the cloud, on Windows Server in some company's equipment room, or on a user's desktop (using Windows XP or Vista) or mobile device (using Windows Mobile). This app might access cloud services. So the Azure Services Platform consists of a cloud OS (Azure) on which applications can run as well as a set of "developer services," including .NET services, SQL services, and Live Services.

Here is video of Ray Ozzie (chief software architect at MSFT) introducing Azure. In this, he positions Azure as part of the "Web tier" of MS offerings (where the experience tier contains Windows Vista and Mobile and the enterprise tier contains Windows Server). The goals for Azure were two-fold: (1) Backwards compatible or *respect the installed base*; in Ray Ozzie's words: let current Windows developers apply their existing skills and knowledge as well as deploy existing code. So provide ability to leverage knowledge of Windows developer tools (such as Visual Studio) and environments (e.g., .NET framework). (2) Provide an "open environment" in which programmers can use arbitrary languages, tools, run-times (i.e., not relegated to only being able to use Microsoft tools).




Windows Azure OS: Compute and Storage
There seems to be a bit of redundancy in offerings between (2) and (4) above but there are some key differences. Azure Storage can be used to store blobs and does not use SQL or any other relational DB. Data is accessed via the REST protocol and using "a straight-forward query language" (that is not SQL), whatever that means. The unstructured-ness of the data storage makes Azure Storage sound a lot like Amazon's S3 service which enables uploading blobs (of widely varying sizes) to the cloud. However, one can also use Azure Storage to put data in tables which can be queried; this sounds much more like Amazon's SimpleDB service. So maybe Azure Storage is effectively a hybrid of both Amazon offerings, in which case look for it to have worse performance since it's not optimized for either (quite different) case.


Initially, could only run apps built using .NET framework on Azure OS. These are apps written in C# and "other .NET languages." It's expected that will be able to run non-.NET apps on Azure but presumably these must still be Windows apps (and there are no details here). Managing apps that run on Azure: is via the app's configuration file which identifies the # of instances and other parameteres. To use the web-based management interface, need a hosting account if are going to actually run an app on Azure and a storage account if going to store data on Azure.

.NET Services: to address common infrastructure challenges in creating distributed apps.
  1. Access Control: I think the challenge is that one organization provides a user with an authentication token while another organization hosts the app. The solution is to supply a user with a token which contains a set of claims. That token is provided to the app which determines what type of access the user has, given his claims. This requires a way to translate claims created in one scope (the user's affiliation) into equivalent claims in a second scope (the application provider); this is referred to as claims transformation. Alternatively, need to have identity federation, which lets claims created in one scope be accepted in another.

    As a concrete (if toy) example, say that ACM has a policy which lets university students read its articles for free. So we need one entity (the university) to provide a token to each student which the student would provide to the second entity (ACM) demonstrating that he should be provided access to the requested ACM article. How does ACM know whether the token is valid (actually generated by the university) or not? What about token revocation? What language should the token use to encode the "right to read any article for free" such that the ACM article service automatically understands what that right means in this context?

    We hear the term federated in tech a lot and it refers to the existence of autonomous domains that cede some power to a central authority which provides some service in exchange (that no single domain itself could provide). When we talk of federated identity management, we mean that one entity vouches for the identify of some user to another entity which consequently provides a service to that user. We might also think of it as providing a unifying description of a single person across multiple domains, which might use different authentication technologies etc. So ACM and the University might know Joe Student in different ways and might store information about his identity differently but they both agree on a particular representation so that each can refer to Joe S. in a way that the other understands. Some additional reading on federated identity mgmt. What's the big deal? Why would you care? FIM lets a company easily outsource various business functions without requiring that the company divulge all of its customer information.

  2. Service Bus: It sounds like this is a directory or mapping service; an on-premises application provides a web service endpoint to SB. SB generates a URI for this endpoint; clients can locate and access the service via this URI (and with help from the SB's directory service which lists all such URIs). Presumably, the URI resolves to some network location in the cloud and all traffic to/from that location is forwarded on to the on-premises application. So the SB acts as a network proxy for traffic to/from the web services endpoint. The actual mechanics is that the web service endpoint opens a persistent connection to the SB; this connection is not torn down. Since the connection originates at the on-premises location, there is no need to punch any holes in the firewall. Traffic is sent over this connection. The client only ever sees the SB's IP address, not the IP of the on-premises web service.

    Maybe the Service Bus can also do authentication so then instead of merely passing all received (client) traffic to the target application (endpoint), the SB will check the client's credentials and only pass traffic through to the target if those credentials are sufficient.

  3. Workflow: Have some logic in the cloud that coordinates various applications, integrating them to achieve some overall effect. Maybe each individual application communicates with the Workflow service to get its inputs and provide its outputs; the Workflow service in turn passes the first app's output to the second app as input and so on. (Again, all conjecture!)
SQL Services: A Database in the Cloud
Exposes both SOAP and RESTful interfaces. Built on MSFT's SQL Server. It sounds like data is stored in a RDF-style format (unstructured); a single data item consists of a property which has: a name, type, and value, which sounds a lot like an RDF statement which consists of a subject, property, and value (and is also reminiscent of Amazon's SimpleDB service described above). No pre-defined schema. So why didn't they just offer actual SQL Server in the cloud? Issues with relational DBMS: scalability, availability, and reliability. The unstructured organization of data makes replication and load balancing easier and faster.

Live Services: MSFT's API for web applications
Windows Live is a family of web applications provided by MSFT. It includes an email app, an instant message app, an app for managing photos, and an app for managing contacts. There is also an app for storing data (SkyDrive) as well as a calendar, something like Evite, and something called Office Live (for creating/editing documents). These applications use Live services to store and manage their data. What the Live Framework is all about is making that data available to other applications, i.e., to applications that are not part of the Windows Live suite. This would let users (you!) create apps and leverage that data; e.g., an app which used the map or directions info provided by the Live Framework. The MSFT Live suite is very reminiscent of Google's set of web services (such as Gmail, Google Docs, Google Calendar, Google Maps, Picasa, and so on).

What's sort of interesting about this is that MSFT talks about making map, email, contact, etc. data available through Live Services rather than talking about exposing a web services endpoint on each individual application (e.g., creating a web service endpoint as part of Hotmail which responds to queries for its data; doing the same for Live Office, Sky Drive, Messenger, and so on). Rather than creating all of those endpoints, since these applications already manage their data through Live Services, just let arbitrary web applications also access that data through Live Services (via Live Framework). So they're creating a single interface on Live Services rather than an interface on each MSFT Live application.

The below figure is my conceptualization of Live Services. There is some suite of Windows apps that interact with and rely upon Live Services in order to store their data; this includes those mentioned above: Windows Live email, contacts, SkyDrive, Maps, Photos, ... The basic idea is to allow other applications to also access that data (maps, directions, photos, calendar, contacts, email, chat, ...). The way this is achieved is that Live Services lives in the cloud and is accessed (using HTTP or RSS) via the Live Operating Environment (LOE), which also lives in the cloud. Thus, any application that can request data via HTTP can use Live Services data (doesn't need to be a .NET framework app or even a Windows app generally).

The most exciting thing about Live services is that you can use it to synchronize data across systems; e.g., you could create a mesh which consisted of your desktop computer, laptop, and mobile phone. Each such device would run an instance of Live Operating Environment (LOE), which would be responsible for synchronizing data across all systems in the mesh (think of it as a network of communicating LOEs). The user specifies which data on these different systems should be kept in sync. Note that the different devices may be running very different operating systems, even OSs not in the MSFT family. LOE can be run on: Vista, XP, Mac OS X, and Windows Mobile 6. Every mesh also implicitly includes the user's Live services account and hence includes his data stored on the cloud. Thus, if a user has contacts stored on his hotmail account, these could be synchronized with those on his mobile device and so on. A user can also provide access to data from his mesh with other users.


An app (which runs on a user's device or in the cloud) accesses that user's mesh data through the local LOE or from the cloud LOE. If the app wants to access the mesh data belonging to a user's friend, the app presumably interacts with the cloud LOE, requesting the friend's data (presumably the friend has already provided the necessary permissions). A user can also run an application on his mesh (i.e., on any machine within that mesh); data modified by that app will be kept synchronized. There is a directory of mesh-enabled web apps. A mesh-enabled web app can also be shared with friends (this is their social-networking-apps angle).

Another option involves a user who has a Windows Live account (with lots of associated data — from various Windows Live applications) but he has an iPhone, Linux computer, and so on (no device which runs a Windows OS). And the data he wants to keep synchronized lives on these systems for which there is no LOE. The solution is for the user to have Silverlight installed as a browser plug-in on each device; .NET apps can run on Silverlight; it's like creating a Windows execution environment inside the browser on top of any arbitrary OS. Then this app (which is built for Silverlight) will be able to access the cloud LOE. This is MSFT's multi-platform support, I think, and it uses Silverlight as a bootstrap. The application in this context is a web app.

(How does this work for data which is represented using different formats on different systems? E.g., office documents on Mac? contacts on Mac? and so on. Maybe practically the system works much better when all devices are running a Windows OS. Also: how about Live Services data? Which of it is available to everyone? Only the generic info? Is it possible for a user to make his data available to everyone? Or temporarily? I'm thinking about my Personal Marketplace idea here. And how a user might provide temporary access to (some portion of) his Live services data for a fee.)

Questions
* What is the isolation between multiple apps all running on the cloud OS (Azure)?
* Accessing the services described is via a new API? Or existing API with new arguments?
* What about non-MS apps? Can they run on Azure? Store data on Azure? Take advantage of cloud services?

More on AppEngine; note to self: read following and update below http://www.stanford.edu/class/ee380/Abstracts/081105-slides.pdf

AppEngine provides a way for folks to run their web applications on Google's infrastructure; it is not a general-purpose computing platform for any arbitrary code. They run and serve code for you; your code can use some of the functionality used by Google's own web apps (e.g., Google Docs, Gmail, ...), such as Google Accounts, Google File System (GFS), and BigTable. Can write an AppEngine app using the following languages: Java, JavaScript, Ruby, Python. Your app can run either in response to receiving a web request or to a cron job. You can also limit who can execute your app. AppEngine exposes a bunch of APIs including: datastore, memcache, URL_fetch, mail, images, and Google Accounts.

AppEngine consists of:
  1. A "scalable serving architecture" — i.e., a way for a request to your web service to reach a running instance of your code. When you submit code to Google, it will be pushed to a bunch of fault tolerant servers. Google automatically scales resources for your code on-demand (you don't neeed to specify a priori the # of machines, CPUs, etc.).

    Your code runs in a sandbox which provides limited access to the underlying OS. The sandbox disallows writing to the file system; an app can only read files that are uploaded with the app's code. Google provides some APIs for persistent storage. So for example if your app's Java bytecode tries to open a socket or write to a file, a run-time exception will be generated.

  2. Python run-time (and now a Java run-time too). The Python run-time includes a fast interpreter. There is a uniform infrastructure, consisting of the API and developer tools, which apply regardless of the language you're programming in.

  3. SDK: is written in python; there's a release of it for Linux, Mac, and Windows. Comes with a AppEngine simulator that lets you test, run, and debug your apps locally (don't need to deploy code to their servers in order to see what the code does).

  4. Web-based Administrative Console: see app status, control who can administer the app, control which version of the app receives the most hits; also contains a bunch of tools, including those which: let you look at request logs, app logs, your data objects, let you hook up a domain to your app, let you see which errors are being received on which URLs, see traffic reports, etc.

  5. Scalable data store, i.e., their persistence layer: this is based on BigTable instead of SQL.
The Application Environment
* Serves web pages
* Automatically scales resources and load balances on-demand
* Provides API for authenticating users
* Provides API for sending email (using Google Accounts)
* Provides persistent storage which can be queried and sorted.
* Provides local development environment which simulates AppEngine on a user's computer
* Provides scheduled tasks

The Sandbox: limits access to the underlying OS
Why? So that your app is portable... i.e., can be migrated from any system to any other system (since doesn't have any environmental dependencies). How does sandbox constrain app execution?
  1. Network: Outbound: your app can only connect to other hosts via the URL fetch and email services. Inbound: your app can only receive connections via HTTP or HTTPS on the standard ports.

  2. File system: Writing: App cannot write to the file system. Reading: App can only read the files that were uploaded with the app initially. Instead, the app should use the datastore, memcache, or other services in order to maintain state across requests.

  3. Execution: app is only executed when receives a web request or a cron job. An AppEngine program consists of a map between URLs (on which requests might be received) and the applicable program to run for each. An app must return a response within 30 seconds of being executed. After returning that response, the request handler (i.e., app) cannot spawn a subprocess or execute code.
The JVM implements the restrictions on the sandbox environment; hence, bytecode that tries to write a file or open a socket will throw a run-time exception. The AppEngine implementation for Java also includes an implementation of some standard Java API functions so that such functions use AppEngine services. For example, if your program invokes the java.net HTTP APIs, your program will be using the AppEngine URL fetch services. This might make porting existing apps easier since don't need to program to an entirely new API; rather, the changes are under-the-hood. Similarly, the APIs for JavaMail, Java Data Objects, and Java Persistence are all reimplemented by AppEngine. That said, AppEngine does have some new APIs which can be used to access its datastore, memcache, Google Accounts, URL fetch, mail, and images services. (So in some cases it appears there is >1 way to skin — or URL fetch — a cat.)


The Datastore
So this sounds like an RDF store which contains triples, each of which consists of a subject, predicate (or property), and object (or property value). Queries can be executed over this store, and the store is actually spread across multiple machines (so it's distributed). A property's value can be String, Boolean, byte string, ..., and so on. This is another example of a non-relational database; it's schemaless. That said, your application code can create and enforce some structure on your datastore (would this be akin to requiring that data be part of some ontology?). Use the Java Data Object (JDO) and Java Persistence API (JPA) to access the datastore or can access the datastore directly using its own API.

The datastore is strongly consistent and uses optimistic concurrency control.

What is app execution cycle?
* Receive request on URL X
* Google AppEngine identifies handler (i.e. program) for X
* AppEngine invokes that handler (executes that program)
* Program must return a response within 30 seconds

An AppEngine application
An app consists of a *.yaml file which identifies the application's name and version, as well as which run-time and which version of the AppEngine API the app uses. The YAML file also provides a mapping from URLs to code which should be invoked when such URLs are visited. You can start up the development web server (on local machine) via: $dev_appserver <appName> which will print out a URL which you then visit in your browser to test the app. E.g., the following encodes that visiting any URL should result in invoking the main.py python script.
-url: *.
script: main.py
Below is from Campfire One, which was held in April 2008.




3 comments:

  1. Hi Liz,

    I am recently studying cloud computing and found your post very interesting. I can say that I’ve learned a lot after reading this. I found it very useful, thanks for the great post! So, which is the best cloud computing provider? I’ve only used Microsoft (Live Mesh, Azure) and found it awesome.

    Our company just developed a cloud-hosted application using Windows Azure. You can take a look at it here: http://personalradiostation.cloudapp.net/. If you like it, you can help us vote for it at the new CloudApp() Contest: http://www.newcloudapp.com/vote.aspx. The app is listed at the end of the page, under the name of Omar Del Rio.

    Rewards,

    Ana Rodriguez
    ana.rodriguez@sieena.com

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete