Friday, June 5, 2009

Notes on Geambasu's CloudViews (HotCloud, June 2009)

Notes on CloudViews: Communal Data Sharing in Public Clouds
Venue: HotCloud 2009
Authors: Geambasu, Gribble, Levy (UW)

Increasingly, we're seeing corporate web services deployed on top of "the public cloud" (e.g., Amazon S3) rather than on their own servers in their own data centers (private clouds). Creates some possibly new security issues when have:
thousands of independent and mutually distrustful Web services [that] share the same runtime environment, storage system, and cloud infrastructure
Referred to as a multi-tenant environment. There has already been some research into the issues relating to isolation, security, and privacy. But this (short) paper is about the opportunities presented by this new reality. In particular, the co-located webservices are running on the same or nearby hardware; so it'll be much easier to compose and integrate the webservices. The ability to compose co-located web services comes from:
  1. Lots of network bandwidth and low latency enables tightly integrating web services in a way that couldn't be done when such services were separated by a WAN.
  2. Since co-located tenants are both using the same underlying storage system, it would be easy and efficient to allow such tenants to share data (stored on that system) with one another (i.e., doing this amounts to removing a barrier between the co-located tenants).
  3. Can augment the run-time system by adding utilities that enable composing web services; e.g., some directory service which identifies services with which to integrate. Also, might have utilities which implement composition operators; e.g., a generic utility which takes two web services and runs the first, feeding its output into the second.
Client-side mashup: info from various parties aggregated in the client's browser
Server-side mashup: the mashup site aggregates the info from the various sites
Their focus is on server-side mashups.

The common storage infrastructure should provide:
  • Performance isolation: the sharer's performance is not adversely affected by the sharee accessing the sharer's data.
  • Access control: the sharee can only access the portion of the sharer's data that the sharee is supposed to be able to access.
  • Billing: the sharee can be billed for access to the sharer's data.
For example, Flickr wants to grant access to some subset of its photos to the webservice Photosynth. With Amazon S3's "Requester Pays," Flickr can rent this data to Photosynth (i.e., provide access to the data to Photosynth in exchange for $). There is also an "Authenticated query" feature to enable query access to your data. I think this latter is what their CloudViews uses. The idea is that the data owner generates a cryptographically-signed token which says that the owner gives permission to X to view the data returned by query Y. Then X will present this token to S3 and in return be granted access to the data returned by query Y.

  • To integrate/compose two (or more) webservices, the webservices need to be co-located. [Comment: By co-located, they mean located within the same datacenter. A datacenter may connect to a large Internet trunk and may contain multiple different cloud providers. Such cloud providers would be co-located within the same datacenter though in different "cages" within that datacenter. This might be slightly different than the authors' working model is that co-located web services are those web services which are part of the same public cloud (e.g., EC2). I think the focus should be on enabling integration of web services across clouds because in some cases it will be efficient to do so. That is, two different cloud providers may in fact be separated by a WAN link but it may be a low-latency, high-bandwidth link.]
  • An abstraction for describing the data to be shared. Should be able to describe this data precisely (i.e., with fine granularity). The abstraction should be independent of the format in which data is stored.
  • Protection: access control for enabling this sharing and preventing unintended sharing.
  • Resource allocation: If one webservice W1 is accessing another webservice W2's data, how do we prevent W1's access from disrupting W2's operation since they're both accessing the same data?
  • Richer ecosystem of utilities in order to enable easy composition of webservices. Could have thousands of utility services (e.g., a scheduling app that takes an arbitrary number of calendars and a desired event duration and determines the best time for that event based on the given calendars, which represent attendees' availability). Lots of code replication at present as each data center develops its own functionality substrate. Preferrable to have different functions implemented once and available to all.
  • Need common data formats. If we're going to expose a utility function that searches a photo database, then we need photo databases to all be in the same format in order to be able to use this utility function. E.g., if Flickr stored its photos in some proprietary format, couldn't use ALIPR to process those photos if ALIPR doesn't understand that format.
  • Need incentives for developing these utility functions. If the utilities are part of a cloud infrastructure provider's platform, that would seem to be a competitive advantage (of one platform over another). Is that not sufficient incentive? Or we want to allow someone who doesn't necessarily work for that provider to develop a utility for that provider's platform?
Their contribution: CloudViews
  • Implementation is being built on top of Hadoop HBase
  • Allows users "to create and share views over the common storage infrastructure"
  • View = query on the underlying DB.
  • All co-located webservices might have their data stored in the same underlying DB.
  • A webservice can share some portion of its data (i.e., a view of its data) via providing a query which returns that data. The type of access can be just read or also edit/modify capability. E.g., Flickr can share a set of photos with ALIPR via providing a query which returns those photos; ALIPR uses this view to access the photos then UPDATES each photo's tags.
  • Then we have:
  1. Each webservice's view is a query Q_w on the underlying DB.
  2. A webservice can provide access to some portion of its data D by providing a query, Q_p.
  3. To obtain D, the query describing the webservice's view (Q_w) would be composed with the query describing the shared data (Q_p) then executed on the underlying DB.
This is kind of "a view inside of a view"... Very Russian dolls.

What is the authorization infrastructure?
Views are signed and "self-certifying" (all info needed -- to verify that the entity who holds the view has access to it-- is contained within the signature); a signed view consists of:
  * the query which generates this view
  * the expiration date of the view
  * billing and resource management info
  * the signature over all of this and whatever data is needed to verify the signature

This self-certifying bit is a little unclear; they claim that mere possession of the signed view is proof that the holder is allowed to access the data specified by that view. But does that mean if Eve stole someone's signed view, then it would appear that Eve was allowed to access the data specified by the view? I.e., is the identity of the "allowed party" encoded in the signature? That would seem to be wise. And there seems to be a lot of handwaving about how no PKI is needed to support this. Definitely need more details here in order to evaluate that claim.

Is a bit unclear how these signed views are used; presumably, Flickr would use S3 interface to obtain a signed view for ALIPR which Flickr would then give to ALIPR. ALIPR would provide this signed view (token) to Amazon S3; S3 would verify the signature and, if things checked out, would provide access to the data covered by the view.

Update Notifications
Also need to handle fact that the data contained in a view may change after someone has "checked out" that view but before he's done with it. E.g., ALIPR gets the current view for query Q then begins work on it. Flickr adds a bunch of photos; now Q would return a bunch more results. How to get the memo to ALIPR? Also, don't want to have to re-scan entire Flickr photo set periodically (in order to discover newly-added photos); instead, want to scan entire photo set then be notified when new photos are added. So only do full scan once -- at initiation. An even more important case is when Flickr removes a bunch of data and hence ALIPR's processing of that data is effectively worthless (since the data is no longer there).

They also imagine a new type of search engine that uses these views as its corpus. So instead of scanning the WWW, the search engine obtains the data contained in all of the views to which that engine has access. So the search engine is operating on DB data rather than on unstructured web pages. Then if a DB were updated, the search engine would want a notification of that change. Basically want an infrastructure that lets entities register to receive notification of changes.

Directions from this work:
* Become familiar with major players in cloud infrastructure provider arena: Amazon AWS, Google AppEngine, Microsoft Azure
* Become familiar with Amazon S3 Storage management options. Also, query access options.

No comments:

Post a Comment