Transactional Orchestration for Cloud Computing
Background and Challenges
The Infrastructure-as-a-Service (IaaS) cloud computing model exemplified by Amazon EC2 provides users on-demand, near-instant access to a large pool of virtual cloud resources such as virtual machines (VMs), virtual block devices, and virtual private networks. The orchestrations of the virtual resources over physical hardware, such as provisioning, configuration, and decommissioning, are exposed to the users as a service via programmable APIs. These APIs hide the complexity of the underlying orchestration details.
From the cloud provider’s perspective, however, building a robust system to orchestrate cloud resources is challenging in terms of both scale and fault tolerance, as shown by recent studies on the open-source cloud platforms. First, today’s large data centers typically run on the scale of over 10,000 machines based on commodity hardware. As such, software glitches and hardware failures including power outages and network partitions are the norm rather than the exception. This unreliability not only impacts the virtual resources assigned to users, but also the controllers that orchestrate the virtual resources. Second, to orchestrate a massively concurrent, multi-tenant IaaS environment, the control logic is inherently complex. In particular, any engineering and service rule must be met while avoiding race conditions. The postmortem from the EC2 outage in April 2011 anecdotally reinforces our arguments: A human error in router configuration that violates an implicit service rule and a race condition in storage provisioning contributed significantly to the prolonged downtime.
To address the challenges of cloud orchestration, we present DMF, a resource orchestration platform to build cloud services. As a programming and execution framework, DMF proposes to build orchestration procedures as transactions such that cloud service developers can group accesses and controls to cloud resources into logical units, which execute with atomicity, consistency, isolation, and durability (ACID) guarantees. Orchestration transactions are easy to use: they are written as typical procedures in an imperative language, and DMF automatically handles the execution in the cloud in a transactional manner. In particular, DMF makes sure that the effect of the transaction does not perform illegal resource manipulations or misconfigurations, by evaluating constraints before physical deployment. DMF aborts the transaction when catching unhandled exceptions during execution and rolls back atomically to the previous state before the transaction starts. DMF also does concurrency control to avoid race conditions. Transaction provides a simple and powerful way for service providers to express and enforce consistency requirements for concurrent operations to the platform.
Cloud Services on DMF
We currently have working prototype of two trial cloud services:
- TCloud: TCloud is a simplified version of what IaaS providers (e.g. Amazon EC2) might offer, and has features similar to OpenStack, etc. In TCloud, a customer may spawn new VMs from an existing disk image, and later start, shutdown, or delete the VMs. Two VMs are placed on the same layer-2 broadcast domain (e.g., within a VLAN) if and only if they belong to the same customer to ensure traffic isolation.
- Follow-the-sun: VMs are migrated in the wide-area network cross data centers to be closer to where work is being performed. In particular, we have a demo video to show you what live VM migration across data center looks like and how DMF automatically handles failures during orchestration. The description of the demo video can be found in our CIDR paper.
Dec 12, 2012: We are integrating transactional cloud orchestration into OpenStack.
Dec 11, 2012: TROPIC / DMF code has been open-source released under Common Public License Version 1.0 (CPL). Download.
Jan 10, 2012: AT&T announced launching a developer friendly cloud based on OpenStack at CES 2012. The TROPIC work extension on transactional task management is highlighted in the announcement by AT&T's CTO John Donovan blog. Also see OpenStack?'s blog and blueprints, and major tech news reported at PCWorld and ITWorld.
- AT&T Labs - Research
- University of Pennsylvania
- Changbin Liu, Yun Mao, Xu Chen, Mary Fernandez, Boon Thau Loo, and Jacobus Van der Merwe. Transactional Resource Orchestration Platform In the Cloud. USENIX Annual Technical Conference (USENIX ATC'12), 2012.
- Changbin Liu, Lu Ren, Boon Thau Loo, Yun Mao, and Prithwish Basu. Cologne: A Declarative Distributed Constraint Optimization Platform. 38th International Conference on Very Large Databases (VLDB’12).
- Changbin Liu, Boon Thau Loo, Yun Mao. Declarative Automated Cloud Resource Orchestration. In Proceedings of the ACM Symposium on Cloud Computing (SOCC), Cascais, Portugal, Oct 2011.
- Changbin Liu, Yun Mao, Xu Chen, Mary Fernandez, Boon Thau Loo, and Jacobus Van der Merwe. Towards Transactional Cloud Resource Orchestration. In USENIX NSDI'11, poster session.
- Changbin Liu, Yun Mao, Jacobus Van der Merwe, and Mary Fernandez. Cloud Resource Orchestration: A Data-Centric Approach. In Proceedings of the biennial Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, Jan 2011.