Allocating Resources in a Federated, Distributed Computing Infrastructure

2.15.05 -- A team led by Alex Snoeren and Amin Vahdat in the Center for Networked Systems wants to determine the best ways to allocate CPU, network bandwidth, and storage (memory and disk) in large-scale, federated systems. They want to optimize three criteria: efficiency (all available resources are used), utility (the resources are used in a productive fashion), and robustness (the resources are used fairly despite outages or attack).

Alex Snoeren

To conduct their studies, the team is using PlanetLab, a network of more than 400 nodes donated by universities and companies around the world. “These nodes are owned by different people,” says Snoeren, “and they’re all being used at the same time for different purposes and to satisfy various needs. It’s a great ‘playpen’ for us to work in.” Snoeren presented this work on behalf of his faculty colleagues and students at the CNS research review held in mid-January.

 

The team’s study has to address the issue of competing demands from a large user population wanting to do multiple, disjoint tasks but with access to only limited resources.

 

“Resource constraints are a reality,” says Snoeren. “As far as PlanetLab is concerned, you can’t just buy more -- at least on short time scales.”  He also says that the tasks are not infinitely divisible; they require some minimum set of resources to make progress. And some techniques have limits: The user won’t know other users from whom to acquire more resources. If he does, there can be simply too many people to ask. And the differences in time zones where the various resources are located make interaction hard.

This project attempts to address the “tragedy of the commons” that often befalls shared resources: PlanetLab doesn't have enough resources for all users to run at the same time.  The more users that try to run, the less progress any one of them makes. Eventually, the system ends up thrashing, a condition in which almost no useful work gets done.

 

Current commonly used allocation approaches include time sharing and batch scheduling (especially in the Grid community where job priorities and fairness are managed out-of-band). 

Instead, Snoeren's team is exploring computational “markets,” where system utility is maximized by understanding what users really want. Says Snoeren, “We’re trying to figure out what the users’ ‘pain threshold’ is. What are they willing to pay for?”

 

They try to promote fair distribution of resources using a market-based combinatorial auction that enables efficient resource utilization through a resource discovery service. “You need to know about all available resources to determine how much you’re willing to pay for a particular subset of them,” says Snoeren.

 

As a first step, the team has developed the SHARE architecture. Users request resources using abstract descriptions of what they want and utility functions (how much they're willing to pay for what they’re requesting). The resource discovery process then looks at all the resources in the pool and enumerates all instantiations that would meet the request. It constructs a set of options (“bids”) with corresponding prices and submits them to the auction. Resources are then handed out to the winning bidders.

 

A bid includes acceptable start times, duration of the reservation, the set of resources needed, and how much the requester is willing to pay. A user can submit multiple bids with at most one satisfied. If multiple bids can be satisfied, the system chooses the bid that maximizes overall system utility.

 

In addition to their PlanetLab deployment, called Bellagio, the SHARE architecture has also been implemented in a second system called Mirage, which allocates 200 quarter-sized sensor nodes (called Berkeley “Motes”) at UC Berkeley and Intel Research. Motes represent an interesting contrast to the more traditional computing resources available in PlanetLab, resulting in specialized bidding patterns. For example, users have proven to be very interested in where the individual nodes are located around the room, as location impacts how long the radio transmission takes between nodes.

 

“Of course, we’re only using ‘play money’ so far,” says Snoeren with a laugh.

 

Ongoing work on the project includes learning more about user demand by collecting user data and identifying bidder stereotypes, determining how currency should be distributed to promote fairness, refining the auction-clearing algorithm, and figuring out how to enable resource bindings for both resource reservation and time sharing.

Related Articles
'Programming' the Net