In this paper we address the problem of dimensioning infrastructure, comprising both network and server resources, for large-scale decentralized distributed systems such as grids or clouds. In particular, we design the resulting grid/cloud to be resilient against network link or server failures. To this end, we exploit relocation: under failure conditions, a grid job or cloud virtual machine may be served at an alternate destination (i.e., different from the one under failure-free conditions). We thus consider grid/cloud requests to have a known origin, but assume a degree of freedom as to where they end up being served, which is typically the case for grid applications of the bag-of-tasks (BoT) type or hosted virtual machines in the cloud case. We present a generic methodology based on integer linear programming (ILP) that (1) chooses a given number of sites in a given network topology where to install server infrastructure, and subsequently (2) determines the amount of both network and server capacity to provision, to cater for both the failure free scenario and failures of links or nodes. For the latter, we consider either failure independent (FID) or failure dependent (FD) recovery strategies. Our case studies on European scale sample networks show that the total amount of network and server resources can be considerably reduced if relocation is exploited, especially in sparse topologies and for higher number of server sites. We also note that adopting a failure dependent rerouting strategy does lead to lower resource dimensions, but only when we adopt relocation (especially for a high number of server sites). However, we find that without exploiting relocation, potential savings of FD versus FID are not meaningful.
Published July 2012 , 29 pages