Cloud RAID Infrastructures

Summary

Cloud-based storage solutions are commonly used for collaborative software solutions, like those of CREMA, since they feature several advantages compared to traditional storage solutions, like local storages or privately maintained storage systems. These advantages originate from the essential Cloud characteristics, as described by the NIST definition of Cloud Computing [1]:

  • Cloud-based data storages can be requested on demand, whenever it is requested.
  • The data stored in a cloud-based data storage can be accessed from any location using standard network mechanisms.
  • Cloud based data storages support resource pooling to ease the deployment for the storage provider.
  • The storage infrastructure is able to adapt to changing resource demands, i.e., it is capable to dynamically scale up and down.
  • The size of the used data storage is measured and the user only pays for the actual used storage.

These five advantages fostered the usage of Cloud-based storage services like Amazon AWS S3, Cloud Storage by Google or the storage solution by Microsoft Azure. Besides these proprietary Cloud-based storage solutions, there are also several open solutions, which emerged of EU research projects. Vision Cloud was an FP7 research project which designed and implemented a Cloud-based storage solution that can be set-up as an individual Cloud-based storage. Additionally there have also been several other research projects, like Contrail, which worked on a high-throughput elastic structured storage or some special purpose Cloud-based storage solutions, like BiobankCloud, which focus on biobank data. 

Although the above mentioned Cloud-based storage solutions feature several advantages, they also have downsides, namely vendor lock-in and inadequate security measures to ensure the privacy of the data stored in the Cloud. Cloud-based RAID Infrastructures provide a solution for these disadvantages and therefore represent an evolution of the simple Cloud-based storage solutions. 

Vendor lock-in. Vendor lock-in defines the situation when data is locked on one provider. Vendor lock-in is one of the major weak spots of simple Cloud-based storage solutions, because it can lead to system failures. Further, the lock-in makes it difficult to replace storage providers if, for example, the storage provider go out of business or another storage provider offers a better service. Cloud-based storage providers typically offer different application programming interfaces (APIs) to store data in the cloud. These proprietary APIs force software developers to design their software according to the guidelines of one provider, which makes it almost impossible to replace a Cloud-based storage provider at a future date. To mitigate the programmatic challenges of a vendor lock-in, several programming frameworks have emerged, e.g., kloudless or Apache jclouds. These frameworks introduce an abstraction layer for the Cloud-based data storage solutions so that the software developer can implement against a stable API that is compatible with several Cloud-based storage providers. This stable API eases the migration among different cloud providers, but at the same time it also reduces the range of available functionalities, since there are only a few functionalities, that are provided by all Cloud-based storage providers.

Besides the migration challenges issued by the vendor lock-in, there are also availability challenges, e.g., when a storage provider goes out of business on short notice or suffers from a power outage. To mitigate these effects, there are several solution approaches, provided by the scientific community. One of the most promising approaches is to apply the principles of RAID storage on Cloud-based storage providers and combine several independent storage providers to form one storage provider. This fusion of different storage providers improves the availability and latency, but it also introduces additional challenges like consistency across different storage providers. Several researchers have already tackled this challenges by introducing an additional middleware to ensure the integrity of the data that is stored in different locations.

Inspired by these scientific prototypes, a couple of commercial solutions arose who combine existing Cloud-based storage providers. One of these solutions is MultCloud that provides a common interface for several cloud providers, to reduce the hassle when migrating files from one storage provider to another. Although MultCloud combines different Cloud-based storage providers, they do not offer any RAID based replication features. Securebeam, another commercial solution, on the other hand only integrates three storage providers, but their application allows the user to distribute the data across these three storage providers. The only commercial solution so far that actually supports a distributed RAID deployment is Symform. Symform provides a software system, where users can upload their data. This data is then redundantly distributed on different cloud providers. This redundant deployment allows the system to retrieve the data, even when one third of the data is unavailable, either due a technical failure or due to bankruptcy of the Cloud-based storage provider.

Despite the fact that there are already several solution approaches to mitigate the risks of vendor lock-in, there are still several open research challenges, to obtain an open solution to deploy the data in a RAID like manner across different cloud storage providers, to exploit the advantages of Cloud-based storage solutions while mitigating the potentially fatal consequences of the failure of a storage provider. 

Security measures for Cloud-based storage solutions. The second major weak spot for Cloud-based storage solutions is the lack of privacy for the data, which is stored in the cloud. Although some storage providers, e.g., Amazon S3, already provide a secure transmission between the user and the Cloud-based storage and the possibility to encrypt the data automatically on the server side or with a client encryption library, there are still some unresolved issues. While the secure transmission and encryption capabilities ensure that no third party can access the data during transmission, the data provider may still be able to read and modify the data. Again, at this point the storage in a RAID like fashion can help. Due to the fact that the data is distributed among several storage providers one provider can only see a small part of the whole data. Furthermore, if one part of the data is modified this modification can be easily detected and repaired by using RAID. Further, the encryption functionalities for Amazon S3 do not cover metadata, e.g., the filename. To resolve the trust issues and the lack of encryption for the metadata, it is required to add an additional encryption layer, like FADE. FADE is a middleware, that not only ensures the privacy and integrity of the stored data, but also ensures that the data is not accessible anymore, when the data is requested to be deleted.  

Besides the scientific prototypes, there are also several client applications, which encrypt the data before it is uploaded to a storage provider, like boxcryptor or Arq. Although these software solutions apply a trustworthy encryption to the data, they are only suited for an encryption on a file level, which is not acceptable for most software services, like those used in CREMA. To close this gap it is required to extend the results of the FP7 research project TClouds, to ensure the privacy of the data while adding only little computational overhead. 

Relation to CREMA

The CREMA Cloud RAID Infrastructure provides the foundation to store all kinds of data in the cloud, in a reliable manner, as required for CREMA.

General Links

  1. P. Mell and T. Grance, “The NIST definition of cloud computing,” National Institute of Standards and Technology, 2009, vol. 53, no. 6, pp. 50–56.

Articles

  1. P. Waibel, C. Hochreiner and S. Schulte, "Cost-Efficient Data Redundancy in the Cloud," 2016 IEEE 9th International Conference on Service-Oriented Computing and Applications (SOCA), Macau, China, 2016, pp. 1-9. Link
    Abstract: Nowadays, the usage of cloud storages to store data is a popular alternative to traditional local storage systems. However, besides the benefits such services can offer, there are also some downsides like vendor lock-in or unavailability. Furthermore, the large number of available providers and their different pricing models can turn the search for the best fitting provider into a tedious and cumbersome task. Furthermore, the optimal selection of a provider may change over time.In this paper, we formalize a system model that uses several cloud storages to offer a redundant storage for data. The according optimization problem considers historic data access patterns and predefined Quality of Service requirements for the selection of the best-fitting storages. Through extensive evaluations we show the benefits of our work and compare the novel approach against a baseline which follows a state-of-the-art approach.
    none entered
  2. Y. F. R. Chen, "The Growing Pains of Cloud Storage," in IEEE Internet Computing, vol. 19, no. 1, pp. 4-7, Jan.-Feb. 2015. Link
    Abstract: The rapid growth of cloud storage has created challenges for storage architects to meet different customers' diverse performance and reliability requirements while controlling costs in a multitenant cloud environment. Erasure-coded storage and software-defined storage (SDS) could address these challenges and open up new opportunities for innovation, as well as ease the transition from traditional IT storage solutions to cloud storage solutions.
    none entered
  3. D. Dobre et al., "Hybris: Robust Hybrid Cloud Storage," ACM Symposium on Cloud Computing SOCC, Seattle, USA, 2014. Link
    Besides well-known benefits, commodity cloud storage also raises concerns that include security, reliability, and consistency. We present Hybris key-value store, the first robust hybrid cloud storage system, aiming at addressing these concerns leveraging both private and public cloud resources.

    Hybris robustly replicates metadata on trusted private premises (private cloud), separately from data which is dispersed (using replication or erasure coding) across multiple untrusted public clouds. Hybris maintains metadata stored on private premises at the order of few dozens of bytes per key, avoiding the scalability bottleneck at the private cloud. In turn, the hybrid design allows Hybris to efficiently and robustly tolerate cloud outages, but also potential malice in clouds without overhead. Namely, to tolerate up to f malicious clouds, in the common case of the Hybris variant with data replication, writes replicate data across f + 1 clouds, whereas reads involve a single cloud. In the worst case, only up to f additional clouds are used. This is considerably better than earlier multi-cloud storage systems that required costly 3f + 1 clouds to mask f potentially malicious clouds. Finally, Hybris leverages strong metadata consistency to guarantee to Hybris applications strong data consistency without any modifications to the eventually consistent public clouds.

    We implemented Hybris in Java and evaluated it using a series of micro and macrobenchmarks. Our results show that Hybris significantly outperforms comparable multi-cloud storage systems and approaches the performance of bare-bone commodity public cloud storage.
    none entered
  4. T. Papaioannou et al., "Scalia: an adaptive scheme for efficient multi-cloud storage," International Conference on High Performance Computing, Networking, Storage and Analysis, Salt Lake City, USA, 2012. Link
    A growing amount of data is produced daily resulting in a growing demand for storage solutions. While cloud storage providers offer a virtually infinite storage capacity, data owners seek geographical and provider diversity in data placement, in order to avoid vendor lock-in and to increase availability and durability. Moreover, depending on the customer data access pattern, a certain cloud provider may be cheaper than another. In this paper, we introduce Scalia, a cloud storage brokerage solution that continuously adapts the placement of data based on its access pattern and subject to optimization objectives, such as storage costs. Scalia efficiently considers repositioning of only selected objects that may significantly lower the storage cost. By extensive simulation experiments, we prove the cost-effectiveness of Scalia against static placements and its proximity to the ideal data placement in various scenarios of data access patterns, of available cloud storage solutions and of failures.
    none entered
  5. D. Bermbach et al., "MetaStorage: A Federated Cloud Storage System to Manage Consistency-Latency Tradeoffs," IEEE 4th International Conference on Cloud Computing (CLOUD), Washington, USA, 2011. Link
    Cost and scalability benefits of Cloud storage services are apparent. However, selecting a single storage service provider limits availability and scalability to the selected provider and may further cause a vendor lock-in effect. In this paper, we present MetaStorage, a federated Cloud storage system that can integrate diverse Cloud storage providers. MetaStorage is a highly available and scalable distributed hash table that replicates data on top of diverse storage services. MetaStorage reuses mechanisms from Amazon's Dynamo for cross-provider replication and hence introduces a novel approach to manage consistency-latency tradeoffs by extending the traditional quorum (N,R,W) configurations to an (N_P,R,W) scheme that includes different providers as an additional dimension. With MetaStorage, new means to control consistency-latency tradeoffs are introduced.
    none entered
  6. K. Abu-Libdeh et al., "RACS: a case for cloud storage diversity," 1st ACM symposium on Cloud computing, Indianapolis, USA, 2010. Link
    The increasing popularity of cloud storage is leading organizations to consider moving data out of their own data centers and into the cloud. However, success for cloud storage providers can present a significant risk to customers; namely, it becomes very expensive to switch storage providers. In this paper, we make a case for applying RAID-like techniques used by disks and file systems, but at the cloud storage level. We argue that striping user data across multiple providers can allow customers to avoid vendor lock-in, reduce the cost of switching providers, and better tolerate provider outages or failures. We introduce RACS, a proxy that transparently spreads the storage load over many providers. We evaluate a prototype of our system and estimate the costs incurred and benefits reaped. Finally, we use trace-driven simulations to demonstrate how RACS can reduce the cost of switching storage vendors for a large organization such as the Internet Archive by seven-fold or more by varying erasure-coding parameters.
    none entered
  7. Y. Tang et al., "FADE: Secure Overlay Cloud Storage with File Assured Deletion," Proc. Security and Privacy in Communication Networks, Singapore, 2010. Link
    While we can now outsource data backup to third-party cloud storage services so as to reduce data management costs, security concerns arise in terms of ensuring the privacy and integrity of outsourced data. We design FADE, a practical, implementable, and readily deployable cloud storage system that focuses on protecting deleted data with policy-based file assured deletion. FADE is built upon standard cryptographic techniques, such that it encrypts outsourced data files to guarantee their privacy and integrity, and most importantly, assuredly deletes files to make them unrecoverable to anyone (including those who manage the cloud storage) upon revocations of file access policies. In particular, the design of FADE is geared toward the objective that it acts as an overlay system that works seamlessly atop today’s cloud storage services. To demonstrate this objective, we implement a working prototype of FADE atop Amazon S3, one of today’s cloud storage services, and empirically show that FADE provides policy-based file assured deletion with a minimal trade-off of performance overhead. Our work provides insights of how to incorporate value-added security features into current data outsourcing applications.
    none entered
  8. P. Mell and T. Grance, "The NIST definition of cloud computing," National Institute of Standards and Technology, 2009. Link
    NA
    none entered
  9. K. Bowers et al., "HAIL: A High-availability and Integrity Layer for Cloud Storage," 16th ACM Conference on Computer and Communications Security (CCS), Chicago, USA, 2009. Link
    We introduce HAIL (High-Availability and Integrity Layer), a distributed cryptographic system that allows a set of servers to prove to a client that a stored file is intact and retrievable. HAIL strengthens, formally unifies, and streamlines distinct approaches from the cryptographic and distributed-systems communities. Proofs in HAIL are efficiently computable by servers and highly compact---typically tens or hundreds of bytes, irrespective of file size. HAIL cryptographically verifies and reactively reallocates file shares. It is robust against an active, mobile adversary, i.e., one that may progressively corrupt the full set of servers. We propose a strong, formal adversarial model for HAIL, and rigorous analysis and parameter choices. We show how HAIL improves on the security and efficiency of existing tools, like Proofs of Retrievability (PORs) deployed on individual servers. We also report on a prototype implementation.
    none entered

Software

  1. Amazon S3 Link
    Amazon Simple Storage Service (Amazon S3), provides developers and IT teams with secure, durable, highly-scalable object storage. Amazon S3 is easy to use, with a simple web services interface to store and retrieve any amount of data from anywhere on the web. With Amazon S3, you pay only for the storage you actually use. There is no minimum fee and no setup cost. Amazon S3 can be used alone or together with other AWS services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Block Store (Amazon EBS), and Amazon Glacier, as well as third party storage repositories and gateways. Amazon S3 provides cost-effective object storage for a wide variety of use cases including cloud applications, content distribution, backup and archiving, disaster recovery, and big data analytics.
    none entered
  2. Google Cloud Storage Link
    Google Cloud Storage offers developers and IT organizations durable and highly available object storage. Google created three simple product options to help you improve the performance of your applications while keeping your costs low. These three product options use the same API, providing you with a simple and consistent method of access.
    none entered
  3. Microsoft Azure Storage Link
    Reliable, economical cloud storage for data big and small
    none entered
  4. kloudless Link
    The last cloud storage API you’ll ever need Build multiple cloud storage services into your app by coding just once.
    none entered
  5. JClouds Link
    Apache jclouds® is an open source multi-cloud toolkit for the Java platform that gives you the freedom to create applications that are portable across clouds while giving you full control to use cloud-specific features.
    none entered
  6. MultCloud Link
    MultCloud - Put multiple cloud drives into one Free App for Managing Files and Transferring Files across Cloud Drives
    none entered
  7. SecureBeam Link
    Combine all your clouds to the safest place in the web
    none entered
  8. symform Link
    Cloud storage for all your digital stuff
    none entered
  9. Boxcryptor Link
    Highest Security for your Files in the Cloud Boxcryptor – Secure your Cloud
    none entered
  10. Arq Link
    Backup solution for OSX, that already encrypts the data locally.
    none entered

Projects

  1. Biobankcloud (2012 - 2015): Scalable, Secure Storage of Biobank Data, FP7 ICT Programme of the European Commission Link
    The storage infrastructure for human biological material is generally known as a biobank. One of the main tenets of biobanking is the digitization of our genomic information for its archival and analysis. That is, in the future, vast amounts of genomic data will be derived from biomaterials stored in biobanks. The scale of the storage requirements for genomic data is huge - a single human genome amounts to coping with the analysis of three billion base pairs. In addition to the storage of genomic data, its analysis will require both massive parallel computing infrastructure and data-intensive computing tools and services to perform analyses in reasonable time. As of 2013, a huge wave of big data is approaching, driven by the decreasing cost of sequencing genomic data, which has been halving every 4 months since 2004. Biobanks store and catalogue human biological material, but they are not prepared to handle this wave of data - there is a biobank bottleneck.
    none entered
  2. VISION CLOUD (2010 - 2013): Virtualized Storage Services Foundation for the Future Internet (IP), FP7 ICT Programme of the European Commission Link
    The goal of VISION Cloud is to introduce a powerful ICT infrastructure for reliable and effective delivery of data-intensive storage services, facilitating the convergence of ICT, media and telecommunications. This infrastructure will support the setup and deployment of data and storage services on demand, at competitive costs, across disparate administrative domains, while providing QoS and security guarantees.
    none entered
  3. Contrail (2010 - 2014): Open Computing Infrastructures for Elastic Services, FP7 ICT Programme of the European Commission Link
    In the future of corporate IT, companies will rely on highly dynamic distributed IT infrastructures. Federation models are envisioned where a given organisation will be both a Cloud provider during periods when its IT infrastructure is not used at its maximal capacity, and a Cloud customer in periods of peak activity. The main contribution of CONTRAIL will be the development of an integrated approach to virtualization, offering Infrastructure as a Service (IaaS), services for federating IaaS Clouds, and Platform as a Service (PaaS) on top of federated Clouds. This service stack will be part of the CONTRAIL open source system, facilitating industrial up-take of Cloud computing. The main outputs of CONTRAIL are a collection of infrastructure services offering network, computation and storage as a service; services to federate IaaS Clouds; a set of high level services and runtime environments for typical Cloud applications, including efficient map/reduce, scalable service-oriented application hosting, and automatic workflow execution; and a set of applications and use cases from the domains of e-business, e-science, telecommunication and media using and demonstrating the CONTRAIL system. CONTRAIL leverages the open source XtreemOS system, developed in the successful XtreemOS European integrated project and which was designed for large scale dynamic infrastructures. XtreemOS integrates services for data, application, security and community management that can be adapted to provide a unified solution for building private, public and federated Cloud infrastructures. CONTRAIL has core virtualization technology integrated with its high-level services and its Cloud management facilities. This unique approach of covering "the whole Cloud" from the core infrastructure, via federation mechanisms, to management services, enables the construction of transparent, trusted and reliable Cloud platforms with operations governed by service level agreements.
    none entered
  4. TClouds (2010 - 2013): Trustworthy Clouds – Privacy and Resilience for Internet-scale Critical Infrastructure, FP7 ICT Programme of the European Commission Link
    Protecting critical infrastructures providing communications, energy, or healthcare presents increasing ICT challenges as ICT itself has become vital to them. Internet-scale ICT infrastructures ("infrastructure clouds") promise scalable virtualised computing, network, and storage resources over the Internet. They provide scalability and cost-efficiency but pose significant new privacy and resilience challenges. Clouds may evolve into a single point of failure, threaten all dependent ICT, and put the Future Internet at risk. TCLOUDS builds a resilient Future Internet platform by progress in four areas: ) Addressing the legal and business implications while building a regulatory framework for enabling privacy-enhanced cross-border infrastructure clouds. 2) Architecture and prototypes for a federation of trustworthy infrastructure clouds that build on complementary and mutually re-enforcing technical approaches:) A Trustworthy Infrastructure Cloud enables individual providers to offer more resilient and privacy-aware infrastructure clouds.\nb) Privacy and Resilience for Commodity Clouds enables end users to put a security layer on top of existing commodity infrastructure clouds to enforce their security objectives.) Federated Cloud-of-cloud Middleware offers privacy-protection and resilience beyond any individual cloud. This expands trust from trusted (enterprise-internal) clouds to less trusted (off-shored) ones or federates a set of partially trusted providers into a trustworthy and adaptive federation.) Validation and impact through benchmark scenarios: ) Smart power grids connect renewable energy sources and users. It is a premier example of an Internet of Things.) Home healthcare provides prophylaxis to citizens. We focus on the privacy and usability challenges of cross-border usage of personal data.\n4) Collaboration with complementary standardisation and FP7 projects maximises impact and fosters a European Trustworthy Cloud ecosystem.
    none entered
This page was last changed on 9 May 2017, at 16:39.


Please log in if you do not want to leave your comment anonymously.

Home
To contribute:
Log in
Help
Contact