Cloud-based storage solutions are commonly used for collaborative software solutions, like those of CREMA, since they feature several advantages compared to traditional storage solutions, like local storages or privately maintained storage systems. These advantages originate from the essential Cloud characteristics, as described by the NIST definition of Cloud Computing :
These five advantages fostered the usage of Cloud-based storage services like Amazon AWS S3‡, Cloud Storage by Google‡ or the storage solution by Microsoft Azure‡. Besides these proprietary Cloud-based storage solutions, there are also several open solutions, which emerged of EU research projects. Vision Cloud‡ was an FP7 research project which designed and implemented a Cloud-based storage solution that can be set-up as an individual Cloud-based storage. Additionally there have also been several other research projects, like Contrail‡, which worked on a high-throughput elastic structured storage or some special purpose Cloud-based storage solutions, like BiobankCloud‡, which focus on biobank data.
Although the above mentioned Cloud-based storage solutions feature several advantages, they also have downsides, namely vendor lock-in and inadequate security measures to ensure the privacy of the data stored in the Cloud. Cloud-based RAID Infrastructures provide a solution for these disadvantages and therefore represent an evolution of the simple Cloud-based storage solutions.
Vendor lock-in. Vendor lock-in defines the situation when data is locked on one provider. Vendor lock-in is one of the major weak spots of simple Cloud-based storage solutions, because it can lead to system failures. Further, the lock-in makes it difficult to replace storage providers if, for example, the storage provider go out of business or another storage provider offers a better service. Cloud-based storage providers typically offer different application programming interfaces (APIs) to store data in the cloud. These proprietary APIs force software developers to design their software according to the guidelines of one provider, which makes it almost impossible to replace a Cloud-based storage provider at a future date. To mitigate the programmatic challenges of a vendor lock-in, several programming frameworks have emerged, e.g., kloudless‡ or Apache jclouds‡. These frameworks introduce an abstraction layer for the Cloud-based data storage solutions so that the software developer can implement against a stable API that is compatible with several Cloud-based storage providers. This stable API eases the migration among different cloud providers, but at the same time it also reduces the range of available functionalities, since there are only a few functionalities, that are provided by all Cloud-based storage providers.
Besides the migration challenges issued by the vendor lock-in, there are also availability challenges, e.g., when a storage provider goes out of business on short notice or suffers from a power outage. To mitigate these effects, there are several solution approaches, provided by the scientific community. One of the most promising approaches is to apply the principles of RAID storage on Cloud-based storage providers and combine several independent storage providers to form one storage provider‡‡‡. This fusion of different storage providers improves the availability and latency, but it also introduces additional challenges like consistency across different storage providers. Several researchers have already tackled this challenges by introducing an additional middleware to ensure the integrity of the data that is stored in different locations‡‡‡.
Inspired by these scientific prototypes, a couple of commercial solutions arose who combine existing Cloud-based storage providers. One of these solutions is MultCloud‡ that provides a common interface for several cloud providers, to reduce the hassle when migrating files from one storage provider to another. Although MultCloud combines different Cloud-based storage providers, they do not offer any RAID based replication features. Securebeam‡, another commercial solution, on the other hand only integrates three storage providers, but their application allows the user to distribute the data across these three storage providers. The only commercial solution so far that actually supports a distributed RAID deployment is Symform‡. Symform provides a software system, where users can upload their data. This data is then redundantly distributed on different cloud providers. This redundant deployment allows the system to retrieve the data, even when one third of the data is unavailable, either due a technical failure or due to bankruptcy of the Cloud-based storage provider.
Despite the fact that there are already several solution approaches to mitigate the risks of vendor lock-in, there are still several open research challenges, to obtain an open solution to deploy the data in a RAID like manner across different cloud storage providers, to exploit the advantages of Cloud-based storage solutions while mitigating the potentially fatal consequences of the failure of a storage provider.
Security measures for Cloud-based storage solutions. The second major weak spot for Cloud-based storage solutions is the lack of privacy for the data, which is stored in the cloud. Although some storage providers, e.g., Amazon S3, already provide a secure transmission between the user and the Cloud-based storage and the possibility to encrypt the data automatically on the server side or with a client encryption library, there are still some unresolved issues. While the secure transmission and encryption capabilities ensure that no third party can access the data during transmission, the data provider may still be able to read and modify the data. Again, at this point the storage in a RAID like fashion can help. Due to the fact that the data is distributed among several storage providers one provider can only see a small part of the whole data. Furthermore, if one part of the data is modified this modification can be easily detected and repaired by using RAID. Further, the encryption functionalities for Amazon S3 do not cover metadata, e.g., the filename. To resolve the trust issues and the lack of encryption for the metadata, it is required to add an additional encryption layer, like FADE‡. FADE is a middleware, that not only ensures the privacy and integrity of the stored data, but also ensures that the data is not accessible anymore, when the data is requested to be deleted.
Besides the scientific prototypes, there are also several client applications, which encrypt the data before it is uploaded to a storage provider, like boxcryptor‡ or Arq‡. Although these software solutions apply a trustworthy encryption to the data, they are only suited for an encryption on a file level, which is not acceptable for most software services, like those used in CREMA. To close this gap it is required to extend the results of the FP7 research project TClouds‡, to ensure the privacy of the data while adding only little computational overhead.
The CREMA Cloud RAID Infrastructure provides the foundation to store all kinds of data in the cloud, in a reliable manner, as required for CREMA.
Hybris robustly replicates metadata on trusted private premises (private cloud), separately from data which is dispersed (using replication or erasure coding) across multiple untrusted public clouds. Hybris maintains metadata stored on private premises at the order of few dozens of bytes per key, avoiding the scalability bottleneck at the private cloud. In turn, the hybrid design allows Hybris to efficiently and robustly tolerate cloud outages, but also potential malice in clouds without overhead. Namely, to tolerate up to f malicious clouds, in the common case of the Hybris variant with data replication, writes replicate data across f + 1 clouds, whereas reads involve a single cloud. In the worst case, only up to f additional clouds are used. This is considerably better than earlier multi-cloud storage systems that required costly 3f + 1 clouds to mask f potentially malicious clouds. Finally, Hybris leverages strong metadata consistency to guarantee to Hybris applications strong data consistency without any modifications to the eventually consistent public clouds.We implemented Hybris in Java and evaluated it using a series of micro and macrobenchmarks. Our results show that Hybris significantly outperforms comparable multi-cloud storage systems and approaches the performance of bare-bone commodity public cloud storage.