Data stream processing or complex event processing (CEP)‡ is one of the vital building blocks to deal with real-time information, like the sensor data provided by manufacturing machines. Stream processing operations can cover different activities. These activities include simple operations like filtering data, i.e., discarding data that contains errors, and aggregating data, i.e., aggregating sensor data over a specific time window to extract key performance indicators. Furthermore, these activities may also transform the data, so that it can be used with external systems, e.g., a proprietary monitoring infrastructure or to propagate error notifications.
Data stream management systems. The first data stream processing systems, e.g., Aurora ‡ or Borealis ‡, originated from the database management domain. They extend the traditional database model to support the continuous aspect of data streams and enable the user to query the data streams. Besides these mere scientific prototypes, there are also commercial solutions to orchestrate different operations on streaming information, like System S by IBM ‡. System S allows to process different kinds of continuous data streams, e.g., financial data, telecommunication data or health data, in a reliable and efficient manner. While System S is designed as a full-featured toolkit, including a graphical user interface, for professional users, there are also several stream processing frameworks that originate from the need of processing enormous amounts of streaming data for web based applications. One of these frameworks is MillWheel ‡, which was integrated into the Google Cloud Platform ‡. Besides Google, also other major cloud service providers also provide their stream processing solutions, like Amazon IoT ‡ or Stream Analytics ‡ by Microsoft.
Besides these proprietary stream processing solutions, there are also open stream processing frameworks. One of the most popular open frameworks is Apache Storm ‡. Storm provides the developer with a programmatic toolkit, to design complex sequences of stream processing operations, whereas these operations can contain any kind of logic, ranging from simple message forwarding to complex analysis operations. To cope with new requirements, like a distributed deployment, Heron‡ has been developed and offers a seamless successor for Apache Storm. Other stream processing frameworks, like Apache S4 ‡, Apache Spark ‡, Apache Samza ‡, or Apache Apex‡ pursue a more component-based approach. They only provide different software components, i.e., software libraries, which can be used to build any kind of stream processing applications, whereas Apache Storm rather focuses on processing the streaming information in a process-oriented manner, similar to System S.
Although most of these stream processing frameworks support the deployment on clusters, there is only very limited support to implement resource elasticity, based on the input data stream, as proposed by Ishii and Suzumura ‡. Some of the proprietary stream processing solutions, like Amazon IoT or Google Cloud Dataflow ‡ provide elastic scaling mechanism to react on different amount of streaming data, but there are still numerous open research challenges on how to efficiently scale the computational resources to process all information in real-time, while minimizing the costs for leasing the computational resources ‡‡.
The scientific community has already picked up some of these challenges and proposed basic scaling mechanism based on thresholds ‡‡. These basic metrics allow for a coarse-grained scaling policy, but there is still much room for improvement, e.g., by considering predictive scaling based on Kalman filters ‡. Further, there are already some projects, which leverage the benefits of a distributed cloud for state-of-the-art stream processing frameworks, e.g., Cardellini et al. ‡ propose an extension for Apache Storm to realize a distributed deployment.
Besides the sole distribution on clusters, there are also several projects, e.g., CSA‡ or VISP‡‡, which incorporate a real distributed approach, where multiple the stream processing framework is deployed across different geographic locations. These distributed frameworks can provide a promising solution to reduce the network load by already preprocessing data, e.g., filtering streaming data near the data provider.
Data stream processing provides the foundation to process a continuous stream of information, such as monitoring data from manufacturing machines. Therefore CREMA investigates towards cost efficient resource provisioning algorithms for stream processing systems to minimize the costs for deploying monitoring solutions which are required for predictive maintenance scenarios. Furthermore, research efforts within CREMA provide distributed stream processing engines, which are capable to efficiently process data originating from different geographic locations. Last, the employment of advanced RDF stream processing (RSP; see Semantic Stream Processing) for intelligent condition monitoring and optimization of manufacturing processes in the cloud.
Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.Storm integrates with the queueing and database technologies you already use. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed. Read more in the tutorial.
This innovative solution will address the challenge of cloud-based self-adaptive real-time Big Data processing, including mobile stream processing and will be demonstrated and assessed in several challenging, complementary and commercially-promising pilots. There will be three PrEstoCloud pilots from the logistics, mobile journalism and security surveillance, application domains. The objective is to validate the PrEstoCloud solution, prove that it is domain agnostic and demonstrate its added-value for attracting early adopters, thus initialising the exploitation process early on.Expand / Contract
•Architecting and developing three resource-efficient Big Data management systems typically involved in Big Data processing: a novel transactional NoSQL key-value data store, a distributed complex event processing (CEP) system, and a distributed SQL query engine. We will achieve at least one order of magnitude in efficiency by removing overheads at all levels of the big-data analytics stack and we will take into account technology trends in multicore technologies and non-volatile memories. •Providing an integrated big data platform with these three main technologies used for big data, NoSQL, SQL, and Streaming/CEP that will improve response time for unified analytics over multiple sources and large amounts of data avoiding the inefficiencies and delays introduced by existing extract-transfer-load approaches. To achieve this we will use fine-grain intra-query and intra-operator parallelism that will lead to sub-second response times.•Supporting an end-to-end big data analytics solution removing the four main sources of delays in data analysis cycles by using: 1) automated discovery of anomalies and root cause analysis; 2) incremental visualization of long analytical queries; 3) drag-and-drop declarative composition of visualizations; and 4) efficient manipulation of visualizations through hand gestures over 3D/holographic views.Finally, LeanBigData will demonstrate these results in a cluster with 1,000 cores in four real industrial use cases with real data, paving the way for deployment in the context of realistic business processes.
• Virtualisation hiding the heterogeneity of the numerous data and information sources • Large-scale data analytics for resource efficient event detection in multiple data streams • Semantic description frameworks and semantic analytics tools to provide machine-interpretable descriptions of data and knowledge• Easy creation of real-time smart city applications by re-usable intelligent components CityPulse will develop, build and test a framework for semantic discovery and processing of large-scale real-time IoT and relevant social data streams for reliable knowledge extraction in a real city environment.