Active Data :

Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

The Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion.

A key point is to handle the complexity of the Data Life Cycle, i.e. the various operations performed on data: transfer, archiving, replication, deletion, etc. Indeed, data-intensive applications span over a large variety of devices and e-infrastructures which implies that many systems are involved in data management and processing.

Active Data is new approach to automate and improve the expressiveness of data management applications :

  • allows to reason about data sets when there are handled by distributed and heterogeneous systems and infrastructures.
  • consists of a formal model that captures the essential data life cycle stages and properties : creation, deletion, replication, derivation, transient unavailability, uniform naming, and many more
  • provides a programming model that simplify the development of data life cycle management applications. Active Data allows code execution at each stage of the data life cycle: routines provided by programmers are executed when a set of events (creation, replication, transfer, deletion) happen to any data.
  • allows to legacy systems to expose their intrinsic data life cycle.

How Does it Work ?

The Data Life Cycle Model (DLCM) is loosely based on the Petri Nets formalism. Petri Nets have some key advantages to represent DLCM: they are graphical and easy to understand by end-users and still they are powerful tools to deal complex situations that one can find in distributed systems (synchronistic).

Thus, a DLCM is made of

  • Places (circles), which represent the data states. DLCM always starts with the place CREATED and finishes with the TERMINATED place.
  • Transitions (rectangles), which represent the operations performed on the data items.
  • Tokens (black dots), which represents data items. Each token has identifiers, which allows to link the token with the actual piece of information in the system; for instance a filename.

"Data Life Cycle Model"

When the DLCM progresses, token will pass from a place to another, which fires a transition. Active Data developers can attach handler code to each transition in the DLCM. The handler code is then executed whenever the transition is fired. The system is distributed, thus any node in the network can publish transition and receive transition notification; this is how are developed DLCM distributed applications.

In addition, Active Data proposes a set of high level features :

  • DLCM composition is the mechanism by which we can assemble together several DLCM.
  • Data Tags and transition Guards are a powerful way of filtering data, conveying information across the systems and triggering handler execution only on specific data.
  • DLCM verification and online checking.

Source Code

You can download the source code here

Download Active Data source code

Use Case: Data Surveillance Framework

In collaboration with Kyle Chard and Ian Foster from Argonne National Lab/University of Chicago, we designed a Data Surveillance Framework for the Advanced Photon Source experiment.

On going Projects

  • Active Data is a joint work with Matei Ripeanu from UCB (Vancouver/Canada) and Samer
  • Asma Ben Cheick (Ph.D sutudent) and Heittem Abbes (Associate Professor Univ. Tunis) are working on using AD as a model for describing data-centric application deployment on IaaS infrastructures.

Publications

  • Active Data: A Programming Model for Managing Big Data Life Cycle Anthony Simonet Gilles Fedak Matei Ripeanu. Future Generation Computer Systems, 2015 HAL
  • Active Data to Provide Smart Data Surveillance to E-Science Users A. Simonet, K. Chard, G. Fedak, I. Foster In Proceedings of EuromicroPDP'15, Turku Finland, March 4-6, 2015 PDF
  • Active Data: A Data-Centric Approach to Data Life-Cycle Management Anthony Simonet, Gilles Fedak, Matei Ripeanu and Samer Al-Kiswany. 8th Parallel Data Storage Workshop (PDSW'13), Proceedings of SC13 workshops, Denver, November, 2013 (position paper 5 pages) PDF
  • MapReduce on Desktop Grids with BitDew and Active Data Anthony Simonet, Lu Lu, Xuanhua Shi, Bing Tang, Jose-Francisco Saray, and Gilles Fedak. In Grid5K Winter School, France, 2013. PDF
GlossyBlue theme adapted by David Gilbert
Powered by PmWiki