Streaming saturation for large RDF graphs with dynamic schema information

In the Big Data era, RDF data are produced in high volumes. While there exist proposals for reasoning over large RDF graphs using big data platforms, there is a dearth of solutions that do so in environments where RDF data are dynamic, and where new instance and schema triples can arrive at any time. In this work, we present the first solution for reasoning over large streams of RDF data using big data platforms. In doing so, we focus on the saturation operation, which seeks to infer implicit RDF triples given RDF schema constraints. Indeed, unlike existing solutions which saturate RDF data in bulk, our solution carefully identifies the fragment of the existing (and already saturated) RDF dataset that needs to be considered given the fresh RDF statements delivered by the stream. Thereby, it performs the saturation in an incremental manner. Experimental analysis shows that our solution outperforms existing bulk-based saturation solutions.


Introduction
To take full advantage of semantic data and turn them into actionable knowledge, the semantic web community has devised techniques for processing and reasoning over RDF data (e.g. [4,19,23]). However, in the Big Data era, RDF data, just like many other kinds of data, are produced in high volumes. This is partly due to sensor data produced in the context of health monitoring and financial market applications, feeds of user-content provided by social network platforms, as well as long-running scientific experiments that adopt a streamflow programming model [12]. This trend generated the need for new solutions for processing and reasoning over RDF datasets since existing state of the art techniques cannot cope with large volumes of RDF data.
A typical and fundamental operation for reasoning about RDF data is data saturation. This operation involves a set D of RDF data triples and a set S of semantics properties, expressed in terms of either RDF Schema and/or OWL, and aims at inferring the implicit triples that can be derived from D by using properties in S. Data saturation is crucial in order to ensure that RDF processing and querying actually work on the complete informative content of an RDF database, without ignoring implicit information. To deal with the problem of saturating massive RDF datasets, a few approaches exploiting big data paradigms (namely Map-Reduce [11]) and platforms, notably Hadoop and Spark (see e.g., [8,20]), have already been proposed. In [20] Urbani et al. described the WebPIE system and showed how massive RDF data can be saturated by leveraging on the Map-Reduce paradigm over Hadoop. In [8] Gu et al. presented the Cichlid system and showed how to speed up saturation by using Spark. In [15,16] authors proposed a parallel reasoning method based on P2P selforganizing networks, while in [24] authors propose a parallel approach for RDF reasoning based on MPI. These approaches, however, assume that RDF datasets are fully available prior to the saturation, and as such, are not instrumented to saturate RDF data produced continuously in streams. Indeed, when RDF data are produced in streams, such systems must reprocess the whole data collection in order to obtain triples entailed by the newly received ones. This is due to the fact that both initial and already obtained triples (by means of past saturation) can entail new triples under the presence of newly received instance/schema triples. A number of works tackled the problem of incremental saturation [3,14,22,25], but these approaches, being mostly centralized, do not ensure scalable, distributed, and robust RDF streaming saturation.
To overcome these limitations, in this work we present the first distributed technique for saturating streams of large RDF data, by relying on a Spark cluster, hence ensuring scalability and robustness. We rely on RDF Schema as a language to define property triples, since, despite its simplicity, RDF Schema is rich enough to make the efficient saturation of streaming large RDF data far from being trivial. The main challenge is to quickly process fresh data, that must be joined with past met data, whose volume can soon become particularly high in the presence of massive streams. To this end, unlike existing solutions [8,20] for large-scale RDF saturation, upon the arrival of new RDF statements (both schema and instance triples) our solution finely identifies the subset of the existing (and already saturated) RDF dataset that needs to be considered. This is obtained by relying on an indexing technique we devise for our approach.
The paper is organized as follows. Section 2 presents preliminaries about RDF saturation and Spark streaming, while Section 3 describes the state-of-the-art concerning largescale RDF saturation using Spark. Section 4 presents an overview of our technique by means of examples, while Section 5 describes the algorithms. Section 6 is dedicated to the performance evaluation of our approach. Sections 7 and 8, respectively, discuss related works and future perspectives.

RDF and Semantic Data Reasoning
An RDF dataset is a set of triples of the form s p o. s is an IRI 1 or a blank node that represents the subject. IRI stands for Internationalized Resource Identifier, and is used in the semantic web to identify resources. p is an IRI that represents the predicate, and o is an IRI, blank node or a literal, and it stands for the object. Blank nodes, denoted as _:b i , are used to represent unknown resources (IRIs or literals). RDF Schema (or RDFS for short) provides the vocabulary for specifying the following relationships between classes and properties, relying on a simplified notation borrowed from [7]: subclass relationship ≺ sc : the triple c 1 ≺ sc c 2 specifies that c 1 is a subclass of c 2 ; subproperty relationship ≺ sp : the triple p 1 ≺ sp p 2 specifies that p 1 is a sub-property of p 2 ; property domain 1   ← d : the triple p ← d x specifies that the property p has as a domain x; and property range → r : the triple p → r z specifies that the property p has as a range z. For the sake of readability, in what follows we use simple strings instead of IRIs to denote predicates, subjects and objects in triples. Also, we abbreviate the rdf:type predicate with the τ symbol.
Example 2.1. Figure 2 illustrates a set of RDF instance triples that we use as a running example, together with the equivalent graph representation. The graph describes the resource doi 1 that belongs to an unknown class, whose title is "Complexity of Answering Queries Using Materialized Views", whose author is "Serge Abiteboul" and having an unknown contact author. This paper is in the proceedings of an unknown resource whose name is "PODS ′ 98". Lastly, the IRI edbt2013 is a conference and hasName, the property associating names to resources, is created by "John Doe". Figure 1 lists schema triples. It specifies that posterCP is a subclass of ConfP, that the property hasContactA is a subproperty of hasAuthor. It also specifies that the property hasAuthor has as domain paper and as range a literal. As in other works (e.g., [7,8,20]) we focus on the core rules of RDFS, the extension to other rules being trivial. In particular, we consider here rules 2, 3, 5, 7, 9, and 11 among the 13 RDFS rules illustrated in Table 1.
The realm of the semantic web embraces the Open World Assumption: facts (triples) that are not explicitly stated may hold given a set of RDFS triples expressing constraints. These are usually called implicit triples, and, in our work, we consider the problem of RDF saturation, i.e., given a set of RDFS rules, inferring all possible implicit triples by means of these rules applied on explicit triples, or, recursively, on implicit triples. For example, rule rdfs2 in Table 1 states that, if a property p has a domain x, given a triple s p o, we can infer that s is of type x. Since rdfs9 specifies that, if s is of type x and x is a subclass of y, then we can infer that s is of type y. In the remaining part of the paper, we will use the following notation to indicate derivations/inference of triples. A derivation tree is defined as follows.
where the rule number X ranges over {2, 3, 5, 7, 9, 11}. A derivation tree can be empty, hence consisting of a given triple t, or can be of the form {T 1 | T 2} − rdfsX → t, meaning that the tree derives t, by means of rule rdfsX whose premises are (matched to) the two triples given by T1 and T2, respectively. So, for instance we can have the following derivation tree T1 for the G and S previously introduced: {hasT itle ← d con f P | doi 1 hasT itle "CAQU MV ′′ } − rdfs2 → doi 1 τ con f P Moreover, we can have the following derivation T2 relying on T1: {T 1 | con f P ≺ sc paper } − rdfs9 → doi 1 τ paper . In the following, given a set of instance RDF triples D and a set of schema triples S, we say that T is over D and S if the derivation tree uses triples in D and S as leaves. Moreover, we define the saturation of D over S as D extended with all the possible instance triples obtained by means of derivation (below, derivation trees are assumed to be over D and S): , 3, 7, 9}} Notice above that, say, T2 can be a derivation tree totally over S, recursively applying rule 5 (or rule 11) thus deriving a triple in S * , below defined.
Above, in the S * definition, please note that since X ∈ {5, 11} the whole derivation tree consists of subsequent applications of rule 5 (or rule 11).

Spark and Spark Streaming
Spark [26] is a widely used in-memory distributed cluster computing framework. It provides the means for specifying DAG-based data flows using operators like map, reduce-ByKey, join, filter, etc. over data collections represented by means of Resilient Distributed Datasets (RDDs). For our purposes, we use the streaming capabilities of Spark whereby data comes into micro-batches that needs to be processed within a time-interval (also referred to as a window).

Saturating Large RDF Graphs Using Spark
We already briefly discussed in the introduction the Cichlid system [8], which represents the state of the art of RDF saturation, and WebPIE [20]. As in our case, these systems focus on rules 2, 3, 5, 7, 9, and 11, illustrated in Table 1.
While the outcome of the saturation operation is orthogonal to the order in which the rules are applied, the time and resources consumed by such an operation are not. Because of this, the authors of Cichlid (and WebPIE before them) identified a number of optimisations that influence the rule application order with the view to increasing the efficiency of the saturation. In what follows, we discuss the main ones.
1. RDF Schema is to be saturated first. The size of the RDF schema 2 in an RDF graph is usually small, even when saturated. It is usually orders of magnitudes smaller than the size of the remaining instance triples. This suggests that the schema of the RDF graph is to be saturated first. By saturating the schema of an RDF graph we mean applying rules that produce triple that describes the vocabulary used in an RDF graph. Furthermore, because the size of the schema is small, schema saturation can be done in centralized fashion. In this respect, the RDFS rules presented in Table 1 can be categorised into two disjoint categories: schema-level and instance-level RDFS rules. Schema-level RDFS rules (rdfs5 and rdfs11) designate the rules that produce triples describing the vocabulary (classes, properties, and their relationships). Instance-level triples, on the other hand, specifies resource instances of the classes in the RDF vocabularies and their relationships. Each rule is made up of two premises and one conclusion, each of which is an RDF triple. While premises of schema-level rules are schema triples, premises of instance-level rules are a schema triple and an instance triple. Also, instance-level rules entail an RDF instance triple, while schema-level rules entail an RDF schema triple.
2. Dependencies between rules. When determining the rule execution order, the dependencies among rules must be taken into account too. In particular, a rule R i precedes a rule R j if the conclusion of R i is used as a premise for rule R j . For example rdfs7 has a conclusion that is used as a premise for rules rdfs2 and rdfs3. Therefore, rdfs7 should be applied before rdfs2 and rdfs3.
By taking (1) and (2) into consideration, the authors of Cichlid established the orders of applications of rules illustrated in Figure 3. To illustrate how rules are implemented in Spark, we will use a concrete example considering rdfs9, which can be expressed as follows. If a resource s is of type x, i.e. s τ x, and x is a sub-class of y, i.e. x ≺ sc y, then s is also an instance of y, i.e. s τ y. Note that, as the output of rdfs2 and rdfs3 are instance triples with predicate τ , these rules are executed in Cichlid before executing rdfs9 (see [8] for more details). In our approach we will rely on the same ordering for streaming saturation.

12: End
To implement rdfs9 in Spark, Cichlid uses the f ilter , map, and collect operators in Algorithm 1. The algorithm first retrieves over all the partitions the RDFS schema, the classes and their corresponding sub-classes in the schema, by means of the filter transformation and the collect action (this last one is needed in order to collect on the master/driver machine the total filtered information). This information is then broad-casted 3 (i.e., locally cached in each machine in the cluster) as pairs (e.g., x → y), thereby avoiding the cost of shipping this information every time it is needed. It first retrieves the RDFS schema (line 4), the classes and their corresponding sub-classes (lines 5-7), and the obtained information is then broad-casted (line 8). Therefore, for each broad-casted pair of subclass and superclass, the instances of the subclass are retrieved (line 9), and new triples are derived stating that such instances are also instances of the broad-casted superclass, by means of the map transformation(line [10][11]. Spark provides other operators, which are used for implementing other rules, such as distinct, partitionBy, persist, union, mapPartitions, mapPartitionsW ithIndex, etc.
Notice that as the saturation process may derive triples that are already asserted or have been derived in previous steps of the saturation operation, Cichlid [8] eliminates the duplicated triples from the derived ones.

Streaming RDF Saturation
Our goal is to support the saturation of RDF streams by leveraging on Spark stream processing capabilities. Using Spark, an RDF stream is discretized into a series of timestamped micro-batches that come (and are, therefore, processed) at different time intervals. In our work, we assume that a microbatch contains a set of instance RDF triples, but may also contain schema (i.e., RDFS) triples.
Consider, for example, an RDF stream composed of the following series of micro-batches [mb i , . . . , mb n ], where i > 0. A first approach for saturating such a stream using a batchoriented solution would proceed as follows: when a microbatch mb i arrives, it unions mb i with the previous instance dataset (including triples obtained by previous saturation) and then the resulting dataset is totally re-saturated.
On the contrary, our approach allows for RDF saturation in a streaming fashion, by sensibly limiting the amount of data re-processing upon the arrival of a new micro-batch. To this end we have devised the following optimization techniques: 1. Rule pruning for schema saturation. Given a new microbatch mb i , we filter all the schema triples contained in it. Note that in the general case it is not likely that these new schema triples trigger all the saturation rules, i.e. it is not the case that the new micro-batch includes all kinds of RDFS triples at once -i.e. subPropertyOf, domain, range, and subClassOf. So for saturating the schema at the level of the new micro-batch we first filter new schema triples, and then obtain the set of new schema triples NST = Saturation(new received schema ∪ past schema) -past schema. The Saturation operation is local and only triggers rules that do need to be applied, in the right order. Table 2 illustrates the rules to be activated given some matching schema triple: the number 1 indicates the availability of a matching schema, and 0 indicates it is not. For example, if a schema triple specifying the domain of a property exists, then this triggers rule 2. All possible cases are indicated in Table 2, and Saturation selects one line of this table, depending on the kind of schema predicates met in the new schema triples. This avoids triggering useless rules. Once saturation for mb i schema triples is done in this Table 2. The 1 and 0 indicate for the availability of that particular schema rules in mb i . X → Y means: The output of rule X used as an input of rule Y .
subPropertyOf domain range subClassOf Saturation order optimized fashion, obtained triples (i.e., NST ) are merged with the existing RDFS schema for a second-pass of global schema saturation, taking into account triples deriving from both mb i and the pre-existing schema.
2. Efficiently saturate existing instance triples by leveraging our incremental indexing scheme. Given the new schema triples that are provided by the micro-batch mb i or inferred in (1), we need to scan existing instances triples to identify those that if combined with the new schema triples will trigger RDFS rules in Table 1. This operation can be costly as it involves examining all the instance triples that have been provided and inferred micro-batches received before mb i . To alleviate this problem, we have devised an incremental indexing technique that allows for the fast retrieval of the instance triples that are likely will trigger the RDFS rules given some schema triples. The technique we developed index instance triples based on their predicate and object, and, as we will show later, allow to greatly reduce the data processing effort for the saturation under the new schema. Once retrieved, such instances triples are used together with the new schema triples to generate new instance triples. Notice here that we cannot infer new schema triple. This is because the rules for inferring new schema triples require two schema triples as a premise (see Table 1).
3. Saturate new instance triples. The instance triples inferred in (2) need to be examined as they may be used to infer new instance triples. Specifically, each of those triples is examined to identify the RDFS rule(s) to be triggered. Once identified such rules are activated to infer instance triples. The instance triples in mb i as well as those inferred in (2) and (3) are stored and indexed using the method that we will detail next.
We will now turn our attention to our indexing scheme, mentioned above. For a micro-batch mb i received at timestamp t we create an HDFS directory named as t, in which we store other indexing information related to mb i , as follows. The instance triples that are asserted in mb i , as well as those that are inferred (see (2) and (3) above), are stored into two t separate sub-directories, which we name o and p.
The instances triples in mb i that provide information about the type of a resource, i.e., having as predicate rdf:type, are stored in the o directory. They are grouped based on their object and are stored in files within the o directory of the micro-batch mb i : instance triples with the same object are stored in the same file. Additionally, our indexing scheme utilizes an associative hash-table stored in a cached RDD in main memory, associating each encountered object with the list of HDFS addresses corresponding to files in the o directories, which include at least one triple with that object. Notice that triples with the rdf:type predicate are used in the premises of rdfs9. Given a schema triple of the form y ≺ sc z, our indexing approach allows for the fast retrieval of the files in the o directories of the micro-batches that have as an object the resource y, and can be used to trigger rdfs9.
The remaining instance triples in mb i , i.e., those that do not have rdf:type as a predicate, are grouped based on their predicate, and stored within files under the p directory. Additionally, an associative hash-table stored in an RDD persisted in main memory, associating each encountered property with the list of HDFS addresses corresponding to files in the p directories including at least one triple with that property is created and maintained. By means of this kind of indexing, we can optimize application of rules rdfs2, rdfs3 and rdfs7 to infer new instance triples as we can inspect the previously described hash-table in order to retrieve only files containing triples with properties needed by these 3 rules.
To illustrate, consider for example that a new micro-batch mb i arrives at a given time instant t, and that it contains the schema triple t sc : s 1 ≺ sc s 2 . Such schema triple can contribute to the inference of new schema triples (i.e., by means of rdfs11) as well as new instance triples by means of rdfs9. Since the indexation mechanism we elaborated is sought for the inference of instance triple, let us focus on rdfs9. To identify the instance triples that can be utilized together with the schema triple t sc , we need to examine existing instance triples. Our indexing mechanism allows us to sensibly restrict the set of triples that need to be examined, as the hash-table indexing the files under the o directories enables the fast recovering of files containing triples with s 1 as an object resource, and that can be combined with the schema triple t sc to trigger rdfs9. The indexing on files in p directories are operated in a similar manner in order to efficiently recover files containing instance triples with a given property so as to use included triples to trigger rdfs2/3/7, under the arrival of a correspondent schema triple in the stream. To illustrate our approach, we use the following example.
Example 4.1. We assume that we have the initial schema S of Figure 1.We saturate it by obtaining S ′ as indicated below.
S ′ = S ∪ { hasContactA → r rdfs:Literal , _:b 0 ≺ sc paper } This operation is fast and centralized, as the initial schema is always relatively small in size. Our approach then proceeds according to the following steps.
NST = { paper ≺ sc publication, hasContactA ≺ sp hasAuthor posterCP ≺ sc publication, confP ≺ sc publication, _:b 0 ≺ sc publication, hasContactA ← d paper } Figure 4. N ew received and inferred Schema T riples (NST ) 1. The saturated schema S ′ is broadcast to each task, so that it can access S ′ with no further network communication.
2. Then available micro-batches are processed. For the sake of simplicity, we make here the (unnatural) assumption that each micro-batch consists of only one triple. The stream of micro-batches is in Table 3.
3. The first received micro-batch triggers rdfs9 so that we have the derivation of two new triples: The received triple plus the two derived ones are then stored according to our indexing strategy. As already said, triples are grouped by their objects when having rdf:type property, so as to obtain the following file assignment, knowing that t 1 is the time stamp for the current micro-batch: 4. The processing goes on by deriving new instance triples for the micro-batches from 2 to 6, as indicated in Table4, which also indicates how instance triples are stored/indexed. Now assume that in micro-batch 7 we have the followed RDF schema triples: paper ≺ sp publication , hasContractA ≺ sp hasAuthor So we have now three steps: i) infer the new schema triples filtering out by considering the already present schema triples, ii) broadcast these schema triples minus the already exist/broadcast schema triples (Figure 4), to enable tasks to locally access them, iii) re-processing previously met/inferred instance triples by taking into consideration the new schema. Consider for instance {hasContactA ≺ sp hasAuthor} as new schema triple. This schema triple triggers rdfs7. Therefore, our indexing tells us that only file p/t 4 /file 1 ( Table 4, line 4) needs to be loaded to infer new triples, that, of course, will be in turn stored according to our indexing strategy.
As we will see in our experimental analysis, the pruning of loaded files ensured by our indexing will entail fast incremental saturation. Also, note that our approach tends to create a non-negligible number of files, but fortunately without compromising efficiency thanks to distribution.

Streaming Saturation Algorithm
The overall streaming saturation algorithm is shown in Algorithm 2, and commented hereafter. Given a micro-batch mb i , we first perform schema saturation if mb i contains schema triples (lines 12, 13). The related instance triples are retrived based on mb NST (line 14). Given newly inferred schema triples, instance triples are reterived and examined to identify cases where new instance triples may be inferred (line 15). The obtained schema triples (i.e., mb NST ) are added and broadcasted within the intial schema RDD (line 17,18). The inferred triples, if any, are merged with instance triples of mb i (i.e., mb ins ) and the saturation is applied to them. In the next step, the received and inferred instance triples are combined and obtained duplicates, if any, are removed (line 22). In the last step, the instance triples from the previous step are saved and indexed using our method (line 24-25).  The efficiency of our solution depends on the technique we use for incrementally indexing the new instance triples that are asserted or inferred given a new micro-batch. As mentioned in the previous section, indexed instance triples are classified into two disjoint categories: objector predicatebased triples. Specifically, a triple is considered an objectbased if its predicate is rdf:type. Triples of this kind are used as a premise to rdfs9 (see Table 1). On the other hand, a triple is considered to be predicate-based if its predicate is different from rdf:type. Triples of this kind are used as premise for rules rdfs2, rdfs3 and rdfs7 (see Table 1).
14: _2, t)).partitionBy(number of different predicate). iterator.map(t ⇒ (t._2, uts + "-" + index + "_")) }).mapPartitions( 22: _.map(t ⇒ (t,1))).reduceByKey(_+_).mapPartitions(_.map(_._1)) 23: // pIndexingRDD is a HashTable which keeps the predicate of instance triple as key and their physical paths as value. 24: pIndexingRDD ∪= pPartition.mapPartitionsWithIndex((index,iterator) ⇒{ 25: iterator.map(t ⇒ (t._2, uts + "-" + index + "_")) }).mapPartitions( 26: _.map(t ⇒ (t,1))).reduceByKey(_+_).mapPartitions(_.map(_._1)) 27: return oIndexingRDD & pIndexingRDD 28: End Labeling a new instance triple as object-based or predicatebased is not sufficient. To speed up the retrieval of the triples that are relevant for activating a given RDFS rule, objectand predicate-based triples are grouped in files based on their object and predicate. This allows for triples having a given predicate/object to be located in only one file inside the directory associated with a micro-batch. More specifically, Algorithm 3 details how the indexation operation is performed. It takes as input new instance triples that are asserted or inferred given the last micro-batch mb ′ . It filters the instances triples to create two RDDs. The first RDD is used for storing object-based triples (line 9-11). Since the predicate of object-based triples is rdf:type, we only store subject and object of object-based triples. The second RDD is used for predicate-based triples (line [13][14][15]. Notice that the triples of the two RDDs are grouped based on their object and predicate, respectively, by utilizing RDD partitioning. The Spark method partitionBy() takes as an argument the number of partitions to be created. In the case of the RDD used for storing object-based triples, we use the number of different objects that appear in the triples as an argument. In the case of the RDD used for storing predicate-based triples, we use the number of different predicates that appear in the triples. It is worth mentioning here that we could have used the method sortBy() provided by Spark for RDDs instead of partitionBy(). However, sortBy() is computationally more expensive as it requires a local sort.
Besides grouping the RDDs containing the triples, the algorithm creates two auxiliary lightweight hash structures to keep track of the partitions that store triples with a given object (line 20-22) and predicate (line 24-26), respectively. Such memory-based hash structures act as indexes. They are lightweight memory-based structures that are utilized during the saturation to quickly identify partitions that contain a given object and predicate, respectively. Note that all the steps of the algorithm, with the exception of the first one (line 7) are processed in a parallel manner.
Soundness and completeness. We need the following lemma, which is at the basis of soundness and completeness of our system as well as of WebPIE [20] and Cichlid [8], and reflects rule ordering expressed in Figure 3. To illustrate the lemma, assume we have D = {s τ c1} while the schema includes four triples of the form c i ≺ sc c i+1 , for i = 1 . . . 4. Over D and S we can have the tree T1 corresponding to: {s τ c 1 | T 1} − rdfs9 → s τ c 3 Imagine now we have T3 defined as {c 3 ≺ sc c 4 | c 4 ≺ sc c 5 } − rdfs11 → c 3 ≺ sc c 5 We can go on by composing our derivation trees, obtaining T4: {T 2 | T 3} − rdfs9 → s τ c 5 Note that the above tree T4 includes two applications of rdfs9. At the same time we can have the tree T5: {T 1 | T 3} − rdfs11 → c 1 ≺ sc c 5 enabling us to have the tree T4 ′ which is equivalent to T4, having only one application of rule 9, and consisting of {s τ c 1 | T 5} − rdfs9 → s τ c 3 As shown by this example, and as proved by the following lemma, repeated applications of instance rules {2, 3, 7, 9} can be collapsed into only one, provided that this rule is then applied to an instance triple and to a schema triple in S * , obtained by repeated applications of schema rules 5 and 11. This also proves that it is sound to first saturate the schema S and then applying instance rules {2, 3, 7, 9} (each one at most once) over schema rules in S * .
Lemma 5.1. Given an RDF data set D of instance triples and a set S of RDFS triples, for any derivation tree T over D and S, deriving t ∈ D * S , there exists an equivalent T ′ deriving t, such that each of the instance rules {2, 3, 7, 9} are used at most once, with rule 7 applied before either rule 2 or 3, which in turn is eventually applied before 9 in T ′ . Moreover, each of these four rules is applied to a S * triple.
Proof. The proof can be found in the technical report [6].
Given the above lemma, we can now present the theorem stating the soundness of our approach.
Theorem 5.2. Given a set of instance triples D and schema triples S, assume the two sets are partitioned in n micro-batches mb i = D i ∪ S i with i = 1 . . . n, we have that there exists a derivation tree {T 1 | T 2} − rdfsX → t over D and S, with t ∈ D * S , if and only if there exists j ∈ {1, . . . , n} such t is derived by our system when mb j is processed, after having processed micro-batches mb h with h = 1 . . . j − 1.
Proof. The proof can be found in the technical report [6].

Evaluation
The saturation method we have just presented lends itself, at least in principle, to outperform state of the art techniques, notably Cichlid, when dealing with streams of RDF data. This is particularly the case when the information about the RDF schema is also obtained in a stream-based fashion.
Empirical evaluation is, however, still needed to be able to answer the following question: Does our method actually outperform in practice the Cichlid solution for saturating streams of RDF? And if so, to what extent? To answer this question, we conducted an experimental analysis.

Datasets
We used for our experiments three RDF datasets that are widely used in the semantic web community: DBpedia [2], LUBM [9], and dblp 4 . These datasets are not stream-based datasets, and therefore we had to partition them into microbatches to simulate a setting where the data is received in a streamed manner. We make in our experiments the assumption that a substantial part of the data is received initially and that micro-batches arrive then in a streaming fashion. We consider this to be a realistic assumption, in those scenarios 4 Computer science bibliography (https://dblp.unitrier.de/faq/What+is+dblp.html) where a substantial part of the data is known initially, and new triples arrive as time goes by. In what follows, and for space sake, we report on the experiment we ran against DBpedia. Readers interested in examining the results obtained using LUBM and DBLP are invited to check the extended version of the paper [6].
Using DBpedia, we created three stream-based datasets DBpedia-100, DBpedia-200, and DBpedia-300. They are composed of initial chunks that contain 100, 200, and 300 million instance triples respectively, and a series of 15 microbatches, each composed of 160K triples plus between 64 and 2500 schema triples. For the initial chunk we reserve 25% of schema triples, while the remaining ones are spread over the micro-batches as indicated above.

Experiment Setup
For each of the above datasets, we ran our saturation algorithm initially for the first chunk, and then incrementally for each remaining micro-batch. For comparison purposes, for each of the above datasets, we run the Cichlid algorithm on the initial chunk, and then on each of the micro-batches. Given that Cichlid is not incremental, for each micro-batch, we had to consider the previous micro-batches and the initial chunk as well as the current micro-batch.
We performed our experiment on a cluster with 4 nodes (and 8 nodes (check the extended version [6])), connected with 1 Gb/s Ethernet. One node was reserved to act as the master node and the remaining nodes as worker nodes. Each node has a Xeon Octet 2.4 GHz processor, 48 GB memory, and 33 TB Hadoop file system, and runs Linux Debian 9.3, Spark 2.1.0, Hadoop 2.7.0, and Java 1.8.
For each dataset we ran our experiment 5 times, and reported the average running time. Figures 5 shows the results obtained when saturating 300 million of DBpedia. The x-axis represents the initial chunk and the micro-batches that composed the dataset. For the initial chunk, the y-axis reports the time required for its saturation. For each of the succeeding micro-batches, the y-axis reports the time required for saturating the dataset composed of the current micro-batch, the previous microbatches, and the initial chunk put together.

Results
The figure shows that the time required by Cichlid for saturating the stream increases substantially as the number of micro-batches increases, and is substantially higher than the one required by our algorithm. Specifically, the saturation takes more than 1000 minutes given the last micro-batch. That is 22 times the amount of time required to saturate the first micro-batch, namely 45 minutes. On the other hand, our incremental algorithm takes almost the same time for all micro-batches. Specifically, it takes 41 minutes given the first micro-batch, and 78 minutes given the last micro-batch.
We obtained similar trends using other datasets (dblp and LUBM), datasets sizes (100M and 200M). We did not report the results obtained in those cases for space limitation. Interested readers are referred to the technical report [6].
The good performance of our algorithm is due to its incremental nature, but also to its underlying indexing mechanism. To demonstrate this, Figure 6 illustrates for DBpedia, and for each micro-batch, the number of triples that are fetched using the index as well as the total number of triples that the saturation algorithm would have to examine in the absence of the indexing structure. It shows that the number of triples fetched by the index is small compared to the total number of triples that compose the dataset.
Micro-batch size. So far, we have considered that the size of the micro-batch is specified apriori. Ultimately, the size of the micro-batch depends, at least partly, on the time interval, the resource we have (cluster configuration). To investigate this point, we considered a DBpedia instance of 25.4GB and run 7 different incremental saturations. In saturation i, for i = 1 . . . 7, the size of the micro-batch is i * 100MB, resulting in n i microbathes, in which the whole set of schema triples have been heavenly distributed over the n i microbathes. We used for this experiment a cluster with 4 nodes, 11 executors, 4 cores per executor, and 5GB memory per executor. Figure 7 illustrates the average time required for performing the saturation given a micro-batch (blue line), and the average time required for the index management (red line). Regarding the saturation, the figure shows that micro-batches with different sizes require different times for processing. For example, the time required for processing a 100MB microbatch is smaller compared to the time required for processing microbatches with larger sizes. The increase is not steady. In particular, we observe that micro-batches with 400MB and 500MB require the same processing time. This means the cluster could process a bigger chunk of data within the given time-interval. We can also conclude that the cluster was idle for some time when processing 400MB micro-batches.
Regarding the index management (red line), it shows that it is comparatively small with respect to the saturation time, and it costs in the worse case less than half a minute. Besides the time-interval, the configuration of the cluster impacts stream saturation. As shown in Figure 7, 500MB microbatchs require the same time as 400MB micro-batches for maintaining the index.
Concerning global execution time (for all micro-batches), experiments showed that when the number of micro-batches decreases, this time can decrease in some cases (this happens in particualar for i ∈ {1, 2, 3}, see [6] for details, Table 5).
To summarize, the results we presented here show that it is possible to saturate streams of RDF data in an incremental manner by using big data platforms, and that our approach outperforms the state of the art. RDF Saturation Using Big data Platforms To the best of our knowledge, the first proposal to use big data platforms, and MapReduce in particular, to scale the saturation operation is [13], but the authors did not present any experimental result. Other works then addressed the problem of large-scale RDF saturation by exploiting big data systems such as Hadoop and Spark, (see e.g., [8,20,21]). For example, Urbani et al. [20,21] proposed a MapReduce-based distributed reasoning system called WebPIE. In doing so, they identified the order in which RDFS rules can be applied to efficiently saturate RDF data. Moreover, they specified for each of the RDFS rule how it can be implemented using map and/or reduce functions, and executed over the Hadoop system. Building on the work by Urbani et al., the authors of Cichlid [8] implemented RDF saturation over Spark using, in addition to map and reduce, other transformations that are provided by Spark, such as filter, union, etc. Cichlid has shown that the use of Spark can speed up saturation wrt the case when Hadoop is used. Our solution builds and adapts the solutions proposed by WebPie and Cichlid to cater for the saturation of streams of massive RDF data.
Incremental Saturation The problem of incremental saturation of RDF data has been investigated by a number of proposals (see e.g., [3,5,7,20,22]). Volz et al. investigated the problem of maintenance of entailments given changes at the level of the RDF instances as well as at the level of the RDF schema [22]. In doing so, they adapted a previous state of the art algorithm for incremental view maintenance proposed in the context of deductive database [18]. Barbieri et al. [3] builds on the solution proposed by Volz et al. by considering the case where the triples are associated with an expiration date in the context of streams (e.g., for data that is location-based). They showed that the deletion, in this case, can be done more efficiently by tagging the inferred RDF triples with an expiration date that is derived based on the expiration dates of the triples used in the derivation. While Volz et al. and Barbieri et al. [3] seek to reduce the effort required for RDF saturation, they do not leverage any indexing structure to efficiently perform the incremental saturation. As reported by the Volz et al. in the results of their evaluation study, even if the maintenance was incremental, the inference engine ran out in certain cases of memory. Regarding, Barbieri et al. [3], they considered in their evaluation a single transitive rule (Section 5 in [3]), and did not report on the size of the dataset or the micro-batches used.
Chevalier et al. proposed Slider, a system for RDF saturation using a distributed architecture [5]. Although the objective of Slider is similar to our work, it differs in the following aspects. First, in Slider, each rule is implemented in a separate module. We adopt a different approach, where rules are broken into finer operations (map, reduce, union, etc.). This creates opportunities for sharing the results of processing at a finer level. For example, the result of a map can be used by multiple rules, thereby reducing the overall processing required. Second, Slider utilizes vertical partitioning [1] for indexing RDF triples. This indexing structure is heavy since it creates a table for each property in the RDF. While such an indexing structure proved its efficiency in the context of RDF querying, it is heavy when it comes to RDF saturation. Indeed, we know in the context of RDF saturation the inference rules that can be triggered, and therefore can tune the indexing structure needed for this purpose, which we did in our solution.
Guasdoué et al. proposed an incremental solution for saturating RDF data [7]. The incrementality comes from the fact that only rules that have a premiss triple that is newly asserted or derived are triggered. We adopt a similar approach to Guasdoué et al.. However, we utilize an indexing structure to fetch existing triples that have been asserted/derived when processing previous micro-batches. Moreover, Guasdoué et al. applies the rules in an arbitrary order, whereas in our work, we order the rules in a way to minimize the number of iterations required for saturating the RDF data.
The authors of WebPie [20] touched on the problem of incrementally saturation. In doing so, they tamp-stamped the RDF tuples to distinguish new and old tuples. An inference rule R is then activated only if the timestamp associated with one of its premises is new, i.e., greater than the last time the saturation was performed. We proceed similarly in our work. However, unlike our work, WebPie does not leverage any indexing structures when querying the existing triples to identify those that may be used to activate a given rule R.
Indexing Structures for RDF Data The indexing mechanism we proposed here is comparable to those proposed by Weiss et al. [25], by Sch atzle et al. [17] and by Kaoudi [10] et al. for efficiently evaluating SPARQL queries. For example, Weiss et al. developed Hexastore, a centalized system that maintains six indexes for all triple permutations, namely spo, sop, pso, pos, osp, and ops. For example, using spo indexing a subject s i is associated with a sorted list of properties {p i 1 , . . . , p i n }. Moreover, each property is associated with an associated sorted list representing the objects. While this approach allows for efficiently evaluating SPARQL queries, it is expensive in terms of memory usage and index maintenance. According to the authors, Hexastore may require 5 times the size of the storage space required for storing an RDF dataset due to the indexes. The solution developed by Sch atzle et al. [17], on the other hand, is meant for distributed evaluation of SPARQL queries using Hadoop. To do so, they uses an indexing scheme named ExtVP, which precompute semi-join reductions for between all properties. As shown by the authors, the computation of such indexes is heavy, e.g., it requires 290 seconds to index 100 million triples. To alleviate this, we proposed here an index that is aimed to speed up RDF saturation, as opposed to any SPARQL query, and that is amenable to incremental maintenance.

Conclusion and Future Work
In this work, we have shown how massive RDF data can be saturated in a stream-based fashion, and showed that in this context our solution outperforms state of the art solutions, namely Cichild [8]. In our ongoing work, we are investigating the extension of the incremental saturation of RDF data considering OWL Horst, a dialect of the web ontology language (OWL), already dealt with in [8,20].