RDF Triples

. In the Big Data era, RDF data are produced in high volumes. While there exist proposals for reasoning over large RDF graphs using big data platforms, there is a dearth of solutions that do so in environments where RDF data are dynamic, and where new instance and schema triples can arrive at any time. In this work, we present the ﬁrst solution for reasoning over large streams of RDF data using big data platforms. In doing so, we focus on the saturation operation, which seeks to infer implicit RDF triples given RDF Schema or OWL constraints. Indeed, unlike existing solutions which saturate RDF data in bulk, our solution carefully identiﬁes the fragment of the existing (and already saturated) RDF dataset that needs to be considered given the fresh RDF statements delivered by the stream. Thereby, it performs the saturation in an incremental manner. Experimental analysis shows that our solution outperforms existing bulk-based saturation solutions.


Introduction
To take full advantage of semantic data and turn them into actionable knowledge, the semantic web community has devised techniques for processing and reasoning over RDF data (e.g. [4,23,27]). However, in the Big Data era, RDF data, just like many other kinds of data, are produced in high volumes. This is partly due to sensor data produced in the context of health monitoring and financial market applications, feeds of user-content provided by social network platforms, as well as long-running scientific experiments that adopt a stream-flow programming model [16]. This trend generated the need for new solutions for processing and reasoning over RDF datasets since existing state of the art techniques cannot cope with large volumes of RDF data.
A typical and fundamental operation for reasoning about RDF data is data saturation. This operation involves a set D of RDF data triples and a set S of semantics properties, expressed in terms of either RDF Schema [5] and/or OWL, and aims at inferring the implicit triples that can be derived from D by using properties in S. Data saturation is crucial in order to ensure that RDF processing and querying actually work on the complete informative content of an RDF database, without ignoring implicit information. To deal with the problem of saturating massive RDF datasets, a few approaches exploiting big data paradigms (namely Map-Reduce [15]) and platforms, notably Hadoop and Spark (see e.g., [11,24]), have already been proposed. In [24] Urbani et al. presented WebPIE, a system for RDF data saturation relying on the Map-Reduce paradigm over Hadoop. In [11] Gu et al. presented the Cichlid system and showed how to speed up saturation by using Spark and its underlying Resilient Distributed Datasets (RDDs) abstraction. In [19,20] authors proposed a parallel reasoning method based on P2P self-organizing networks, while in [28] authors propose a parallel approach for RDF reasoning based on MPI. These approaches, however, assume that RDF datasets are fully available prior to the saturation, and as such, are not instrumented to saturate RDF data produced continuously in streams. Indeed, when RDF data are produced in streams, such systems must re-process the whole data collection in order to obtain triples entailed by the newly received ones. This is due to the fact that both initial and already obtained triples (by means of past saturation) can entail new triples under the presence of newly received instance/schema triples. A number of works have addressed the problem of incremental saturation [3,18,26,29], but these approaches, being mostly centralized, do not ensure a scalable, distributed, and robust RDF streaming saturation.
To overcome these limitations, in this work we present the first distributed technique for saturating streams of large RDF data, by relying on the Spark Streaming API, hence ensuring scalability and robustness. We present our approach in two steps. In the first one, we deal with streaming RDFS saturation in the presence of RDF Schema statements. The choice of focusing first on RDF Schema is motivated by the fact that, despite its simplicity, RDF Schema is rich enough to make the efficient saturation of streaming large RDF data far from being trivial. The main challenge here is to quickly process fresh data, that must be joined with past met data, whose volume can soon become particularly high in the presence of massive streams. To this end, unlike existing state-of-the-art solutions [11,24] for large-scale RDF saturation, upon the arrival of new RDF statements (both schema and instance triples) our solution finely identifies the subset of the existing (and already saturated) RDF dataset that needs to be considered. This is obtained by relying on a specific indexing technique we devised for our approach. Our indexing algorithm partitions triples into property and object triples, and creates distinct subindexes for each micro-batch; hash maps allow the system to quickly retrieve all triples having a given property or a given object.
In the second part of the presentation, we deal with OWL-Horst rules. In this case we show how our saturation technique, initially developed for RDFS only, can be easily adapted to OWL-Horst: indeed, here we have to deal with weaker constraints on rule application order as well as with the need of computing a fix point.
Finally, we validate our claims of efficiency and scalability through an extensive experimental evaluation, where we analyze the behavior of our algorithm on RDFS-based datasets as well as on OWL-based datasets.
Paper Outline. The paper is structured as follows. Section 2 presents preliminaries about RDF saturation and Spark Streaming, while Sect. 3 presents an overview of our technique on RDFS by means of examples. In Sect. 4, we describe our extension of our technique for the OWL-Horst rule set, while Sect. 5 is dedicated to the performance evaluation of our approach. Sections 6 and 7, respectively, discuss related works and future perspectives.
The work presented in this paper is an extension of a previous conference paper [8]. The material in Sect. 4, which examines incremental saturation considering OWL-Horst rules is new, as is the evaluation of the effectiveness of saturation in the presence of OWL-Horst rules reported in Sect. 5. We also added in Sect. 3.3 a proof showing the soundness of our solution given the ordering of RDFS rules that we consider.

RDF and Semantic Data Reasoning
An RDF dataset is a set of triples of the form s p o, where s is an IRI 1 or a blank node that represents the subject, p is an IRI that represents the predicate, and o is an IRI, blank node or a literal, and it stands for the object. Blank nodes, denoted as :b i , are used to represent unknown resources (IRIs or literals).
RDF Schema (or RDFS for short) provides the vocabulary for specifying the following relationships between classes and properties, relying on a simplified notation borrowed from [10]: -subclass relationship ≺ sc : the triple c 1 ≺ sc c 2 specifies that c 1 is a subclass of c 2 ; -subproperty relationship ≺ sp : the triple p 1 ≺ sp p 2 specifies that p 1 is a subproperty of p 2 ; -property domain ← d : the triple p ← d x specifies that the property p has as a domain x; and -property range → r : the triple p → r z specifies that the property p has as a range z.
For the sake of readability, in what follows we use simple strings instead of IRIs to denote predicates, subjects, and objects in triples. Also, we abbreviate the rdf:type predicate with the τ symbol. Example 1. Figure 2 illustrates a set of RDF instance triples that we use as a running example, together with the equivalent graph representation. The graph describes the resource doi 1 that belongs to an unknown class, whose title is "Complexity of Answering Queries Using Materialized Views", whose author is "Serge Abiteboul" and having an unknown contact author. This paper is in the proceedings of an unknown resource whose name is "PODS 98". Lastly, the IRI edbt2013 is a conference and hasName, the property associating names to resources, is created by "John Doe". Figure 1 lists schema triples. For example, it specifies that the class posterCP is a subclass of ConfP, that the property hasContactA is a sub-property of hasAuthor. It also specifies that the property hasAuthor has as domain paper and as range a literal. As in other works (e.g., [10,11,24]) we focus on the core rules of RDFS, the extension to other rules being trivial. In particular, we consider here rules 2, 3, 5, 7, 9, and 11 among the 13 RDFS rules illustrated in Table 1.
The realm of the semantic web embraces the Open World Assumption: facts (triples) that are not explicitly stated may hold given a set of RDFS triples expressing constraints. These are usually called implicit triples, and, in our work, we consider the problem of RDF saturation, i.e., given a set of RDFS rules, inferring all possible implicit triples by means of these rules applied on explicit triples, or, recursively, on implicit triples. For example, rule rdfs2 in Table 1 states that, if a property p has a domain x, given a triple s p o, we can infer that s is of type x. Rule rdfs9, instead, specifies that, if s is of type x and x is a subclass of y, then we can infer that s is of type y. In the remaining part of the paper, we will use the following notation to indicate derivations/inference of triples. A derivation tree is defined as follows.
where the rule number X ranges over {2, 3, 5, 7, 9, 11}. A derivation tree can be empty, hence consisting of a given triple t, or can be of the form {T 1 | T 2} − rdfsX → t, meaning that the tree derives t, by means of rule rdfsX whose premises are (matched to) the two triples given by T1 and T2, respectively. So, for instance we can have the following derivation tree T1 for the G and S previously introduced: Moreover, we can have the following derivation T2 relying on T1: In the following, given a set of instance RDF triples D and a set of schema triples S, we say that T is over D and S if the derivation tree uses triples in D and S as leaves. Moreover, we define the saturation of D over S as D extended with all the possible instance triples obtained by means of derivation (below, derivation trees are assumed to be over D and S): Notice above that, say, T2 can be a derivation tree totally over S, recursively applying rule 5 (or rule 11) thus deriving a triple in S * , below defined.
Above, in the S * definition, please note that since X ∈ {5, 11} the whole derivation tree consists of subsequent applications of rule 5 (or rule 11).

Spark and Spark Streaming
Spark [30] is a widely used in-memory distributed cluster computing framework. It provides the means for specifying DAG-based data flows using operators like map, reduceByKey, join, filter, etc. over data collections represented by means of Resilient Distributed Datasets (RDDs). For our purposes, we use the streaming capabilities of Spark whereby data come into micro-batches that need to be processed within a time-interval (also referred to as a window). In Spark, the data to be processed are mapped into RDDs, where an RDD is an immutable collection of objects (e.g, <key, value> pairs); RDDs are partitioned and distributed over the Spark cluster.
Spark essentially works by applying operations to RDDs. These operations can be divided into transformations and actions. Transformations are lazy operations that return a new RDD, and are evaluated only when an action (e.g., count, collect) is invoked. Typical transformations are map(), that applies a given function to all the objects in a RDD, and filter(), which applies a predicate to the input data (e.g., rdd.map(x => x + x), rdd.filter(x => x != 3).

Streaming RDF Saturation
Our goal is to support the saturation of RDF streams by leveraging on Spark stream processing capabilities. Using Spark, an RDF stream is discretized into a series of timestamped micro-batches that come (and are, therefore, processed) at different time intervals. In our work, we assume that a micro-batch contains a set of instance RDF triples, but may also contain schema (i.e., RDFS) triples.
Consider, for example, an RDF stream composed of the following series of micro-batches [mb i , . . . , mb n ], where i > 0. A first approach for saturating such a stream using a batch-oriented solution would proceed as follows: when a microbatch mb i arrives, it unions mb i with the previous instance dataset (including triples obtained by previous saturation) and then the resulting dataset is totally re-saturated.
On the contrary, our approach allows for RDF saturation in a streaming fashion, by sensibly limiting the amount of data re-processing upon the arrival of a new micro-batch. To this aim our saturation approach leverages on a novel indexing scheme for RDF triples, as well as on a few heuristics and optimization techniques.

Indexing Scheme
Our triple indexing structure allows the system to quickly retrieve a triple given its object or its property. This structure, which is stored on HDFS, comprises two root HDFS directories, called o and p.
Assume that a new micro-batch mb i arrives at time t. At mb i arrival time, the indexing algorithm creates a new subdirectory t inside o, as well as a new subdirectory t inside p. On o/t the algorithm stores triples having as predicate rdf:type, and, therefore, providing information about the type of a resource; inside o/t triples are further partitioned into files according to their actual object, so that triples with the same object are stored in the same file. Notice that triples with the rdf:type predicate are used in the premises of rdfs9. Given a schema triple of the form y ≺ sc z, our indexing approach allows for the fast retrieval of the files in the o directories of the micro-batches that have as an object the resource y, and therefore can be used to trigger rdfs9.
On p/t the algorithm stores the remaining instance triples of mb i i.e., those that do not have rdf:type as a predicate. As in the previous case, triples inside p/t are partitioned according to their predicate, so that triples with the same predicate are stored in the same file.
Our indexing scheme also exploits two hash maps, stored in RDDs persisted in main memory, that map each object o i to the corresponding HDFS files, as well as each property p i to the HDFS files storing the matching triples; these hash maps contain no mapping for objects and properties without triples. By means of this kind of indexing, we can optimize application of rules rdfs2, rdfs3 and rdfs7 to infer new instance triples as we can inspect the previously described hash maps in order to retrieve only files containing triples with properties needed by these 3 rules.
To illustrate, assume for example that a new micro-batch mb i arrives at a given time instant t, and that it contains the schema triple t sc : s 1 ≺ sc s 2 . Such schema triple can contribute to the inference of new schema triples (i.e., by means of rdfs11 ) as well as new instance triples by means of rdfs9. Since the indexing mechanism we designed is sought for the inference of instance triple, let us focus on rdfs9. To identify the instance triples that can be utilized together with the schema triple t sc , we need to examine existing instance triples. Our indexing mechanism allows us to sensibly restrict the set of triples that need to be examined, as the hash map indexing the files under the o directories enables the fast recovering of files containing triples with s 1 as an object resource, and that can be combined with the schema triple t sc to trigger rdfs9. The indexing on files in p directories is exploited in a similar manner in order to efficiently recover files containing instance triples with a given property, so as to use included triples to trigger rdfs2/3/7, under the arrival of a correspondent schema triple in the stream. To illustrate our approach more in detail, consider the following example.
Example 2. We assume that we have the initial schema S of Fig. 1 and that we saturate it by obtaining S as indicated below.
This operation is fast and centralized, as the initial schema is always relatively small in size. Our approach then proceeds according to the following steps.
1. The saturated schema S is broadcast to each task, that can access S with no further network communication. 2. Then available micro-batches are processed. For the sake of simplicity, we make here the (unnatural) assumption that each micro-batch consists of only one triple. The stream of micro-batches is in Table 2. 3. The first received micro-batch triggers rdfs9 so that we have the derivation of two new triples: The received triple plus the two derived ones are then stored according to our indexing strategy. As already said, triples are grouped by their objects when having the rdf:type property, so as to obtain the following file assignment, knowing that t 1 is the time stamp for the current micro-batch: 4. The processing goes on by deriving new instance triples for the micro-batches from 2 to 6, as indicated in Table 3, which also indicates how instance triples are stored/indexed. Now assume that in micro-batch 7 we have the followed RDF schema triples: paper ≺ sp publication, hasContractA ≺ sp hasAuthor So we have now three steps: i) infer the new schema triples by considering the already present schema triples, ii) broadcast these schema triples minus the already existing/broadcast schema triples (Fig. 3), to enable tasks to locally access them, iii) re-process previously met/inferred instance triples by taking into consideration the new schema.
Consider for instance {hasContactA ≺ sp hasAuthor} as new schema triple. This schema triple triggers rdfs7. Therefore, our indexing tells us that only file p/t 4 /file 1 ( Table 3, line 4) needs to be loaded to infer new triples, that, of course, will be in turn stored according to our indexing strategy.
As we will see in our experimental analysis, the pruning of loaded files ensured by our indexing will entail fast incremental saturation. Also, note that our app-Algorithm 1. Incremental RDFS Indexing Algorithm 1: // mb i is indicated as instance and implicit triples from received mb i 2: Input: Saturated mb i 3: // The information of mb i keeps as two RDDs in memory. 4: Output: oIndexingRDD, pIndexingRDD 5: Begin 6: // Get a f ixed timestamp to save the mb i triples. 7: val fts = System.currentTimeMillis.toString 8: // The mb i triples partitions by their object where their predicate is rdf:type. 9: val oPartition = mb i .filter( . 2.contains("rdf-syntax-ns#type")). 10: map(t ⇒ (t. 3, t. 1)).partitionBy(number of different object).
The indexing algorithm (Algorithm 1) is responsible for storing on HDFS at the intended paths, but also for collecting the object/predicate of triples and their paths for indexing variable. We focus here on the algorithm for indexing, which is central to our contribution. Central to the efficiency of the solution presented in the previous section is the technique that we elaborated for incrementally indexing the new instance triples that are asserted or inferred given a new micro-batch.
Algorithm 1 takes as input new instance triples that are asserted or inferred given the last micro-batch mb . It filters the instances triples to create two RDDs. The first RDD is used for storing object-based triples (line 9-11). Since the predicate of object-based triples is rdf:type, we only store subject and object of object-based triples. The second RDD is used for predicate-based triples (line [13][14][15]. Notice that the triples of the two RDDs are grouped based on their object and predicate, respectively, by utilizing RDD partitioning. The Spark method partitionBy() takes as an argument the number of partitions to be created. In the case of the RDD used for storing object-based triples, we use the number of different objects that appear in the triples as an argument. In the case of the RDD used for storing predicate-based triples, we use the number of different predicates that appear in the triples. It is worth mentioning here that we could have used the method sortBy() provided by Spark for RDDs instead of partitionBy(). However, sortBy() is computationally more expensive as it requires a local sort.
Besides grouping the RDDs containing the triples, the algorithm creates two auxiliary lightweight hash structures to keep track of the partitions that store triples with a given object (line [20][21][22] and predicate (line 24-26), respectively. Such memory-based hash structures act as indexes. They are lightweight memory-based structures that are utilized during the saturation to quickly identify partitions that contain a given object and predicate, respectively. Note that all the steps of the algorithm, with the exception of the very first one (line 7), are processed in a parallel manner.

Heuristics and Optimization Techniques
To improve the performance and the scalability of RDF streaming saturation, our indexing scheme alone is not sufficient. Therefore, we also adopt the rule application strategy of Cichlid, and devised new optimization techniques. We will briefly recall the Cichlid strategy, and then focus on our novel techniques.
Rule Application Order. While the outcome of the saturation operation is orthogonal to the order in which the rules are applied, the time and resources consumed by such an operation are not. Because of this, the authors of Cichlid (and WebPIE before them) identified a number of optimisations that influence the rule application order with the view to increasing the efficiency of the saturation. In what follows, we discuss the main ones.
1. RDF Schema is to be saturated first. The size of the RDF schema 2 in an RDF graph is usually small, even when saturated. It is usually orders of magnitudes smaller than the size of the remaining instance triples. This suggests that the schema of the RDF graph is to be saturated first. By saturating the schema of an RDF graph we mean applying rules that produce new triples describing the vocabulary used in an RDF graph. Furthermore, because the size of the schema is small, schema saturation can be done in a centralized fashion. In this respect, the RDFS rules presented in Table 1 can be categorised into two disjoint categories: schema-level and instance-level RDFS rules. Schema-level RDFS rules (rdfs5 and rdfs11 ) designate the rules that produce triples describing the vocabulary (classes, properties, and their relationships). Instance-level triples, on the other hand, specify resource instances of the classes in the RDF vocabularies and their relationships. Each rule is made up of two premises and one conclusion, each of which is an RDF triple. While premises of schema-level rules are schema triples, premises of instancelevel rules are a schema triple and an instance triple. Also, instance-level rules entail an RDF instance triple, while schema-level rules entail an RDF schema triple. 2. Dependencies between rules. When determining the rule execution order, the dependencies among rules must be taken into account too. In particular, a rule R i precedes a rule R j if the conclusion of R i is used as a premise for rule R j . For example rdfs7 has a conclusion that is used as a premise for rules rdfs2 and rdfs3. Therefore, rdfs7 should be applied before rdfs2 and rdfs3.
By taking (1) and (2) into consideration, the authors of Cichlid established the order of application of rules illustrated in Fig. 4.
Rule Pruning for Schema Saturation. Given a new micro-batch mb i , we filter all the schema triples contained in it. Note that, in the general case, it is not likely that these new schema triples trigger all the saturation rules, i.e., it is not the case that the new micro-batch includes all kinds of RDFS triples at once -i.e., subPropertyOf, domain, range, and subClassOf. Therefore, for saturating the schema at the level of the new micro-batch we first filter new schema triples, and then obtain the set of new schema triples The Saturation operation is local and only triggers rules that do need to be applied, in the right order. Table 4 illustrates the rules to be activated given some matching schema triple: the number 1 indicates the availability of a matching schema triple, and 0 indicates it is not. For example, if a schema triple specifying the domain of a property exists, then this triggers rule 2. All possible cases are Table 4. The 1 and 0 indicate for the availability of that particular schema rules in mbi. X → Y means: the output of rule X used as an input of rule Y .
subPropertyOf domain range subClassOf Saturation order Table 4, and Saturation selects one line of this table, depending on the kind of schema predicates met in the new schema triples. This avoids triggering useless rules. Once saturation for mb i schema triples is done in this optimized fashion, obtained triples (i.e., NST ) are merged with the existing RDFS schema for a second-pass of global schema saturation, taking into account triples deriving from both mb i and the pre-existing schema.

Efficiently Saturate Existing Instance Triples by Leveraging on our Incremental
Indexing Scheme. Given the new schema triples that are provided by the microbatch mb i or inferred in (1), we need to scan existing instances triples to identify those that, if combined with the new schema triples, will trigger RDFS rules in Table 1. This operation can be costly as it involves examining all the instance triples that have been provided and inferred micro-batches received before mb i . To alleviate this problem, we exploit the incremental indexing scheme of the previous section; this technique allows for the fast retrieval of the instance triples that will likely trigger the RDFS rules given some schema triples. Once retrieved, such instances triples are used together with the new schema triples to generate new instance triples. Notice here that we cannot infer new schema triples. This is because the rules for inferring new schema triples require two schema triples as a premise (see Table 1).
Incremental Loading. As we previously observed, our indexing technique may lead to the creation of a huge number of files on HDFS, which in turn may increase the risk of a failure when Spark must so many files at once. Therefore, we addressed this reliability issue by loading the index files incrementally (i.e., 150 files per time), until all files have been loading, and then by unioning the tuples inside them.
Saturate New Instance Triples. The instance triples inferred in (2) need to be examined as they may be used to infer new instance triples. Specifically, each of those triples is examined to identify the RDFS rule(s) to be triggered. Once identified, such rules are activated to infer instance triples. The instance triples in mb i as well as those inferred in (2) and (3) are stored and indexed using the technique described in Sect. 3.1.

Streaming Saturation Algorithm
The overall streaming saturation algorithm is shown in Algorithm 2, and commented hereafter.
Given a micro-batch mb i , we first perform schema saturation if mb i contains schema triples (lines 12, 13 ). The related instance triples are retrieved based on mb NST (line 14 ). Given newly inferred schema triples, instance triples are retrieved and examined to identify cases where new instance triples may be inferred (line 15 ). The obtained schema triples (i.e., mb NST ) are added and broadcast within the initial schema RDD (line 17,18 ). The inferred triples, if any, are merged with instance triples of mb i (i.e., mb ins ) and saturation is applied to them. In the next step, the received and inferred instance triples are combined and obtained duplicates, if any, are removed (line 22 ). In the last step, the instance triples from the previous step are saved and indexed using our method (line 24-25 ).

Soundness and Completeness.
We deal now with the proof of soundness and completeness of our approach.
We need the following lemma, which is at the basis of soundness and completeness of our system as well as of WebPIE [24] and Cichlid [11], and reflects rule ordering expressed in Fig. 4. To illustrate the lemma, assume we have D = {s τ c 1 } while the schema includes four triples of the form c i ≺ sc c i+1 , for i = 1 . . . 4. Over D and S we can have the tree T1 corresponding to:

5:
Dins ← ∅ // Initialize a dataset for instance triples 6: Imagine now we have T3 defined as We can go on by composing our derivation trees, obtaining T4: Note that the above tree T4 includes two applications of rdfs9. At the same time we can have the tree T5 enabling us to have the tree T4 which is equivalent to T4, having only one application of rule 9, and consisting of As shown by this example, and as proved by the following lemma, repeated applications of instance rules {2, 3, 7, 9} can be collapsed into only one, provided that this rule is then applied to an instance triple and to a schema triple in S * , obtained by repeated applications of schema rules 5 and 11. This also proves that it is sound to first saturate the schema S and then applying instance rules {2, 3, 7, 9} (each one at most once) over schema rules in S * .

Lemma 1. Given an RDF dataset D of instance triples and a set S of RDFS triples, for any derivation tree T over D and S, deriving t ∈ D *
S , there exists an equivalent T deriving t, such that each of the instance rules {2, 3, 7, 9} are used at most once, with rule 7 applied before either rule 2 or 3, which in turn is eventually applied before 9 in T . Moreover, each of these four rules is applied to a S * triple.
Proof. To prove the above lemma, we examine the dependencies between the rules {2, 3, 5, 7, 9, 11}. A rule r depends on a rule r where possibly r and r are the same rule, if the activation of r produces a triple that can be used as a premise for the activation of r. This examination of rule dependencies reveals that: -Rule 5 depends on itself only; -Rule 11 depends on itself only; -Rule 7 depends on rule 5: indeed, rule 7 uses as a premise triples of the form p ≺ sp q, which are produced by the activation of rule 5; -Rules 2 and 3 depend on rule 7: both rules 2 and 3 use as a premise triples of the form spo, which are given in prior and produced by rule 7; -Rule 9 depends on rules 2, 3 and all given triples in prior with τ as a predicate: both rules produce triples of the form p τ x, a premise for activating rule 9. It also depends on rule 5.  With the exception of rule 5 and 11, the graph is acyclic, meaning that the saturation can be performed in a single pass. Furthermore, the dependency graph shows that in order for the saturation to be made in a single pass schema rules 5 and 11 need to be first (transitively) applied to saturate the schema, followed by the instance rules. Rule 7 is the first instance rule to be executed, followed by the instance rules 2 and 3 (which can be applied simultaneously or in any order), before applying at the end rule 9. That said, we need to prove now that for an arbitrary derivation tree T there exists an equivalent derivation tree T as described in the lemma thesis. This follows from the fact that if (*) T contains more than one rule rdfsX with X ∈ {2, 3, 7, 9}, then it must be because of subsequent applications of rule 9 (resp. rule 7) each one applied to a schema triple eventually derived by rule 11 (resp. rule 5), exactly as depicted by the example just before the lemma. As shown by the example, this chain of rule 9 (resp. rule 7) applications can be contracted so as to obtain a unique application of rule 9 (resp. rule 7) applied to a schema triple in S * , obtained by subsequent applications of rule 11 (resp. rule 9). So, in the case (*) holds, the just described rewriting for chains of rule 9 (resp. rule 7) can be applied to T in order to obtain T .
Given the above lemma, we can now present the theorem stating the soundness of our approach. Proof. The 'if' direction (soundness) is the easiest direction. We prove this case by induction on j. In case one triple t is derived by our system when processing the micro-batch mb 1 , then we can see that in Algorithm 2 this triple is obtained by a derivation tree calculated by Saturate(), and including at the leaves instance triples in D 1 and schema triples in S 1 * . As D 1 ⊆ D and S 1 * ⊆ S * , we have that this derivation tree can derive t also from D and for S. Assume now t is derived by our system when processing the micro-batch mb j with j > 1. Triple t is derived by a derivation tree T possibly using triples t derived in mb h with h < j, as well as triples in D j and ( j 1 S i ) * . By induction we have that for each t derived at step h < j there exists a derivation tree T over D and S deriving t . So to conclude it is sufficient to observe that, if in T we replace leaves corresponding to triples t with the corresponding T , then we obtain the desired derivation tree for t.
Let's now consider the 'only-if' direction (completeness). We proceed by a double induction, first on n, the number of micro-batches, and then on the size of the derivation tree T deriving t. Assume n = 1; this means that we only process one micro-batch. By Lemma 1 we have that there exists an equivalent T for t, satisfying the properties stated in the lemma, and hence that can be produced by our algorithm, as we first saturate the schema and then apply instance rules in sequence 7-2-9 or 7-3-9, as in T .
Assume now n > 1. We proceed by induction on the derivation tree T = {T 1 | T 2} − rdfsX → t. The base case is that both T1 and T2 are simple triples t1 and t2 in D and S respectively. In this case let j be the minimal index ensuring that both triples have been met in processed micro-batches mb h , with h ≤ j. This j exists by hypothesis, and we have that either t1 or t2 is in mb j . Assume it is t1, a schema triple and that t2 has been met in mb s with s < j. Then by means of our index we recover t2 (line 14) and saturation for the step j in line 21 builds T to derive the triple t.
Assume now that both T1 and T2 do not consist of a simple triple (the case when only one of T1 and T2 is a triple is similar). By Lemma 1, we have that there exists an equivalent T = {T 1 | T 2 } − rdfsY → t such that instance rules are used a most once (in the order of Fig. 4), where each rule uses a schema triple in S * . This means that, w.l.o.g, T2 is a schema triple t2 in S * . By hypothesis (S = n 1 S i ), we have that there exists mb h such that t2 is obtained by schema saturation (which is globally kept in memory) and that there exists mb s in which t1 is derived and indexed by our algorithm. Now consider j = max(s, h). At step j both t1 (indexed) and t2 (in the main memory) are available to our algorithm, which can hence produce {t1 | t2} − rdfsY → t.
The remaining cases are similar.

Extension to Streaming OWL-Horst Saturation
This section is devoted to RDF streaming saturation in the presence of OWL-Horst ontology [13]. For space reason, we do not present OWL-Horst in detail, and assume the reader is already familiar with it.
In recent years, OWL-Horst has gained consistent attention by both research and industrial communities, as it represents a good balance between expressivity and computational tractability. The rules are reported in Table 5, and, as it can be seen, OWL-Horst is much more expressive than RDFS. However, the techniques we have developed for RDFS saturation still remain effective in the context of OWL-Horst rules, but the transposition is not direct, and we had to take particular care of the integration of RDFS and OWL-Horst saturation in the presence of streaming instance and schema data. However, for each single OWL-Horst rule the extension of our RDFS approach is almost direct, validating the effectiveness of our previously introduced indexing technique.
As already observed in [11], an important difference wrt RDFS saturation is that for OWL-Horst it is not possible to identify an ordering in rule application which is as fine grained as that for RDFS (Fig. 5). That said, a careful analysis that distinguishes the setting where new schema triples are considered from the setting where new instance triples are considered, allows us to establish some partial ordering among the rules. Specifically, given some new schema triples, which come with a new micro-batch, the examination of rules dependencies allows us to identify three groups of rules that may be triggered: G sh 1 , G sh 2 and G sh 3 , each of which is composed of the following rules: The first group contains OWL-Horst rules together with the two RDFS rules that produce schema triples, viz RDFS5 and RDFS11. The OWL-Horst in the group can be applied in any order once, however RDFS rules 5 and 7 need to be applied multiple times until a fix-point is reached. The second group G sh 2 contains RDFS rules that use as premise triples that are produced by the rules in G sh 1 . The third group G sh 3 , on the other hand, is independent of the other two groups. The above analysis suggests the following order of application of rules given newly acquired schema triples: the rules in group G sh 1 need to be applied before applying those in group G sh 2 , whereas the rules in the third group G sh 3 can applied in parallel to those in G sh 1 and G sh 2 . The analysis of rule dependencies considering newly acquired or inferred instance triples is less conclusive since we cannot escape the iterative application of rules. That said, we identified the following groups of rules, which are exploited in the saturation algorithm shown in Fig. 6: Horst(14 (a,b) , 3, 8 (a,b) )}  groups, in the sense that the triples produced by these can be used as a premise by the rules in G ins 3 . Also, the rules in G ins 1 depends on those in G ins 3 . Notice that this introduces a loop between the three first groups. The rules in the forth group G ins 4 depend on those triples that are produced by the last two schema groups (G sch 2 and G sch 3 ) and the first three instance groups (G ins 1 , G ins 2 , and G ins 3 ) plus those triples that are received by the current micro-batch. Since the rules in the fourth group G ins 4 needs two instance triples to trigger the rules, thus, the objective of the fourth group G ins 4 is to find the complement part of the received instance triples from DS by assuming that both schema triples plus one of the instance triples exist. The analysis of the rules in the fifth group, G ins 5 , reveals that these depend on the rules in the first four groups, whereas none of the rules in the first four groups depend on the rules in G ins 5 , as already assumed by Cichlid [11]. It is worth recalling that sameAs saturation, performed by G ins 5 , needs to be dealt with in a careful way in order to avoid a blow up in triple creation. We use the approach introduced by WebPIE [24] and then re-used by Cichlid [11], which we do not describe here again and which, in a nutshell, creates and manages in an efficient way a sameAsTable in which, for instance, if a, b, c, and d are the same according to the sameAs relation, then those resources will be stored in a unique line of the table, essentially containing one equivalence class induced by sameAs. Also observe that we could imagine that OWL-Horst rule 11 could trigger again other rules, say rule 14 a . Actually, as shown in [11,24], these triggered rules would produce triples already inferred by our step B5. For instance, if rule 11 produces x τ y where (*) x and y are, respectively, sameAs u and v (already used by rule 14a in the premise u τ v), then if we assume rule 14 a produces again (once re-triggered) a triple by using x τ y, that triple would be x p y. Since step B5 takes as input u p v (produced in step B2 by rule 14 a ), we have that x p y is produced by step B5 due to (*), so there is no need to trigger rule 14 a again.
The above analysis allowed us to design an algorithm, which is depicted in Fig. 6, for efficiently saturating RDF streams considering both RDFS and OWL-Horst. It is worth observing that the rule ordering in our solution is similar to that proposed by Cichild [11], with the notable difference that we strive to perform the saturation incrementally.
As shown in Fig. 6, when a new micro-batch arrives, first a simple filtering separates new instance triples from new schema triples. Our algorithm first performs step A, in which saturation for new schema rules is performed, by also taking into account the previously inferred schema triples. Note that this step is needed in order to avoid inferring many times the same triples starting from newly arrived schema triples. The idea is to infer the new schema and perform a first wave of instance triple derivation in terms of the new schema, only once (as we will see in Step B, newly derived instance triples will be considered for fix point computation).

Using total schema (so far received and entailed)
Fetch schema triples to complete and/or enrichment In d e x in g a n d S a v in g Fig. 6. The global overview of saturation process on OWL-horst rules.
In step A, first the driver saturates the new schema rules in sub step A1.1 dedicated to the derivation of new RDFS triples, and in which, first, OWL-Horst rules that can produce premises for RDFS rules 5 and 11 are applied, and then these last ones are applied. In A1.1, once the new schema triples are derived, these ones plus the new schema triples in the current micro-batch (the old schema rules are not used here) are used for RDFS instance triple saturation, as happened for our algorithm for RDFS saturation (Sect. 3). The novelty in step A is that now we have step A2 applying OWL-Horst rules for instance triple derivation, by using new schema triples plus the old ones, and by using our indexing approach to retrieve needed instance triples derived in the past (stored on HDFS). For instance, for rule 2, once the indicated schema triple identifies a property p, then we use this to retrieve, by means of our indexing scheme, only the triples having p as property and then we perform the join required by the second and third premise (note that such kind of joins does not occur for RDFS saturation). In this way the number of triples involved in the join operation is sensibly reduced. To summarize, step A is totally along the lines of our algorithm for RDFS saturation: we obtain new schema triples and use them to infer new instance triples, that are then used in step B, which we comment below.
Step B follows step A and takes as input: the newly received instance triples in the current micro-batch, plus the instance triples derived in step A, plus the new global RDFS-OWL schema triples still computed and broadcast in step A. The main part of step B consists of a loop for iterating saturation until a fix point is reached. The body of the iteration consists of three subsequent steps: a first one concerning RDFS rules for instance triple derivation (step B1), followed by OWL-Horst derivation (step B2) involving rules that could be triggered by RDFS derivation in step B1; in case these two steps produce new triples, then step B3 uses those triples for applying OWL-Horst rules 15, 16 and 4. Once the fix point of the loop is reached, step B4 needs to fetch instance triples from HDFS by using our indexing technique, once in the driver the existence of schema triples to trigger rules (e.g., rule 15) is detected. In case B4 produces new instance triples, then (step B5) uses those triples as well for applying OWL-Horst rules 1,2,7 and 11 . Their application requires the system to fetch instance triples through our indexes, plus the indexing of newly inferred triples.

24: End
Rules in (1) and (2) can be implemented similarly to the RDFS rules presented earlier. Rules in (3) can be implemented straightforwardly since they involve a single instance triple. Rules in (4), however, need to be processed differently. For this reason, we focus on detailing the processing of Rule 15. Other rule in (4) can be implemented similarly.
For the sake of clarity, we recall rule 15 definition. This rule is processed differently depending on whether it is triggered given a newly acquired schema triple (see Box A2 in Fig. 6), given an instance triple (see Box B3 in Fig. 6), or based on received a new instance triple (see Box B4 in Fig. 6). Algorithm 3 details the processing of rule 15 (Fig. 6 Step A2), when  .value(t. 2), t. 3), t. 1)) 8: var ty15 = potential T ypes.map(t ⇒ ((sv Swap.value(t. 2), t. 1), Nil)).persist 9: val r15 = tr15.join(ty15).map(t ⇒ (t. 2. 1, t. 1. 1)) 10: return r15 11: End given corresponding new schema triples. It starts by retrieving the two kinds of schema triples that are necessary for triggering the rule, namely onProperty triples and someValuesFrom triples (lines 6-7). If such triples exist, then the algorithm tries to find their match. For example, if a newly acquired triple is an onProperty triple, e.g., (v 1 owl:onProperty p 1 ), then the algorithm attempts to find a matching triples, e.g., (v 1 owl:someValuesFrom w), from received and already existing schema, and vice versa (lines 10-13). For every matching pair of someValuesFrom and onProperty triples (lines [14][15], the algorithm retrieves instance triples that can be used for triggering the rule using our index (lines [18][19], and, inferring implicit triples (lines 20-22) accordingly as specified by the rule.

Schemas
Algorithm 4 details the processing of rule 15 given the received and inferred instance triples. This algorithm relies on received instance triples and total schema. The algorithm starts by retrieving the required instance triples from the received and inferred instance triples (lines 5-6) and triggers the rule by considering the schema triples that were present before and along the given micro-batch (lines 7-9). The inferred results (line 10) will be used into another round of saturation process if the previous round of the saturation process infers new triples. Otherwise, the saturation process jumps out of the loop and goes to the next saturation step (Step B4).

))
// Choose all someValuesFrom if exists any related onProperty for that and vice versa. Algorithm 5 aims at finding one of the bipartite instance triples of rule 15 from the already existing triples (DS ) received through previous micro-batches. In this step, the necessary condition is the existence of both schema triples and at least one of the instance triple. In this regard, we suppose that both necessary schema triples exist, and one of the related instance triples received and/or inferred via current micro-batch. Therefore as a first step, we extract the related schema triple (owl:onProperty and owl:someValuesFrom), that are required for rule 15, among the schema triples received up to this moment of the process (lines 5-6). As we said, just a complete set of the schema triples is eligible to trigger. For this purpose, the algorithm makes use of the findMatches() subroutine to find those matches for every intended schema triple which for owl:onProperty triples returns corresponding owl:someValuesFrom triples and vice-versa (lines 7-8). In the next step, by considering that both schema triples exist (line(9)), we examine the provided instance triples with the selected schema triples to pick those instances that both schema triples exist for them. For this purpose, we broadcast the collected schema triples via broadcast operation (lines 10-13). Then, we pick those triples from the mb inst/inf when they have rdf:type as a predicate with the same object as the collected someValuesFrom objects (line 14). We extract all corresponding owl:onProperty schema triples based on the owl:someValuesFrom schema and the candidate triples -those with rdf:types as predicate- (lines 15-16). In the following, by utilizing the indexing information, we fetch the related triples among the predicate-based triples from the disk DS (line 17), those triples that have the same predicate as the owl:onProperty's object. It is worth mentioning that the number of distinct objects and predicates in datasets is small enough to fit in memory. For example, the examined dataset in this section contains only 116 different distinct objects and 83 different predicates for object-and predicate-based triples, respectively. Finally, we apply the saturation process between the chosen schema triples, the candidate rdf:type triples, and their corresponding predicate-based triples fetched from the disk DS (lines [18][19][20]. So far, we have done a complete informative saturation for every triple with rdf:type as a predicate that we got and inferred via the current mb inst/inf . The Algorithm 5, lines 21-24, is dedicated to the same process (lines [14][15][16][17][18][19][20] except to find the right rdf:type triples with corresponding objects. For this purpose, we fetch those triples from the object-based triples located on the disk, in DS . Finally, we apply the saturation process between over the selected and fetched triples by considering both matched schema triples. Finally, the results of the saturation processes (i.e., r 15 1 and r 15 2) are concatenated and returned (line 28).

Evaluation
The saturation method we have just presented 3 lends itself, at least in principle, to outperform state of the art techniques, notably Cichlid, when dealing with streams of RDF data. This is particularly the case when the information about the RDF schema or OWL-Horst ruleset is also obtained in a stream-based fashion.
To validate this claim and to understand to which extent our method outperforms its competitors, we performed an empirical evaluation on real-life RDF datasets. The results of this evaluation are shown in the next sections.

Datasets
We used for our experiments three RDF datasets for RDFS saturation that are widely used in the semantic web community: DBpedia [2], LUBM [12], and dblp 4 Since these three datasets do not have any OWL-Horst schema triples, we choose a portion of UniProt [7] for OWL-Horst saturation These datasets are not stream-based datasets, and therefore we had to partition them into micro-batches to simulate a setting where the data are received in a streamed manner. We make in our experiments the assumption that a substantial part of the data is received initially and that micro-batches arrive then in a streaming fashion. We consider this to be a realistic assumption, in those scenarios where a substantial part of the data is known initially, and new triples arrive as time goes by. In what follows, and for space sake, we report on the experiments we ran against DBpedia for RDFS and UniProt for OWL-Horst saturation.
Using DBpedia, we created three stream-based datasets DBpedia-100, DBpedia-200, and DBpedia-300. They are composed of initial chunks that contain 100, 200, and 300 million instance triples respectively, and a series of 15 micro-batches, each composed 160K triples plus between 64 and 2500 schema triples. For the initial chunk we reserve 25% of schema triples, while the remaining ones are spread over the micro-batches as indicated above. Regarding saturation using OWL-Horst ruleset, we used the UniP rot dataset, which contains 320 million triples and occupies 49.6 GB. For evaluation purposes, we partition the Uniprot datasets into micro-batches. In doing so, we set the size of micro-batches to 512 MB. The schema triples of Uniprot (549 triples) are divided equally between the micro-batches. Thus, each micro-batch has a range of [5][6] schema triples.

Experiment Setup
In the case of RDFS saturation, for each of the above datasets, we ran our saturation algorithm initially for the first chunk, and then incrementally for each remaining micro-batch. For comparison purposes, for each of the above datasets, we ran the Cichlid algorithm on the initial chunk, and then on each of the micro-batches. Given that Cichlid is not incremental, for each micro-batch, we had to consider the previous micro-batches and the initial chunk as well as the current micro-batch.
Alike RDFS datasets in the OWL-Horst dataset, every micro-batch contains schema triples. Therefore, Cichlid needs to reload every past micro-batches by receiving a new micro-batch.
We performed our experiments on a cluster with 4 nodes (and 8 nodes (check the extended version [9])), connected with 1 Gbps Ethernet. One node was reserved to act as the master node and the remaining 3 nodes as worker nodes. Each node has a Xeon O2.4 GHz processor, 48 GB memory, and 33 TB Hadoop file system, and runs Linux Debian 9.3, Spark 2.1.0, Hadoop 2.7.0, and Java 1.8.
For each dataset we ran our experiment 5 times, and reported the average running time.

Saturation Considering RDF Schema Rules
Figures 7 shows the results obtained when saturating 300 million triples from the DBpedia dataset. The x-axis represents the initial chunk and the micro-batches that composed the dataset. For the initial chunk, the y-axis reports the time required for its saturation. For each of the following micro-batches, the y-axis reports the time required for saturating the dataset composed of the current micro-batch, the previous micro-batches, and the initial chunk put together. The figure shows that the time required by Cichlid for saturating the stream increases substantially as the number of micro-batches increases, and is significantly higher than the one required by our algorithm. Specifically, the saturation takes more than 1000 min given the last micro-batch, that is 22 times the amount of time required to saturate the first micro-batch, namely 45 min. On the other hand, our incremental algorithm takes almost the same time for all micro-batches. Specifically, it takes 41 min given the first micro-batch, and 78 min given the last micro-batch.
We obtained similar trends using other datasets: the dblp dataset (see Fig. 8), and the LUBM dataset (see Fig. 9); these datasets have smaller sizes wrt to the DBpedia one (190M and 69M, respectively).  The good performance of our algorithm is due to its incremental nature, but also to its underlying indexing mechanism. To demonstrate this, Fig. 10 illustrates for DBpedia, and for each micro-batch, the number of triples that are fetched using the index as well as the total number of triples that the saturation algorithm would have to examine in the absence of the indexing structure (that requires the whole amount of triples to be loaded). As it can be observed, the number of triples fetched by the index is a small fraction of the total number of triples that compose the dataset. Micro-Batch Size. So far, we have considered that the size of the micro-batch is specified a priori. Ultimately, the size of the micro-batch depends, at least partly, on the time interval, the resource we have (cluster configuration). To investigate this point, we considered a DBpedia instance of 25.4 GB and run 7 different incremental saturations. In saturation i, for i = 1 . . . 7, the size of the micro-batch is i * 100 MB, resulting in n i microbatches, in which the whole set of schema triples have been evenly distributed over the n i microbatches. We used for this experiment a cluster with 4 nodes, 11 executors, 4 cores per executor, and 5 GB memory per executor. Figure 11 illustrates the average time required for performing the saturation given a micro-batch (blue line), and the average time required for the index management (red line). Regarding the saturation, the figure shows that microbatches with different sizes require different times for processing. For example, the time required for processing a 100 MB micro-batch is smaller compared to the time required for processing micro-batches with larger sizes. The increase is not steady. In particular, we observe that micro-batches with 400 MB and 500 MB require the same processing time. This means the cluster could process a bigger chunk of data within the given time-interval. We can also conclude that the cluster was idle for some time when processing 400 MB micro-batches.
Regarding the index management time (red line), our experiment shows that it is significantly small with respect to the saturation time, and it costs in the worse case less than half a minute. Besides the time-interval, the configuration of the cluster impacts stream saturation. As shown in Fig. 11, 500 MB microbatches require the same time as 400 MB micro-batches for maintaining the index.
Concerning global execution time (for all micro-batches), experiments showed that, when the number of micro-batches decreases, this time can decrease in some cases (this happens in particular for i ∈ {1, 2, 3}, see [9] for details, Table 5).
To summarize, the results we presented here show that it is possible to saturate streams of RDF data in an incremental manner by using big data platforms, and that our approach outperforms the state of the art.  Figure 12 shows the execution time required by our incremental streaming method and the execution time required by the state of the art, i.e., Cichlid, to saturate the UniProt dataset. The figure also shows for both approaches the exponential trendline. The x-axis represents the received micro-batches, each one composed of instance triples, 512 MB, and a few schema triples, i.e., 5 to 6 schema triple (Table 6) per micro-batch. The y-axis reports the execution time required for saturation of each micro-batch in seconds. Figure 12 shows that the time required by Cichlid to saturate the dataset (depicted using a red line) increases substantially as the number of micro-batches does. Furthermore, Cichlid fails to saturate the dataset starting from the 18 th micro-batch. This is in contrast to our incremental solution (depicted using a blue line), which manages to efficiently saturate the dataset given the received microbatches. Specifically, we observe that the time required for saturation varies slightly between micro-batches, and is far smaller than the time required by Cichlid, especially in later iterations.  Table 6. Types and numbers of schema triple per micro-batch in Fig. 12.
Figures 12 illustrates that using Cichlid, the saturation of the first 17 microbatches takes 1086 min, while our solution takes 15mn to process those batches. Our solution takes 257 min (which is still far smaller than the time required by Cichlid to process the initial 17 micro-batches) to process the entire dataset, which consists of 100 micro-batches.  Figure 13 illustrates the time required for each step of our incremental solution using 4 nodes. The blue line illustrates the total time required to process each micro-batch. The green line illustrates the time required to process the schema (Step A in Fig. 6). The yellow line illustrates the time required to process the instance triples within the micro-batch (Step B in Fig. 6). The purple line illustrates the time required for index management, which consists of partitioning, compressing (using a default Spark compressor, i.e., gzip), storage time, in addition to collecting data information for our indexing technique from the saturated micro-batch. Finally, the gray line illustrates the time required to detect and retrieve data from the existing dataset DS (this time is embedded in both steps A and B). The figure shows that the total fetching time (the gray line) for the entire process is 61 min. That is 23.7% of the total processing time, which is 257 min.
On average, our indexing technique takes 37 s to detect and fetch the necessary triples from DS (stored on disk) when new schema triples are received. Notice that in Fig. 13, we have a spike on the micro-batch 84 (mb 84 ). To find out the reason for this leap, we compared what happens during the saturation of micro-batch 82 (mb 82 ), which takes 224 s, with what happens during the saturation of micro-batch mb 84 , which takes instead 593 s. That time corresponds to the retrieval of almost 111 and 31 million potential RDF triples (object-and predicate-based triples) from DS for mb 84 and mb 82 , respectively. Those triples are retrieved from 1075 files for mb 84 , and 964 files for mb 82 , respectively. We, therefore, conclude that the fetching time is not the main reason for the difference in processing the two micro-batches.
Furthermore, given the growing volume of data to be fetched, we can observe that our fetching algorithm retrieves them in a reasonable time.  We also recorded the number of triples that are retrieved by a given new micro-batch using our incremental method and compared it with the number of triples that are retrieved by Cichlid. Figure 14 depicts the results. It shows that our method retrieves far smaller numbers of triples compared with the Cichlid. This can be explained by the fact that our method utilizes indexing structures designed to retrieve only the triples that are likely to yield the activation of a saturation rule.
It is worth noting that for mb 84 , we fetch 111 million triples (around one-third of the whole dataset) from the already existing dataset DS . Thanks to the incremental loading of our indexing data structures, our approach can fetch a massive number of triples successfully in a reasonable time by utilizing a relatively small cluster.

Related Work
RDF Saturation Using Big data Platforms. To the best of our knowledge, the first proposal to use big data platforms, and MapReduce in particular, to scale the saturation operation is [17], but the authors did not present any experimental result. Other works then addressed the problem of large-scale RDF saturation by exploiting big data systems such as Hadoop and Spark, (see e.g., [11,24,25]). For example, Urbani et al. [24,25] proposed a MapReduce-based distributed reasoning system called WebPIE. In doing so, they identified the order in which RDFS rules can be applied to efficiently saturate RDF data. Moreover, they specified for each of the RDFS rule how it can be implemented using map and/or reduce functions, and executed over the Hadoop system. Building on the work by Urbani et al., the authors of Cichlid [11] implemented RDF saturation over Spark using, in addition to map and reduce, other transformations that are provided by Spark, such as filter, union, etc. Cichlid has shown that the use of Spark can speed up saturation wrt the case when Hadoop is used. Our solution builds on and adapts the techniques proposed by WebPie and Cichlid to cater for the saturation of streams of massive RDF data.
Incremental Saturation. The problem of incremental saturation of RDF data has been investigated by a number of proposals (see e.g., [3,6,10,24,26]). For example, Volz et al. investigated the problem of maintenance of entailments given changes at the level of the RDF instances as well as at the level of the RDF schema [26]. In doing so, they adapted a previous state of the art algorithm for incremental view maintenance proposed in the context of deductive database [22]. Barbieri et al. [3] builds on the solution proposed by Volz et al. by considering the case where the triples are associated with an expiration date in the context of streams (e.g., for data that are location-based). They showed that the deletion, in this case, can be done more efficiently by tagging the inferred RDF triples with an expiration date that is derived based on the expiration dates of the triples used in the derivation. While Volz et al. and Barbieri et al. [3] seek to reduce the effort required for RDF saturation, they do not leverage any indexing structure to efficiently perform the incremental saturation. As reported by Volz et al. in the results of their evaluation study, even if the maintenance was incremental, the inference engine ran out of memory in certain cases. Regarding, Barbieri et al. [3], they considered in their evaluation a single transitive rule (Sect. 5 in [3]), and did not report on the size of the dataset used, nor the micro-batch size.
Chevalier et al. proposed Slider, a system for RDF saturation using a distributed architecture [6]. Although the objective of Slider is similar to our work, it differs in the following aspects. First, in Slider, each rule is implemented in a separate module. We adopt a different approach, where rules are broken into finer operations (map, reduce, union, etc.). This creates opportunities for sharing the results of processing at a finer level. For example, the result of a map can be used by multiple rules, thereby reducing the overall processing required. Second, Slider utilizes vertical partitioning [1] for indexing RDF triples. This indexing structure is heavy since it creates a table for each property in the RDF. While such an indexing structure proved its efficiency in the context of RDF querying, it is heavy when it comes to RDF saturation. Indeed, we know in the context of RDF saturation the inference rules that can be triggered, and therefore can tune the indexing structure needed for this purpose, which we did in our solution.
Guasdoué et al. proposed an incremental solution for saturating RDF data [10]. The incrementality comes from the fact that only rules that have a premise triple that is newly asserted or derived are triggered. We adopt a similar approach to Guasdoué et al.. However, we utilize an indexing structure to fetch existing triples that have been asserted/derived when processing previous micro-batches. Moreover, Guasdoué et al. apply the rules in an arbitrary order, whereas in our work, we order the rules in a way to minimize the number of iterations required for saturating the RDF data.
The authors of WebPie [24] briefly touched on the problem of incrementally saturating RDF data. In doing so, they time-stamped the RDF tuples to distinguish new and old tuples. An inference rule R is then activated only if the timestamp associated with one of its premises is new, i.e., greater than the last time the saturation was performed. We proceed similarly in our work. However, unlike our work, WebPie does not leverage any indexing structures when querying the existing triples to identify those that may be used to activate a given rule R.
To sum up, compared with the existing state of the art in incremental saturation of RDF, we leverage a lightweight indexing structure, a fine-tuned ordering of the execution of the rules, as well as the use of a Big Data platform, namely Spark, to efficiently saturate large micro-batches of RDF data.
Indexing Structures for RDF Data. The indexing mechanism we proposed here is comparable to those proposed by Weiss et al. [29], by Schätzle et al. [21] and by Kaoudi [14] et al. for efficiently evaluating SPARQL queries. For example, Weiss et al. developed Hexastore, a centralized system that maintains six indexes for all triple permutations, namely spo, sop, pso, pos, osp, and ops. For example, using spo indexing a subject s i is associated with a sorted list of properties {p i 1 , . . . , p i n }. Moreover, each property is associated with a sorted list representing the objects. While this approach allows for efficiently evaluating SPARQL queries, it is expensive in terms of memory usage and index maintenance. According to the authors, Hexastore may require 5 times the size of the storage space required for storing an RDF dataset due to the indexes. The solution developed by Schätzle et al. [21], on the other hand, is meant for distributed evaluation of SPARQL queries using Hadoop. To do so, they uses an indexing scheme named ExtVP, which precompute semi-join reductions between all properties. As shown by the authors, the computation of such indexes is heavy, e.g., it requires 290 s to index 100 million triples. To alleviate this, we proposed here an index that is aimed at speeding up RDF saturation, as opposed to any SPARQL query, and that is amenable to incremental maintenance.

Conclusion and Future Work
In this paper we presented a solution for incrementally saturating streams of massive RDF datasets considering RDFS rules and OWL-Horst rules. In our solution, we strive to cater for the incremental processing of the saturation operation. To do so, we make use of an indexing scheme that allows us to retrieve only the instance triples that are necessary for saturation given a newly arrived micro-batch. We have shown that our approach and techniques are effective for both RDFS and OWL-Horst rules and outperform the state of the art solution, viz Cichlid. As future work, we would like to extend our algorithm for query answering problem, for which we believe that our technique could still entail possible optimizations.