Appendix to : LH ∗ RS — A Highly-Available Scalable Distributed Data Structure

The Image Adjustment Algorithm that LHRS client executes to update its image when the IAM comes back is as follows. Here a is the address of the last bucket to forward the request to the correct one, and j is the level of bucket a. These values are in IAM. Notice that they come from a different bucket than that considered in Litwin et al. [1996]. The latter was the first bucket to receive the request. The change produces the image whose extent is closer to the actual one in many cases. The search for key c = 60 in the file in Figure 1(b) illustrates one such case.

partitioned files of records identified with a primary or with multiple keys. See [SDDS] for a partial list of references. A prototype system, SDDS 2000, for Wintel PCs, is freely available for a non-commercial purpose [CERIA].
Among best-known SDDS schemes is the LH* scheme [LNS93,LNS96,KLR96,BVW96,B99a,K98v3,R98]. It creates scalable distributed hash partitioned files. Each server stores the records in a bucket. The buckets split when the file scales up. The splits follow the linear hashing (LH) principles [L80a, L80b]. Buckets are usually stored in distributed RAM. Only the maximum number of nodes of the multicomputer limits the file size. A search or an insert of a record in an LH* file can be hundreds times faster than a disk access [BDNL00,B02].
An LH* server may become unavailable (failed), which makes it impossible to access its data. The likelihood of a server unavailability increases with the scaling file. Similarly, the likelihood of k unavailable servers for any fixed k increases with file size. Data loss or inaccessibility can be very costly [CRP06]. The well-known crash of EBay in June 1999 resulted in a loss of $4B of market value and of $25M in operations. The failure of a financial database may easily cost $10K-$27K per minute, [B99].
The information-theoretical minimum storage overhead for k-availability of m data servers is k/m [H&al94]. It requires the encoding of k parity symbols (records, buckets…), per m data symbols (records, buckets…). Decoding k unavailable symbols requires access to m available symbols among m + k. Large values for m seem impractical. One approach to reasonably limit m is to partition a data file into groups with independent parity calculus, of m nodes (buckets) at most per group.
For a small file using a few servers, a failure of more than one node is unlikely. Thus, k = 1 availability should typically suffice. The parity overhead is then the smallest, 1/m, and the parity operations the fastest, using only XORing, as in RAID-5. The probability that a server becomes unavailable increases however with the size of the file. We need availability levels of k > 1 despite the increased storage overhead. Any static choice for k becomes eventually too small. The probability that k servers become unavailable increases necessarily. The file reliability, which is the probability that all the data are available for the application, declines necessarily as well. To offset this decline, we need the scalable availability, making k dynamically growing with the file [LMR98].
Below, we present an efficient scalable availability scheme we called LH* RS . It generalizes the LH* scheme, structuring the scaling data file into groups of m data buckets, as we indicated above. The parity calculus uses a novel variant of Reed-Solomon (RS) erasure correcting coding/decoding we have designed. To our best knowledge, it offers the fastest encoding for our needs. The storage overhead remains in particular in practice about optimal, between k/m and (k+1)/m.
We recall that RS codes use a parity matrix in a Galois Field (GF), typically in a GF (2 f ). The GF (2 16 ) turned out to be the most efficient for us at present. The addition in a GF is fast, amounting to XORing. The multiplication is slower, regardless of the algorithm used [MS97]. Our parity calculus for k = 1 applies the XORing only, optimizing the most common case. We resort to GF multiplication only for k > 1. Our multiplication uses log and antilog tables. One novelty is that the XORing only encoding remains however then for the first parity symbol (record, bucket…), and for the decoding of a single unavailability of a data bucket. Another novelty is an additional acceleration of the encoding, by needing only XOR operations for the first bucket in each bucket group. Finally, we innovate by using a logarithmic parity matrix, as we will explain, accelerating the parity calculus even more. Besides, the high-availability management does not affect the speed of searches and scans. These operations perform as in an LH* file with the same data records and bucket size.
The study of LH* RS we present here stems from our initial proposals in [LS00]. The analysis reported below has perfected the scheme. This includes various improvements to the parity calculus, such as the use of GF(2 16 ) instead GF(2 8 ), as well as more extensive use of XORing and of the logarithmic parity matrices. We will show other aspects of the evolution in what follows. We have also completed the study of various operational aspects of the scheme, especially of messaging, crucial for the performance. LH* RS is the only high-availability SDDS scheme operational to the extend we present, demonstrated by a prototype for Wintel PCs, [LMS04]. It is not however the only one known. There were proposals to use the mirroring for to achieve 1-availability [LN96,BV98,VBW98]. Two schemes using only XORing provide 1-availability [L&al97,LR01,L97]. Another XORing-only scheme LH* SA was the first to offer the scalable availability [LMRS99]. Its encoding speed can be faster than for LH* RS . The price is sometimes greater storage overhead. We compare various schemes in the related work section.
We first describe the general structure of an LH* RS file and the addressing rules. Next, we discuss the mathematics of the parity calculus and its use for the LH* RS encoding and decoding. We then present the basic LH* RS file manipulations. We follow with a theoretical and experimental performance analysis of the prototype. Our measurements justify various design choices for the basic scheme and confirm its promising efficiency.
In one of our experiments, about 1.5 sec sufficed to recover 100 000 records in three unavailable data buckets, with more than 10MB of data. We investigate also variants with different trade-offs, including a different choice of an erasure correcting code. Finally we discuss the related work, the conclusions and the directions for future work. The capabilities of LH* RS appear to open new perspectives for data intensive applications, including the emerging applications of grids and of P2P computing.
Section 2 describes the LH* RS file structure. Section 3 presents the parity encoding.
Section 4 discusses the data decoding. We explain the LH* RS file manipulations in Section 5. Section 6 deals with the performance analysis. In Section 7 we investigates variants to the scheme. Section 8 discusses the related work. Section 9 concludes the study and proposes directions for the future work. Appendix A shows our parity matrices for GF (2 16 ) and GF (2 8 ). Appendix B sums up our terminology.

THE LH* RS FILE STRUCTURE
LH* RS provides high availability to the LH* scheme [LNS93,LNS96,LMRS99]. LH* itself is the scalable distributed generalization of Linear Hashing (LH) [L80a, L80b]. An LH* RS file contains data records and parity records. Data records contain the application data. The application interacts with the data records as in an LH* file. Parity records provide high availability and are invisible to the application.
We store the data and parity records at the server nodes of the LH* RS file. The application does not access the servers directly, but uses the services of the LH* RS client component. A client usually resides at the application node. It acts as a middleware between the application and the servers.
An LH* RS operation is in normal mode as long as it does not access an unavailable bucket. If it does, the operation enters degraded mode. In what follows, we assume normal mode unless otherwise stated. We first present the storage and addressing of the data records. We introduce the parity management afterwards.

Storage
An LH* RS file stores data records as if they constituted an LH* LH file, a variant of LH* described in [KLR96]. The details and correctness proofs of the algorithms presented below are in [KLR96] and [LNS96]. A data record consists of a (primary) key field, that identifies the record, and a non-key field, Figure 1. The application provides the data for both fields. We write c for the key. The records are stored in data buckets, numbered 0, 1, 2… and located each at a different server node of the multicomputer. The location of bucket a, i.e., the actual address A of the node supporting it, results from a mapping a → A, e.g., through static or dynamic allocation tables at the clients and servers [LNSb96]. A data bucket has the capacity to store b>>1 data records. Additional records become overflow records. Each bucket reports an overflow to the coordinator component of LH*. Typically, the coordinator resides at node of bucket 0.
Initially, an LH* RS file typically stores its data records only in bucket 0 at some server A 0 . It also contains at least one parity bucket, at a different server, as we discuss later.
The file adjusts to growth by dynamically increasing (or decreasing) the number of buckets. New buckets are created by splits. Inserts that overflow their buckets trigger these. Inversely, deletion causing an underflow may trigger bucket merges. Each merge undoes the last split, freeing the last bucket. The coordinator manages both splits and merges.
We show now how LH* RS data buckets split in normal mode. In Section 5, we treat the processing of the parity records during the splits, as well the degraded mode. We postpone the description of the merge operation to that section too.
As in LH file, the LH* RS buckets split in fixed order: 0; 0,1; 0,1,2,3; …; 0…n,…2 i -1; 0… The coordinator maintains the data (i, n) forming the file state. The variable i determines the hash function used to address the data records, as we discuss in Section 2.1.2 below. We call i the LH* file level. The variable n points to the bucket to split; we call it the split pointer. Initially, (i, n) = (0,0).
To trigger a split, the coordinator sends a split message to bucket n. It also dynamically appends a new bucket to the file with the address n + 2 i . The address of each key c in bucket n is then recalculated using a hash function h i+1 : c → c mod 2 i+1 . We call the functions h i linear hash functions (LH-function). For any record in bucket n, either h i+1 (c) = n or h i+1 (c) = n + 2 i . Accordingly, any record either remains in bucket n or migrates to the new bucket n + 2 i . Assuming as usual that key values are randomly distributed, both events are equally likely.
After the split, the coordinator updates n as follows. If n < 2 i -1, it increments n to n + 1. Otherwise it sets it to n = 0 and increments i to i + 1.
Internally, each LH* RS data bucket a is organized as an LH file. We call the internal buckets pages. There is also an internal file state, called the bucket state, (ĩ, ñ). We perform the LH* RS split, (and the LH* LH split in general [KLR96]), by moving all odd pages to the new bucket a'. This is faster than the basic technique of recalculating the address for every record in bucket a. We rename the odd pages in the new bucket a' to 0, 1… by dropping the least significant bit. Likewise, we rename the even pages in bucket a. In other words, page number p becomes page number p/2 in the new buckets.

Addressing
The LH* storage rules guarantee that for any given file state (i, n), the following LH addressing algorithm [L80a] determines the bucket a for a data record with key c uniquely: (A1) a := h i (c); if a < n then a := h i + 1 (c).
Algorithm (A1) determines the correct (primary) address for key c. Its calculus depends on the file state at the coordinator. The general principles of SDDS mandate avoiding hot spots. Therefore, LH* RS clients do not access the coordinator to obtain the file state. Each client uses instead (A1) on its private copy of the file state. We call it client image, and write as (i',n'). Initially, for a new client or file, (i', n') = (0,0), as the initial file state. In general, the client image differs from the file state, as the coordinator does not notify the clients of any bucket splits and merges. The strategy reflects the basic SDDS design, avoiding excessive messaging after a split, problems with unavailable clients, etc. [LNS96]. As the result, any split or merge causes all client images to be out of date.
Using (A1) on the client image can lead to an incorrect address. The client then sends the request to an incorrect bucket. The receiving bucket forwards the request that normally reaches the correct server in one or at most two additional hops (as we will show more below). The correct LH* RS server sends an image adjustment message (IAM).
The IAM adjusts the image so that the same addressing error cannot occur again. It does not guarantee that the image becomes equal to the file state. In general, different clients may have different images.
The binding between bucket numbers and node addresses is done locally as well.
Buckets can change their location. For instance, a bucket can merge with its "father" bucket, and then again split off, but on another (spare) server. A bucket located on a failed node is also reconstructed on a spare. We call a bucket, and query to it, displaced if the bucket is located at a different server than the client knows. We use IAM message to update these data whenever the client deals with a displaced bucket. It will appear that the presence of displaced bucket may lead to one or two additional hops.
Most LH* RS file operations are key-based. The exception is the scan operation, which returns all records satisfying a certain condition. We treat scans in Section 5.5. The keybased operations are insert, delete, update, and (key) search. To perform such an operation, the client uses (A1) to obtain a bucket address a 1 and sends the operation to this bucket. Most of the time [LNS96], the correct bucket number ã, is identical to a 1 .
However, if the file grew and now contains more buckets, then a 1 < ã is possible. If the file shrank, then a 1 > a is possible as well. In both cases, the client image differed from the file image.
To be able to resolve the incorrectly addressed operations, every bucket stores the j value of h j last used to split or to create the bucket. We call j the bucket level and we have j = i or j = i + 1. A server that receives a message intended for bucket ã, first tests whether it really has bucket ã. If not, it forwards the displaced query to the coordinator.
The coordinator attempts to resolve the addressing using the file state. We treat this case later in this section.
Otherwise, bucket ã starts the LH* forwarding algorithm (A2): (A2) a' := h j (c) ; if a' = ã then accept c ; else a'' := h j -1 (c) ; if a'' > ã and a'' < a' then a' := a'' ; send c to bucket a' ;If the address a' provided by (A2) is not ã, then the client image was incorrect. Then, bucket ã forwards the query to bucket a 2 = a', a 2 > a 1 . If the query is not displaced, bucket a 2 becomes the new intended bucket for the query, i.e., ã := a 2 . It acts accordingly, i.e., executes (A2). This may result in the query forwarded to yet another bucket a 3 := a', with a' recalculated at bucket a 2 and a 3 > a 2 . Bucket a 3 becomes the next intended bucket, i.e., ã := a 3 . If the query is not displaced, bucket a 3 acts accordingly in turn, i.e., it executes (A2). A basic property of LH* scheme is then that (A2) must yield a' = a 3 , i.e., bucket a 3 must be the correct bucket a. In other words, the scheme forwards a key-based operation at most twice.
The correct bucket a performs the operation. In addition, if it received the operation through forwarding i.e. a ≠ a 1 , it sends an IAM to the client. The IAM contains the level j of bucket a 1 (as well as the locations known to bucket a, which are in fact all those of its preceding buckets). The client uses the IAM to update its image as follows, according to the LH* Image Adjustment algorithm: (A3) guarantees that the client cannot repeat the same error, although the client image typically still differs from the file image. The client also updates its location data.
We process the query within the correct bucket as a normal LH query. We locate first the page that contains the key c. However, if the bucket has level j > 0, we do not apply the addressing algorithm (A1) directly to c, but rather to c shifted by j bits to the right. In other words, we apply (A1) to ⎣c/2 j ⎦. The modification corresponds to the algorithm for splitting buckets described in the previous section, [KLR96].
If a server forwards a displaced query to the coordinator, the latter calculates the correct address a. It does so according to (A1) and the file state. It sends the actual location of the displaced bucket to the query originator and to the server. These update their locations accordingly.

Parity Records
The LH* RS parity records protect an application against the unavailability of servers with its stored data. The LH* RS file tolerates up to k ≥ 1 unavailable server in a manner transparent to the application. As usual, we call this property k-availability. The (actual) availability level k depends on an LH* RS file specific parameter, the intended availability level K ≥ 1. We adjust K dynamically with the size of the file, (Section 2.2.3).
Depending on the file state, the actual availability level k is either K or K-1.

Record Grouping
LH* RS parity records constitute a specific structure, invisible to the application, and separate from the LH* structure of the data records. The structure consists of bucket groups and record groups. We collect all buckets a with / a m g = ⎢ ⎥ ⎣ ⎦ in a bucket group g, g = 0, 1 ... Here, m > 0 is a file parameter that is a power of 2. In practice, we only consider m ≤ 128, as we do not envision currently any applications where a larger value might be practical. A bucket group consists thus of m consecutively numbered buckets, except perhaps for the last bucket group with less than m members.
Data records in a bucket group form record groups. Each record group (g,r) is identified by a unique rank r, and the bucket group number g. At most one record in a bucket has any given rank r. The record gets its rank, when an insert or split operation places it into the bucket. Essentially, records arriving at a bucket are given successive ranks, i.e. r = 1, 2… However, we can reuse ranks of deleted records. A record group contains up to m data records, each at a different bucket in the bucket group. A record that moves with a split obtains a new rank in the new bucket. For example, the first record arriving at a bucket a, a = 0, 1 ... m-1, because of an insert or a split, obtains rank 1. It thus joins the record group (0,1). If the record is not deleted, before the next record arrives at that bucket, that one joins group (0,2), etc.
Each record group has k ≥ 1 parity records p 1 … p k in addition to the data records.
The parity records are stored at different parity buckets P 1 , ... P k . The local availability level k depends on the intended availability level K, the bucket group number g, and the file state. Using the parity calculus presented in Section 3 and given any s ≤ k parity records and any m-s data records, we can recover the remaining l data records in the group. The availability level of the file is the minimum of the local availability levels.

Record Structure
Figure 1 (ii) shows the structure of a parity record for group (g,r). The first field of every record p i of the group contains the rank r as the record key in P i . The next field is the (record) group structure field C = c 0 , c 1 , … c m-1 . If there is a data record in the i th bucket of the group, then c i is the key of that data record. Otherwise, c i is zero.
The final field is the parity field B. The contents of B are the parity symbols. We calculate them from the non-key data in the data records in the record group (Section 3) in a process called encoding. Inversely, given s parity fields and m-s non-key data fields from records in the group, we can recover the remaining non-key data fields using the decoding process from Section 3. We can recover the keys of the s lacking data records from any parity record. In [LS00], we used a variable length list of keys for existing data structures. The fixed structure presented here and introduced since in [Lj00], proved however more efficient. It typically needs slightly less storage. In addition, its position indicates the bucket at which data records are located. During record recovery, we can directly access the bucket instead of using the LH* addressing algorithm. We avoid possible forwarding messages.

Scalable Availability
Storing and maintaining parity creates storage overhead increasing with k. For a file with only 1 data bucket, the overhead is k. For a larger file, the overhead is at least equal to k/m. In addition, we have the run-time overhead to update all k parity records of a group whenever the application inserts, updates or deletes a data record. A file with few buckets is less likely to suffer from multiple unavailable buckets. However, as the size of the file increases, multiple failures become more probable. For any given k, the probability of catastrophic loss, i.e. loss of more than k buckets in a single group and the resulting inability to access all records, increases with the file size [H&a94]. In response, the LH* RS scheme provides the scalable availability, [LMR98]. When the growing file reaches certain sizes, the file starts to incrementally increase every local k-availability to k+1-availability. We illustrate the principle in Figure 2 that we will discuss more in depth soon.
Specifically, we maintain the file parameter termed intended availability level K. If we create a new bucket that is the first in the group, then this group gets k = K parity buckets. Every data record in this group has then k = K parity records. Initially, K = 1, and any group has one parity bucket. We basically increment K when (i) the split pointer returns to bucket 0, and (ii) the total number of buckets reaches some predetermined level. Then, any existing bucket group has k = K -1 parity buckets (as we will see by induction). Every new group gets then K parity buckets. In addition, whenever we split the first bucket in a group, we equip this group with an additional parity bucket. As the split pointer moves through the data buckets of the file, we create all new groups with K parity buckets, and add an additional parity bucket to all old groups. Thus, by the time the split pointer reaches bucket 0 again, all groups have local availability level k = K.
In Figure 2, a bucket group has the size of m = 4. Data buckets are white and parity buckets are grey. In Figure 2a, we create the file with one data bucket and one parity bucket, i.e., with K = 1. When the file size increases, we split the first bucket, but only maintain one parity bucket, Figure 2b,c. When the m+1 st bucket is created, the new bucket group receives also receives a parity bucket. Thus, the file is 1-available. Each new bucket group has this availability level until the number N of data buckets in the file reaches some N 1 . In our example, N 1 = 16, Figure 2e. More generally, N 1 = 2 l with some large l >> 1 in practice. That condition implies n = 0. The next bucket to split is bucket 0.
At this point of the file scale up, K increases by one. From now on, starting with the split of bucket 0, each split creates two parity records per record group, Figure 2f. Both the existing group and the one started by the split have 2 parity buckets. The process continues until some size N 2 , also a power of 2 and, necessarily this time, a multiple of m.
On the way, starting from the file size N = 2N 1 , all the bucket groups are 2-available, Figure 2g. When the file grows to include N 2 buckets, K increases by 1 again, to K = 3 this time. The next split adds a third parity bucket to the group of bucket 0 and initializes a new group starting with bucket N 2 and carrying three parity buckets. The next series of splits provides all groups with K = 3 parity records. When the file size reaches N 3 , K is again incremented, etc.
Basically for our scheme, and in Figure 2, the successive values N i are predefined as N i+1 =2 i N 1 . We call this strategy uncontrolled availability and justify it in [LMR98].
Alternatively, a controlled availability strategy implies that the coordinator calculates dynamically the values of the N i . The decision may be based on the probability of k unavailable buckets in a bucket group, whenever the split pointer n comes back to 0.
The global file availability level K file is the maximum k so that we can recover any k buckets failing simultaneously. Obviously, K file is equal to the minimum k for all groups in the file. We thus have K file = K or K file = K-1. K file starts at 1 and increases to K file = 2, when N reaches 2N 1 . In general, K increases to i after N reaches N i and K file reaches i when N reaches 2N i . The growing LH* RS file is thus progressively able to recover from larger and larger numbers of unavailable buckets, as these events become increasingly likely, necessarily.
One consequence of the scheme is the possible presence of transitional bucket groups where not all the data buckets are split yet. The split pointer n points there somewhere between the 2 nd and the last bucket of the group. The first bucket group in Figure 2f is transitional, as well as in Figure 2h, both with n = 2. In such a group, the newly added parity bucket only encodes the contents of the data buckets that have already split. LH* RS recovery cannot use this additional parity bucket in conjunction with data buckets that have not yet split. As the result, the availability level of any transitional group is K -1. It becomes K when the last bucket splits, (hence the group ceases to be transitional).

PARITY ENCODING
We now explain our parity encoding, that is, the calculation of the B field in a parity record. Section 4 deals with the decoding for the reconstruction of an unavailable record.
The parity encoding in general is based on Erasure Correcting Codes (ECC). We have designed a generalization of a Reed-Solomon (RS) code. RS codes are popular, [MS97], [P97], being sometimes indirectly referred to also as information dispersal codes, [R89].

Galois Field
Our GF has 2 f elements ; f = 1,2…, called symbols. Whenever the size 2 f of a GF matters, we note the field as GF(2 f ). Each symbol in GF(2 f ) is a bit-string of length f. One symbol is zero, written as 0, consisting of f zero-bits. Another is the one symbol, written as 1, with f-1 bits 0 followed by bit 1. Symbols can be added (+), multiplied (⋅), subtracted (-) and divided (/). These operations in a GF possess the usual properties of their analogues in the field of real or complex numbers, including the properties of 0 and 1. As usual, we may omit the '⋅' symbol.
Initially, we elaborated the LH* RS scheme for f = 4, [LS00]. First experiments showed that f = 8 was more efficient. The reason was the (8-bit) byte and word oriented structure of current computers [Lj00]. Later, the choice of f = 16 proved even more practical. It became our final choice, Section 6.3. For didactic purposes, we discuss our parity calculus nevertheless for f = 8, i.e., for GF(2 8 ) = GF(256). The reason is the sizes of the tables and matrices involved. We note this GF as F. The symbols of F are all the byte values. F has thus 256 symbols which are 0,1…255 in decimal notation, or 0,1...ff in hexadecimal notation. We use the latter in Table 1 and often in our examples.
The addition and the subtraction in any our GF(2 f ) are the same. These are the bitwise XOR (Exclusive-OR) operation on f-bit bytes or words. That is: The XOR operation is widely available, e.g., as the ^ operator in C and Java, i.e., a XOR  The calculus exploits the existence in every GF of the primitive elements. If α is primitive, then any element ξ ≠ 0 is α i for some integer power i, 0 ≤ i < 2 f -1. We call i the logarithm of ξ and write i = log α (ξ). Table 1 tabulates the non-zero GF(2 8 ) elements and their logarithms for α = 2. Likewise, ξ = α i is then the antilogarithm of i that we write as ξ = antilog (i).
The successive powers α i for any i, including i ≥ 2 f -1 form a cyclic group of order Using the logarithms and the antilogarithms, we can calculate multiplication and division through the following formulae. They apply to symbols ξ,ψ ≠ 0. If one of the symbols is 0, then the product is obviously 0. The addition and subtraction in the formulae is the usual one of integers: ξ⋅ψ = antilog( log(ξ) + log(ψ) mod (2 f -1)), ξ/ψ = antilog( log(ξ) -log(ψ) + 2 f -1 mod (2 f -1)).
To implement these formulae, we store symbols as char type (byte long) for GF(2 8 ) and as short integers (2-byte long) for GF(2 16 ). This way, we use them as offsets into arrays. We store the logarithms and antilogarithms in two arrays. The logarithm array log has 2 f entries. Its offsets are symbols 0x00 … 0xff, and entry i contains log(i), an unsigned integer. Since element 0 has no logarithm, that entry is a dummy value such as 0xffffffff. Table 1 shows the logarithms for F.
Our multiplication algorithm applies the antilogarithm to sums of logarithms modulo 2 f -1. To avoid the modulus calculation, we use all possible sums of logarithms as offsets. The resulting antilog array then stores antilog[i] = antilog( i mod (2 f -1)) for entries i = 0, 1, 2…, 2(2 f -2). We double the size of the antilog array in this way to avoid the modulus calculus for the multiplication. This speeds up both encoding and decoding times. We could similarly avoid the modulo operation for the division as well. In our scheme however, division are rare and the savings seem too minute to justify the additional storage (128KB for our final choice of f = 16). Figure 3 shows our final multiplication algorithm. Figure 4 shows the algorithm generating our two arrays. We call them respectively log and antilog arrays. The following example illustrates their use. The first equality uses our multiplication formula but for the first term. We use the logarithm array log to look up the logarithms. For the second term, the logarithms of 49 and 1a are 152 and 105 (in decimal) respectively (Table 1). We add these up as integers to obtain 257. This value is not in

Parity Calculus
We recall that a parity record contains the keys of the data records and the parity data of the non-key fields of the data records in a record group, Figure 1. We encode the parity data from the non-key data as follows.
We number the data records in the record group 0, 1,… m-1. We represent the nonkey field of the data record j as a sequence   . We also number the parity records in the record group 0,1…k. We for the B-field symbols of the j th parity record. We arrange the parity records also in a matrix with l rows and k columns Each parity symbol is thus the sum of m products of data symbols with the same offset times m coefficients of a column of the parity matrix: The LH* RS parity calculus does not use P directly. Instead, we use the logarithmic parity matrix Q with coefficients q i,j = log α (p i,j ). The implementation of equation (3.1) gets the form: Here, ⊕ designates XOR and the antilog designates the calculus using our antilog table, which avoids the mod (2 f -1) computation. Using (3.2) and Q instead of (3.1) and P speeds up the encoding, by avoiding half of the accesses to the log table.
The overall speed-up of the encoding is however more moderate than one could perhaps expect from these figures (Section 6.3.1). While using Q that is our actual approach, we continue to present the parity calculus in terms of P for ease of presentation.

Generic Parity Matrices
We have designed for LH* RS several algorithms for generating parity matrices. We presented the first one in [LS00] for 4-bit symbols of GF(2 4 ). When implemented, operations turned out to be slower they could be on the byte-oriented structure of modern computers [Lj00]. We turned to byte sized symbols of GF(2 8 ) that proved faster, and to 2-byte symbols of GF(2 16 ) that proved even more effective. We have reported early results in [M03]. We show further outcomes below.
We upgraded our parity matrices with respect to [LS00] (and any other proposal we know about in the literature) so that the first column and the first row now only contain coefficients 1. The column of ones allows us to calculate the first parity records of the bucket group using the XOR only, as for the "traditional" RAID-like parity calculus. Our prior parity matrices required GF multiplications for this column, slower than XOR alone as we already discussed. Next, if one data bucket in a group has failed and the first parity bucket is available, then we can decode the unavailable records using XOR only. Before, we also needed the GF multiplications. The row of ones allows us to use XOR calculations for the encoding of each first record of a record group. This also contributes to the overall speed up as well, with respect to any proposal requiring the multiplications, including our own earlier ones. Our final change was the use of Q instead of the original P (Section 3.2.1). The experiments confirmed the interest of all these changes (Section 6.3.1).
LH* RS files may differ by their group size m and availability level k. Smaller m speed up the recovery time, but increase the storage overhead, and vice versa. The parity matrix P for a bucket group needs m rows and k columns, k = K or k = K -1. Different files in a system may need in this way different matrices P. We show in Section 4.2 that the choice of GF(2 f ) limits the possibilities for any P to m + k ≤ 2 f + 1. Except for this constraint, m and k can be chosen quite arbitrarily. We also prove that for any parity matrix P' with dimensions m' and k', every m < m' by k < k' top left corner of P' is also a parity matrix.
These properties govern our use of the parity matrices for different files. Namely, we use a generic parity matrix P' and its logarithmic parity matrix Q' in an LH* RS file system.
The m' and k' dimensions of P' and Q' should be big enough for any system application.
Any actual P and Q we use are then the m ≤ m' by k ≤ k' top left corners of P' and of Q'.
Their columns are derived dynamically when needed.
Section 4.2 below shows the construction of our P' for GF(2 8 ), (within the generator matrix containing it). We have to respect the condition that m' + k' ≤ 257. Because of LH* RS specifically, m' has to be a power of two. Our choice for m' was therefore m' = 128, to maximize the bound on the group size while allowing k' > 1. Hence, k' = 129. Figure 16 displays the 20 leftmost columns of Q'. Figure 17 displays these columns of P'. The selection suffices for 20-available files. We are not aware of any application that needs higer level of availability.
This allows LH* RS files with more than 128 buckets per group. Ultimately, even a very large file could consist of a single group, if such an approach would ever prove useful.
Example 2 We continue to use GF(2 8 ) and the conventions of Section 3.1. We now illustrate the encoding principles presented until now, by the determination of the parity data that should be in k =3 parity records for the record group of size of m = 4 whose description follows. Figure 5 shows P and Q. These are the top left four rows and three columns of P' and Q' in Figure 17 and Figure 16.  We suppose the encoded data records to have the non-key fields as follows: "En arche en o logos …", "In the beginning was the word …", "Au commencement était le mot …", and "Am Anfang war das Wort…". Using ASCII coding, these strings translate to In our implementation, we use Q and the '*' multiplication between two GF elements (or matrices) when the right operand is a logarithm. This yields, according to   Figure 1, is encoded as B = "c 18 0…". Likewise, the second B-field is "d2 76 e2…" and the third one is finally "d0 93 ff…".

Parity Updating
The application updates an LH* RS file a data record at a time. An insert, update or delete of a record, modifies the parity records of a record group. We call this parity updating. It is the actual calculus for the parity encoding in the LH* RS files. We now introduce these principles.
We formally assimilate an insert and a deletion to specific cases of the update that is In particular, if b j is the old symbol, then we calculate the new symbol b' j in record j as where ∆ i is the difference between the new and the old symbol in the updated record, and p i,j is the coefficient of P located in the i th row and j th column.
The ∆-record is the string obtained as the XOR of the new and the old symbols with the same offset within the non-key field of the updated record. For an insert or a delete, the ∆-record is the non-key data. We implement the parity updating operation resulting from an update of a data record with key c and rank r as follows. The LH* RS data bucket computes the ∆-record and sends it, together with c and r, to all the parity buckets of the record group. Each bucket sets the B field value according to (3.3). It then either updates the existing parity record r or creates it. Likewise, the data record deletion updates the B field of the parity record r or removes all records r in each parity bucket. We discuss these operations more in depth in Section 5.6 and 5.7.
As we have seen, the l th parity bucket in a bucket group only needs the column p l of P for the encoding. A parity bucket stores therefore basically only this column. Obviously, the first parity bucket does not have to store p 1 if it is the column of ones as above.
Notice that our parity updating needs only one data record in the group, i.e., the updated one. This property is crucial to the efficiency of the encoding scheme. In particular, update speed is independent of m. Its theoretical basis is that our coding scheme is systematic. We elaborate more on it while discussing alternate codes in Section 7.5.1.

Example 3
We continue with the running example. We consider a file of four data buckets D0, D1, D2, and D3 forming the bucket group of size m = 4. We also consider three parity buckets P0, P1, P2 corresponding to the columns of matrices P and Q in Figure 5. We now insert one by one the records from Example 2. We assume they end up in successive buckets and form a record group. At the end, we also update the record in D1. Figure 6 shows vertically each non-key field of a data record in the group, and the evolution of the B-fields, also represented vertically. It thus illustrates also the matrices A, A' and B, and B' for each parity updating operation we perform. being the difference between this string and the previous non-key data string, which is here the zero string. The first row of P consisting of ones, we calculate the content of each parity bucket by XORing the ∆-record to its previous content. As there were no parity records for our group yet, each B-field gets the ∆-record and we create all three records. Figure 6b shows the evolution after the insert of "In principio …" into D1. The ∆record is again identical to the data record. At P0, the existing parity record is XORed with the ∆-record. At P1, we multiply the ∆-record by '1a' and we XOR the result with the existing string. The '1a' is the P-coefficient located in Figure 5 in the second row (corresponding to D1) and the second column (corresponding to P1). We update the parity data in P2 similarly, except that we multiply by '1c'.

Using Generator Matrix
The decoding calculus uses the concept of a generator matrix. Let I be an m x m identity matrix and P a parity matrix. The generator matrix G for P is the concatenation I|P. We recall from Section 3.2.1 that we organize the data records in a matrix A. Let U denote the matrix A⋅G. U is the concatenation (A|B) of matrix A and matrix B from the previous section. We refer to each line u = (a 1 , a 2 ,..., a m , a m+1 , ..., a n ) of U as a code word. The first m coordinates of u are the coordinates of the corresponding line vector a of A. We recall that these are the data symbols with the same offset in all the data records in the record group. The remaining k coordinates of u are the newly generated parity codes. A column u' of U corresponds to an entire data or parity record.
A crucial property of G is that any m by m square submatrix H is invertible. (See Section 4.2 for the proof.) We use this property for reconstructing up to k unavailable data or parity records. Consider first that we wish to recover only data records. We form a matrix H from any m columns of G that do not correspond to the unavailable records. Let S be A⋅H. The columns of S are the m available data and parity records we picked in order to form H. Using any matrix inversion algorithm, we compute H -1 . Since A⋅H = S, we have A = S⋅H -1 . We thus can decode all the data records in the record group. Hence, we can decode in particular our k data records. In contrast, we cannot perform the decoding if more than k data or parity records are unavailable. We would not be able to form any square matrix H of size m. In general, if there are unavailable parity records, we can decode the data records first and then re-encode the unavailable parity records. Alternatively, we may recover these records in a single pass. We form the recovery matrix R = H -1 ⋅G. Since S = A⋅H, we have A = S⋅H -1 , hence U = A⋅G = S⋅H -1 ⋅G = S⋅R. Although the recovery matrix has m rows and n columns, we only need the columns of the unavailable data and parity records.
Our basic scheme in the prototype uses Gaussian elimination to compute H -1 . It also decodes data buckets before recovering parity buckets. Our generic matrix P' has 128 rows and 129 columns for GF (256). As we said, to encode a group of size m < 128, we dimensional zero vector. We split G' similarly by writing: Here G 0 is a matrix with m rows and G 1 is a matrix with m'm rows. We have u = a⋅G = b⋅G 0 + o·G 1 = b⋅G 0 . Thus, we only use the first m coefficients of each row for encoding.
Assume now that some data records are unavailable in a record group, but m records among m + k data and parity records in the group remain available. We can now decode all the m data records of the group as follows. We assemble the symbols with offset l from the m available records, in a vector b l . The order of the coordinates of b l is the order of columns in G. Similarly; let x l denote the word consisting of m data symbols with same offset l from m data records, in the same order. Some of the values in x l are from the unavailable buckets and thus unknown. Our goal is to calculate x from b.
To achieve this, we form an m' by m' matrix H' with at the left the m columns of G' corresponding to the available data or parity records and then the m'-m unit vectors formed by the column from the I portion of G' corresponding to the dummy data buckets.
This gives H' a specific form: Here That is: According to a well-known theorem of Linear Algebra, for matrices of this form The last equation tells us that we only need to invert the m-by-m matrix H. This is precisely the desired submatrix H cut out from the generic one. This concludes our proof.

Example 4
Consider the situation where the first three data buckets in Example 3 are unavailable.
We collect the columns of G corresponding to the remaining four buckets in matrix: We invert H to obtain:

Constructing a Generic Generator Matrix
We now show the construction of our generic generator matrix G', illustrated in Figure 9.
Matrix G used in Example 4 above is derived from G'. The construction provides also our matrix P' as a byproduct. Let a j be l elements of any field. It is well known, see, e.g. [MS97], that the determinant of the l-by-l matrix that has the i th power of element a j in row i and column j is: If the elements a i are all different, then the determinant is not zero and the matrix invertible.
We start constructing G' by forming a matrix V with n + 1 columns and m' rows, We recall that the record group size for LH* RS is a power of 2. For the reasons already discussed for P', for GF(256), our I' matrix is 128 by 128 and P' is 128 by 129.
Hence, our G' is 128 by 257. Notice the absence of need to store I' or even I. We recall also that Figure 17 shows the leftmost 20 columns of P' produced by the algorithm using the above calculus. Likewise, Figure 14 shows a fragment of P' computed for GF(2 16 ).
Finally, notice that there is no need to store even these columns. At the recovery, they can be obviously dynamically reconstructed from columns of Q' or Q, available for the encoding anyway. This may speed up a key search by recovering the single data record or by finding that it is not in the file. If a bucket recovery already goes on, then the coordinator waits until it finishes or starts the record recovery anyway. Once the record recovery alone or the entire bucket recovery successfully terminates, the coordinator completes the requested operation.
We first present the bucket and record recovery operations. Next, we describe the application interface. The operations are file creation and removal, key search, non-key search (scan), and record insert, update or delete. Except for the file creation and removal, any of these operations may enter the degraded mode thus triggering at least a bucket recovery. Finally, we discuss the bucket split and merge that adds or removes buckets from the file, invisibly to the application. These can also enter degraded mode.

Bucket Recovery
The coordinator starts the bucket recovery by probing the m data and k parity buckets of the bucket group of bucket a for availability. Typically, k = K, unless the last change of K was not yet posted to all the groups or the group is a transitional one. The group availability level may be still then k = K -1. The probe may find several unavailable buckets. If the coordinator finds up to l ≤ k unavailable buckets, then the failure is not catastrophic. Otherwise, the coordinator halts and reports the catastrophic failure to the user. It may still be possible to recover some records, but the case is beyond our scheme.
Otherwise, the coordinator starts the bucket group recovery. If l = 1, the group recovery reduces to recovering bucket a. This should be the most frequent case. Otherwise, the operation recovers also all other unavailable data or parity buckets of the group. The The manager first recreates at each spare the complete, although yet empty, structure of the bucket to be recovered there. Next, it collects the columns of P it needs for L A , according to Section 4.1. It reads these from the parity buckets in L A . It then forms matrix H. This may include dummy columns if the group is the last one in the file and not all m data buckets exist yet. Then, if the first parity bucket is not the sole parity bucket to use, the manager calculates H -1 . Next, it loops over the record group recovery that produces all the unavailable records of one group. First, it produces then all the data records. Next, -the parity records, provided there is any parity bucket number in L S . The loop is over all the ranks of the parity records in one of the parity buckets in L A . The manager chooses the bucket and reads one-by-one all its records. If only one data bucket is unavailable, and 1 st parity bucket is in L A , then the manager skips H -1 calculus and GF multiplications to recover the data bucket. We recall that the decoding is then faster, using the XORing only.
During the loop, for each parity record encountered, the manager explores its group structure field C, Figure 1. For every non-null key c i , it requests the data record c i from its bucket, provided it is in L A (as already discussed, the bucket is also the i th in the group).
The manager decodes the non-key fields of unavailable data records in the group. It uses XORing for the first parity bucket, and/ or H -1 . Next, using the C-field, it reconstructs the keys of the recovered records. Finally, if there is any parity bucket to recover, it requests the missing Q columns from the coordinator. It then encodes the unavailable parity records using the m data records in the recovered group. Finally the manager sends the recovered records (in bulks, as we discussed later) to the spares for insertion into the recovered buckets.
Once the bucket group recovery ends, the manager pushes the addresses of the recovered buckets to the existing buckets in the group so that they can update their location tables. It finally successfully returns the control to the coordinator. The coordinator considers the new buckets as ready to use. It updates accordingly its server addresses. A client or server will get the new address of a recovered bucket when it finds the bucket displaced.
It is perhaps worth recalling the alternate decoding algorithm here, for a record group with the appropriate submatrix of can decode all unavailable data and parity records in the group at once. Since typically only one bucket is lost, our choice is however typically faster.
As mentioned above, our bucket group recovery moves records in bulks. Formerly, we transferred them individually using UDP. This turned out to be much less effective than the current approach with the TCP/IP in passive mode (Section 6.3).

Record Recovery
The operation results from the key search for record c, (Section 5.4 below), that localizes it in an unavailable bucket a. The result for the application is nevertheless the delivery of record c alone, or the reply that it is not in the file. We determine the latter by consulting Record recovery access time to an unavailable data record is typically much faster than the time to recover the whole bucket group (what we do anyhow). See Sections 6.3.6 and 6.3.7 below. For typically even better record recovery times, we need to avoid the intra parity bucket scan above mentioned, searching at present for key c. This requires an index binding rank r to key c, or an algorithm making rank r a function of c etc. The obvious trade-off is some storage and run-time overhead. The variation is a candidate future work.

File Creation and Removal
The client creates an LH* RS file F as an empty data bucket 0. File creation sets the parameters m and K. The latter is typically set to K = 1. The SDDS manager at bucket 0 becomes the coordinator for F. The coordinator initializes the file state to (i = 1, n = 0).
The coordinator creates also K empty parity buckets, to be used by the first m data buckets, which will form the first bucket group. The coordinator stores column i of P with the i th parity bucket, with the exception of the first parity bucket (using P' to generate these columns). There is no degraded mode for the file creation operation.
Notice however that the operation fails if no K+1 available servers are to be found.
If the application requests the removal of the file, the client sends the request to coordinator. The coordinator acknowledges the operation to the client. It also forwards the removal message to all data and parity buckets. Every node acknowledges it. The unresponsive servers enter an error list to be dealt with beyond the scope of our scheme.

Key Search
In normal mode, LH* RS searches for a key with the LH* key search algorithm (Section 2.1.2). The client or the forwarding server triggers degraded mode if it encounters an unavailable bucket, called a 1 . It passes then the control to the coordinator.
The coordinator starts the recovery of bucket a 1 . It also uses the LH* file state parameters to calculate the address of the correct bucket for the record, call it bucket a 2 . If a 2 = a 1 , the coordinator starts also the record recovery. If a 2 ≠ a 1 , and bucket a 2 was not found to be unavailable during the probing phase of the bucket a 1 recovery, then the coordinator forwards c to bucket a 2 . If bucket a 2 is available, it replies to the LH* RS client as in the normal mode, including the IAM. If the coordinator finds it unavailable and the bucket is not yet being recovered, e.g., is in another group than bucket a 1 , then the coordinator starts the recovery of bucket a 2 as well. It performs than also the record recovery.

Scan
A scan returns all records in the file that satisfy a certain query Q in their non-key fields.
A client performing a scan sends Q to all buckets in the propagation phase. Each server executes Q and sends back the results during the termination phase. The termination can be probabilistic or deterministic, [LNS96]. The choice is up to the application.

Scan Propagation
The client sends Q to all the data buckets in its image using unicast or broadcast when possible. Unicast messages only reach the buckets in the client image. LH* RS applies then the following LH* scan propagation algorithm in the normal mode. The client sends Q with the message level j' attached. This is the presumed level j of the recipient bucket, according to the client image. Each recipient bucket executes Algorithm (A4) below. A4 forwards Q recursively to all the buckets that are beyond the client image. Any of these must result, perhaps recursively through its parents, also beyond the image, a split of exactly one of the buckets in the image.

Algorithm A4: Scan Propagation
The client executes: In normal mode, Algorithm (A4) guarantees that the scan message arrives at every bucket exactly once [LNS96]. We detect unavailable buckets and enter degraded mode in the termination phase.

Example
Assume that the file consists of 12 buckets 0, 1, … 11. The file state is n = 4 and i = 3.
Assume also that the client has still the initial image (n', i') = (0,0). According to this image, only bucket 0 exists. The client sends only one message (Q,0) to bucket 0.
Bucket 3 receives (Q,2) and forwards with level 3 to bucket 7 and with level 4 to bucket 11. The remaining buckets receive messages with a message level equal to their own level and do not forward.

Scan Termination
A bucket responds to a scan with probabilistic termination only if it has a relevant record.
The client assumes that the scan has successfully terminated if no message arrives after a timeout following the last reply. A scan with probabilistic termination does not have the degraded mode. The operation cannot always discover indeed the unavailable buckets.
In deterministic termination mode, every data bucket sends at least its level j. The Example For change, we consider now a file in state (n, i) = (0,3), hence with 8 buckets 0,1…7.
The client issues a scan Q with the deterministic termination. It got replies with bucket levels j from buckets 0, 1, 2, 4, 5, and 6. None of the termination conditions are met.
Condition (i) fails because, among other j 0 > j 4 . Likewise, condition (ii) cannot become true until bucket 3 replies. The client waits for further replies.
Consider now that no bucket replied within the time-out. The client alerts the coordinator and sends Q and the addresses in list L = {0, 1, 2, 4, 5, 6}. Based on the file state and L, the coordinator determines that buckets 3 and 7 are unavailable. Since the loss is not catastrophic (for m = 4 and k =1 in each group concerned), the coordinator now launches recovery of buckets 3 and 7. Once this has succeeded, the coordinator sends the scan to these buckets. Each of them finally sends its reply with its j a , perhaps some records, and its (new) address. The client adjusts its image to (n', i') = (0,3) and refreshes the location data for buckets 3 and 7.

Insert
In normal mode, an LH* RS client performs an insert like an LH* client. The client sends the insert request to the bucket determined by the data record key c and the client's image. The client waits for an acknowledgement to terminate. If it does not come within a timeout, then the client sends the insert to the coordinator and the operation enters degraded mode.
The receiving data bucket follows algorithm A2 (Section 2.1.2) by forwarding the request if necessary. If the correct data bucket receives the insert, it stores the record as for an LH* file. If the data bucket overflows, the bucket informs the coordinator. In addition, it assigns a rank r to the record. Next, it sends the ∆-record (with key) c and r to the k parity buckets. Recall that the ∆-record is essentially the inserted record. The data bucket then waits for the k acknowledgements from all parity buckets. Inserts (as well as updates and delete operations) need to maintain coherency between parity records and data records. That is, a data record should be changed with all its parity records or not at all. Otherwise, the k-availability of all records in the group is no longer guaranteed. For instance, assume that (i) we have a group with k = 2, (ii) a data record was inserted, but finally only one of the parity records was updated. If the data record and one of the parity records become unavailable, then we can retrieve the inserted record if the updated parity record is still available, otherwise not. As we will see, the situation is even more difficult for updates. As the general rule, in order to maintain k-availability, we need to perform any insert (update / delete) operation at the data bucket and all k parity buckets. In other words, a change should be committed simultaneously at all buckets involved.
The commit process between the data and the parity buckets differs for k = 1 and k > 1. In the former case, we use an implicit 1-phase commit (1PC). The parity bucket simply acknowledges the reception of the ∆-record and creates/updates its record r. The data bucket then acknowledges the insert to the client that can eventually avert the application. The server or the client enters the degraded mode if any of the expected messages does not arrive in time.
We now discuss the case k > 1. 1PC no longer guarantees that all parity records and the data record are updated. We therefore use a variant of 2PC that guarantees that the ∆-record c updates all or none of the k buckets. The data bucket sends the ∆-record (with key) c and r to all k parity buckets. Each parity bucket starts the commit process by acknowledging the reception of the message with ∆-record c and with r. This confirmation constitutes the "ready-to-commit" message of 2PC. Each parity bucket encodes the record as usual into parity record r. But it retains the ∆-record in a differential file (buffer) for a possible rollback. If the data bucket gets all k "ready-tocommit" messages, it acknowledges to the client after it sends out "commit" message to the k buckets. Each bucket that receives the message discards the ∆-record. Notice that one could also use a more sophisticated scheme based on collective acknowledgements as in TCP/IP, but this variations is beyond our scope at present.
The degraded mode starts when any of the buckets involved cannot get a response it waits for. The data bucket enters the degraded mode if it lacks any of the acknowledgments from the parity buckets. It alerts the coordinator, transmitting r and the number p of the unavailable parity bucket (its column index in P). The coordinator probes the group for the availability of m buckets. It also probes all the parity buckets of the group, except bucket p, whether they have c or not. In the latter case, such a bucket either has c in the key-field C and the ∆-record in the differential file, or only has c, or lacks both. In every case, provided the coordinator finds m available buckets, it synchronizes all available parity buckets so that all reflect the insert. Then, the coordinator recovers the group and finally acknowledges to the client.
Another degraded case occurs when parity bucket p does not receive a message from the data bucket. The data bucket must then have just failed and bucket p must be in the "ready-to-commit" state and must have the ∆-record. It then alerts the coordinator, sending out ∆-record c and r. The coordinator probes the bucket group for the recoverability. If the probe is successful in finding the required number of available data buckets, the coordinator synchronizes all parity buckets so that all have processed the insert. The recovery process can now proceed. The recovered data bucket will contain the inserted record. Finally, the coordinator sends an acknowledgement to the client.
Next, the client might detect that the data bucket has failed because of a lacking acknowledgement. The client informs the coordinator. After the coordinator determines the availability of buckets in the group, it synchronizes the parity buckets with regard to the insert. It might find that the data bucket never sent any messages to a parity bucket, because it failed before receiving the original insert command from the client or because it failed before it could forward the ∆-record. Alternatively it might find that all available parity buckets have already committed the insert. Otherwise, a parity bucket would have informed the coordinator of the data bucket's unavailability. In all the cases, the coordinator can determine the state because either the record key c is in the parity record or not. After synchronization at the parity records, all unavailable buckets in the group are recovered and the insert is finally acknowledged to the client.
Finally, we have to deal with the simultaneous unavailability of client and data bucket. In this case, either all (available) parity buckets have not received the ∆-record or all have committed the insert. Only a later file operation will discover that the data bucket is unavailable. Depending on the shared state of the parity records in regard to the insert, the data bucket will be recovered without or with the inserted record.
An insert in a degraded mode where the correct data bucket was unavailable may generate an overflow at the recovered bucket. The new bucket itself alerts the coordinator to perform a split.

Delete
In the normal mode, the client performs the delete of record c as for LH*. In addition, the correct bucket sends the ∆-record, the rank r of the deleted record, and key c to the k parity buckets. Each bucket confirms the reception and removes key c. If c is the last actual key in the list, then the parity bucket deletes the entire parity record r. Otherwise, it adjusts the B-field of the parity record to reflect that there is no more record c in the record group.
The data bucket communicates with the parity buckets using the 1PC or the 2PC. The latter is as for an insert, except for the inverse result of the key c test. As for the insert for k > 1, the parity buckets keep also the ∆-record till the commit message. More generally, the degraded mode for a delete is analogous to that of an insert.

Update
An update operation of record c changes its non-key field. In the normal mode, the client performs the update as in LH*. The client sends the record with its key c and the new value of the non-key field. The data bucket uses c to look up the record, determines its rank r, calculates the ∆-record, and sends both to all k parity buckets. These recalculate the parity records. Finally, the data bucket commits the operation.
As for inserts and deletes, 1PC suffices only for k = 1. But for k > 1, 2PC as used for inserts and deletes is no longer sufficient. We cannot always make out with that protocol whether a parity record has been actually updated.
Our basic 2PC version for updates works as follows. The data bucket sends the ∆record with key c and rank r to the k > 1 parity buckets. But now the messaging follows the order in P. Each parity bucket starts the commit process by acknowledging c and r.
As for an insert, the acknowledgement constitutes the "ready-to-commit" message of 2PC. Each parity bucket encodes the record as usual but also keeps the ∆-record in its differential file. If the data bucket gets all k "ready-to-commit" messages, it acknowledges to the client. It also sends out "commit" messages to the k buckets.
However, it does so one at a time, waiting for each previous acknowledgement before sending to the next (in the order in P) parity bucket. Once a receiving bucket gets this message, it discards the ∆-record.
Assume now that the coordinator is alerted because of the loss of parity bucket p after p entered the commit phase. The coordinator can find that all parity buckets before p have committed and that the ones after p have not. Therefore, the coordinator can synchronize the parity buckets accordingly.
It is easy, but tedious, to prove the correct termination for an update with this protocol for all the other cases of the degraded mode. These are as for the insert. We avoid discussing them here. Let us say only that the algorithms are quite similar, although knowing the position of the alerter may make some faster.

Split
As in LH*, if an insert to an LH* RS data bucket a overflows a, then a alerts the coordinator. The coordinator starts the split operation of bucket n, identified by the split pointer. Typically, we have n ≠ a. In the normal mode, the coordinator first locates an available server and allocates there the new data bucket N, where N denotes the number of data buckets in the file before the split. Bucket N is usually in the bucket group different from that of bucket n, unless the file is small and N<m. If N is the first bucket in the group, then the coordinator allocates K new, empty parity buckets. If K > k of the bucket group with bucket n, then the coordinator also allocates an additional K th parity bucket to the group. Provided all this performs normally, the coordinator sends the split message to bucket n with all the corresponding addresses. This hands control of the split to bucket n. The coordinator waits nevertheless for the final commit message. The bucket sends all the data records that change the address when rehashed using h j + 1 to data bucket N. We recall from Section 2.1.1 that our implementation sends these records in bulks.
For each data record that moves, bucket n finds its rank r, produces a ∆-record that is actually identical to the record itself, and requests its deletion from the parity records r in all the k buckets of its group. It also assigns new successive ranks r', starting from r' = 1, to the remaining data records. Bucket n sends then both ranks with each ∆-record to the K parity buckets. At the k existing buckets, it requests the delete of ∆-record from parity record r and its insert into parity record r'. A new K th parity bucket, if there is one, disregards the delete requests.
At data bucket N, the bucket requests the inserts into its K parity buckets with the successive ranks it assigns. Once the split processing terminates at bucket N, N reports this to the waiting coordinator.
The operations on the parity buckets use 1PC for K = 1 and 2PC as described for the inserts and deletes otherwise. The degraded mode starts when a data or a parity bucket does not reply. The various cases are similar to those already discussed. Likewise, the 2PC termination algorithms are similar to those for an insert as well. We thus avoid the discussion of all these aspects of splitting here. Notice however that all unavailable buckets are reported, as the coordinator waits for the commit messages from both buckets n and N.
Once the split terminates successfully, the coordinator resets the value of n as already described in Section 2.1.1. Notice that bucket N becomes then "officially" bucket N -1 since N := N + 1.

Merge
Deletions may decrease the number of records in a bucket under an optional threshold b' << b, e.g., 0.4 b. The bucket reports this to the coordinator. The coordinator may start a bucket merge operation. The merge removes the last data bucket in the file, provided the file has at least two data buckets. It moves the records in this bucket back to its parent bucket that has created it during its split. The operation increases the load of the file.
In the normal mode, for n > 0, the merge starts with setting the split pointer n to n := n -1. For n = 0, it sets n = 2 i -1 -1. Next, it moves the data records of bucket n + 2 i (the last in the file), back into bucket n (the parent bucket). There, each record gets a new rank following consecutively the ranks of the records already in the bucket. The merge finally removes the last data bucket of the file that is now empty. For n = 0 and i > 0, it If n is set to 0, the merge may also decrease K by one. This happens if N decreases to a value that previously caused K to increase. Since merges are rare and merges that decrease K are even rarer, we omit discussion of the algorithm for this case.
The merge updates also the k parity buckets. This undoes the result of a split. The number of parity buckets in the bucket group can remain the same. If the removed data bucket was the only in its group, then all the k parity buckets for this group are also deleted. The merge commits the parity updates using 1PC or 2PC. It does it similarly to what we have discussed for splits.
As for the other operations, the degraded mode for a merge starts when any of the buckets involved does not reply. The sender other than the coordinator itself alerts the latter. The various cases with which we are to counted are similar to those already discussed. Likewise, the 2PC termination algorithms in the degraded mode are similar to those for an insert or a delete. As for the split, every bucket involved reports any unavailability. We omit the details.

PERFORMANCE ANALYSIS
We now discuss the storage, communication, and processing performance of the scheme.
As usual, we derive the formulae for the load factor, parity storage overhead, and the messaging costs. We discuss some design choices that appear. Next, we show the mostly experimental analysis of the processing times. A purely formal analysis of these did not seem useful, because of the practical complexity of our system. The response times also depend heavily on various implementation level choices, as we will show.

Storage Occupancy
The file load factor α is the ratio of the number of data records in the file over the capacity of the file buckets. The average load factor α d of the LH* RS data buckets is that of LH*. Under the typical assumptions (uniform hashing, few overflow records…), we have α d = ln(2) ≈ 0.7. Data records in LH* RS may be slightly larger than in LH*, since it may be convenient to store the rank with them.
The parity overhead should be about k/m in practice. This is the minimal possible overhead for k-available record or bucket group. Notice that parity records are slightly larger than data buckets, since they contain additional fields. If we neglect these aspects, then the load factor of a bucket group is typically: The average load factor α f of the file depends on its state. As long as the file availability level K' is the intended one K, we have α f = α g , provided N >> m so that the influence of the last group is negligible. The last group contains possibly less than m data buckets,. If K' = K -1, i.e., if the file is in process of scaling to a higher availability level, then α f depends on the split pointer n and file level i as follows: α f ≈ α d ((2 in) / (1 + (K -1) / m ) + 2n / (1 + K / m ) ) / (2 i + n).
There are indeed 2n buckets in the groups with k = K and (2 i − n) bucket in the groups whose k = K'. Again, we neglect the possible impact of the last group. If α g (k) denotes α g for given k, we have: In other words, α f is then slightly lower that α g (K'). It decreases progressively until its lower bound for K', reaching it for n = 2 i + 1 -1. Then, if n = 0 again, K' increases to K, and α f is α g (K') again.
The increase in availability should concern in practice only relatively few N values of an LH* RS file. The practical choice of N 1 should be indeed N 1 >> 1. For any intended availability level K, and of group size m, the load factor of the scaling LH* RS file should be therefore in practice about constant and equal to α g (K). That one is the highest possible load factor for the availability level K and α d . We thus achieve the highest possible α f for any technique added upon an LH* file to make it K-available.
Our file availability scale-up to level K + 1 is incremental. One also accesses among the data buckets only to the existing splitting bucket and the new one at the time. This strategy induces a storage occupancy penalty with respect to best α f (K), as long as the file does not reach the new level. The worst case for K-available LH* RS is then in practice α f (K + 1). This value is in our case still close to the best for (K + 1) -available file. It does not seem possible to achieve a better evolution of α f for our type of an incremental availability increase strategy.
The record group size m limits the record and bucket recovery times. If this time is of lesser concern than the storage occupancy, one can set m to a larger value, e.g., 64, 128, Observe that for given α f and the resulting acceptable parity storage overhead, the choice of a larger m benefits the availability. While choosing for an α f some m 1 and k 1 leads to the k 1 -available file, the choice of m 2 = l m 1 allows for k 2 = l k 1 which provides l more times available file. The penalty is however obviously about l times greater messaging cost of bucket recovery, since m buckets have to be read. It does not mean however (fortunately) that the recovery time also increases l times, as it will appear.
Hence, the trade-off can be worthy in practice.

Example
We now illustrate the practical consequences of the above analysis. Consider m = 8. The parity overhead is then (only) about 12.5 % for the 1-availability of the group, 25 % for its 2-availability etc.
We also choose uncontrolled scalable availability with N 1 = 16. We thus have 1available file, up to N = 16 buckets. We can expect α f = α g (1) ≈ 0.62 which is the best for this availability level, given the load factor α d of the data buckets. When N := 16, we set K := 2. The file remains still only 1-available, until it scales to N = 32 buckets. In the meantime, α f decreases monotonically to ≈ 0.56. At N = 32, K' reaches K and the file becomes 2-available. Then, α f becomes again the best for the availability level and remains so until the file reaches N = 256. It stays thus optimal for fourteen times longer period than when the availability transition was in progress, and the file load was below the optimal one of α g (1). Then, we have K := 3 etc.
Assume now a file that has currently N = 32 buckets and is growing up to N = 256, hence it is 2-available. The file tolerates the unavailability of buckets 8 and 9, and, separately, that of bucket 10. But the unavailability of buckets 8-10 is catastrophic.
Consider then rather the choice of m = N 1 = 16 for the file starting with K = 2. The storage overhead remains the same hence is α f . But now the file tolerates that unavailability as well, even that of up to any four buckets among 1 to 16.

Messaging
We calculate the messaging cost of a record manipulation as the number of (logical) messages exchanged between the SDDS clients and servers, to accomplish the operation.
This performance measure has the advantage of being independent of various practical factors such as network, and CPU performance, communication protocol, flow control strategy, bulk messaging policy etc. We consider one message per record sent or received, or a request for a service, or a reply carrying no record. We assume reliable messaging. In particular, we consider that the network level handles message acknowledgments, unless this is part of the SDDS layer, e.g., for the synchronous update of the parity buckets. The sender considers a node unavailable if it cannot deliver its message.  For the record recovery, the coordinator forwards the client request to an unavailable parity bucket. That looks for the rank of the record. If the record does not exist, two messages follow, to the coordinator and to the client. Otherwise, 2(m-1) messages are typically, and at most, necessary to recover the record. proportionally affects the recovery. To offset the incidence at B, one may possibly decrease b accordingly. This increases C N for the same records, since N increases accordingly. This does not mean however that the scan time increases as well. In practice, it should even often decrease.

Experimental Results
We have prototyped LH* RS to study the timing of various operations and prove the viability of the scheme. The prototype was a many-year effort. The earliest implementation is presented in [Lj00]. It put into practice the parity calculus defined in [LS00]. It also reused an LH* LH implementation for the data bucket management, [B02].
Experiments with next version of LH* RS prototyping were presented in [ML02]. The current version used for the experiments below builds upon that one. We present the prototype itself more in [LMS04]. Further details of the prototype, as well as the deeper discussion of the experiments discussed below, are in [M03].
The prototype consists of the LH* RS client and server nodes. These are C++ programs running under Windows 2000 Server. Internally, each client and server processes the queries and data using threads. The threads communicate through queues and other data structures and synchronize on events. There are basically two kinds of threads. The listening threads manage the communications at each node. There is one thread for UDP, one for TCP/IP and one for multicast messaging. Next, four working threads process simultaneously the queries and data, received or send out. We have designed the prototype to experimentally measure the speed of the operations using the parity calculus, depending on design choices. Most experiments compared the use of GF(2 8 ) and of GF(2 16 ). We ourselves assumed that using the latter was faster, but could not quantify this assumption nor validate it. Experiments confirmed that using the latter was indeed usually, faster, but not always. Most noticeable speed up occurred for the decoding. We could also confirm the utility and measure the benefit of using our newest logarithmic matrix Q, derived from our also newest matrix P, with a first column and first row of ones. We then measured the speed of the operations involving the parity updating, namely the inserts, file creation with splits, updates, as well as the bucket and record recovery. The study used various availability levels, namely k = 0… 3. We left the study of deletes, of merges and of scans for the future. First two operations are of lesser practical interest. The last one is out of our goal here, as normally independent of the parity calculus. We have measured nevertheless the key search speed, as the referential of the time to operate over a data record. Each measure was averaged over several experiments.
Practical considerations lead to simplified implementation of some operations compared to their description in previous sections. Also, the experiments modified our own ideas on the best design of some operations. We discuss the differences in respective section.
The configuration for our experiments included five P4 PCs with 1.8 GHz clock rate and 512 MB memory, and a 2.6 GHz, 512 MB P4 machine. The latter was used as a client. Others were data or parity servers. Sometimes, we also used additional client machines (733 MHz, P3). Our network was a 1 Gbps Ethernet.

Parity Calculus Optimization
To test the efficacy of using Q, we conducted experiments creating parity records in a bucket with a logarithmic Q column, versus its original P column. We used the group of m = 4 data buckets and created a parity bucket using the second or third or fourth parity column of each matrix (the first column of P was that of ones Using Q with the first column and first row of zeros makes thus the encoding the fastest. Such Q was therefore our final choice for the scheme and all the experiments we report below. We attribute the better savings for GF(2 8 ), with respect to the above percentages for GF(2 16 ), to the higher efficacy of the XORing for byte sized symbols.

Key Search
The key search time is the basic referential of access performance of the prototype, since it does not involve the parity calculus for k > 0. We have measured the time to perform random individual (synchronous) and bulk (asynchronous) successful key searches. All for the experiments with the updates, we report on the unreliable messaging performance assumption as well.
As for the experiment, we have timed the series of 10 000 inserts into an initially empty bucket of b = 10 000. We avoided any split in this way, unlike for the measure of the file creation time in Section 6.3.4 below. A record has again the 4 B key and 100 B of non-key data. The average times were in practice identical for both GFs used. We recorded 0.29 ms for k = 0, 0.33 ms for k = 1 and 0.36 ms for k = 2. The average bulk insert times were seven to nine times faster reaching 0.04 ms. These times were the same as for the updates, discussed in Section 6.3.5 below. They were measured in the same way, and are similarly independent of k.
The figures above show that adding the first parity bucket to 0-available file, slows down an insert on the average by 0.04 ms or 14 %. Adding one more parity bucket costs slightly less, 0.03 ms or 10 %, despite the XOR only calculus on the first bucket. The reason is that most of the operations at the data bucket are in common, and the operations at the parity buckets proceed in parallel. All this appears to be a quite efficient behavior.
Finally, the measured times are respectively about 30 to 250 times faster than to local disks (assuming 10 ms per access). As for a key search, the individual insert time was bound mainly by the server speed, while the bulk insert was due to the maximal client speed. Figure 10 shows the average file creation time, by inserts with splits this time, for a bucket group of m = 4 data buckets and k = 0,1,2 parity buckets. The inserts are individual ones. We did not experiment with the bulk inserts, as they need a more complex design of splits left for future work, to prevent side effects resulting from the concurrent processing of splits and of inserts. Besides, the average time to create a file using l record bulk inserts would be at the client simply 0.04 l ms, given the bulk insert time above. At a server, the time could be somehow longer, to complete the last inserts (see the discussion of the bulk updates below). For the experiments with inserts above, during the file creation the data bucket sends the acknowledgement to the client, after sending the messages to the k parity buckets, but without waiting for the acknowledgements from these buckets. The results we measured were practically the same for GF(2 16 ) and GF(2 8 ). Hence, the charts shown apply to both fields, although the numerical values shown are for GF(2 8 ). We inserted a series of 25000 records, again with a 4 B key and 100 B of non-key data per record. The bucket size was b = 10 000. A point of the chart corresponding to l inserts shows the total time to perform these inserts.

File Creation
The inserts caused the file to split thrice. The split of bucket 0 occurred naturally after the insert 10 000. A temporary slow down of the insert times resulted, greater for greater k. The next inserts went uniformly into buckets 0 and 1. After slightly more than 10 000 further inserts, both buckets split almost concurrently. That is why the chart seems to show only two splits. For instance thus, less than a minute should suffice for a 1 M record file.
We also timed the use of our former Q matrix, without the first column and row of ones. The creation time for k = 1 was 10.011 sec. Thus, our new Q effectively speeds up the encoding time, by almost 2 % here. We recall that this acceleration, although slight here, is at no other cost.

Update
To determine the update performance, we generated series of 500, 1000, 5000, and 8000 blind updates to the records in our LH* RS file (same as for the insert experiments). We updated different records, to prevent caching effects. The results are in Table 3 for bulk and individual updates. All updates are sent using UDP and 1PC. As before, we neglect the rare case of double unavailability of data bucket and of the client. This time, however, the data bucket waits for the acknowledgements before sending the commitment to the Notice that the bulk insert time is independent of the GF used.

Individual
Bulk  Compared also to the insert times, the bulk times do not change, as the client processes inserts and updates at the same possible speed. The update processing takes in contrast longer per record at the servers, with perhaps longer Listen Queues. This results from Table 3, as the individual update time for k = 1 already is almost 45 % longer than the time to insert. The individual insert time for k = 0 is in contrast about 15 % longer than that of an individual update. This is due to the internal LH splits within the bucket.

Bucket Recovery
As described in Section 5.1, the recovery manager organizes bucket recovery. For implementation related reasons, our prototype locates the recovery manager at a parity bucket and not at a spare. To measure the performance, we simulated the creation of an LH* RS group with 4 data buckets and 1, 2, or 3 parity buckets. The group contained 125 000 = 4 * 31 250 data records consisting again of a 4 B key and 100 B non-key data.
We then reconstructed 1, 2, and 3 "unavailable" buckets. The recovery manager loops conceptually over all the existing record groups, i.e., over all the parity records in the parity bucket (Section 5.1). In fact, it recovers records by slices of a given size s. It requests s successive records from each of the m data/parity buckets, and recovers the s record groups. Then, it requests next s records from each bucket. While waiting, it sends the recovered slice to the spare(s). Figure 11 presents the effect of slice size on the recovery of a data bucket in the sample case of using the first parity bucket with 1's only and GF(2 16 ). We measured the total recovery time T, the processing time P, and the communication time C. The basic finding is that the recovery time greatly decreases for a larger s. For s = 1, we have C = 149 s, P = 1.735 s and T = 165 s. Figure 11 does not give these values since they are so large but rather displays values only for s ≥ 100. Once s is above 1000, T drops under 1s, and P and C under 0.5 s. All the times decrease slightly for larger s and become constant when we choose s over 3000. This is a consequence of our latest communication architecture that uses the passive TCP connections we already spoke about. The result means also that a server may efficiently work with buffers much smaller than the bucket capacity b, e.g., 10 times smaller. The experiments with our earlier architectures are in [M03]. They prove the great superiority of the current one.  Figure 11 by listing the T, P, C times for s values minimizing T and k = 1,2,3. We used GF(2 8 ) and GF(2 16 ). The difference between a T value and the related P + C is the thread synchronization and switching time. We have measured all these times also for the other s values marked in Figure 11. For s ≥ 1250, the differences to the times listed here were under 15 % for 1-DB recovery, 5 % for 2-DB recovery and 2 % for 3-DBs.The 1 st line of the table presents the recovery of a single data bucket (1-DB), using the XOR decoding only, as at Figure 11. The second line of the table shows 1-DB recovery using the RS decoding (with the XORing and multiplications). We used another parity bucket or the first one in our initial scheme, [LS00], but not that with ones only. The XOR calculus proves notably faster for both GFs used. The gain was expected, but not its actual magnitude. P becomes indeed almost three times smaller for GF(2 8 ), and almost 1.5 times smaller for GF(2 16 ). T decreases less, given the incidence of the C value. That value is naturally rather stable and reveals relatively important with respect to P, despite our fast 1 Gbs network. For the RS decoding we have C > 0.5P at least. Even more interestingly, we reach C > P for the XOR decoding.  All together, our numbers prove the efficiency of the LH* RS bucket recovery mechanism. It takes only 0,667 s to recover 1 DB in our experiments, and less than 1.5 s to recover 3 DBs, i.e., 9.375 MB of data in three buckets. Notice that the growth of T appears sub-linear with respect to the number of buckets recovered. This is the consequence of the parallelism at the implementation level, and of the recovery of the bucket group as the whole, at the conceptual level. The numbers greatly contribute also to the advantage of using GF(2 16 ). It halves P of any recovery measured, but that using XOR only. This was the rationale for our choice of this field for the basic LH* RS scheme, given also its behavior for the encoding as good in practice as of GF(2 8 ). Notice that C in Table 4 increases more moderately than T as the function of the number of DBs recovered.
The flat character of charts in Figure 11 for larger values of s confirms the scalability of the scheme. It allows us also to guess the recovery times for larger buckets. We can infer from the above numbers that we recover a data bucket group of size m = 4 from 1unavailability at the rate (speed) of 4.68 MBs of data. Next, we recover 2 data buckets of the group at the rate of 5.74 MBs. Finally, we recover the group from 3-unavailability at the rate of 6.38 MBs. If we thus have 1 GB of data per bucket, the figures imply T of about 3.5, 5.5 and 8 minutes, respectively. If we choose the group size m = 8, to halve the storage overhead, the recovery rates will halve as well, while the recovery time will double, etc.

Record Recovery
The record recovery manager is in our prototype located at one of the parity buckets. It acts as described in Section 5.2. Table 6 shows the average total record recovery time T we have measured. The bucket size was b = 50 000. The group size was m = 4. The times are measured at the parity bucket and starts when the bucket gets the message from the coordinator, until the recovery of the record.
The times for GF(2 16 ) are slightly higher. The reason is that we convert 1B characters to 2B symbols and back. In any case, the average scan time of our parity bucket to locate the key c of the data record, as described in Section 5.2, was measured to be 0.822 ms. This is the dominant part of the total time as it represents 62% and 64% respectively.
The results match the intuition and the experimental key search time. They confirm that the basic record recovery capability should be often sufficient in practice. If one seeks for faster record recovery, or buckets are much larger, the additional already

VARIANTS
There are several ways to enhance the basic scheme with additional capabilities, or to amend the design choices, so as to favor specific capabilities at the expense of others. We now discuss a few such variations, potentially attractive to some applications. We show the advantages, but also the price to pay for them, with respect to the basic scheme. First, we address the messaging of the parity records. Next, we discuss the on-demand tuning of the availability level, and of the group size. We also discuss a variant where the data bucket nodes share the load of the parity records. We recall that in the basic scheme, the parity and data records are at separate nodes. The sharing decreases substantially the total number of nodes necessary for a larger file. Finally, we consider alternate coding schemes.

Parity Messaging
Often, an update changes only a small part of the existing data record. This is for instance the case of a relational database, where an update concerns usually one or a few attributes among many. For such applications, the ∆-record would consist mainly of zeros, except for a few symbols. If we compress the ∆-record and no longer have to transmit these zeroes explicitly, our messages should be noticeably smaller.
Furthermore, in the basic scheme the data bucket manages its messaging to every parity bucket. It also manages the rank that it sends along with the ∆-record. An alternate design is to send the ∆-record only to the first parity bucket, and without the rank. The first parity bucket assigns the rank. It is also in charge of the updates to the k-1 other parity buckets, if there are any, using 1PC or 2PC. The drawback of the variant is that now updating needs two rounds of messages. The advantage is simpler parity management at the data buckets. The 1 PC suffices for the dialog between the data bucket and the first parity bucket. The management of the ranks becomes also transparent to the data buckets, as well as of the scalable availability. The parity subsystem is more autonomous. An arbitrary 0-available SDDS scheme can be more easily generalized to a highly-available scheme.
Finally, it is also possible to avoid the commit ordering during 2PC for updates. It suffices to add to each parity record the commit state field, which we call S. The field has the binary value s l per l th data bucket in the group. When a parity bucket p gets the commit message from this bucket, it sets s l to s l = s l XOR 1. If bucket p alerts the coordinator because of the lack of the commit message, the coordinator probes each other available parity bucket for its s l . The parity update was done iff any bucket p' probed had Recall that the update had to be posted to all or none of the available parity buckets that were not in the ready-to-commit state during the probing. The coordinator synchronizes the parity buckets accordingly, using the ∆-record in the differential file of bucket p. The advantage is a faster commit process as the data bucket may send messages in parallel. The disadvantage is an additional field to manage, necessary for updates only.

Availability Tuning
We can add to the basic data record manipulations the operations over the parity management. First, we may wish to be able to decrease or increase the availability level K of the file. Such availability tuning could perhaps reflect the past experience. It differs from scalable availability, where splits change k incrementally. To decrease K, we drop, in one operation, the last parity bucket(s) of every bucket group. Vice versa, to increase the availability, we add the parity bucket(s) and records to every group. The parity overhead decreases or increases accordingly, as well as the cost of updates.
More precisely, to decrease the availability of a group from k > 1 to k-1, it suffices to delete the k th parity bucket in the group. The parity records in the remaining buckets do not need to be recomputed. Notice that this is not true for every alternate coding scheme we discuss below. This reorganization may be trivially set up in parallel for the entire file. As the client might not have all the data buckets in its image, it may use as the basis the scan operation discussed previously. Alternatively, it may simply send the query to the coordinator. The need being rare, there is no danger of a hot spot.
Vice versa, to add a parity bucket to a group requires a new node for it with (k + 1) column of Q (or P). Next, one should read all the data records in the group and calculate the new parity records, as if each data record was an insert. Various strategies exist to efficiently read in parallel the data buckets. Their efficiency remains to be studied. As above, it is easy to set up the operation in parallel for all the groups in the file. Also as above, the existing parity records do not need the recalculation, unlike for other candidate coding schemes for LH* RS we investigate below.
Adding a parity bucket operation can be concurrent with normal data bucket updates.
Some synchronization is however necessary over the new bucket. For instance, the data buckets may be made aware of the existence of this bucket before it requests the first data records. As the result they will start sending there the ∆-record for each update coming afterwards. Next, the new bucket may create its parity records in their rank order. The bucket encodes then any incoming ∆-record it did not request. This, provided it already has created the parity record; hence it processed its rank. It disregards any other ∆-record.
In both cases, it commits the ∆-record. The parity record will include the disregarded ∆record when the bucket will encode the data records with that rank, requesting then also the ∆-record.

Group Size Tuning
We recall that the group size m for LH* RS is basically a power of two. The group size tuning may double or halve m synchronously for the entire file, one or more times. The doubling merges two successive groups, which we will call left and right that become a single group of 2m buckets. The first left group starts with bucket 0. Typically the merged groups have each k parity buckets. Seldom, if the split pointer is in the left group, and the file is changing its availability level, the right group may have an availability level of k-1. We discuss the former case only. The generalization to the latter and to the entire file is trivial.
The operation reuses the k buckets of the left group as the parity buckets for the new group. Each of the k-1 columns of the parity matrices P and Q for the parity buckets other than the first one is however now provided with 2m elements, instead of top m only previously. The parity for the new group is computed in these buckets as if all the data records in the right group were reinserted to the file. There are a number of ways to perform this operation efficiently that remain for the further study. It is easy to see however that for the first new parity bucket, a faster computation may consist simply in XORing rank-wise the B-field of each record with this of the parity record in the first bucket of the right group, and unioning their key lists. Once the merge, ends the former parity buckets of the right group are discarded.
The group size halving splits in contrast each group into two. The existing k parity buckets become those of the new left group. The right group gets k new empty parity buckets. In both sets of parity buckets, the columns of P or Q need only the top m elements. Afterwards, each record of the right group is read. It is then encoded into the existing buckets as if it was deleted, i.e., its key is removed from the key list of its parity records and its non-key data are XORed to the B-fields of these records. In the same time, it is encoded into the new parity buckets as if it was just inserted into the file. Again, there is a number of ways to implement the group size halving efficiently that remain open for study.

Parity Load Balancing
In the basic scheme, the data and parity buckets are at separate nodes. A parity bucket sustains also the updating processing load up to m times that of the update load of data bucket, as all the data buckets in the group may get updated simultaneously. The scheme requires about Nk/m nodes for the parity buckets, in addition to N data bucket nodes. This number scales out with the file. In practice, for a larger file, e.g., on, let us say, N = 1K data nodes, with m = 16 and K = 2, this leads to 128 parity nodes. These parity nodes do not carry any load for queries. On the other hand, the update load on a parity bucket is about 16 times that of a data bucket. If there are intensive burst of updates, the parity nodes could form a bottleneck that slows down commits. This argues against using larger m. Besides, some user may be troubled with the sheer number of the additional nodes.
The following variant decreases the storage and processing load of the parity records on the node supporting them. This happens provided that k ≤ m which seems a practical assumption. It also balances the load so that the parity records are located mostly on data bucket nodes. This reduces the number of additional nodes needed for the parity records to m at most. The variant works as follows.
Consider the i th parity record in the record group with rank r, i = 0,1…k -1. Assume that for each (data) bucket group there is a parity bucket group of m buckets, numbered 0,1…m -1, of capacity kb/m records each. Store each parity record in parity bucket j = (r + i) mod m. Does it as the primary record, or an overflow one if needed, as usual.
Place the m parity buckets of the first group, i.e., containing data buckets 0,…, m-1, on the nodes of the data buckets of its immediately right group, i.e., with data buckets m,...,2m -1. Place the parity records of this group on the nodes of its (immediately) left group. Repeat for any next groups while the file scales out.
The result is that each parity record of a record group is in a different parity bucket.
Thus, if we no longer can access a parity bucket, then we loss access to a single parity record per group. This is the key requirement to the k-availability, as for the basic scheme. The LH* RS file remains consequently K available. The parity storage overhead, i.e., the parity bucket size at a node decreases now uniformly by factor m/k. In our example, it divides by 8. The update load on a parity bucket becomes also twice that of a data bucket. In general, the total processing and storage load is about balanced over the data nodes, for both the updates and searches.
The Notice finally that if n > 1 nodes, possibly spares, may participate in the recovery calculus, then the idea, described above, of partitioning of a parity bucket onto the n nodes may be usefully applied to speed up the recovery phase. The partitioning would become dynamically the first step of the recovery process. As discussed, this would decrease the calculus time by the factor possibly about reaching l/n. The overall recovery time possibly improves as well. The gain may be substantial for large buckets and n >> 1.

Alternative Erasure Correcting Codes
In principle, we can retain the basic LH* RS architecture with a different erasure correcting code. The interest in these codes stems first the interest in higher availability RAID We will now discuss therefore replacing our code with other erasure correcting codes, within the scope of our scheme. Certain codes allow to trade-off performance factors.
Typically, a variant can offer faster calculus than our scheme at the expense of parity storage overhead or limitations on the maximum value of k. For the sake of comparison, we first list a number of necessary and desirable properties for a code. Next, we discuss how our code fits them. Finally, we use the framework for the analysis.

Design Properties of an Erasure Correcting Code for LH* RS
1. Systematic code. The code words consist of data symbols concatenated with parity symbols. This means that the application data remains unchanged and that the parity symbols are stored separately.
2. Linear code. We can use ∆-records when we update, insert, or delete a single data record. Otherwise, after a change we would have to access all data records and recalculate all parity from them.
5. Constant bucket group size, independent of the availability level.
Notice that it is (2) that also allows us to compress the delta record by only transmitting non-zero symbols and their location within the delta record.
Our codes (as defined in Section 3) fulfill all these properties. They are systematic and linear. They have minimal possible overhead for parity data within a group of any size. This is a consequence of being Maximum Distance Separable (MDS). Since the parity matrix contains a column of ones, record reconstruction in the most important case (a single data record failure) proceeds at the highest speed possible. As long as k=1, any update incurs the minimal parity update cost for the same reason. In addition, for any k, updates to a group's first data bucket result also in XORing because of the row of ones in the parity matrix. Finally, we can use the logarithmic matrices.
Our performance results (Section 6.3) show, that the update performance at the second, third, etc. parity bucket is therefore adequate. We recall that for GF(2 16 ), the slow down was of 10 % for the 2 nd parity bucket and of additional 7 % for the 3 rd one, with respect to the 1 st bucket only, Table 3. It is further impossible to improve the parity matrix further by introducing additional one-coefficients to avoid GF multiplication, (we omit the proof of this statement). Next, a bucket group can be extended to a total of n = 257 or n = 65,537, depending whether we use the Galois field with 2 8 or 2 16 elements.
Up to these bounds, we can freely choose m and k subject to m + k = n, in particular, we can keep m constant. An additional nice property is that small changes in a data record result in small changes in the parity records. In particular, if a single bit is changed, then a single parity symbol only in each parity record changes, (except for the first parity record where only a single bit changes).

Candidate Codes
Array Codes These are two-dimensional codes in which the parity symbols are the XOR of symbols in lines in one or more directions. One type is the convolutional array codes that we discuss now. We address some others later in this section. The convolutional codes were developed originally for tapes, adding parity tracks with parity records to the data tracks with data records [PB76], [Pat85], [FHB89]. Figure  The attractive property of a convolutional code in our context is its updating and decoding speed. During an update, we change all parity records only by XORing them with the ∆-record. We start for that at different positions in each parity record, Figure 12.
The updates proceed at the fastest possible speed for all data and parity buckets. Unlike in our case where this is true only for the first parity bucket and the first data bucket.
Likewise, the decoding iterates by XORing and shifting of records. This should be faster than our GF multiplications. Notice however that writing a generic decoding algorithm for any m and k is more difficult than for the RS code.
All things considered, these codes can replace RS codes in the LH* RS framework, offering faster performance at the costs of larger parity overhead. Notice that we can reduce the parity overhead by using also negative slopes, at the added expense of the decoding complexity (inversion of a matrix in the field of Laurent series over GF(2)).  Figure 12: Convolutional array code.
Block array codes are another type of codes that are MDS. They avoid indeed the overhang in the parity records. As an example, we sketch the code family B k (p), [BFT98], where k is the availability level and p is a prime, corresponding to our m, i.e., p ≥ k + m. Prime p is not a restriction, since we may introduce dummy symbols and data records.
In Figure 12 for instance, a i , b i , c i with i > 5 at are dummy symbols. Next, in Figure  13 we have chosen k = 2 and m = 3, hence p = 5. We encode first four symbols from three data records a 0 , a 1 ,..., b 0 , b 1 ,..., and c 0 ,c 1 ,... The pattern repeats for following symbols in groups of four symbols. We arrange the data and parity records as the columns of a 4 by 5 matrix. For ease of presentation, and because slopes are generally defined for square matrices, we added a fictional row of zeroes (which are not stored). We now require that the five symbols in all rows and all lines of slope -1 in the resulting 5 by 5 matrix have parity zero. The line in parentheses in Figure 13 is the third such line.
Block array codes are linear and systematic. As for our code, we update the parity records using ∆-records. As the figure illustrates, we only use XORing. In contrast to our code however, and to the convolutional array code, the calculus of most parity symbols involves more than one ∆-record symbol. For example, the updating of the 1 st parity symbol in Figure 13 requires XORing of two symbols of any ∆-record. For instance, -the first and second symbol of the ∆-record if record a 0 ,a 1 ,… changes. This results in between one and two times more XORing. Decoding turns out to have about the same complexity as encoding for k = 2. All this should translate to faster processing than for our code.
For k ≥ 3, we generalize by using k parity columns, increasing p if needed, and requiring parity zero along additional slopes -2, -3, etc. In our example, increasing k to 3 involves setting p to next prime, which is 7, to accommodate the additional parity column and adding a dummy data record to each record group. We could use p = 7 also for k = 2, but this choice slows down the encoding by adding terms to XOR in the parity expressions. The main problem with B k (p) for k > 2 is that the decoding algorithm becomes fundamentally more complicated than for k = 2. Judging from the available literature, an implementation is not trivial, and we can guess that even an optimized decoder should perform slower than our RS decoder, [BFT98]. All things considered, using B k (p) does not seems a good choice for k > 2.
The EvenOdd code, [BBM93,BBBM95,BFT98], is a variant of B 2 (p) that improves encoding and decoding. The idea is that the 1 st parity column is the usual parity and the 2 nd parity column is either the parity or its binary complement of all the diagonals of the data columns with the exception of a special diagonal whose parity decides on the alternative used. The experimental analysis in [S03] showed that both encoding and decoding of EvenOdd are faster than for our fastest RS code. In the experiment, EvenOdd repaired a double record erasure four times faster. The experiment did not measure the network delay, so that the actual performance advantage is less pronounced.
It is therefore attractive to consider a variant of LH* RS using EvenOdd for k = 2. An alternative to EvenOdd is the Row-Diagonal Parity code presented in [C&al04].
EvenOdd can be generalized to k > 2, [BFT98]. For k = 3, one obtains an MDS code with the same difficulties of decoding as for B 3 (p). For k > 3 the result is known to not be MDS.
A final block-array code for k = 2 is X-code [XB99]. These have zero parity only along the lines with slopes 1 and -1 and as all block-array codes use only XORing for encoding and decoding. They too seem to be faster than our code, but they cannot be generalized to higher values of k.   If the application record is in Kbytes, then a larger m allows for a few chunks per record or a single one. If the record size is not a chunk multiple, then we pad with zeros the last bytes. One can use ∆-records calculated over the chunk(s) of the updated data record to send updates to the parity buckets as LDPC codes are linear.
If application data records consist of hundreds of bytes or are smaller, then it seems best to pack several records into a chunk. As typical updates address only a single record at a time, we should use compressed ∆-records. Unlike in our code however, an update will usually change then more parity symbols than in the compressed ∆-record. This obviously comparatively affects the encoding speed.
In both cases, the parity records would consists of full parity chunks of size M/m+ε, where ε reflects the deviation from MDS, .e.g., the 11% quoted above. The padding, if any, introduces some additional overhead. The incidence of all the discussed details on the performance of the LDPC coding within LH* RS as well as further related design issues are open research problems. At this stage, all things considered, the attractiveness of LDPC codes is their encoding and decoding speed, close to the fastest possible, i.e., of the symbol-wise XORing of the data and parity symbols, like for the first parity record of our coding scheme, [BLM99]. Notice however, that encoding and decoding are only part of the processing cost in LH* RS parity management. The figures in Section 6.3 show that the difference in processing using only the 1 st parity bucket and the others is by far not that pronounced. Thus, the speed-up resulting from replacing RS with a potentially faster code is limited. Notice also that finding good LDPC codes for smaller M is an active research area.

RAID Codes
The interest in RAID generated specialized erasure correcting codes. One approach is XOR operations only, generating parity data for a k-available disk array with typical values of k = 2 and k = 3, e.g., [H&al94], [CCL00], [CC01. For a larger k, the only RAID code known to us is based on the k-dimensional cube. RAID codes are designed for a relatively large number of disks, e.g., more than 20 in the array. Each time we scale from k = 2 to k = 3 and beyond, we change the number of data disks. Implementing these changing group sizes would destroy the LH* RS architecture, but could result in some interesting scalable availability variant of LH*.
For the sake of completeness, we finally mention other flavors of generalized RS codes used in erasure correction, but not suited for LH* RS . The Digital Fountain project used a non-systematic RS code in order to speed up the matrix inversion during decoding.
The Kohinoor project, [M02], developed a specialized RS code for group size n = 257 and k = 3 for a large disk array to support an email server. [P97] seemed to give a simpler and longer (and hence better) generator matrix for an RS code, but [PD03] retracts this statement.

RELATED WORK
Traditionally, in both the centralized and distributed environment, high availability was not part of a (key-based) data structure. If needed, a lower storage level provided it such as mirroring or a RAID like technique. This approach simplifies the design of a data structure. It can in contrast deteriorate access times in the distributed environment. For example, a dictionary data structure using hashing could place a data unit at some particular node. But the underlying RAID system could replace the data at a different node or even distributes it over several nodes. This lower level interference would results in additional messaging that an integration of the parity data management into the hashing structure could avoid.
The problem is more acute for a scalable distributed storage environment with a large number of nodes. The elementary reliability calculus shows that higher levels of availability are often necessary for a data structure stored on many nodes. One approach provides the high level at each node. This approach fails if the storage nodes are standard PCs or workstations, especially in a P2P network where nodes may have low availability [WK02]. In addition, files in the same environment may require different availability levels just because of their different sizes. The alternative is to integrate high availability into scalable distributed data structures and let the availability level itself scale.
In response to the need of integrating high-availability and SDDS the concept of a high-availability data structure appeared, [LN96]. The first high-availability SDDS was LH* M , where high-availability results from mirroring two LH* files. The files contain exactly the same records. They may however differ by the internal structures, e.g., the bucket size. In any case, the two files in LH* M are more strongly coupled than usual mirrors. This resolves some double and more failures that would be otherwise catastrophic.
[L&al97] proposed another 1-availability SDDS called LH* S . Here, one partitions a record into n segments, stored each at a different site. There is also the (n+ 1) XOR parity segment at some other site. Compared with LH* M , the parity overhead is much smaller, namely close to 1/n. The operations require in contrast more messages. A LH* S key search in normal mode needs n messages, even though the messages are shorter.
Another 1-available SDDS, LH* g , [LR97, L97, LR01] keeps records intact. It introduces the concept of record groups used by LH* RS . Retrospectively, the LH* RS parity calculus could generalize LH* g to higher availability. As for LH* RS , an LH* g record enters a record group when it is created. The group members are always on different servers and the group contains an additional parity record of the same structure as a LH* RS parity record. The initial record group is the same for an LH* g record as for an LH* RS record. However, an LH* g record keeps its initial record group membership, regardless of its moves caused by splits. In comparison to LH* RS , LH* g splits are then faster. In contrast, a data bucket recovery processing is more costly. In particular, one always scans all the parity buckets, instead of usually one only for LH* RS . Notice that the recovery is not then necessarily longer than for LH* RS , as the scans can be parallel. If the communication is slow with respect to the processing time, it can be even faster.
LH* SA was the first SDDS to achieve scalable availability, [LMR98], [LMRS99]. To achieve k-availability, LH* SA places each records in k or k+1 different record groups that only intersect in this one record. Each record group has an additional parity record, basically consisting of the XOR of the other records. LH* SA places the buckets conceptually into a high-dimensional cube with n buckets in the first k or k+1 dimensions. Just as for LH* RS , a controlled or an uncontrolled strategy adds parity buckets. A small LH* SA file with a k > 1 has a larger storage overhead than a corresponding LH* RS file. This advantage of LH* RS dissipates however for larger files.
LH* SA parity calculations use only XORing, which gives it an advantage over k-available LH* RS files for k > 1. However, if there is more than one unavailable bucket, recovering a lost record can involve additional recovery steps. Deeper comparison of trade-offs between LH* SA and LH* RS remains to be explored.
Outside the domain of SDDSs, research has addressed high-availability needs for distributed flat files for many years. The dominant approach was the replication, [H96].
The major issue was the replicas consistency, [P93]. Disk arrays in a centralized environment needed historically high availability with less storage overhead [BM93], [H&a94]. The arrays have typically a fixed number of disks so that the proposed highavailability schemes were static. The aspects under investigation were mainly the parity update mechanisms (e.g. parity logging), and the parity placement providing the 1availability through XORing. These were the performance determinants of a disk array.
Next, parity placement schemes appeared intended for larger, but still static, arrays, e.g., [ABC97]. Current research increasingly focuses on very large storage systems, using an expandable number of storage units, whether disks or entire servers. Recent proposals for the k > 1 k-available erasure correcting codes already discussed in Section 7.5 came in this context.
High-availability is also a general goal for a DBMS. Nevertheless, our facet of this concept, concerning the unavailability of a part of data storage, received relatively little attention. The general assumption seems to be the use of a high-availability storage or file system underneath. Typically, it should be a software or hardware RAID storage. For a parallel DBMS, this should concern each DBMS node. At the database layer, the replication seems the only technique used. The DBMS is then typically 1-available, with respect to storage node unavailability.
The Clustra DBMS, now a commercial product, proposes a DBMS level structure that some claim the most efficient in the domain, [S99]. It hashes partitions a table into fragments located each at a different node. The nodes communicate using a dedicated high-speed switch. The Clustra hashing is static, hence with limited scalability compared to LH* RS . The practical limit is 24 nodes at present. Each fragment is replicated on two nodes, using the primary copy approach. If a fragment is unavailable, (detected by lack of heart beat basically), its available copy, possible the primary one, is copied to a spare.
The partitioning limits the recovery to a single fragment typically. The whole scheme makes Clustra tables only 1-available and limits its scalability compared to our scheme.
The conclusion holds for other prominent DBMSs, whether they use for the parallel table partitioning the (static) hashing (DB2), or range partitioning (SQL Server) or both (Oracle).
Research also starts addressing the high-availability needs of scalable disk farms, [X&al03], [X&al04]. These should be soon necessary for the grid computing, and very large Internet databases. Some simple techniques are already in everyday use. They are apparently replication based, but covered by the corporate secret. The prominent example is Google. The gray literature estimates its farm spreading already over more than 10 000 Linux nodes, perhaps as much as 54 000, [D03], [E03]. There are also open research proposals for high-availability distributed data structures over large clusters specifically intended for the Internet access. One is a distributed hash table scheme with built-in specific replication, [G00]. An on-going research project follows up with the goal of a scalable distributed highly-available linked B-tree, [B03].
Emerging P2P applications, including the Wi-Fi ones, also lead to compelling highavailability storage needs, [AK02], [K03], [D00]. In this new environment the availability of the nodes should be more "chaotic" than one typically supposed in the past. Their number and geographical spread-out should often be also orders of magnitude larger. Possibly easily running in near future into hundreds of thousands and soon reaching millions, spread worldwide. This thinking clearly shares some rationales for LH* RS . Our scheme could thus reveal useful for these new applications as well.

CONCLUSION
LH* RS is a high-availability scalable distributed data structure. It scales up to any size and any availability level k one can reasonably foresee for an application these days. The file scalability is transparent to the application, as for any SDDS. The k-availability may scale transparently as well, or may be adjusted by the application on demand.
The scheme matured in many aspects with respect to our initial proposals, [LS00].
The evolution concerned the parity calculus, and various algorithmic issues to make the file always at least (K -1) -available, and the parity calculus the fastest. We thus have increased the Galois Field size to GF(2 16 ). We have evolved the parity matrix P so it has 1 st column and row of 1s. We have also improved the calculus so to take advantage of the logarithmic parity. We have built a prototype implementation, proving the feasibility of the scheme. We have experimented on this basis with the new, and the former, parity calculus, as well as with the above-mentioned algorithmic issues. Performance analysis proved substantial speed-up of various operations.
At present, for the most frequent case of k = 1, the scheme performs as well as any popular 1-available RAID scheme using XORing only. For k > 1, it appears more effective in practice than if we used any alternate parity code or scheme we are aware of.
This concerns our own earlier approach, as we just mentioned. The yet unique presence of the row of 1's contributes to this performance. In particular, while the parity storage and communication overhead increase substantially with k used, they globally always remain close to the optimal bounds. Another known high-availability SDDS scheme may nevertheless eventually outperform the LH* RS on a selected feature. The diversity should profit the applications.
"Au finale", the experimental performance analysis has shown very fast access and recovery performance. Our testbed files with 125K records, recovered in less than a second from a single unavailability and in about two seconds from a triple one.
Individual search, insert and update times were at most 0.5 msec for a 3-available file.
Bulk operations were several times faster. This performance is also due to the data processing in the distributed RAM. All together, the capabilities of our scheme should attract numerous applications, including the exciting new ones in the domains of grid computing and of P2P. In particular, they should be useful for DBMSs. Those still use for the high-availability the more limited static and 1-available replication or RAID storage schemes.
Future work should concern experiments with applications of our scheme. One should also port the parity subsystem to other known 0-available SDDS schemes. The range partitioning schemes appear preferential candidates. One should also add the capabilities of the concurrent and transactional access to LH* RS . Notice that the data records of a record group conflict on the parity records. One should finally study more in depth the outlined variants.

Term Description Typical value
Addressing an LH* RS file file level split pointer file state internal file state in a data bucket client's view of i client's view of n current number of data buckets in the file logical address of a data bucket server physical address of a data bucket server initial physical address of the file (server of data bucket 0) data bucket level (primary) key of a data record series of hash functions load factor Parity calculus