Home  About  Journals  Submit  Contact Us  Français 
The relative ease of collaborative data science and analysis has led to a proliferation of many thousands or millions of versions of the same datasets in many scientific and commercial domains, acquired or constructed at various stages of data analysis across many users, and often over long periods of time. Managing, storing, and recreating these dataset versions is a nontrivial task. The fundamental challenge here is the storagerecreation tradeoff: the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions. Despite the fundamental nature of this problem, there has been a surprisingly little amount of work on it. In this paper, we study this tradeoff in a principled manner: we formulate six problems under various settings, trading off these quantities in various ways, demonstrate that most of the problems are intractable, and propose a suite of inexpensive heuristics drawing from techniques in delayconstrained scheduling, and spanning tree literature, to solve these problems. We have built a prototype version management system, that aims to serve as a foundation to our DataHub system for facilitating collaborative data science. We demonstrate, via extensive experiments, that our proposed heuristics provide efficient solutions in practical dataset versioning scenarios.
The massive quantities of data being generated every day, and the ease of collaborative data analysis and data science have led to severe issues in management and retrieval of datasets. We motivate our work with two concrete example scenarios.
In such scenarios and many others, it is essential to keep track of versions of datasets and be able to recreate them on demand; and at the same time, it is essential to minimize the storage costs by reducing redundancy and duplication. The ability to manage a large number of datasets, their versions, and derived datasets, is a key foundational piece of a system we are building for facilitating collaborative data science, called DataHub [12]. DataHub enables users to keep track of datasets and their versions, represented in the form of a directed version graph that encodes derivation relationships, and to retrieve one or more of the versions for analysis.
In this paper, we focus on the problem of trading off storage costs and recreation costs in a principled fashion. Specifically, the problem we address in this paper is: given a collection of datasets as well as (possibly) a directed version graph connecting them, minimize the overall storage for storing the datasets and the recreation costs for retrieving them. The two goals conflict with each other — minimizing storage cost typically leads to increased recreation costs and vice versa. We illustrate this tradeoff via an example.
Figure 1(i) displays a version graph, indicating the derivation relationships among 5 versions. Let V_{1} be the original dataset. Say there are two teams collaborating on this dataset: team 1 modifies V_{1} to derive V_{2}, while team 2 modifies V_{1} to derive V_{3}. Then, V_{2} and V_{3} are merged and give V_{5}. As presented in Figure 1, V_{1} is associated with 10000, 10000, indicating that V_{1}’s storage cost and recreation cost are both 10000 when stored in its entirety (we note that these two are typically measured in different units – see the second challenge below); the edge (V_{1} → V_{3}) is annotated with 1000, 3000, where 1000 is the storage cost for V_{3} when stored as the modification from V_{1} (we call this the delta of V_{3} from V_{1}) and 3000 is the recreation cost for V_{3} given V_{1}, i.e, the time taken to recreate V_{3} given that V_{1} has already been recreated.
One naive solution to store these datasets would be to store all of them in their entirety (Figure 1 (ii)). In this case, each version can be retrieved directly but the total storage cost is rather large, i.e., 10000 + 10100 + 9700 + 9800 + 10120 = 49720. At the other extreme, only one version is stored in its entirety while other versions are stored as modifications or deltas to that version, as shown in Figure 1 (iii). The total storage cost here is much smaller (10000 + 200 + 1000 + 50 + 200 = 11450), but the recreation cost is large for V_{2}, V_{3}, V_{4} and V_{5}. For instance, the path {(V_{1} → V_{3} → V_{5})} needs to be accessed in order to retrieve V_{5} and the recreation cost is 10000 + 3000 + 550 = 13550 > 10120.
Figure 1 (iv) shows an intermediate solution that trades off increased storage for reduced recreation costs for some version. Here we store versions V_{1} and V_{3} in their entirety and store modifications to other versions. This solution also exhibits higher storage cost than solution (ii) but lower than (iii), and still results in significantly reduced retrieval costs for versions V_{3} and V_{5} over (ii).
Despite the fundamental nature of the storageretrieval problem, there is surprisingly little prior work on formally analyzing this tradeoff and on designing techniques for identifying effective storage solutions for a given collection of datasets. Version Control Systems (VCS) like Git, SVN, or Mercurial, despite their popularity, use fairly simple algorithms underneath, and are known to have significant limitations when managing large datasets [1, 2]. Much of the prior work in literature focuses on a linear chain of versions, or on minimizing the storage cost while ignoring the recreation cost (we discuss the related work in more detail in Section 6).
In this paper, we initiate a formal study of the problem of deciding how to jointly store a collection of dataset versions, provided along with a version or derivation graph. Aside from being able to handle the scale, both in terms of dataset sizes and the number of versions, there are several other considerations that make this problem challenging.
We note that the problem described thus far is inherently “online” in that new datasets and versions are typically being created continuously and are being added to the system. In this paper, we focus on the static, offline version of this problem and focus on formally and completely understanding that version. We plan to address the online version of the problem in the future. The key contributions of this work are as follows.
In this section, we first introduce essential notations and then present the various problem formulations. We then present a mapping of the basic problem to a graphtheoretic problem, and also describe an integer linear program to solve the problem optimally.
We let = {V_{i}}, i = 1, …, n be a collection of versions. The derivation relationships between versions are represented or captured in the form of a version graph: (, ). A directed edge from V_{i} to V_{j} in (, ) represents that V_{j} was derived from V_{i} (either through an update operation, or through an explicit transformation). Since branching and merging are permitted in DataHub (admitting collaborative data science), is a DAG (directed acyclic graph) instead of a linear chain. For example, Figure 1 represents a version graph , where V_{2} and V_{3} are derived from V_{1} separately, and then merged to form V_{5}.
Given a collection of versions , we need to reason about the storage cost, i.e., the space required to store the versions, and the recreation cost, i.e., the time taken to recreate or retrieve the versions. For a version V_{i}, we can either:
Thus the storage and recreation costs can be represented using two matrices Δ and Φ: the entries along the diagonal represent the costs for the materialized versions, while the offdiagonal entries represent the costs for deltas. From this point forward, we focus our attention on these matrices: they capture all the relevant information about the versions for managing and retrieving them.
Notice that by changing the differencing algorithm, we can produce deltas of various types:
Furthermore, the deltas could be stored compressed or uncompressed. The various delta variants lead to various dimensions of problem that we will describe subsequently.
The reader may be wondering why we need to reason about two matrices Δ and Φ. In some cases, the two may be proportional to each other (e.g., if we are using uncompressed UNIXstyle diffs). But in many cases, the storage cost of a delta and the recreation cost of applying that delta can be very different from each other, especially if the deltas are stored in a compressed fashion. Furthermore, while the storage cost is more straightforward to account for in that it is proportional to the bytes required to store the deltas between versions, recreation cost is more complicated: it could depend on the network bandwidth (if versions or deltas are stored remotely), the I/O bandwidth, and the computation costs (e.g., if decompression or running of a script is needed).
Figure 2 shows the matrices Δ and Φ based on version graph in Figure 1. The annotation associated with the edge (V_{i}, V_{j}) in Figure 1 is essentially Δ_{i,j}, Φ_{i,j}, whereas the vertex annotation for V_{i} is Δ_{i,i}, Φ_{i,i}. If there is no edge from V_{i} to V_{j} in the version graph, we have two choices: we can either set the corresponding Δ and Φ entries to “−” (unknown) (as shown in the figure), or we can explicitly compute the values of those entries (by running a differencing algorithm). For instance, Δ_{3,2} = 1100 and Φ_{3,2} = 3200 are computed explicitly in the figure (the specific numbers reported here are fictitious and not the result of running any specific algorithm).
Before moving on to formally defining the basic optimization problem, we note several complications that present unique challenges in this scenario.
The storage cost matrix Δ may be symmetric or asymmetric depending on the specific differencing mechanism used for constructing deltas. For example, the XOR differencing function results in a symmetric Δ matrix since the delta from a version V_{i} to V_{j} is identical to the delta from V_{j} to V_{i}. UNIXstyle diffs where linebyline modifications are listed can either be twoway (symmetric) or oneway (asymmetric). The asymmetry may be quite large. For instance, it may be possible to represent the delta from V_{i} to V_{j} using a command like: delete all tuples with age > 60, very compactly. However, the reverse delta from V_{j} to V_{i} is likely to be quite large, since all the tuples that were deleted from V_{i} would be a part of that delta. In this paper, we consider both these scenarios. We refer to the scenario where Δ is symmetric and Δ is asymmetric as the undirected case and directed case, respectively.
A second issue is the relationship between Φ and Δ. In many scenarios, it may be reasonable to assume that Φ is proportional to Δ. This is generally true for deltas that contain detailed linebyline or cellbycell differences. It is also true if the system bottleneck is network communication or I/O cost. In a large number of cases, however, it may be more appropriate to treat them as independent quantities with no overt or known relationship. For the proportional case, we assume that the proportionality constant is 1 (i.e., Φ = Δ); the problem statements, algorithms and guarantees are unaffected by having a constant proportionality factor. The other case is denoted by Φ ≠ Δ.
This leads us to identify three distinct cases with significantly diverse properties: (1) Scenario 1: Undirected case, Φ = Δ; (2) Scenario 2: Directed case, Φ = Δ; and (3) Scenario 3: Directed case, Φ ≠ Δ.
Given Δ, Φ, our goal is to find a good storage solution, i.e., we need to decide which versions to materialize and which versions to store as deltas from other versions. Let = {(i_{1}, j_{1}), (i_{2}, j_{2}), …} denote a storage solution. i_{k} = j_{k} indicates that the version V_{ik} is materialized (i.e., stored explicitly in its entirety), whereas a pair (i_{k}, j_{k}), i_{k} ≠ j_{k} indicates that we store a delta from V_{ik} to V_{jk}.
We require any solution we consider to be a valid solution, where it is possible to reconstruct any of the original versions. More formally, is considered a valid solution if and only if for every version V_{i}, there exists a sequence of distinct versions V_{l1}, …, V_{lk} = V_{i} such that (i_{l1}, i_{l1} ), (i_{l1}, i_{l2} ), (i_{l2}, i_{l3}), …, (i_{lk−1}, i_{lk}) are contained in (in other words, there is a version V_{l1} that can be materialized and can be used to recreate V_{i} through a chain of deltas).
We can now formally define the optimization goals:
We now state the problem formulations that we consider in this paper, starting with two base cases that represent two extreme points in the spectrum of possible problems.
Given Δ, Φ, find a valid solution such that is minimized.
GivenΔ, Φ, identify a valid solution such that ∀i, R_{i} is minimized.
The above two formulations minimize either the storage cost or the recreation cost, without worrying about the other. It may appear that the second formulation is not welldefined and we should instead aim to minimize the average recreation cost across all versions. However, the (simple) solution that minimizes average recreation cost also naturally minimizes _{i} for each version.
In the next two formulations, we want to minimize (a) the sum of recreation costs over all versions (Σ_{i} _{i}), (b) the max recreation cost across all versions (max_{i} _{i}), under the constraint that total storage cost is smaller than some threshold β. These problems are relevant when the storage budget is limited.
GivenΔ; Φ and a threshold β, identify such that ≤ β, andΣ_{i} _{i} is minimized.
GivenΔ, Φ and a threshold β, identify such that ≤ β, and max_{i} _{i} is minimized.
The next two formulations seek to instead minimize the total storage cost given a constraint on the sum of recreation costs or max recreation cost. These problems are relevant when we want to reduce the storage cost, but must satisfy some constraints on the recreation costs.
Given Δ, Φ and a threshold θ, identify such that Σ_{i} _{i} ≤ θ, and is minimized.
Given Δ, Φ and a threshold θ, identify such that max_{i} _{i} ≤ θ, and is minimized.
In this section, we’ll map our problem into a graph problem, that will help us to adopt and modify algorithms from wellstudied problems such as minimum spanning tree construction and delayconstrained scheduling. Given the matrices Δ and Φ, we can construct a directed, edgeweighted graph G = (V, E) representing the relationship among different versions as follows. For each version V_{i}, we create a vertex V_{i} in G. In addition, we create a dummy vertex V_{0} in G. For each V_{i}, we add an edge V_{0} → V_{i}, and assign its edgeweight as a tuple Δ_{i}_{,}_{i}, Φ_{i}_{,}_{i}. Next, for each Δ_{i}_{,}_{j} ≠ ∞, we add an edge V_{i} → V_{j} with edgeweight Δ_{i}_{,}_{j}, Φ_{i}_{,}_{j}.
The resulting graph G is similar to the original version graph, but with several important differences. An edge in the version graph indicates a derivation relationship, whereas an edge in G simply indicates that it is possible to recreate the target version using the source version and the associated edge delta (in fact, ideally G is a complete graph). Unlike the version graph, G may contain cycles, and it also contains the special dummy vertex V_{0}. Additionally, in the version graph, if a version V_{i} has multiple inedges, it is the result of a user/application merging changes from multiple versions into V_{i}. However, multiple inedges in G capture the multiple choices that we have in recreating V_{i} from some other versions.
Given graph G = (V, E), the goal of each of our problems is to identify a storage graph G_{s} = (V_{s}, E_{s}), a subset of G, favorably balancing total storage cost and the recreation cost for each version. Implicitly, we will store all versions and deltas corresponding to edges in this storage graph. (We explain this in the context of the example below.) We say a storage graph G_{s} is feasible for a given problem if (a) each version can be recreated based on the information contained or stored in G_{s}, (b) the recreation cost or the total storage cost meets the constraint listed in each problem.
Given matrix Δ and Φ in Figure 2(i) and 2(ii), the corresponding graph G is shown in Figure 3. Every version is reachable from V_{0}. For example, edge (V_{0}, V_{1}) is weighted with Δ_{1,1}, Φ_{1,1} = 10000, 10000; edge V_{3}, V_{5} is weighted with Δ_{3,5}, Φ_{3,5} = 800, 2500. Figure 4 is a feasible storage graph given G in Figure 3, where V_{1} and V_{3} are materialized (since the edges from V_{0} to V_{1} and V_{3} are present) while V_{2}, V_{4} and V_{5} are stored as modifications from other versions.
After mapping our problem into a graph setting, we have the following lemma.
The optimal storage graph G_{s} = (V_{s}, E_{s}) for all 6 problems listed above must be a spanning tree T rooted at dummy vertex V_{0} in graph G.^{1}
Recall that a spanning tree is a tree where every vertex is connected and reachable, and has no cycles. For Problems 1 and 2, we have the following observations. A shortest path tree is defined as a spanning tree where the path from root to each vertex is a shortest path between those two in the original graph: this would be simply consist of the edges that were explored in an execution of Dijkstra’s shortest path algorithm.
The optimal storage graph G_{s} for Problem 1 is a minimum spanning tree of Grooted at V_{0}, considering only weights Δ_{i,j}.
The optimal storage graph G_{s} for Problem 2 is a shortest path tree of G rooted at V_{0}, considering only weights Φ_{i,j}.
We present an ILP formulation of the optimization problems described above. Here, we take Problem 6 as an example; other problems are similar. Let x_{i}_{,}_{j} be a binary variable for each edge (V_{i}, V_{j}) E, indicating whether edge (V_{i}, V_{j}) is in the storage graph or not. Specifically, x_{0,}_{j} = 1 indicates that version V_{j} is materialized, while x_{i}_{,}_{j} = 1 indicates that the modification from version i to version j is stored where i ≠ 0. Let r_{i} be a continuous variable for each vertex V_{i} V, where r_{0} = 0; r_{i} captures the recreation cost for version i (and must be ≤ θ).
minimize Σ_{(}_{Vi}_{,}
_{Vj}
_{)}_{E}x_{i}_{,}_{j} × Δ_{i}_{,}_{j}, subject to:

Problem 6 is equivalent to the optimization problem described above.
Note however that the general form of an ILP does not permit an ifthen statement (as in (2) above). Instead, we can transform to the general form with the aid of a large constant C. Thus, constraint 2 can be expressed as follows:
Where C is a “sufficiently large” constant such that no additional constraint is added to the model. For instance, C here can be set as 2*θ. On one hand, if x_{i}_{,}_{j} = 1 Φ_{i}_{,}_{j} +r_{i}−r_{j} ≤ 0. On the other hand, if x_{i}_{,}_{j} = 0 Φ_{i}_{,}_{j} + r_{i} − r_{j} ≤ C. Since C is “sufficiently large”, no additional constraint is added.
In this section, we study the complexity of the problems listed in Table 1 under different application scenarios.
As discussed in Section 2, Problem 1 and 2 can be solved in polynomial time by directly applying a minimum spanning tree algorithm (Kruskal’s algorithm or Prim’s algorithm for undirected graphs; Edmonds’ algorithm [35] for directed graphs) and Dijkstra’s shortest path algorithm respectively. Kruskal’s algorithm has time complexity O(E log V ), while Prim’s algorithm also has time complexity O(E log V ) when using binary heap for implementing the priority queue, and O(E + V log V ) when using Fibonacci heap for implementing the priority queue. The running time of Edmonds’ algorithm is O(EV ) and can be reduced to O(E + V log V ) with faster implementation. Similarly, Dijkstra’s algorithm for constructing the shortest path tree starting from the root has a time complexity of O(E log V ) via a binary heapbased priority queue implementation and a time complexity of O(E + V log V ) via Fibonacci heapbased priority queue implementation.
Next, we’ll show that Problem 5 and 6 are NPhard even for the special case where Δ = Φ and Φ is symmetric. This will lead to hardness proofs for the other variants.
The primary challenge that we encounter while demonstrating hardness is that our deltas must obey the triangle inequality: unlike other settings where deltas need not obey real constraints, since, in our case, deltas represent actual modifications that can be stored, it must obey additional realistic constraints. This causes severe complications in proving hardness, often transforming the proofs from very simple to fairly challenging.
Consider the scenario when Δ = Φ and Φ is symmetric. We take Δ as an example. The triangle inequality, can be stated as follows:
where p, q, w V and p ≠ q ≠ w. The first inequality states that the “delta” between two versions can not exceed the total “deltas” of any twohop path with the same starting and ending vertex; while the second inequality indicates that the “delta” between two versions must be bigger than one version’s full storage cost minus another version’s full storage cost. Since each tuple and modification is recorded explicitly when Φ is symmetric, it is natural that these two inequalities hold.
We now demonstrate hardness.
Problem 6 is NPhard when Δ = Φ and Φ is symmetric.
Here we prove NPhardness using a reduction from the set cover problem. Recall that in the set cover problem, we are given m sets S = {s_{1}, s_{2}, …, s_{m}} and n items T = {t_{1}, t_{2}, … t_{n}}, where each set s_{i} covers some items, and the goal is to pick k sets S such that _{{}_{F}_{}}F = T while minimizing k.
Given a set cover instance, we now construct an instance of Problem 6 that will provide a solution to the original set cover problem. The threshold we will use in Problem 6 will be (β + 1)α, where β, α are constants that are each greater than 2(m + n). (This is just to ensure that they are “large”.) We now construct the graph G(V, E) in the following way; we display the constructed graph in Figure 5. Our vertex set V is as follows:
We add the two dummy vertices simply to ensure that v_{0} is materialized, as we will see later. We now define the storage cost for materializing each vertex in V in the following way:
(These are the numbers colored blue in the tree of Figure 5(b).) As we can see above, we have set the costs in such a way that the vertex v_{0} and the vertices corresponding to sets in S have low materialization cost, while the other vertices have high materialization cost: this is by design so that we only end up materializing these vertices. Our edge set E is now as follows.
It is easy to show that our constructed graph G obeys the triangle inequality.
Consider a solution to Problem 6 on the constructed graph G. We now demonstrate that that solution leads to a solution of the original set cover problem. Our proof proceeds in four key steps:
Thus, minimizing the total storage cost is equivalent to minimizing k in set cover problem.
We now show that Problem 5 is NPHard as well. The general philosophy is similar to the proof in Lemma 5, except that we create c dummy vertices instead of two dummy vertices v_{1}, v_{2} in Lemma 5, where c is sufficiently large—this is to once again ensure that v_{0} is materialized. The detailed proof can be found in the extended technical report [13].
Problem 5 is NPHard when Δ = Φ and Φ is symmetric.
Since Problem 4 swaps the constraint and goal compared to Problem 6, it is similarly NPHard. (Note that the decision versions of the two problems are in fact identical, and therefore the proof still applies.) Similarly, Problem 3 is also NPHard. Now that we have proved the NPhard even in the special case where Δ = Φ and Φ is symmetric, we can conclude that Problem 3, 4, 5, 6, are NPhard in a more general setting where Φ is not symmetric and Δ ≠ Φ, as listed in Table 1.
In the extended technical report, we also consider the variant of the problem where Δ ≠ Φ but the recreation cost Φ_{ij} is the same for all pairs of versions, and a version recreation cost is simply the number of hops or delta operations to reconstruct the version. The reason why this hopbased variant is interesting is because it is related to a special case of the dMinimumSteinerTree problem, namely the dMinimumSpanningTree problem, i.e., identifying the smallest spanning tree where the diameter is bounded by d. There has been some work on the dMinimumSpanningTree problem [11, 17, 24], including demonstrating hardness for dMinimumSpanningTree (using a reduction from SAT), and also demonstrating hardness of approximation.
Since the hopbased variant is a special case of the last column of Table 1, this indicates that Problem 6 for the most general case is similarly hard to approximate; we suspect similar results hold for the other problems as well. It remains to be seen if hardness of approximation can be demonstrated for the variants in the second and third last columns.
As discussed in Section 2, our different application scenarios lead to different problem formulations, spanning different constraints and objectives, and different assumptions about the nature of Φ, Δ.
Given that we demonstrated in the previous section that all the problems are NPHard, we focus on developing efficient heuristics. In this section, we present two novel heuristics: first, in Section 4.1, we present LMG, or the Local Move Greedy algorithm, tailored to the case when there is a bound or objective on the average recreation cost: thus, this applies to Problems 3 and 5. Second, in Section 4.2, we present MP, or Modified Prim’s algorithm, tailored to the case when there is a bound or objective on the maximum recreation cost: thus, this applies to Problems 4 and 6. We present two variants of the MP algorithm tailored to two different settings.
Then, we present two algorithms — in Section 4.3, we present an approximation algorithm called LAST, and in Section 4.4, we present an algorithm called GitH which is based on Git repack. Both of these are adapted from literature to fit our problems and we compare these against our algorithms in Section 5. Note that LAST does not explicitly optimize any objectives or constraints in the manner of LMG, MP, or GitH, and thus the four algorithms are applicable under different settings; LMG and MP are applicable when there is a bound or constraint on the average or maximum recreation cost, while LAST and GitH are applicable when a “good enough” solution is needed. Furthermore, note that all these algorithms apply to both directed and undirected versions of the problems, and to the symmetric and unsymmetric cases.
The pseudocodes for the algorithms can be found in our extended technical report [13].
The LMG algorithm is applicable when we have a bound or constraint on the average case recreation cost. We focus on the case where there is a constraint on the storage cost (Problem 3); the case when there is no such constraint (Problem 5) can be solved by repeated iterations and binary search on the previous problem.
At a high level, the algorithm starts with the Minimum Spanning Tree (MST) as G_{S}, and then greedily adds edges from the Shortest Path Tree (SPT) that are not present in G_{S}, while G_{S} respects the bound on storage cost.
The algorithm starts off with G_{S} equal to the MST. The SPT naturally contains all the edges corresponding to complete versions. The basic idea of the algorithm is to replace deltas in G_{S} with versions from the SPT that maximize the following ratio:
This is simply the reduction in total recreation cost per unit addition of weight to the storage graph G_{S}.
Let ξ consists of edges in the SPT not present in the G_{S} (these precisely correspond to the versions that are not explicitly stored in the MST, and are instead computed via deltas in the MST). At each “round”, we pick the edge e_{uv} ξ that maximizes ρ, and replace previous edge e_{u′v} to v. The reduction in the sum of the recreation costs is computed by adding up the reductions in recreation costs of all w G_{S} that are descendants of v in the storage graph (including v itself). On the other hand, the increase in storage cost is simply the weight of e_{uv} minus the weight of e_{u′v}. This process is repeated as long as the storage budget is not violated. We explain this with the means of an example.
Figure 6(a) denotes the current G_{S}. Node 0 corresponds to the dummy node. Now, we are considering replacing edge e_{14} with edge e_{04}, that is, we are replacing a delta to version 4 with version 4 itself. Then, the denominator of ρ is simply Δ_{04} − Δ_{14}. And the numerator is the changes in recreation costs of versions 4, 5, and 6 (notice that 5 and 6 were below 4 in the tree.) This is actually simple to compute: it is simply three times the change in the recreation cost of version 4 (since it affects all versions equally). Thus, we have the numerator of ρ is simply 3 × (Φ_{01} + Φ_{14} − Φ_{04}).
Our overall complexity is O(V ^{2}). We provide details in the technical report.
Note that the algorithm can easily take into account access frequencies of different versions and instead optimize for the total weighted recreation cost (weighted by access frequencies). The algorithm is similar, except that the numerator of ρ will capture the reduction in weighted recreation cost.
Next, we introduce a heuristic algorithm based on Prim’s algorithm for Minimum Spanning Trees for Problem 6 where the goal is to reduce total storage cost while recreation cost for each version is within threshold θ; the solution for Problem 4 is similar.
At a high level, the algorithm is a variant of Prim’s algorithm, greedily adding the version with smallest storage cost and the corresponding edge to form a spanning tree T. Unlike Prim’s algorithm where the spanning tree simply grows, in this case, even if an edge is present in T, it could be removed in future iterations. At all stages, the algorithm maintains the invariant that the recreation cost of all versions in T is bounded within θ.
At each iteration, the algorithm picks the version V_{i} with the smallest storage cost to be added to the tree. Once this version V_{i} is added, we consider adding all deltas to all other versions V_{j} such that their recreation cost through V_{i} is within the constraint θ, and the storage cost does not increase. Each version maintains a pair l(V_{i}) and d(V_{i}): l(V_{i}) denotes the marginal storage cost of V_{i}, while d(V_{i}) denotes the total recreation cost of V_{i}. At the start, l(V_{i}) is simply the storage cost of V_{i} in its entirety.
We now describe the algorithm in detail. Set X represents the current version set of the current spanning tree T. Initially X = . In each iteration, the version V_{i} with the smallest storage cost (l(V_{i})) in the priority queue PQ is picked and added into spanning tree T. When V_{i} is added into T, we need to update the storage cost and recreation cost for all V_{j} that are neighbors of V_{i}. Notice that in Prim’s algorithm, we do not need to consider neighbors that are already in T. However, in our scenario a better path to such a neighbor may be found and this may result in an update. For instance, if edge V_{i}, V_{j} can make V_{j} ’s storage cost smaller while the recreation cost for V_{j} does not increase, we can update p(V_{j}) = V_{i} as well as d(V_{j}), l(V_{j}) and T. For neighbors V_{j} T, we update d(V_{j}), l(V_{j}), p(V_{j}) if edge V_{i}, V_{j} can make V_{j} ’s storage cost smaller and the recreation cost for V_{j} is no bigger than θ.
Say we operate on G given by Figure 7, and let the threshold θ be 6. Each version V_{i} is associated with a pair l(V_{i}), d(V_{i}). Initially version V_{0} is pushed into priority queue. When V_{0} is dequeued, each neighbor V_{j} updates < l(V_{j}), d(V_{j}) > as shown in Figure 9 (a). Notice that l(V_{i}), i ≠ 0 for all i is simply the storage cost for that version. For example, when considering edge (V_{0}, V_{1}), l(V_{1}) = 3 and d(V_{1}) = 3 is updated since recreation cost (if V_{1} is to be stored in its entirety) is smaller than threshold θ, i.e., 3 < 6. Afterwards, version V_{1}, V_{2} and V_{3} are inserted into the priority queue. Next, we dequeue V_{1} since l(V_{1}) is smallest among the versions in the priority queue, and add V_{1} to the spanning tree. We then update < l(V_{j}), d(V_{j}) > for all neighbors of V_{1}, e.g., the recreation cost for version V_{2} will be 6 and the storage cost will be 2 when considering edge (V_{1}, V_{2}). Since 6 ≤ 6, (l(V_{2}), d(V_{2})) is updated to (2, 6) as shown in Figure 9 (b); however, < l(V_{3}), d(V_{3}) > will not be updated since the recreation cost is 3 + 4 > 6 when considering edge (V_{1}, V_{3}). Subsequently, version V_{2} is dequeued because it has the lowest l(V_{2}), and is added to the tree, giving Figure 9 (b). Subsequently, version V_{3} are dequeued. When V_{3} is dequeued from PQ, (l(V_{2}), d(V_{2})) is updated. This is because the storage cost for V_{2} can be updated to 1 and the recreation cost is still 6 when considering edge (V_{3}, V_{2}), even if V_{2} is already in T as shown in Figure 9 (c). Eventually, we get the final answer in Figure 9 (d).
The complexity of the algorithm is the same as that of Prim’s algorithm, i.e., O(E log V).
Here, we sketch an algorithm from previous work [21] that enables us to find a tree with a good balance of storage and recreation costs, under the assumptions that Δ = Φ and Φ is symmetric.
The algorithm, which takes a parameter α as input, starts with a minimum spanning tree and does a depthfirst traveral (DFS) on it. When visiting V_{i} during the traversal, if it finds that the recreation cost for V_{i} exceeds α× the cost of the shortest path from V_{0} to V_{i}, then this current path is replaced with the shortest path to the node. It can be shown that the total cost of the resulting spanning tree is within (1+2=(α − 1)) times the cost of minimum spanning tree in G. Even though the algorithm was proposed for undirected graphs, it can be applied to the directed graph case but without any comparable guarantees. We refer the reader to the full version for more details and pseudocode [13].
Figure 10 (a) is the minimum spanning tree (MST) rooted at node V_{0} of G in Figure 8. The approximation threshold α is set to be 2. The algorithm starts with the MST and conducts a depthfirst traversal in the MST from root V_{0}. When visiting node V_{2}, d(V_{2}) = 3 and the shortest path to node V_{2} is 3, thus 3 < 2 × 3. We continue to visit node V_{2} and V_{3}. When visiting V_{3}, d(V_{3}) = 8 > 2 × 3 where 3 is the shortest path to V_{3} in G. Thus, d(V_{3}) is set to be 3 and p(V_{3}) is set to be node 0 by replacing with the shortest path V_{0}, V_{3} as shown in Figure 10 (b). Afterwards, the backedge < V_{3}, V_{1} > is traversed in MST. Since 3 + 2 < 6, where 3 is the current value of d(V_{3}), 2 is the edge weight of (V_{3}, V_{1}) and 6 is the current value in d(V_{1}), thus d(V_{1}) is updated as 5 and p(V_{1}) is updated as node V_{3}. At last node V_{4} is visited, d(V_{4}) is first updated as 7. Since 7 < 2 × 4, lines 9–11 are not executed. Figure 10 (c) is the resulting spanning tree of the algorithm, where the recreation cost for each node is under the constraint and the total storage cost is 3 + 3 + 2 + 2 = 10.
The complexity of the algorithm is O(E log V ). Further details can be found in the technical report.
This heuristic is an adaptation of the current heuristic used by Git and we refer to it as GitH. We sketch the algorithm here and refer the reader to the extended version for more details [13]. GitH uses two parameters: w (window size) and d (max depth).
We consider the versions in an nonincreasing order of their sizes. The first version in this ordering is chosen as the root of the storage graph and has depth 0 (i.e., it is materialized). At all times, we maintain a sliding window containing at most w versions. For each version V_{i} after the first one, let V_{l} denote a version in the current window. We compute: ${\mathrm{\Delta}}_{l,i}^{\prime}={\mathrm{\Delta}}_{l,i}/(d{d}_{l})$, where d_{l} is the depth of V_{l} (thus deltas with shallow depths are preferred over slightly smaller deltas with higher depths). We find the version V_{j} with the lowest value of this quantity and choose it as V_{i}’s parent (as long as d_{j} < d). The depth of V_{i} is then set to d_{j} +1. The sliding window is modified to move V_{l} to the end of the window (so it will stay in the window longer), V_{j} is added to the window, and the version at the beginning of the window is dropped.
The running time of the heuristic is O(V  log V  + wV ), excluding the time to construct deltas.
We have built a prototype version management system, that will serve as a foundation to DataHub [12]. The system provides a subset of Git/SVNlike interface for dataset versioning. Users interact with the version management system in a clientserver model over HTTP. The server is implemented in Java, and is responsible for storing the version history of the repository as well as the actual files in them. The client is implemented in Python and provides functionality to create (commit) and check out versions of datasets, and create and merge branches. Note that, unlike traditional VCS which make a best effort to perform automatic merges, in our system we let the user perform the merge and notify the system by creating a version with more than one parent.
In the following sections, we present an extensive evaluation of our designed algorithms using a combination of synthetic and derived realworld datasets. Apart from implementing the algorithms described above, LMG and LAST require both SPT and MST as input. For both directed and undirected graphs, we use Dijkstra’s algorithm to find the singlesource shortest path tree (SPT). We use Prim’s algorithm to find the minimum spanning tree for undirected graphs. For directed graphs, we use an implementation [3] of the Edmonds’ algorithm [35] for computing the mincost arborescence (MCA). We ran all our experiments on a 2.2GHz Intel Xeon CPU E52430 server with 64GB of memory, running 64bit Red Hat Enterprise Linux 6.5.
We use four data sets: two synthetic and two derived from realworld source code repositories. Although there are many publicly available source code repositories with large numbers of commits (e.g., in GitHub), those repositories typically contain fairly small (source code) files, and further the changes between versions tend to be localized and are typically very small; we expect dataset versions generated during collaborative data analysis to contain much larger datasets and to exhibit large changes between versions. We were unable to find any realistic workloads of that kind.
Hence, we generated realistic dataset versioning workloads as follows. First, we wrote a synthetic version generator suite, driven by a small set of parameters, that is able to generate a variety of version histories and corresponding datasets. Second, we created two realworld datasets using publicly available forks of popular repositories on GitHub. We describe each of the two below.
Our synthetic dataset generation suite^{2} takes a twostep approach to generate a dataset that we sketch below. The first step is to generate a version graph with the desired structure, controlled by the following parameters:
Once a version graph is generated, the second step is to generate the appropriate versions and compute the deltas. The files in our synthetic dataset are ordered CSV files (containing tabular data) and we use deltas based on UNIXstyle diffs. The previous step also annotates each edge (u, v) in the version graph with edit commands that can be used to produce v from u. Edit commands are a combination of one of the following six instructions – add/delete a set of consecutive rows, add/remove a column, and modify a subset of rows/columns.
Using this, we generated two synthetic datasets (Figure 11):
We use 986 forks of the Twitter Bootstrap repository and 100 forks of the Linux repository, to derive our realworld workloads. For each repository, we checkout the latest version in each fork and concatenate all files in it (by traversing the directory structure in lexicographic order). Thereafter, we compute deltas between all pairs of versions in a repository, provided the size difference between the versions under consideration is less than a threshold. We set this threshold to 100KB for the Twitter Bootstrap repository and 10MB for the Linux repository. This gives us two realworld datasets, Bootstrap Forks (BF) and Linux Forks (LF), with properties shown in Figure 11.
We begin with evaluating the performance of two popular version control systems, SVN (v1.8.8) and Git (v1.7.1), using the LF dataset. We create an FSFStype repository in SVN, which is more space efficient than a Berkeley DBbased repository [4]. We then import the entire LF dataset into the repository in a single commit. The amount of space occupied by the db/revs/directory is around 8.5GB and it takes around 48 minutes to complete the import. We contrast this with the naive approach of applying a gzip on the files which results in total compressed storage of 10.2GB. In case of Git, we add and commit the files in the repository and then run a git repack a d –depth=50 –window=50 on the repository^{3}. The size of the Git pack file is 202 MB although the repack consumes 55GB memory and takes 114 minutes (for higher window sizes, Git fails to complete the repack as it runs out of memory).
In comparison, the solution found by the MCA algorithm occupies 516MB of compressed storage (2.24GB when uncompressed) when using UNIX diff for computing the deltas. To make a fair comparison with Git, we use xdiff from the LibXDiff library [7] for computing the deltas, which forms the basis of Git’s delta computing routine. Using xdiff brings down the total storage cost to just 159 MB. The total time taken is around 102 minutes; this includes the time taken to compute the deltas and then to find the MCA for the corresponding graph.
The main reason behind SVN’s poor performance is its use of “skipdeltas” to ensure that at most O(log n) deltas are needed for reconstructing any version [8]; that tends to lead it to repeatedly store redundant delta information as a result of which the total space requirement increases significantly. The heuristic used by Git is much better than SVN (Section 4.4). However as we show later (Fig. 12), our implementation of that heuristic (GitH) required more storage than LMG for guaranteeing similar recreation costs.
We begin with a comprehensive evaluation of the three algorithms, LMG, MP, and LAST, on directed datasets. Given that all of these algorithms have parameters that can be used to trade off the storage cost and the total recreation cost, we compare them by plotting the different solutions they are able to find for the different values of their respective input parameters. Figure 12(a–d) show four such plots; we run each of the algorithms with a range of different values for its input parameter and plot the storage cost and the total (sum) recreation cost for each of the solutions found. We also show the minimum possible values for these two costs: the vertical dashed red line indicates the minimum storage cost required for storing the versions in the dataset as found by MCA, and the horizontal one indicates the minimum total recreation cost as found by SPT (equal to the sum of all version sizes).
The first key observation we make is that, the total recreation cost decreases drastically by allowing a small increase in the storage budget over MCA. For example, for the DC dataset, the sum recreation cost for MCA is over 11 PB (see Table 11) as compared to just 34TB for the SPT solution (which is the minimum possible). As we can see from Figure 12(a), a space budget of 1.1×the MCA storage cost reduces the sum of recreation cost by three orders of magnitude. Similar trends can be observed for the remaining datasets and across all the algorithms. We observe that LMG results in the best tradeoff between the sum of recreation cost and storage cost with LAST performing fairly closely. An important takeaway here, especially given the amount of prior work that has focused purely on storage cost minimization (Section 6), is that: it is possible to construct balanced trees where the sum of recreation costs can be reduced and brought close to that of SPT while using only a fraction of the space that SPT needs.
We also ran GitH heuristic on the all the four datasets with varying window and depth settings. For BF, we ran the algorithm with four different window sizes (50, 25, 20, 10) for a fixed depth 10 and provided the GitH algorithm with all the deltas that it requested. For all other datasets, we ran GitH with an infinite window size but restricted it to choose from deltas that were available to the other algorithms (i.e., only deltas with sizes below a threshold); as we can see, the solutions found by GitH exhibited very good total recreation cost, but required significantly higher storage than other algorithms. This is not surprising given that GitH is a greedy heuristic that makes choices in a somewhat arbitrary order.
In Figures 13(a–b), we plot the maximum recreation costs instead of the sum of recreation costs across all versions for two of the datasets (the other two datasets exhibited similar behavior). The MP algorithm found the best solutions here for all datasets, and we also observed that LMG and LAST both show plateaus for some datasets where the maximum recreation cost did not change when the storage budget was increased. This is not surprising given that the basic MP algorithm tries to optimize for the storage cost given a bound on the maximum recreation cost, whereas both LMG and LAST focus on minimization of the storage cost and one version with high recreation cost is unlikely to affect that significantly.
We test the three algorithms on the undirected versions of three of the datasets (Figure 14). For DC and LC, undirected deltas between pairs of versions were obtained by concatenating the two directional deltas; for the BF dataset, we use UNIX diff itself to produce undirected deltas. Here again we observe that LMG consistently outperforms the other algorithms in terms of finding a good balance between the storage cost and the sum of recreation costs. MP again shows the best results when trying to balance the maximum recreation cost and the total storage cost. Similar results were observed for other datasets but are omitted due to space limitations.
In many cases, we may be able to estimate access frequencies for the various versions (from historical access patterns), and if available, we may want to take those into account when constructing the storage graph. The LMG algorithm can be easily adapted to take such information into account, whereas it is not clear how to adapt either LAST or MP in a similar fashion. In this experiment, we use LMG to compute a storage graph such that the sum of recreation costs is minimal given a space budget, while taking workload information into account. The worload here assigns a frequency of access to each version in the repository using a Zipfian distribution (with exponent 2); realworld access frequencies are known to follow such distributions. Given the workload information, the algorithm should find a storage graph that has the sum of recreation cost less than the index when the workload information is not taken into account (i.e., all versions are assumed to be accessed equally frequently). Figure 15 shows the results for this experiment. As we can see, for the DC dataset, taking into account the access frequencies during optimization led to much better solutions than ignoring the access frequencies. On the other hand, for the LF dataset, we did not observe a large difference.
Here we evaluate the running times of the LMG algorithm. Recall that LMG takes MST (or MCA) and SPT as inputs. In Fig. 16, we report the total running time as well as the time taken by LMG itself. We generated a set of version graphs as subsets of the graphs for LC and DC datasets as follows: for a given number of versions n, we randomly choose a node and traverse the graph starting at that node in breadthfirst manner till we construct a subgraph with n versions. We generate 5 such subgraphs for increasing values of n and report the average running time for LMG; the storage budget for LMG is set to three times of the space required by the MST (all our reported experiments with LMG use less storage budget than that). The time taken by LMG on DC dataset is more than LC for the same number of versions; this is because DC has lower delta values than LC (see Fig. 11) and thus requires more edges from SPT to satisfy the storage budget.
On the other hand, MP takes between 1 to 8 seconds on those datasets, when the recreation cost is set to maximum. Similar to LMG, LAST requires the MST/MCA and SPT as inputs; however the running time of LAST itself is linear and it takes less than 1 second in all cases. Finally the time taken by GitH on LC and DC datasets, on varying window sizes range from 35 seconds (window = 1000) to a little more than 120 minutes (window = 100000); note that, this excludes the time for constructing the deltas.
In summary, although LMG is inherently a more expensive algorithm than MP or LAST, it runs in reasonable time on large input sizes; we note that all of these times are likely to be dwarfed by the time it takes to construct deltas even for moderatelysized datasets.
Finally, we compare the quality of the solutions found by MP with the optimal solution found using the Gurobi Optimizer for Problem 6. We use the ILP formulation from Section 2.3 with constraint on the maximum recreation cost (θ), and compare the optimal storage cost with that of the MP algorithm (which resulted in solutions with lowest maximum recreation costs in our evaluation). We use our synthetic dataset generation suite to generate three small datasets, with 15, 25 and 50 versions denoted by v15, v25 and v50 respectively and compute deltas between all pairs of versions. Table 2 reports the results of this experiment, across five θ values. The ILP turned out to be very difficult to solve, even for the very small problem sizes, and in many cases, the optimizer did not finish and the reported numbers are the best solutions found by it.
As we can see, the solutions found by MP are quite close to the ILP solutions for the small problem sizes for which we could get any solutions out of the optimizer. However, extrapolating from the (admittedly limited) data points, we expect that on large problem sizes, MP may be significantly worse than optimal for some variations on the problems (we note that the optimization problem formulations involving max recreation cost are likely to turn out to be harder than the formulations that focus on the average recreation cost). Development of better heuristics and approximation algorithms with provable guarantees for the various problems that we introduce are rich areas for further research.
Perhaps the most closely related prior work is source code version systems like Git, Mercurial, SVN, and others, that are widely used for managing source code repositories. Despite their popularity, these systems largely use fairly simple algorithms underneath that are optimized to work with modestsized source code files and their ondisk structures are optimized to work with linebased diffs. These systems are known to have significant limitations when handling large files and large numbers of versions [2]. As a result, a variety of extensions like gitannex [9], gitbigfiles [10], etc., have been developed to make them work reasonably well with large files.
There is much prior work in the temporal databases literature [14, 31, 26, 34] on managing a linear chain of versions, and retrieving a version as of a specific time point (called snapshot queries) [29]. [15] proposed an archiving technique where all versions of the data are merged into one hierarchy. It was not, however, a fullfledged version control system representing an arbitrarily graph of versions. Snapshot queries have recently also been studied in the context of array databases [32, 30] and graph databases [22]. Seering et al. [30] proposed an MSTlike technique for storing an arbitrary tree of versions in the context of scientific databases. They also proposed several heuristics for choosing which versions to materialize given the distribution of access frequencies to historical versions. Several databases support “time travel” features (e.g., Oracle Flashback, Postgres [33]). However, those do not allow for branching trees of versions. [19] articulates a similar vision to our overall DataHub vision; however, they do not propose formalisms or algorithms to solve the underlying data management challenges.
There is also much prior work on compactly encoding differences between two files or strings in order to reduce communication or storage costs. In addition to standard utilities like UNIX diff, many sophisticated techniques have been proposed for computing differences or edit sequences between two files (e.g., xdelta [25], vdelta [20], vcdiff [23], zdelta [36]). That work is largely orthogonal and complementary to our work.
Many prior efforts have looked at the problem of minimizing the total storage cost for storing a collection of related files (i.e., Problem 1). These works do not typically consider the recreation cost or the tradeoffs between the two. Quinlan et al. [28] propose an archival “deduplication” storage system that identifies duplicate blocks across files and only stores them once for reducing storage requirements. Zhu et al. [37] present several optimizations on the basic theme. Douglis et al. [18] present several techniques to identify pairs of files that could be efficiently stored using delta compression even if there is no explicit derivation information known about the two files; similar techniques could be used to better identify which entries of the matrices Δ and Φ to reveal in our scenario. Burns and Long [16] present a technique for inplace reconstruction of deltacompressed files using a graphtheoretic approach. That work could be incorporated into our overall framework to reduce the memory requirements during reconstruction. We refer the reader to a recent survey [27] for a more comprehensive coverage of this line of work.
Large datasets and collaborative and iterative analysis are becoming a norm in many application domains; however we lack the data management infrastructure to efficiently manage such datasets, their versions over time, and derived data products. Given the high overlap and duplication among the datasets, it is attractive to consider using delta compression to store the datasets in a compact manner, where some datasets or versions are stored as modifications from other datasets; such delta compression however leads to higher latencies while retrieving specific datasets. In this paper, we studied the tradeoff between the storage and recreation costs in a principled manner, by formulating several optimization problems that trade off these two in different ways and showing that most variations are NPHard. We also presented several efficient algorithms that are effective at exploring this tradeoff, and we presented an extensive experimental evaluation using a prototype version management system that we have built. There are many interesting and rich avenues for future work that we are planning to pursue. In particular, we plan to develop online algorithms for making the optimization decisions as new datasets or versions are being created, and also adaptive algorithms that reevaluate the optimization decisions based on changing workload information. We also plan to explore the challenges in extending our work to a distributed and decentralized setting.
This research was supported by NSF Grants IIS1319432 and IIS1513407, grant 1U54GM114838 awarded by NIGMS through funds provided by the transNIH Big Data to Knowledge (BD2K) initiative, a Google Faculty Research Award, and an IBM Faculty Award.
^{1}We refer the reader to the extended version for proofs.
^{2}Our synthetic dataset generator may be of independent interest to researchers working on version management.
^{3}Unlike git repack, svnadmin pack has a negligible effect on the storage cost as it primarily aims to reduce disk seeks and perversion disk usage penalty by concatenating files into a single “pack” [5, 6].
PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. 