Persistent data structure

In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (visibly) update the structure in-place, but instead always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjan's 1986 article.

A data structure is partially persistent if all versions can be accessed but only the newest version can be modified. The data structure is fully persistent if every version can be both accessed and modified. If there is also a meld or merge operation that can create a new version from two previous versions, the data structure is called confluently persistent. Structures that are not persistent are called ephemeral.

These types of data structures are particularly common in logical and functional programming, In the fully persistent model, both updates and queries are allowed on any version of the data structure. In some cases the performance characteristics of querying or updating older versions of a data structure may be allowed to degrade, as is true with the rope data structure. In addition, a data structure can be referred to as confluently persistent if, in addition to being fully persistent, two versions of the same data structure can be combined to form a new version which is still fully persistent.

Techniques for preserving previous versions

Copy-on-write

One method for creating a persistent data structure is to use a platform provided ephemeral data structure such as an array to store the data in the data structure and copy the entirety of that data structure. This is an inefficient technique because the entire backing data structure must be copied for each write, leading to worst case <math>O(n\cdot m)</math> performance characteristics for m modifications of an array of size n.

Copy-on-write memory management can reduce the price for an update from <math>\Theta(n)</math> to <math>O(Bu)</math>, where B is the memory block size and u the number of pages updated in an operation.

Fat node

The fat node method is to record all changes made to node fields in the nodes themselves, without erasing old values of the fields. This requires that nodes be allowed to become arbitrarily “fat”. In other words, each fat node contains the same information and pointer fields as an ephemeral node, along with space for an arbitrary number of extra field values. Each extra field value has an associated field name and a version stamp which indicates the version in which the named field was changed to have the specified value. Besides, each fat node has its own version stamp, indicating the version in which the node was created. The only purpose of nodes having version stamps is to make sure that each node only contains one value per field name per version. In order to navigate through the structure, each original field value in a node has a version stamp of zero.

Complexity of fat node

With using fat node method, it requires O(1) space for every modification: just store the new data. Each modification takes O(1) additional time to store the modification at the end of the modification history. This is an amortized time bound, assuming modification history is stored in a growable array. At access time, the right version at each node must be found as the structure is traversed. If m modifications were to be made, then each access operation would have <math>O(\log m)</math> slowdown resulting from the cost of finding the nearest modification in the array. Alternatively, one can employ the van Emde Boas tree at each node (possibly the space-efficient version using hashing) to reduce the time for an access to <math>O(\log\log m)</math> at the cost of increasing update time to <math>O(\log\log m)</math>. If only partial persistence is required, the time for an update can be kept at its original order of magnitude, modulo randomization and amortization (since the time for a single update to the fat node can be amortized expected <math>O(1)</math>).

Path copying

This method assumes that the data structure is a linked graph of nodes.

On update, a copy is made of all nodes on the path to any node which is about to be modified. These changes must then be cascaded back through the data structure: all nodes that pointed to the old node must be modified to point to the new node instead. These modifications cause more cascading changes, and so on, until the root node is reached.

Complexity of path copying

With m modifications, this costs O(log m) additive lookup time. Modification time and space are bounded by the maximal number of ancestors for any node in the data structure times the cost of the update in the ephemeral data structure. In a Balanced Binary Search Tree without parent pointers the worst case modification time complexity is O(log n + update cost). However, in a linked list the worst case modification time complexity is O(n + update cost).

A combination

Driscoll, Sarnak, Sleator, Tarjan came up stacks, and treaps, can easily be adapted to create a persistent version. Some others need slightly more effort, for example: queues, dequeues, and extensions including min-deques (which have an additional O(1) operation min returning the minimal element) and random-access deques (which have an additional operation of random access with sub-linear, most often logarithmic, complexity).

Persistent data strctures which are based on immutable ("pure functional") structures should be constrasted with structures that used destructive updates (mutation) and are made persistent using the fat node or path copying techniques, described above.

Linked lists

Singly linked lists are the bread-and-butter data structure in functional languages. Some ML-derived languages, like Haskell, are purely functional because once a node in the list has been allocated, it cannot be modified, only copied, referenced or destroyed by the garbage collector when nothing refers to it. (Note that ML itself is not purely functional, but supports non-destructive list operations subset, that is also true in the Lisp (LISt Processing) functional language dialects like Scheme and Racket.)

Consider the two lists:

xs = [0, 1, 2]

ys = [3, 4, 5]

These would be represented in memory by:

File:Purely_functional_list_before.svg

where a circle indicates a node in the list (the arrow out representing the second element of the node which is a pointer to another node).

Now concatenating the two lists:

zs = xs ++ ys

results in the following memory structure:

File:Purely_functional_list_after.svg

Notice that the nodes in list <code>xs</code> have been copied, but the nodes in <code>ys</code> are shared. As a result, the original lists (<code>xs</code> and <code>ys</code>) persist and have not been modified.

The reason for the copy is that the last node in <code>xs</code> (the node containing the original value <code>2</code>) cannot be modified to point to the start of <code>ys</code>, because that would change the value of <code>xs</code>.

Trees

Consider a binary search tree,

Hash array mapped tries were originally described in a 2001 paper by Phil Bagwell entitled "Ideal Hash Trees". This paper presented a mutable Hash table where "Insert, search and delete times are small and constant, independent of key set size, operations are O(1). Small worst-case times for insert, search and removal operations can be guaranteed and misses cost less than successful searches". This data structure was then modified by Rich Hickey to be fully persistent for use in the Clojure programming language.

Conceptually, hash array mapped tries work similar to any generic tree in that they store nodes hierarchically and retrieve them by following a path down to a particular element. The key difference is that Hash Array Mapped Tries first use a hash function to transform their lookup key into a (usually 32 or 64 bit) integer. The path down the tree is then determined by using slices of the binary representation of that integer to index into a sparse array at each level of the tree. The leaf nodes of the tree behave similar to the buckets used to construct hash tables and may or may not contain multiple candidates depending on hash collisions.

Usage in programming languages

Haskell

Haskell is a pure functional language and therefore does not allow for mutation. Therefore, all data structures in the language are persistent, as it is impossible to not preserve the previous state of a data structure with functional semantics. This is because any change to a data structure that would render previous versions of a data structure invalid would violate referential transparency.

In its standard library Haskell has efficient persistent implementations for linked lists, Maps (implemented as size balanced trees), and Sets among others.

Clojure

Like many programming languages in the Lisp family, Clojure contains an implementation of a linked list, but unlike other dialects its implementation of a linked list has enforced persistence instead of being persistent by convention. Clojure also has efficient implementations of persistent vectors, maps, and sets based on persistent hash array mapped tries. These data structures implement the mandatory read-only parts of the Java collections framework.

The designers of the Clojure language advocate the use of persistent data structures over mutable data structures because they have value semantics which gives the benefit of making them freely shareable between threads with cheap aliases, easy to fabricate, and language independent.

These data structures form the basis of Clojure's support for parallel computing since they allow for easy retries of operations to sidestep data races and atomic compare and swap semantics.

Elm

The Elm programming language is purely functional like Haskell, which makes all of its data structures persistent by necessity. It contains persistent implementations of linked lists as well as persistent arrays, dictionaries, and sets.

Elm uses a custom virtual DOM implementation that takes advantage of the persistent nature of Elm data. As of 2016 it was reported by the developers of Elm that this virtual DOM allows the Elm language to render HTML faster than the popular JavaScript frameworks React, Ember, and Angular.

Java

The Java programming language is not particularly functional. Despite this, the core JDK package java.util.concurrent includes CopyOnWriteArrayList and CopyOnWriteArraySet which are persistent structures, implemented using copy-on-write techniques. The usual concurrent map implementation in Java, ConcurrentHashMap, is not persistent, however. Fully persistent collections are available in third-party libraries, or other JVM languages.

JavaScript

The popular JavaScript frontend framework React is frequently used along with a state management system that implements the Flux architecture, a popular implementation of which is the JavaScript library Redux. The Redux library is inspired by the state management pattern used in the Elm programming language, meaning that it mandates that users treat all data as persistent. As a result, the Redux project recommends that in certain cases users make use of libraries for enforced and efficient persistent data structures. This reportedly allows for greater performance than when comparing or making copies of regular JavaScript objects.

One such library of persistent data structures Immutable.js is based on the data structures made available and popularized by Clojure and Scala. It is mentioned by the documentation of Redux as being one of the possible libraries that can provide enforced immutability. Immer.js brings an interesting approach where one "creates the next immutable state by mutating the current one".

Immer.js uses native JavaScript objects and not efficient persistent data structures and it might cause performance issues when data size is big.

Prolog

Prolog terms are naturally immutable and therefore data structures are typically persistent data structures. Their performance depends on sharing and garbage collection offered by the Prolog system. Extensions to non-ground Prolog terms are not always feasible because of search space explosion. Delayed goals might mitigate the problem.

Some Prolog systems nevertheless do provide destructive operations like setarg/3, which might come in different flavors, with/without copying and with/without backtracking of the state change. There are cases where setarg/3 is used to the good of providing a new declarative layer, like a constraint solver.

Scala

The Scala programming language promotes the use of persistent data structures for implementing programs using "Object-Functional Style". Scala contains implementations of many persistent data structures including linked lists, red–black trees, as well as persistent hash array mapped tries as introduced in Clojure.

Garbage collection

Because persistent data structures are often implemented in such a way that successive versions of a data structure share underlying memory ergonomic use of such data structures generally requires some form of automatic garbage collection system such as reference counting or mark and sweep. In some platforms where persistent data structures are used it is an option to not use garbage collection which, while doing so can lead to memory leaks, can in some cases have a positive impact on the overall performance of an application.

References

External links

Lightweight Java implementation of Persistent Red-Black Trees
Efficient persistent structures in C#
- GitHub repo containing implementations of persistent BSTs using Fat Nodes, Copy-on-Write, and Path Copying Techniques. To use the persistent BST implementations, simply clone the repository and follow the instructions provided in the README file.