arangodb/Documentation/Books/Manual/Architecture/StorageEngines.md

# Storage Engines

At the very bottom of the ArangoDB database lies the storage
engine. The storage engine is responsible for persisting the documents
on disk, holding copies in memory, providing indexes and caches to
speed up queries.

Up to version 3.1 ArangoDB only supported memory mapped files (MMFiles)
as sole storage engine. Beginning with 3.2 ArangoDB has support for
pluggable storage engines. The second supported engine is RocksDB from
Facebook.

| MMFiles | RocksDB |
|---|---|
| default | optional |
| dataset needs to fit into memory | work with as much data as fits on disk |
| indexes in memory | hot set in memory, data and indexes on disk |
| slow restart due to index rebuilding | fast startup (no rebuilding of indexes) |
| volatile collections (only in memory, optional) | collection data always persisted |
| collection level locking (writes block reads) | concurrent reads and writes |

*Blog article: [Comparing new RocksDB and MMFiles storage engines](https://www.arangodb.com/why-arangodb/comparing-rocksdb-mmfiles-storage-engines/)*

RocksDB is an embeddable persistent key-value store. It is a log
structure database and is optimized for fast storage.

The MMFiles engine is optimized for the use-case where the data fits
into the main memory. It allows for very fast concurrent
reads. However, writes block reads and locking is on collection
level. Indexes are always in memory and are rebuilt on startup. This
gives better performance but imposes a longer startup time.

The RocksDB engine is optimized for large data-sets and allows for a
steady insert performance even if the data-set is much larger than the
main memory. Indexes are always stored on disk but caches are used to
speed up performance. RocksDB uses document-level locks allowing for
concurrent writes. Writes do not block reads. Reads do not block writes.

The engine must be selected for the whole server / cluster. It is not
possible to mix engines. The transaction handling and write-ahead-log
format in the individual engines is very different and therefore cannot
be mixed.

## RocksDB

### Advantages

RocksDB is a very flexible engine that can be configured for various use cases.

The main advantages of RocksDB are

- document-level locks
- support for large data-sets
- persistent indexes

### Caveats

RocksDB allows concurrent writes. However, when touching the same document a
write conflict is raised. This cannot happen with the MMFiles engine, therefore
applications that switch to RocksDB need to be prepared that such exception can
arise. It is possible to exclusively lock collections when executing AQL. This
will avoid write conflicts but also inhibits concurrent writes.

Currently, another restriction is due to the transaction handling in
RocksDB. Transactions are limited in total size. If you have a statement
modifying a lot of documents it is necessary to commit data inbetween. This will
be done automatically for AQL by default. Transactions that get too big (in terms of
number of operations involved or the total size of data modified by the transaction)
will be committed automatically. Effectively this means that big user transactions
are split into multiple smaller RocksDB transactions that are committed individually.
The entire user transaction will not necessarily have ACID properties in this case.

The threshold values for transaction sizes can be configured globally using the
startup options

* [`--rocksdb.intermediate-commit-size`](../Administration/Configuration/RocksDB.md#non-pass-through-options)

* [`--rocksdb.intermediate-commit-count`](../Administration/Configuration/RocksDB.md#non-pass-through-options)

* [`--rocksdb.max-transaction-size`](../Administration/Configuration/RocksDB.md#non-pass-through-options)

It is also possible to override these thresholds per transaction.

### Performance

RocksDB is a based on log-structured merge tree. A good introduction can be
found in:

- http://www.benstopford.com/2015/02/14/log-structured-merge-trees/
- https://blog.acolyer.org/2014/11/26/the-log-structured-merge-tree-lsm-tree/

The basic idea is that data is organized in levels were each level is a factor
larger than the previous. New data will reside in smaller levels while old data
is moved down to the larger levels. This allows to support high rate of inserts
over an extended period. In principle it is possible that the different levels
reside on different storage media. The smaller ones on fast SSD, the larger ones
on bigger spinning disks.

RocksDB itself provides a lot of different knobs to fine tune the storage
engine according to your use-case. ArangoDB supports the most common ones
using the options below.

Performance reports for the storage engine can be found here:

- https://github.com/facebook/rocksdb/wiki/performance-benchmarks
- https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide

### ArangoDB options

ArangoDB has a cache for the persistent indexes in RocksDB. The total size
of this cache is controlled by the option

    --cache.size

RocksDB also has a cache for the blocks stored on disk. The size of
this cache is controlled by the option

    --rocksdb.block-cache-size

ArangoDB distributes the available memory equally between the two
caches by default.

ArangoDB chooses a size for the various levels in RocksDB that is
suitable for general purpose applications.

RocksDB log strutured data levels have increasing size

    MEM: --
    L0:  --
    L1:  -- --
    L2:  -- -- -- --
    ...

New or updated Documents are first stored in memory. If this memtable
reaches the limit given by

    --rocksdb.write-buffer-size

it will converted to an SST file and inserted at level 0.

The following option controls the size of each level and the depth.

    --rocksdb.num-levels N

Limits the number of levels to N. By default it is 7 and there is
seldom a reason to change this. A new level is only opened if there is
too much data in the previous one.

    --rocksdb.max-bytes-for-level-base B

L0 will hold at most B bytes.

    --rocksdb.max-bytes-for-level-multiplier M

Each level is at most M times as much bytes as the previous
one. Therefore the maximum number of bytes forlevel L can be
calculated as

    max-bytes-for-level-base * (max-bytes-for-level-multiplier ^ (L-1))

## Future

RocksDB imposes a limit on the transaction size. It is optimized to
handle small transactions very efficiently, but is effectively limiting
the total size of transactions.

ArangoDB currently uses RocksDB's transactions to implement the ArangoDB
transaction handling. Therefore the same restrictions apply for ArangoDB
transactions when using the RocksDB engine.

We will improve this by introducing distributed transactions in a future
version of ArangoDB. This will allow handling large transactions as a
series of small RocksDB transactions and hence removing the size restriction.