arangodb/Documentation/Books/Manual/Indexing/IndexBasics.md

Index basics
============

Indexes allow fast access to documents, provided the indexed attribute(s)
are used in a query. While ArangoDB automatically indexes some system
attributes, users are free to create extra indexes on non-system attributes
of documents.

User-defined indexes can be created on collection level. Most user-defined indexes
can be created by specifying the names of the index attributes.
Some index types allow indexing just one attribute (e.g. *fulltext* index) whereas
other index types allow indexing multiple attributes at the same time.

Learn how to use different indexes efficiently by going through the
[ArangoDB Performance Course](https://www.arangodb.com/arangodb-performance-course/).

The system attributes `_id`, `_key`, `_from` and `_to` are automatically indexed
by ArangoDB, without the user being required to create extra indexes for them.
`_id` and `_key` are covered by a collection's primary key, and `_from` and `_to`
are covered by an edge collection's edge index automatically.

Using the system attribute `_id` in user-defined indexes is not possible, but
indexing `_key`, `_rev`, `_from`, and `_to` is.

Creating new indexes is by default done under an exclusive collection lock. The collection is not
available while the index is being created.  This "foreground" index creation can be undesirable,
if you have to perform it on a live system without a dedicated maintenance window.

For potentially long running index  creation operations the _rocksdb_ storage-engine also supports
creating indexes in "background". The collection remains (mostly) available during the index creation,
see the section [Creating Indexes in Background](#creating-indexes-in-background) for more information.


ArangoDB provides the following index types:

Primary Index
-------------

For each collection there will always be a *primary index* which is a hash index
for the [document keys](../Appendix/Glossary.md#document-key) (`_key` attribute)
of all documents in the collection. The primary index allows quick selection
of documents in the collection using either the `_key` or `_id` attributes. It will
be used from within AQL queries automatically when performing equality lookups on
`_key` or `_id`.

There are also dedicated functions to find a document given its `_key` or `_id`
that will always make use of the primary index:

```js
db.collection.document("<document-key>");
db._document("<document-id>");
```

As the primary index is an unsorted hash index, it cannot be used for non-equality
range queries or for sorting.

The primary index of a collection cannot be dropped or changed, and there is no
mechanism to create user-defined primary indexes.


Edge Index
----------

Every [edge collection](../Appendix/Glossary.md#edge-collection) also has an
automatically created *edge index*. The edge index provides quick access to
documents by either their `_from` or `_to` attributes. It can therefore be
used to quickly find connections between vertex documents and is invoked when
the connecting edges of a vertex are queried.

Edge indexes are used from within AQL when performing equality lookups on `_from`
or `_to` values in an edge collections. There are also dedicated functions to
find edges given their `_from` or `_to` values that will always make use of the
edge index:

```js
db.collection.edges("<from-value>");
db.collection.edges("<to-value>");
db.collection.outEdges("<from-value>");
db.collection.outEdges("<to-value>");
db.collection.inEdges("<from-value>");
db.collection.inEdges("<to-value>");
```

Internally, the edge index is implemented as a hash index, which stores the union
of all `_from` and `_to` attributes. It can be used for equality
lookups, but not for range queries or for sorting. Edge indexes are automatically
created for edge collections. It is not possible to create user-defined edge indexes.
However, it is possible to freely use the `_from` and `_to` attributes in user-defined
indexes.

An edge index cannot be dropped or changed.


Hash Index
----------

A hash index can be used to quickly find documents with specific attribute values.
The hash index is unsorted, so it supports equality lookups but no range queries or sorting.

A hash index can be created on one or multiple document attributes. A hash index will
only be used by a query if all index attributes are present in the search condition,
and if all attributes are compared using the equality (`==`) operator. Hash indexes are
used from within AQL and several query functions, e.g. `byExample`, `firstExample` etc.

Hash indexes can optionally be declared unique, then disallowing saving the same
value(s) in the indexed attribute(s). Hash indexes can optionally be sparse.

The different types of hash indexes have the following characteristics:

- **unique hash index**: all documents in the collection must have different values for
  the attributes covered by the unique index. Trying to insert a document with the same
  key value as an already existing document will lead to a unique constraint
  violation.

  This type of index is not sparse. Documents that do not contain the index attributes or
  that have a value of `null` in the index attribute(s) will still be indexed.
  A key value of `null` may only occur once in the index, so this type of index cannot
  be used for optional attributes.

  The unique option can also be used to ensure that
  [no duplicate edges](Hash.md#ensure-uniqueness-of-relations-in-edge-collections) are
  created, by adding a combined index for the fields `_from` and `_to` to an edge collection.

- **unique, sparse hash index**: all documents in the collection must have different
  values for the attributes covered by the unique index. Documents in which at least one
  of the index attributes is not set or has a value of `null` are not included in the
  index. This type of index can be used to ensure that there are no duplicate keys in
  the collection for documents which have the indexed attributes set. As the index will
  exclude documents for which the indexed attributes are `null` or not set, it can be
  used for optional attributes.

- **non-unique hash index**: all documents in the collection will be indexed. This type
  of index is not sparse. Documents that do not contain the index attributes or that have
  a value of `null` in the index attribute(s) will still be indexed. Duplicate key values
  can occur and do not lead to unique constraint violations.

- **non-unique, sparse hash index**: only those documents will be indexed that have all
  the indexed attributes set to a value other than `null`. It can be used for optional
  attributes.

The amortized complexity of lookup, insert, update, and removal operations in unique hash
indexes is O(1).

Non-unique hash indexes have an amortized complexity of O(1) for insert, update, and
removal operations. That means non-unique hash indexes can be used on attributes with
low cardinality.

If a hash index is created on an attribute that is missing in all or many of the documents,
the behavior is as follows:

- if the index is sparse, the documents missing the attribute will not be indexed and not
  use index memory. These documents will not influence the update or removal performance
  for the index.

- if the index is non-sparse, the documents missing the attribute will be contained in the
  index with a key value of `null`.

Hash indexes support [indexing array values](#indexing-array-values) if the index
attribute name is extended with a <i>[\*]</i>.


Skiplist Index
--------------

A skiplist is a sorted index structure. It can be used to quickly find documents
with specific attribute values, for range queries and for returning documents from
the index in sorted order. Skiplists will be used from within AQL and several query
functions, e.g. `byExample`, `firstExample` etc.

Skiplist indexes will be used for lookups, range queries and sorting only if either all
index attributes are provided in a query, or if a leftmost prefix of the index attributes
is specified.

For example, if a skiplist index is created on attributes `value1` and `value2`, the
following filter conditions can use the index (note: the `<=` and `>=` operators are
intentionally omitted here for the sake of brevity):

```js
FILTER doc.value1 == ...
FILTER doc.value1 < ...
FILTER doc.value1 > ...
FILTER doc.value1 > ... && doc.value1 < ...

FILTER doc.value1 == ... && doc.value2 == ...
FILTER doc.value1 == ... && doc.value2 > ...
FILTER doc.value1 == ... && doc.value2 > ... && doc.value2 < ...
```

In order to use a skiplist index for sorting, the index attributes must be specified in
the `SORT` clause of the query in the same order as they appear in the index definition.
Skiplist indexes are always created in ascending order, but they can be used to access
the indexed elements in both ascending or descending order. However, for a combined index
(an index on multiple attributes) this requires that the sort orders in a single query
as specified in the `SORT` clause must be either all ascending (optionally omitted
as ascending is the default) or all descending.

For example, if the skiplist index is created on attributes `value1` and `value2`
(in this order), then the following sorts clauses can use the index for sorting:

- `SORT value1 ASC, value2 ASC` (and its equivalent `SORT value1, value2`)
- `SORT value1 DESC, value2 DESC`
- `SORT value1 ASC` (and its equivalent `SORT value1`)
- `SORT value1 DESC`

The following sort clauses cannot make use of the index order, and require an extra
sort step:

- `SORT value1 ASC, value2 DESC`
- `SORT value1 DESC, value2 ASC`
- `SORT value2` (and its equivalent `SORT value2 ASC`)
- `SORT value2 DESC` (because first indexed attribute `value1` is not used in sort clause)

Note: the latter two sort clauses cannot use the index because the sort clause does not
refer to a leftmost prefix of the index attributes.

Skiplists can optionally be declared unique, disallowing saving the same value in the indexed
attribute. They can be sparse or non-sparse.

The different types of skiplist indexes have the following characteristics:

- **unique skiplist index**: all documents in the collection must have different values for
  the attributes covered by the unique index. Trying to insert a document with the same
  key value as an already existing document will lead to a unique constraint
  violation.

  This type of index is not sparse. Documents that do not contain the index attributes or
  that have a value of `null` in the index attribute(s) will still be indexed.
  A key value of `null` may only occur once in the index, so this type of index cannot
  be used for optional attributes.

- **unique, sparse skiplist index**: all documents in the collection must have different
  values for the attributes covered by the unique index. Documents in which at least one
  of the index attributes is not set or has a value of `null` are not included in the
  index. This type of index can be used to ensure that there are no duplicate keys in
  the collection for documents which have the indexed attributes set. As the index will
  exclude documents for which the indexed attributes are `null` or not set, it can be
  used for optional attributes.

- **non-unique skiplist index**: all documents in the collection will be indexed. This type
  of index is not sparse. Documents that do not contain the index attributes or that have
  a value of `null` in the index attribute(s) will still be indexed. Duplicate key values
  can occur and do not lead to unique constraint violations.

- **non-unique, sparse skiplist index**: only those documents will be indexed that have all
  the indexed attributes set to a value other than `null`. It can be used for optional
  attributes.

The operational amortized complexity for skiplist indexes is logarithmically correlated
with the number of documents in the index.

Skiplist indexes support [indexing array values](#indexing-array-values) if the index
attribute name is extended with a <i>[\*]</i>`.


Geo Index
---------

Users can create additional geo indexes on one or multiple attributes in collections.
A geo index is used to find places on the surface of the earth fast.

The geo index stores two-dimensional coordinates. It can be created on either two
separate document attributes (latitude and longitude) or a single array attribute that
contains both latitude and longitude. Latitude and longitude must be numeric values.

The geo index provides operations to find documents with coordinates nearest to a given
comparison coordinate, and to find documents with coordinates that are within a specifiable
radius around a comparison coordinate.

The geo index is used via dedicated functions in AQL, the simple queries
functions and it is implicitly applied when in AQL a SORT or FILTER is used with
the distance function. Otherwise it will not be used for other types of queries
or conditions.


Fulltext Index
--------------

A fulltext index can be used to find words, or prefixes of words inside documents.
A fulltext index can be created on a single attribute only, and will index all words
contained in documents that have a textual value in that attribute. Only words with a (specifiable)
minimum length are indexed. Word tokenization is done using the word boundary analysis
provided by libicu, which is taking into account the selected language provided at
server start. Words are indexed in their lower-cased form. The index supports complete
match queries (full words) and prefix queries, plus basic logical operations such as
`and`, `or` and `not` for combining partial results.

The fulltext index is sparse, meaning it will only index documents for which the index
attribute is set and contains a string value. Additionally, only words with a configurable
minimum length will be included in the index.

The fulltext index is used via dedicated functions in AQL or the simple queries, but will
not be enabled for other types of queries or conditions.


Persistent Index
----------------

{% hint 'warning' %}
this index should not be used anymore, instead use the rocksdb storage engine
with either the *skiplist* or *hash* index.
{% endhint %}

The persistent index is a sorted index with persistence. The index entries are written to
disk when documents are stored or updated. That means the index entries do not need to be
rebuilt from the collection data when the server is restarted or the indexed collection
is initially loaded. Thus using persistent indexes may reduce collection loading times.

The persistent index type can be used for secondary indexes at the moment. That means the
persistent index currently cannot be made the only index for a collection, because there
will always be the in-memory primary index for the collection in addition, and potentially
more indexes (such as the edges index for an edge collection).

The index implementation is using the RocksDB engine, and it provides logarithmic complexity
for insert, update, and remove operations. As the persistent index is not an in-memory
index, it does not store pointers into the primary index as all the in-memory indexes do,
but instead it stores a document's primary key. To retrieve a document via a persistent
index via an index value lookup, there will therefore be an additional O(1) lookup into
the primary index to fetch the actual document.

As the persistent index is sorted, it can be used for point lookups, range queries and sorting
operations, but only if either all index attributes are provided in a query, or if a leftmost
prefix of the index attributes is specified.


Indexing attributes and sub-attributes
--------------------------------------

Top-level as well as nested attributes can be indexed. For attributes at the top level,
the attribute names alone are required. To index a single field, pass an array with a
single element (string of the attribute key) to the *fields* parameter of the
[ensureIndex() method](WorkingWithIndexes.md#creating-an-index). To create a
combined index over multiple fields, simply add more members to the *fields* array:

```js
// { name: "Smith", age: 35 }
db.posts.ensureIndex({ type: "hash", fields: [ "name" ] })
db.posts.ensureIndex({ type: "hash", fields: [ "name", "age" ] })
```

To index sub-attributes, specify the attribute path using the dot notation:

```js
// { name: {last: "Smith", first: "John" } }
db.posts.ensureIndex({ type: "hash", fields: [ "name.last" ] })
db.posts.ensureIndex({ type: "hash", fields: [ "name.last", "name.first" ] })
```

Indexing array values
---------------------

If an index attribute contains an array, ArangoDB will store the entire array as the index value
by default. Accessing individual members of the array via the index is not possible this
way.

To make an index insert the individual array members into the index instead of the entire array
value, a special array index needs to be created for the attribute. Array indexes can be set up
like regular hash or skiplist indexes using the `collection.ensureIndex()` function. To make a
hash or skiplist index an array index, the index attribute name needs to be extended with <i>[\*]</i>
when creating the index and when filtering in an AQL query using the `IN` operator.

The following example creates an array hash index on the `tags` attribute in a collection named
`posts`:

```js
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*]" ] });
db.posts.insert({ tags: [ "foobar", "baz", "quux" ] });
```

This array index can then be used for looking up individual `tags` values from AQL queries via
the `IN` operator:

```js
FOR doc IN posts
  FILTER 'foobar' IN doc.tags
  RETURN doc
```

It is possible to add the [array expansion operator](../../AQL/Advanced/ArrayOperators.html#array-expansion)
<i>[\*]</i>, but it is not mandatory. You may use it to indicate that an array index is used,
it is purely cosmetic however:

```js
FOR doc IN posts
  FILTER 'foobar' IN doc.tags[*]
  RETURN doc
```

The following FILTER conditions will **not use** the array index:

```js
FILTER doc.tags ANY == 'foobar'
FILTER doc.tags ANY IN 'foobar'
FILTER doc.tags IN 'foobar'
FILTER doc.tags == 'foobar'
FILTER 'foobar' == doc.tags
```

It is also possible to create an index on subattributes of array values. This makes sense
if the index attribute is an array of objects, e.g.

```js
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*].name" ] });
db.posts.insert({ tags: [ { name: "foobar" }, { name: "baz" }, { name: "quux" } ] });
```

The following query will then use the array index (this does require the
[array expansion operator](../../AQL/Advanced/ArrayOperators.html#array-expansion)):

```js
FOR doc IN posts
  FILTER 'foobar' IN doc.tags[*].name
  RETURN doc
```

If you store a document having the array which does contain elements not having
the subattributes this document will also be indexed with the value `null`, which
in ArangoDB is equal to attribute not existing.

ArangoDB supports creating array indexes with a single <i>[\*]</i> operator per index
attribute. For example, creating an index as follows is **not supported**:

```js
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*].name[*].value" ] });
```

Array values will automatically be de-duplicated before being inserted into an array index.
For example, if the following document is inserted into the collection, the duplicate array
value `bar` will be inserted only once:

```js
db.posts.insert({ tags: [ "foobar", "bar", "bar" ] });
```

This is done to avoid redundant storage of the same index value for the same document, which
would not provide any benefit.

If an array index is declared **unique**, the de-duplication of array values will happen before
inserting the values into the index, so the above insert operation with two identical values
`bar` will not necessarily fail

It will always fail if the index already contains an instance of the `bar` value. However, if
the value `bar` is not already present in the index, then the de-duplication of the array values will
effectively lead to `bar` being inserted only once.

To turn off the deduplication of array values, it is possible to set the **deduplicate** attribute
on the array index to `false`. The default value for **deduplicate** is `true` however, so
de-duplication will take place if not explicitly turned off.

```js
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*]" ], deduplicate: false });

// will fail now
db.posts.insert({ tags: [ "foobar", "bar", "bar" ] });
```

If an array index is declared and you store documents that do not have an array at the specified attribute
this document will not be inserted in the index. Hence the following objects will not be indexed:

```js
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*]" ] });
db.posts.insert({ something: "else" });
db.posts.insert({ tags: null });
db.posts.insert({ tags: "this is no array" });
db.posts.insert({ tags: { content: [1, 2, 3] } });
```

An array index is able to index explicit `null` values. When queried for `null`values, it
will only return those documents having explicitly `null` stored in the array, it will not
return any documents that do not have the array at all.

```js
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*]" ] });
db.posts.insert({tags: null}) // Will not be indexed
db.posts.insert({tags: []})  // Will not be indexed
db.posts.insert({tags: [null]}); // Will be indexed for null
db.posts.insert({tags: [null, 1, 2]}); // Will be indexed for null, 1 and 2
```

Declaring an array index as **sparse** does not have an effect on the array part of the index,
this in particular means that explicit `null` values are also indexed in the **sparse** version.
If an index is combined from an array and a normal attribute the sparsity will apply for the attribute e.g.:

```js
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*]", "name" ], sparse: true });
db.posts.insert({tags: null, name: "alice"}) // Will not be indexed
db.posts.insert({tags: [], name: "alice"}) // Will not be indexed
db.posts.insert({tags: [1, 2, 3]}) // Will not be indexed
db.posts.insert({tags: [1, 2, 3], name: null}) // Will not be indexed
db.posts.insert({tags: [1, 2, 3], name: "alice"})
// Will be indexed for [1, "alice"], [2, "alice"], [3, "alice"]
db.posts.insert({tags: [null], name: "bob"})
// Will be indexed for [null, "bob"]
```

Please note that filtering using array indexes only works from within AQL queries and
only if the query filters on the indexed attribute using the `IN` operator. The other
comparison operators (`==`, `!=`, `>`, `>=`, `<`, `<=`, `ANY`, `ALL`, `NONE`) currently
cannot use array indexes.

Vertex centric indexes
----------------------

As mentioned above, the most important indexes for graphs are the edge
indexes, indexing the `_from` and `_to` attributes of edge collections.
They provide very quick access to all edges originating in or arriving
at a given vertex, which allows to quickly find all neighbors of a vertex
in a graph.

In many cases one would like to run more specific queries, for example
finding amongst the edges originating from a given vertex only those
with a timestamp greater than or equal to some date and time. Exactly this
is achieved with "vertex centric indexes". In a sense these are localized
indexes for an edge collection, which sit at every single vertex.

Technically, they are implemented in ArangoDB as indexes, which sort the
complete edge collection first by `_from` and then by other attributes
for _OUTBOUND_ traversals, or first by `_to` and then by other attributes
for _INBOUND_ traversals. For traversals in _ANY_ direction two indexes
are needed, one with `_from` and the other with `_to` as first indexed field.

If we for example have a skiplist index on the attributes `_from` and
`timestamp` of an edge collection, we can answer the above question
very quickly with a single range lookup in the index.

Since ArangoDB 3.0 one can create sorted indexes (type "skiplist" and
"persistent") that index the special edge attributes `_from` or `_to`
and additionally other attributes. Since ArangoDB 3.1, these are used
in graph traversals, when appropriate `FILTER` statements are found
by the optimizer.

For example, to create a vertex centric index of the above type, you
would simply do

```js
db.edges.ensureIndex({"type":"skiplist", "fields": ["_from", "timestamp"]});
```

in arangosh. Then, queries like

```js
FOR v, e, p IN 1..1 OUTBOUND "V/1" edges
  FILTER e.timestamp >= "2018-07-09"
  RETURN p
```

will be considerably faster in case there are many edges originating
from vertex `"V/1"` but only few with a recent time stamp. Note that the
optimizer may prefer the default edge index over vertex centric indexes
based on the costs it estimates, even if a vertex centric index might
in fact be faster. Vertex centric indexes are more likely to be chosen
for highly connected graphs and with RocksDB storage engine.


Creating Indexes in Background
------------------------------

{% hint 'info' %}
This section only applies to the *rocksdb* storage engine
{% endhint %}

Creating new indexes is by default done under an exclusive collection lock. This means
that the collection (or the respective shards) are not available as long as the index
is created. This "foreground" index creation can be undesirable, if you have to perform it
on a live system without a dedicated maintenance window.

**STARTING FROM VERSION vX.Y.Z**, indexes can also be created in "background", not using an exclusive lock during the creation.
The collection remains available, other CRUD operations can run on the collection while the index is created.
This can be achieved by using the *inBackground* option.

To create a indexes in the background in *arangosh* just specify `inBackground: true`,
like in the following examples:

```js
// create the hash index in the background
db.collection.ensureIndex({ type: "hash", fields: [ "value" ], unique: false, inBackground: true });
db.collection.ensureIndex({ type: "hash", fields: [ "email" ], unique: true, inBackground: true });

// skiplist indexes work also of course
db.collection.ensureIndex({ type :"skiplist", fields: ["abc", "cdef"], unique: true, inBackground: true });
db.collection.ensureIndex({ type :"skiplist", fields: ["abc", "cdef"], sparse: true, inBackground: true });

// also supported on fulltext indexes
db.collection.ensureIndex({ type: "geo", fields: [ "latitude", "longitude"], inBackground: true });
db.collection.ensureIndex({ type: "geo", fields: [ "latitude", "longitude"], inBackground: true });
db.collection.ensureIndex({ type: "fulltext", fields: [ "text" ], minLength: 4, inBackground: true })
```

### Behavior

Indexes that are still in the build process will not be visible via the ArangoDB API. Nevertheless it is not
possible to create the same index twice via the *ensureIndex* API. AQL Queries will not use these indexes either
until the indexes report back as finished. Note that the initial *ensureIndex* call or HTTP request will block until the index is completely ready. Existing single-threaded client programs can safely specify the
*inBackground* option as *true* and continue to work as before.

{% hint 'info' %}
Should you be building an index in the background you cannot rename or drop the collection.
These operations will block until the index creation is finished.
{% endhint %}

Interrupted index build (i.e. due to a server crash) will remove the partially build index.
In the ArangoDB cluster the index might then be automatically recreated on affected shards.

### Performance

The background index creation might be slower than the "foreground" index creation and require more RAM.
Under a write heavy load (specifically many remove, update or replace) operations,
the background index creation needs to keep a list of removed documents in RAM. This might become unsustainable
if this list grows to tens of millions of entries.

Building an index is always a write heavy operation (internally), it is always a good idea to build indexes
during times with less load.