arangodb/Documentation/Books/AQL/Views/ArangoSearch/README.md

ArangoSearch Views in AQL
=========================

Views of type **arangosearch** are an integration layer meant to seamlessly
integrate with and natively expose the full power of the
[IResearch library](https://github.com/iresearch-toolkit/iresearch)
to the ArangoDB user.

They provide the capability to:
* evaluate together documents located in different collections
* filter documents based on AQL boolean expressions and functions
* sort the result set based on how closely each document matched the filter

ArangoSearch value analysis
---------------------------

A concept of value 'analysis' that is meant to break up a given value into
a set of sub-values internally tied together by metadata which influences both
the filter and sort stages to provide the most appropriate match for the
specified conditions, similar to queries to web search engines.

In plain terms this means a user can for example:
* request documents where the 'body' attribute best matches 'a quick brown fox'
* request documents where the 'dna' attribute best matches a DNA sub sequence
* request documents where the 'name' attribute best matches gender
* etc... (via custom analyzers described in the next section)

To a limited degree the concept of 'analysis' is even available in
non-ArangoSearch AQL, e.g. the TOKENS(...) function will utilize the power of
IResearch to break up a value into an AQL array that can be used anywhere in the
AQL query.

In plain terms this means a user can match a document attribute when its
value matches at least one entry from a set,
e.g. to match docs with 'word == quick' OR 'word == brown' OR 'word == fox'

    FOR doc IN someCollection
      FILTER doc.word IN TOKENS('a quick brown fox', 'text_en')
      RETRUN doc

ArangoSearch filters
--------------------

The basic ArangoSearch functionality can be accessed via common AQL filters and
operators, e.g.:

- *AND*
- *OR*
- *NOT*
- *==*
- *<=*
- *>=*
- *<*
- *>*
- *!=*
- *IN <ARRAY>*
- *IN <RANGE>*

However, the full power of ArangoSearch is harnessed and exposed via functions,
during both the filter and sort stages.

The supported AQL context functions are:

### ANALYZER()

`ANALYZER(filter-expression, analyzer)`

Override analyzer in a context of **filter-expression** with another one, denoted
by a specified **analyzer** argument, making it available for filter functions.

- *filter-expression* - any valid filter expression
- *analyzer* - string with the analyzer to imbue, i.e. *"text_en"* or one of the other
  [available string analyzers](../../../Manual/Views/ArangoSearch/Analyzers.html)

By default, context contains `Identity` analyzer.

### BOOST()

`BOOST(filter-expression, boost)`

Override boost in a context of **filter-expression** with a specified value,
making it available for scorer funtions.

- *filter-expression* - any valid filter expression
- *boost* - numeric boost value

By default, context contains boost value equal to `1.0`.

The supported filter functions are:

### EXISTS()

`EXISTS(attribute-name)`

Match documents where the attribute **attribute-name** exists in the document.

`EXISTS(attribute-name, "analyzer" [, analyzer])`

Match documents where the **attribute-name** exists in the document and
was indexed by the specified **analyzer**.
In case if **analyzer** isn't specified, current context analyzer (e.g. specified by
`ANALYZER` function) will be used.

`EXISTS(attribute-name, type)`

Match documents where the **attribute-name** exists in the document
 and is of the specified type.

- *attribute-name* - the path of the attribute to exist in the document
- *analyzer* - string with the analyzer used, i.e. *"text_en"* or one of the other
  [available string analyzers](../../../Manual/Views/ArangoSearch/Analyzers.html)
- *type* - data type as string; one of:
    - **bool**
    - **boolean**
    - **numeric**
    - **null**
    - **string**

In case if **analyzer** isn't specified, current context analyzer (e.g. specified by
`ANALYZER` function) will be used.

### PHRASE()

```
PHRASE(attribute-name,
       phrasePart [, skipTokens, phrasePart [, ... skipTokens, phrasePart]]
       [, analyzer])
```

Search for a phrase in the referenced attributes.

The phrase can be expressed as an arbitrary number of *phraseParts* separated by *skipToken* number of tokens.

- *attribute-name* - the path of the attribute to compare against in the document
- *phrasePart* - a string to search in the token stream; may consist of several words; will be split using the specified *analyzer*
- *skipTokens* number of words or tokens to treat as wildcards
- *analyzer* - string with the analyzer used, i.e. *"text_en"* or one of the other
  [available string analyzers](../../../Manual/Views/ArangoSearch/Analyzers.html)

### STARTS_WITH()

`STARTS_WITH(attribute-name, prefix)`

Match the value of the **attribute-name** that starts with **prefix**

- *attribute-name* - the path of the attribute to compare against in the document
- *prefix* - a string to search at the start of the text

### TOKENS()

`TOKENS(input, analyzer)`

Split the **input** string with the help of the specified **analyzer** into an Array.
The resulting Array can i.e. be used in subsequent `FILTER` statements with the **IN** operator.
This can be used to better understand how the specific analyzer is going to behave.

- *input* string to tokenize
- *analyzer* one of the [available string analyzers](../../../Manual/Views/ArangoSearch/Analyzers.html)

### MIN_MATCH()

`MIN_MATCH(filter-expression, [..., filter-expression], min-match-count)`

Match documents where at least **min-match-count** of the specified **filter-expression**s
are satisfied.

- *filter-expression* - any valid filter expression
- *min-match-count* - minimum number of filter-expression that should be satisfied

#### Filtering examples

to match documents which have a 'name' attribute

    FOR doc IN VIEW someView
      FILTER EXISTS(doc.name)
      RETURN doc

or

    FOR doc IN VIEW someView
      FILTER EXISTS(doc['name'])
      RETURN doc

to match documents where 'body' was analyzed via the 'text_en' analyzer

    FOR doc IN VIEW someView
      FILTER EXISTS(doc.body, 'analyzer', 'text_en')
      RETURN doc

or

    FOR doc IN VIEW someView
      FILTER EXISTS(doc['body'], 'analyzer', 'text_en')
      RETURN doc

or

    FOR doc IN VIEW someView
      FILTER ANALYZER(EXISTS(doc['body'], 'analyzer'), 'text_en')
      RETURN doc

to match documents which have an 'age' attribute of type number

    FOR doc IN VIEW someView
      FILTER EXISTS(doc.age, 'numeric')
      RETURN doc

or

    FOR doc IN VIEW someView
      FILTER EXISTS(doc['age'], 'numeric')
      RETURN doc

to match documents where 'description' contains word 'quick' or word
'brown' and has been analyzed with 'text_en' analyzer

    FOR doc IN VIEW someView
      FILTER ANALYZER(doc.description == 'quick' OR doc.description == 'brown', 'text_en')
      RETURN doc

to match documents where 'description' contains at least 2 of 3 words 'quick',
'brown', 'fox' and has been analyzed with 'text_en' analyzer

    FOR doc IN VIEW someView
      FILTER ANALYZER(
        MIN_MATCH(doc.description == 'quick', doc.description == 'brown', doc.description == 'fox', 2),
        'text_en'
      )
      RETURN doc

to match documents where 'description' contains a phrase 'quick brown'

    FOR doc IN VIEW someView
      FILTER PHRASE(doc.description, [ 'quick brown' ], 'text_en')
      RETURN doc

or

    FOR doc IN VIEW someView
      FILTER PHRASE(doc['description'], [ 'quick brown' ], 'text_en')
      RETURN doc

or

    FOR doc IN VIEW someView
      FILTER ANALYZER(PHRASE(doc['description'], [ 'quick brown' ]), 'text_en')
      RETURN doc

to match documents where 'body' contains the phrase consisting of a sequence
like this:
'quick' * 'fox jumps' (where the asterisk can be any single word)

    FOR doc IN VIEW someView
      FILTER PHRASE(doc.body, [ 'quick', 1, 'fox jumps' ], 'text_en')
      RETURN doc

or

    FOR doc IN VIEW someView
      FILTER PHRASE(doc['body'], [ 'quick', 1, 'fox jumps' ], 'text_en')
      RETURN doc

or

    FOR doc IN VIEW someView
      FILTER ANALYZER(PHRASE(doc['body'], [ 'quick', 1, 'fox jumps' ]), 'text_en')
      RETURN doc

to match documents where 'story' starts with 'In the beginning'

    FOR doc IN VIEW someView
      FILTER STARTS_WITH(doc.story, 'In the beginning')
      RETURN DOC

or

    FOR doc IN VIEW someView
      FILTER STARTS_WITH(doc['story'], 'In the beginning')
      RETURN DOC

to watch the analyzer doing its work

    RETURN TOKENS('a quick brown fox', 'text_en')

to match documents where 'description' best matches 'a quick brown fox'

    FOR doc IN VIEW someView
      FILTER ANALYZER(doc.description IN TOKENS('a quick brown fox', 'text_en'), 'text_en')
      RETURN doc

ArangoSearch sort
-----------------

A major feature of ArangoSearch views is their capability of sorting results
based on the creation-time filter conditions and zero or more sorting functions.
The sorting functions are meant to be user-defined.

Note: Similar to other sorting functions on regular collections the first
  argument to any sorting function is _always_ either the document emmited by
  the `FOR` statement, or some sub-attribute of it.

The sorting functions are meant to be user-defined. The following functions are already built in:

### Literal sorting
You can sort documents by simply specifying the *attribute-name* directly, as you do using indices in other places.

### Best Matching 25 Algorithm

`BM25(attribute-name, [k, [b]])`

Sorts documents using the [**Best Matching 25** algorithm](https://en.wikipedia.org/wiki/Okapi_BM25).

Optionally the term frequency **k** and coefficient **b** of the algorithm can be specified as floating point numbers:

- *k* defaults to `1.2`; *k* calibrates the text term frequency scaling.
  A *k* value of *0* corresponds to a binary model (no term frequency),
  and a large value corresponds to using raw term frequency.

- *b* defaults to `0.75`; *b* determines the scaling by the total text length.
  - b = 1 corresponds to fully scaling the term weight by the total text length
  - b = 0 corresponds to no length normalization.

At the extreme values of the coefficient *b*, BM25 turns into ranking functions known as BM11 (for b = 1) and BM15 (for b = 0).

### Term Frequency – Inverse Document Frequency Algorithm

`TFIDF(attribute-name, [with-norms])`

Sorts documents using the [**term frequency–inverse document frequency** algorithm](https://en.wikipedia.org/wiki/TF-IDF).

  optionally specifying that norms should be used via **with-norms**

### Sorting examples

to sort documents by the value of the 'name' attribute

    FOR doc IN VIEW someView
      SORT doc.name
      RETURN doc

or

    FOR doc IN VIEW someView
      SORT doc['name']
      RETURN doc

to sort documents via the
[BM25 algorithm](https://en.wikipedia.org/wiki/Okapi_BM25)

    FOR doc IN VIEW someView
      SORT BM25(doc)
      RETURN doc

to sort documents via the
[BM25 algorithm](https://en.wikipedia.org/wiki/Okapi_BM25)
with 'k' = 1.2 and 'b' = 0.75

    FOR doc IN VIEW someView
      SORT BM25(doc, 1.2, 0.75)
      RETURN doc

to sort documents via the
[TFIDF algorithm](https://en.wikipedia.org/wiki/TF-IDF)

    FOR doc IN VIEW someView
      SORT TFIDF(doc)
      RETURN doc

to sort documents via the
[TFIDF algorithm](https://en.wikipedia.org/wiki/TF-IDF) with norms

    FOR doc IN VIEW someView
      SORT TFIDF(doc, true)
      RETURN doc

to sort documents by value of 'name' and then by the
[TFIDF algorithm](https://en.wikipedia.org/wiki/TF-IDF) where 'name' values are
equivalent

    FOR doc IN VIEW someView
      SORT doc.name, TFIDF(doc)
      RETURN doc


Use cases
---------

### Prefix search

The data contained in our view looks like that:

```json
{ "id": 1, "body": "ThisIsAVeryLongWord" }
{ "id": 2, "body": "ThisIsNotSoLong" }
{ "id": 3, "body": "ThisIsShorter" }
{ "id": 4, "body": "ThisIs" }
{ "id": 5, "body": "ButNotThis" }
```

We now want to search for documents where the attribute `body` starts with "ThisIs",

A simple AQL query executing this prefix search:

    FOR doc IN VIEW someView
      FILTER STARTS_WITH(doc.body, 'ThisIs')
      RETURN doc

It will find the documents with the ids `1`, `2`, `3`, `4`, but not `5`.