9.7 KiB
ArangoSearch Views in AQL
Views of type arangosearch are an integration layer meant to seamlessly integrate with and natively expose the full power of the IResearch library to the ArangoDB user.
They provide the capability to:
- evaluate together documents located in different collections
- filter documents based on AQL boolean expressions and functions
- sort the resultset based on how closely each document matched the filter
ArangoSearch value analysis
A concept of value 'analysis' that is meant to break up a given value into a set of sub-values internally tied together by metadata which influences both the filter and sort stages to provide the most appropriate match for the specified conditions, similar to queries to web search engines.
In plain terms this means a user can for example:
- request documents where the 'body' attribute best matches 'a quick brown fox'
- request documents where the 'dna' attribute best matches a DNA sub sequence
- request documents where the 'name' attribute best matches gender
- etc... (via custom analyzers described in the next section)
To a limited degree the concept of 'analysis' is even available in non-ArangoSearch AQL, e.g. the TOKENS(...) function will utilize the power of IResearch to break up a value into an AQL array that can be used anywhere in the AQL query.
In plain terms this means a user can match a document attribute when its value matches at least one entry from a set, e.g. to match docs with 'word == quick' OR 'word == brown' OR 'word == fox'
FOR doc IN someCollection
FILTER doc.word IN TOKENS('a quick brown fox', 'text_en')
RETRUN doc
ArangoSearch filters
The basic ArangoSearch functionality can be accessed via common AQL filters and operators, e.g.:
- AND
- OR
- NOT
- ==
- <=
- >=
- <
- >
- !=
- IN
- IN
However, the full power of ArangoSearch is harnessed and exposed via functions, during both the filter and sort stages.
The supported filter functions are:
EXISTS()
EXISTS(attribute-name)
Match documents where the attribute attribute-name exists in the document.
EXISTS(attribute-name, "analyzer", analyzer)
Match documents where the attribute-name exists in the document and was indexed by the specified analyzer.
EXISTS(attribute-name, "type", type)
Match documents where the attribute-name exists in the document and is of the specified type.
- attribute-name - the path of the attribute to exist in the document
- analyzer - string with the analyzer used, i.e. "text_en" or one of the other available string analyzers
- type - data type as string; one of:
- bool
- boolean
- numeric
- null
- string
PHRASE()
PHRASE(attribute-name,
phrasePart [, skipTokens, phrasePart [, ... skipTokens, phrasePart]],
analyzer)
Search for a phrase in the referenced attributes.
The phrase can be expressed as an arbitrary number of phraseParts separated by skipToken number of tokens.
- attribute-name - the path of the attribute to compare against in the document
- phrasePart - a string to search in the token stream; may consist of several words; will be split using the specified analyzer
- skipTokens number of words or tokens to treat as wildcards
- analyzer - string with the analyzer used, i.e. "text_en" or one of the other available string analyzers
STARTS_WITH()
STARTS_WITH(attribute-name, prefix)
Match the value of the attribute-name that starts with prefix
- attribute-name - the path of the attribute to compare against in the document
- prefix - a string to search at the start of the text
TOKENS()
TOKENS(input, analyzer)
Split the input string with the help of the specified analyzer into an Array.
The resulting Array can i.e. be used in subsequent FILTER
statements with the IN operator.
This can be used to better understand how the specific analyzer is going to behave.
- input string to tokenize
- analyzer one of the available string analyzers
Filtering examples
to match documents which have a 'name' attribute
FOR doc IN VIEW someView
FILTER EXISTS(doc.name)
RETURN doc
or
FOR doc IN VIEW someView
FILTER EXISTS(doc['name'])
RETURN doc
to match documents where 'body' was analyzed via the 'text_en' analyzer
FOR doc IN VIEW someView
FILTER EXISTS(doc.body, 'analyzer', 'text_en')
RETURN doc
or
FOR doc IN VIEW someView
FILTER EXISTS(doc['body'], 'analyzer', 'text_en')
RETURN doc
to match documents which have an 'age' attribute of type number
FOR doc IN VIEW someView
FILTER EXISTS(doc.age, 'type' 'numeric')
RETURN doc
or
FOR doc IN VIEW someView
FILTER EXISTS(doc['age'], 'type' 'numeric')
RETURN doc
to match documents where 'description' contains a phrase 'quick brown'
FOR doc IN VIEW someView
FILTER PHRASE(doc.description, [ 'quick brown' ], 'text_en')
RETURN doc
or
FOR doc IN VIEW someView
FILTER PHRASE(doc['description'], [ 'quick brown' ], 'text_en')
RETURN doc
to match documents where 'body' contains the phrase consisting of a sequence like this: 'quick' * 'fox jumps' (where the asterisk can be any single word)
FOR doc IN VIEW someView
FILTER PHRASE(doc.body, [ 'quick', 1, 'fox jumps' ], 'text_en')
RETURN doc
or
FOR doc IN VIEW someView
FILTER PHRASE(doc['body'], [ 'quick', 1, 'fox jumps' ], 'text_en')
RETURN doc
to match documents where 'story' starts with 'In the beginning'
FOR doc IN VIEW someView
FILTER STARTS_WITH(doc.story, 'In the beginning')
RETURN DOC
or
FOR doc IN VIEW someView
FILTER STARTS_WITH(doc['story'], 'In the beginning')
RETURN DOC
to watch the analyzer doing its work
RETURN TOKENS('a quick brown fox', 'text_en')
to match documents where 'description' best matches 'a quick brown fox'
FOR doc IN VIEW someView
FILTER doc.description IN TOKENS('a quick brown fox', 'text_en')
RETURN doc
ArangoSearch sort
A major feature of ArangoSearch views is their capability of sorting results based on the creation-time filter conditions and zero or more sorting functions. The sorting functions are meant to be user-defined.
Note: Similar to other sorting functions on regular collections the first
argument to any sorting function is always either the document emmited by
the FOR
statement, or some sub-attribute of it.
The sorting functions are meant to be user-defined. The following functions are already built in:
Literal sorting
You can sort documents by simply specifying the attribute-name directly, as you do using indices in other places.
Best Matching 25 Algorithm
BM25(attribute-name, [k, [b]])
Sorts documents using the Best Matching 25 algorithm.
Optionally the term frequency k and coefficient b of the algorithm can be specified as floating point numbers:
-
k defaults to
1.2
; k calibrates the text term frequency scaling. A k value of 0 corresponds to a binary model (no term frequency), and a large value corresponds to using raw term frequency. -
b defaults to
0.75
; b determines the scaling by the total text length.- b = 1 corresponds to fully scaling the term weight by the total text length
- b = 0 corresponds to no length normalization.
At the extreme values of the coefficient b, BM25 turns into ranking functions known as BM11 (for b = 1) and BM15 (for b = 0).
Term Frequency – Inverse Document Frequency Algorithm
TFIDF(attribute-name, [with-norms])
Sorts documents using the term frequency–inverse document frequency algorithm.
optionally specifying that norms should be used via with-norms
Sorting examples
to sort documents by the value of the 'name' attribute
FOR doc IN VIEW someView
SORT doc.name
RETURN doc
or
FOR doc IN VIEW someView
SORT doc['name']
RETURN doc
to sort documents via the BM25 algorithm
FOR doc IN VIEW someView
SORT BM25(doc)
RETURN doc
to sort documents via the BM25 algorithm with 'k' = 1.2 and 'b' = 0.75
FOR doc IN VIEW someView
SORT BM25(doc, 1.2, 0.75)
RETURN doc
to sort documents via the TFIDF algorithm
FOR doc IN VIEW someView
SORT TFIDF(doc)
RETURN doc
to sort documents via the TFIDF algorithm with norms
FOR doc IN VIEW someView
SORT TFIDF(doc, true)
RETURN doc
to sort documents by value of 'name' and then by the TFIDF algorithm where 'name' values are equivalent
FOR doc IN VIEW someView
SORT doc.name, TFIDF(doc)
RETURN doc
Use cases
Prefix search
The data contained in our view looks like that:
{ "id": 1, "body": "ThisIsAVeryLongWord" }
{ "id": 2, "body": "ThisIsNotSoLong" }
{ "id": 3, "body": "ThisIsShorter" }
{ "id": 4, "body": "ThisIs" }
{ "id": 5, "body": "ButNotThis" }
We now want to search for documents where the attribute body
starts with "ThisIs",
A simple AQL query executing this prefix search:
FOR doc IN VIEW someView
FILTER STARTS_WITH(doc.body, 'ThisIs')
RETURN doc
It will find the documents with the ids 1
, 2
, 3
, 4
, but not 5
.