1
0
Fork 0
arangodb/Documentation/Books/AQL/Views/ArangoSearch.md

9.7 KiB
Raw Blame History

ArangoSearch Views in AQL

Views of type arangosearch are an integration layer meant to seamlessly integrate with and natively expose the full power of the IResearch library to the ArangoDB user.

They provide the capability to:

  • evaluate together documents located in different collections
  • filter documents based on AQL boolean expressions and functions
  • sort the resultset based on how closely each document matched the filter

ArangoSearch value analysis

A concept of value 'analysis' that is meant to break up a given value into a set of sub-values internally tied together by metadata which influences both the filter and sort stages to provide the most appropriate match for the specified conditions, similar to queries to web search engines.

In plain terms this means a user can for example:

  • request documents where the 'body' attribute best matches 'a quick brown fox'
  • request documents where the 'dna' attribute best matches a DNA sub sequence
  • request documents where the 'name' attribute best matches gender
  • etc... (via custom analyzers described in the next section)

To a limited degree the concept of 'analysis' is even available in non-ArangoSearch AQL, e.g. the TOKENS(...) function will utilize the power of IResearch to break up a value into an AQL array that can be used anywhere in the AQL query.

In plain terms this means a user can match a document attribute when its value matches at least one entry from a set, e.g. to match docs with 'word == quick' OR 'word == brown' OR 'word == fox'

FOR doc IN someCollection
  FILTER doc.word IN TOKENS('a quick brown fox', 'text_en')
  RETRUN doc

ArangoSearch filters

The basic ArangoSearch functionality can be accessed via common AQL filters and operators, e.g.:

  • AND
  • OR
  • NOT
  • ==
  • <=
  • >=
  • <
  • >
  • !=
  • IN
  • IN

However, the full power of ArangoSearch is harnessed and exposed via functions, during both the filter and sort stages.

The supported filter functions are:

EXISTS()

EXISTS(attribute-name)

Match documents where the attribute attribute-name exists in the document.

EXISTS(attribute-name, "analyzer", analyzer)

Match documents where the attribute-name exists in the document and was indexed by the specified analyzer.

EXISTS(attribute-name, "type", type)

Match documents where the attribute-name exists in the document and is of the specified type.

  • attribute-name - the path of the attribute to exist in the document
  • analyzer - string with the analyzer used, i.e. "text_en" or one of the other available string analyzers
  • type - data type as string; one of:
    • bool
    • boolean
    • numeric
    • null
    • string

PHRASE()

PHRASE(attribute-name, 
       phrasePart [, skipTokens, phrasePart [, ... skipTokens, phrasePart]],
       analyzer)

Search for a phrase in the referenced attributes.

The phrase can be expressed as an arbitrary number of phraseParts separated by skipToken number of tokens.

  • attribute-name - the path of the attribute to compare against in the document
  • phrasePart - a string to search in the token stream; may consist of several words; will be split using the specified analyzer
  • skipTokens number of words or tokens to treat as wildcards
  • analyzer - string with the analyzer used, i.e. "text_en" or one of the other available string analyzers

STARTS_WITH()

STARTS_WITH(attribute-name, prefix)

Match the value of the attribute-name that starts with prefix

  • attribute-name - the path of the attribute to compare against in the document
  • prefix - a string to search at the start of the text

TOKENS()

TOKENS(input, analyzer)

Split the input string with the help of the specified analyzer into an Array. The resulting Array can i.e. be used in subsequent FILTER statements with the IN operator. This can be used to better understand how the specific analyzer is going to behave.

Filtering examples

to match documents which have a 'name' attribute

FOR doc IN VIEW someView
  FILTER EXISTS(doc.name)
  RETURN doc

or

FOR doc IN VIEW someView
  FILTER EXISTS(doc['name'])
  RETURN doc

to match documents where 'body' was analyzed via the 'text_en' analyzer

FOR doc IN VIEW someView
  FILTER EXISTS(doc.body, 'analyzer', 'text_en')
  RETURN doc

or

FOR doc IN VIEW someView
  FILTER EXISTS(doc['body'], 'analyzer', 'text_en')
  RETURN doc

to match documents which have an 'age' attribute of type number

FOR doc IN VIEW someView
  FILTER EXISTS(doc.age, 'type' 'numeric')
  RETURN doc

or

FOR doc IN VIEW someView
  FILTER EXISTS(doc['age'], 'type' 'numeric')
  RETURN doc

to match documents where 'description' contains a phrase 'quick brown'

FOR doc IN VIEW someView
  FILTER PHRASE(doc.description, [ 'quick brown' ], 'text_en')
  RETURN doc

or

FOR doc IN VIEW someView
  FILTER PHRASE(doc['description'], [ 'quick brown' ], 'text_en')
  RETURN doc

to match documents where 'body' contains the phrase consisting of a sequence like this: 'quick' * 'fox jumps' (where the asterisk can be any single word)

FOR doc IN VIEW someView
  FILTER PHRASE(doc.body, [ 'quick', 1, 'fox jumps' ], 'text_en')
  RETURN doc

or

FOR doc IN VIEW someView
  FILTER PHRASE(doc['body'], [ 'quick', 1, 'fox jumps' ], 'text_en')
  RETURN doc

to match documents where 'story' starts with 'In the beginning'

FOR doc IN VIEW someView
  FILTER STARTS_WITH(doc.story, 'In the beginning')
  RETURN DOC

or

FOR doc IN VIEW someView
  FILTER STARTS_WITH(doc['story'], 'In the beginning')
  RETURN DOC

to watch the analyzer doing its work

RETURN TOKENS('a quick brown fox', 'text_en')

to match documents where 'description' best matches 'a quick brown fox'

FOR doc IN VIEW someView
  FILTER doc.description IN TOKENS('a quick brown fox', 'text_en')
  RETURN doc

ArangoSearch sort

A major feature of ArangoSearch views is their capability of sorting results based on the creation-time filter conditions and zero or more sorting functions. The sorting functions are meant to be user-defined.

Note: Similar to other sorting functions on regular collections the first argument to any sorting function is always either the document emmited by the FOR statement, or some sub-attribute of it.

The sorting functions are meant to be user-defined. The following functions are already built in:

Literal sorting

You can sort documents by simply specifying the attribute-name directly, as you do using indices in other places.

Best Matching 25 Algorithm

BM25(attribute-name, [k, [b]])

Sorts documents using the Best Matching 25 algorithm.

Optionally the term frequency k and coefficient b of the algorithm can be specified as floating point numbers:

  • k defaults to 1.2; k calibrates the text term frequency scaling. A k value of 0 corresponds to a binary model (no term frequency), and a large value corresponds to using raw term frequency.

  • b defaults to 0.75; b determines the scaling by the total text length.

    • b = 1 corresponds to fully scaling the term weight by the total text length
    • b = 0 corresponds to no length normalization.

At the extreme values of the coefficient b, BM25 turns into ranking functions known as BM11 (for b = 1) and BM15 (for b = 0).

Term Frequency Inverse Document Frequency Algorithm

TFIDF(attribute-name, [with-norms])

Sorts documents using the term frequencyinverse document frequency algorithm.

optionally specifying that norms should be used via with-norms

Sorting examples

to sort documents by the value of the 'name' attribute

FOR doc IN VIEW someView
  SORT doc.name
  RETURN doc

or

FOR doc IN VIEW someView
  SORT doc['name']
  RETURN doc

to sort documents via the BM25 algorithm

FOR doc IN VIEW someView
  SORT BM25(doc)
  RETURN doc

to sort documents via the BM25 algorithm with 'k' = 1.2 and 'b' = 0.75

FOR doc IN VIEW someView
  SORT BM25(doc, 1.2, 0.75)
  RETURN doc

to sort documents via the TFIDF algorithm

FOR doc IN VIEW someView
  SORT TFIDF(doc)
  RETURN doc

to sort documents via the TFIDF algorithm with norms

FOR doc IN VIEW someView
  SORT TFIDF(doc, true)
  RETURN doc

to sort documents by value of 'name' and then by the TFIDF algorithm where 'name' values are equivalent

FOR doc IN VIEW someView
  SORT doc.name, TFIDF(doc)
  RETURN doc

Use cases

The data contained in our view looks like that:

{ "id": 1, "body": "ThisIsAVeryLongWord" }
{ "id": 2, "body": "ThisIsNotSoLong" }
{ "id": 3, "body": "ThisIsShorter" }
{ "id": 4, "body": "ThisIs" }
{ "id": 5, "body": "ButNotThis" }

We now want to search for documents where the attribute body starts with "ThisIs",

A simple AQL query executing this prefix search:

FOR doc IN VIEW someView
  FILTER STARTS_WITH(doc.body, 'ThisIs')
  RETURN doc

It will find the documents with the ids 1, 2, 3, 4, but not 5.