mirror of https://gitee.com/bigwinds/arangodb
291 lines
13 KiB
Markdown
291 lines
13 KiB
Markdown
# Bringing the power of IResearch to ArangoDB
|
|
|
|
## What is ArangoSearch
|
|
|
|
ArangoSearch is a natively integrated AQL extension making use of the IResearch library.
|
|
|
|
Arangosearch allows one to:
|
|
* join documents located in different collections to one result list
|
|
* filter documents based on AQL boolean expressions and functions
|
|
* sort the resultset based on how closely each document matched the filter
|
|
|
|
A concept of value 'analysis' that is meant to break up a given value into
|
|
a set of sub-values internally tied together by metadata which influences both
|
|
the filter and sort stages to provide the most appropriate match for the
|
|
specified conditions, similar to queries to web search engines.
|
|
|
|
In plain terms this means a user can for example:
|
|
* request documents where the 'body' attribute best matches 'a quick brown fox'
|
|
* request documents where the 'dna' attribute best matches a DNA sub sequence
|
|
* request documents where the 'name' attribute best matches gender
|
|
* etc... (via custom analyzers described in the next section)
|
|
|
|
### The IResearch Library
|
|
|
|
IResearch s a cross-platform open source indexing and searching engine written in C++,
|
|
optimized for speed and memory footprint, with source available from:
|
|
https://github.com/iresearch-toolkit/iresearch
|
|
|
|
IResearch is a framework for indexing, filtering and sorting of data. The indexing stage can
|
|
treat each data item as an atom or use custom 'analyzers' to break the data item
|
|
into sub-atomic pieces tied together with internally tracked metadata.
|
|
|
|
The IResearch framework in general can be further extended at runtime with
|
|
custom implementations of analyzers (used during the indexing and filtering
|
|
stages) and scorers (used during the sorting stage) allowing full control over
|
|
the behaviour of the engine.
|
|
|
|
|
|
### ArangoSearch Scorers
|
|
|
|
ArangoSearch accesses scorers directly by their internal names. The
|
|
name (in upper-case) of the scorer is the function name to be used in the
|
|
['SORT' section](../../AQL/Views/ArangoSearch.html#arangosearch-sort).
|
|
Function arguments, (excluding the first argument), are serialized as a
|
|
string representation of a JSON array and passed directly to the corresponding
|
|
scorer. The first argument to any scorer function is the reference to the
|
|
current document emitted by the `FOR` statement, i.e. it would be 'doc' for this
|
|
statement:
|
|
|
|
FOR doc IN VIEW someView
|
|
|
|
IResearch provides a 'bm25' scorer implementing the
|
|
[BM25 algorithm](https://en.wikipedia.org/wiki/Okapi_BM25). This scorer
|
|
optionally takes 'k' and 'b' positional parameters.
|
|
|
|
The user is able to run queries with the said scorer, e.g.
|
|
|
|
SORT BM25(doc, 1.2, 0.75)
|
|
|
|
The function arguments will then be serialized into a JSON representation:
|
|
|
|
```json
|
|
[ 1.2, 0.75 ]
|
|
```
|
|
|
|
and passed to the scorer implementation.
|
|
|
|
Similarly an administrator may choose to deploy a custom DNA analyzer 'DnaRank'.
|
|
|
|
The user is then immediately able to run queries with the said scorer, e.g.
|
|
|
|
SORT DNARANK(doc, 123, 456, "abc", { "def", "ghi" })
|
|
|
|
The function arguments will then be serialized into a JSON representation:
|
|
|
|
```json
|
|
[ 123, 456, "abc", { "def", "ghi" } ]
|
|
```
|
|
|
|
and passed to the scorer implementation.
|
|
|
|
Runtime-plugging functionality for scores is not avaiable in ArangoDB at this
|
|
point in time, so ArangoDB comes with a few default-initialized scores:
|
|
|
|
- *attribute-name*
|
|
order results based on the value of **attribute-name**
|
|
|
|
- BM25
|
|
order results based on the
|
|
[BM25 algorithm](https://en.wikipedia.org/wiki/Okapi_BM25)
|
|
|
|
- TFIDF
|
|
order results based on the
|
|
[TFIDF algorithm](https://en.wikipedia.org/wiki/TF-IDF)
|
|
|
|
### ArangoSearch is much more than a fulltext search
|
|
|
|
But fulltext searching is a subset of its available functionality, supported via
|
|
the 'text' analyzer and 'tfidf'/'bm25' scorers, without impact to performance
|
|
when specifying documents from different collections or filtering on multiple
|
|
document attributes.
|
|
|
|
### View datasource
|
|
|
|
The IResearch functionality is exposed to ArangoDB via the the ArangoSearch view
|
|
API because the ArangoSearch view is merely an identity transformation applied
|
|
onto documents stored in linked collections of the same ArangoDB database.
|
|
In plain terms an ArangoSearch view only allows filtering and sorting of documents
|
|
located in collections of the same database.
|
|
The matching documents themselves are returned as-is from their corresponding collections.
|
|
|
|
### Links to ArangoDB collections
|
|
|
|
A concept of an ArangoDB collection 'link' is introduced to allow specifying
|
|
which ArangoDB collections a given ArangoSearch View should query for documents
|
|
and how these documents should be queried.
|
|
|
|
An ArangoSearch Link is a uni-directional connection from an ArangoDB collection
|
|
to an ArangoSearch view describing how data coming from the said collection should
|
|
be made available in the given view. Each ArangoSearch Link in an ArangoSearch view is
|
|
uniquely identified by the name of the ArangoDB collection it links to. An
|
|
ArangoSearch view may have zero or more links, each to a distinct ArangoDB
|
|
collection. Similarly an ArangoDB collection may be referenced via links by zero
|
|
or more distinct ArangoSearch views. In plain terms any given ArangoSearch view may be
|
|
linked to any given ArangoDB collection of the same database with zero or at
|
|
most one link. However, any ArangoSearch view may be linked to multiple distinct
|
|
ArangoDB collections and similarly any ArangoDB collection may be referenced by
|
|
multiple ArangoSearch views.
|
|
|
|
To configure an ArangoSearch view for consideration of documents from a given
|
|
ArangoDB collection a link definition must be added to the properties of the
|
|
said ArangoSearch view defining the link parameters as per the section
|
|
[View definition/modification](#view-definitionmodification).
|
|
|
|
### Analyzers
|
|
|
|
To simplify query syntax ArangoSearch provides a concept of
|
|
[named analyzers](ArangoSearch/Analyzers.md) which
|
|
are merely aliases for type+configuration of IResearch analyzers. Management of
|
|
named analyzers is exposed via both REST, GUI and JavaScript APIs, e.g.
|
|
|
|
|
|
### View definition/modification
|
|
|
|
An ArangoSearch view is configured via an object containing a set of
|
|
view-specific configuration directives and a map of link-specific configuration
|
|
directives.
|
|
|
|
During view creation the following directives apply:
|
|
* id: (optional) the desired view identifier
|
|
* name: (required) the view name
|
|
* type: \<required\> the value "arangosearch"
|
|
any of the directives from the section [View properties](#view-properties-modifiable)
|
|
|
|
During view modification the following directives apply:
|
|
* links: (optional)
|
|
a mapping of collection-name/collection-identifier to one of:
|
|
* link creation - link definition as per the section [Link properties](#link-properties)
|
|
* link removal - JSON keyword *null* (i.e. nullify a link if present)
|
|
any of the directives from the section [modifiable view properties ](#view-properties-modifiable)
|
|
|
|
|
|
### View properties (modifiable)
|
|
|
|
* commit: (optional; default: use defaults for all values)
|
|
configure ArangoSearch View commit policy for single-item inserts/removals,
|
|
e.g. when adding removing documents from a linked ArangoDB collection
|
|
|
|
* cleanupIntervalStep: (optional; default: `10`; to disable use: `0`)
|
|
wait at least this many commits between removing unused files in the
|
|
ArangoSearch data directory
|
|
for the case where the consolidation policies merge segments often (i.e. a
|
|
lot of commit+consolidate), a lower value will cause a lot of disk space to
|
|
be wasted
|
|
for the case where the consolidation policies rarely merge segments (i.e.
|
|
few inserts/deletes), a higher value will impact performance without any
|
|
added benefits
|
|
|
|
* commitIntervalMsec: (optional; default: `60000`; to disable use: `0`)
|
|
wait at least *count* milliseconds between committing view data store
|
|
changes and making documents visible to queries
|
|
for the case where there are a lot of inserts/updates, a lower value will
|
|
cause the view not to account for them, (unlit commit), and memory usage
|
|
would continue to grow
|
|
for the case where there are a few inserts/updates, a higher value will
|
|
impact performance and waste disk space for each commit call without any
|
|
added benefits
|
|
|
|
* commitTimeoutMsec: (optional; default: `5000`; to disable use: `0`)
|
|
try to commit as much as possible before *count* milliseconds
|
|
for the case where there are a lot of inserts/updates, a lower value will
|
|
cause a delay in the view accounting for them, due skipping of some commits
|
|
for the case where there are a lot of inserts/updates, a higher value will
|
|
cause higher memory consumption between commits due to accumulation of
|
|
document modifications while a commit is in progress
|
|
|
|
* consolidate: (optional; default: `none`)
|
|
a per-policy mapping of thresholds in the range `[0.0, 1.0]` to determine data
|
|
store segment merge candidates, if specified then only the listed policies
|
|
are used, keys are any of:
|
|
|
|
* bytes: (optional; for default values use an empty object: `{}`)
|
|
|
|
* segmentThreshold: (optional, default: `300`; to disable use: `0`)
|
|
apply consolidation policy IFF {segmentThreshold} >= #segments
|
|
|
|
* threshold: (optional; default: `0.85`)
|
|
consolidate `IFF {threshold} > segment_bytes / (all_segment_bytes / #segments)`
|
|
|
|
* bytes_accum: (optional; for default values use: `{}`)
|
|
|
|
* segmentThreshold: (optional; default: `300`; to disable use: `0`)
|
|
apply consolidation policy IFF {segmentThreshold} >= #segments
|
|
|
|
* threshold: (optional; default: `0.85`)
|
|
consolidate `IFF {threshold} > (segment_bytes + sum_of_merge_candidate_segment_bytes) / all_segment_bytes`
|
|
|
|
* count: (optional; for default values use: `{}`)
|
|
|
|
* segmentThreshold: (optional; default: `300`; to disable use: `0`)
|
|
apply consolidation policy IFF {segmentThreshold} >= #segments
|
|
|
|
* threshold: (optional; default: `0.85`)
|
|
consolidate `IFF {threshold} > segment_docs{valid} / (all_segment_docs{valid} / #segments)`
|
|
|
|
* fill: (optional)
|
|
if specified, use empty object for default values, i.e. `{}`
|
|
|
|
* segmentThreshold: (optional; default: `300`; to disable use: `0`)
|
|
apply consolidation policy IFF {segmentThreshold} >= #segments
|
|
|
|
* threshold: (optional; default: `0.85`)
|
|
consolidate `IFF {threshold} > #segment_docs{valid} / (#segment_docs{valid} + #segment_docs{removed})`
|
|
|
|
* locale: (optional; default: `C`)
|
|
the default locale used for ordering processed attribute names
|
|
|
|
* threadsMaxIdle: (optional; default: `5`)
|
|
maximum idle number of threads for single-run tasks
|
|
for the case where there are a lot of short-lived asynchronous tasks, a lower
|
|
value will cause a lot of thread creation/deletion calls
|
|
for the case where there are no short-lived asynchronous tasks, a higher
|
|
value will only waste memory
|
|
|
|
* threadsMaxTotal: (optional; default: `5`)
|
|
maximum total number of threads (>0) for single-run tasks
|
|
for the case where there are a lot of parallelizable tasks and an abundance
|
|
of resources, a lower value would limit performance
|
|
for the case where there are limited resources CPU/memory, a higher value
|
|
will negatively impact performance
|
|
|
|
### View properties (unmodifiable)
|
|
|
|
* collections:
|
|
an internally tracked list of collection identifiers which were explicitly
|
|
added to the current view by the user via view 'link' property modification
|
|
the list may have no-longer valid identifiers if the user did not explicitly
|
|
drop the link for the said collection identifier from the current view
|
|
invalid collection identifiers are removed during view property modification
|
|
among other things used for acquiring collection locks in transactions (i.e.
|
|
during a view query no documents will be returned for collections not in this
|
|
list) and generating view properties 'links' list
|
|
|
|
### Link properties
|
|
|
|
* analyzers: (optional; default: `[ 'identity' ]`)
|
|
a list of analyzers, by name as defined via the [Analyzers](ArangoSearch/Analyzers.md), that
|
|
should be applied to values of processed document attributes
|
|
|
|
* fields: (optional; default: `{}`)
|
|
an object `{attribute-name: [Link properties]}` of fields that should be
|
|
processed at each level of the document
|
|
each key specifies the document attribute to be processed, the value of
|
|
*includeAllFields* is also consulted when selecting fields to be processed
|
|
each value specifies the [Link properties](#link-properties) directives to be used when
|
|
processing the specified field, a Link properties value of `{}` denotes
|
|
inheritance of all (except *fields*) directives from the current level
|
|
|
|
* includeAllFields: (optional; default: `false`)
|
|
if true then process all document attributes (if not explicitly specified
|
|
then process the fields with default Link properties directives, i.e. `{}`),
|
|
otherwise only consider attributes mentioned in *fields*
|
|
|
|
* trackListPositions: (optional; default: false)
|
|
if true then for array values track the value position in the array, e.g. when
|
|
querying for the input: `{ attr: [ 'valueX', 'valueY', 'valueZ' ] }`
|
|
the user must specify: `doc.attr[1] == 'valueY'`
|
|
otherwise all values in an array are treated as equal alternatives, e.g. when
|
|
querying for the input: `{ attr: [ 'valueX', 'valueY', 'valueZ' ] }`
|
|
the user must specify: `doc.attr == 'valueY'`
|