diff --git a/Documentation/Books/Manual/SUMMARY.md b/Documentation/Books/Manual/SUMMARY.md index 777006e143..3e300fb2a5 100644 --- a/Documentation/Books/Manual/SUMMARY.md +++ b/Documentation/Books/Manual/SUMMARY.md @@ -123,7 +123,10 @@ * [Working with Edges](Graphs/Edges/README.md) * [Pregel](Graphs/Pregel/README.md) * [ArangoSearch Views](Views/ArangoSearch/README.md) + * [Getting Started](Views/ArangoSearch/GettingStarted.md) + * [Detailed Overview](Views/ArangoSearch/DetailedOverview.md) * [Analyzers](Views/ArangoSearch/Analyzers.md) + * [Scorers](Views/ArangoSearch/Scorers.md) ## ADVANCED TOPICS diff --git a/Documentation/Books/Manual/Views/ArangoSearch/Analyzers.md b/Documentation/Books/Manual/Views/ArangoSearch/Analyzers.md index de2bf54325..e9186873f7 100644 --- a/Documentation/Books/Manual/Views/ArangoSearch/Analyzers.md +++ b/Documentation/Books/Manual/Views/ArangoSearch/Analyzers.md @@ -1,13 +1,14 @@ -### Analyzers: +ArangoSearch Analyzers +====================== To simplify query syntax ArangoSearch provides a concept of named analyzers which are merely aliases for type+configuration of IResearch analyzers. Management of -named analyzers is exposed via both REST, GUI and JavaScript APIs, e.g. +named analyzers is exposed via REST, GUI and JavaScript APIs e.g. `db._globalSettings("iresearch.analyzers")` A user then merely uses these analyzer names in ArangoSearch view configurations -and AQL queries, e.g. +and AQL queries. ArangoSearch provides a 'text' analyzer to analyze human readable text. A required configuration parameter for this type of analyzer is 'locale' used to specify @@ -27,7 +28,7 @@ The ArangoDB administrator may then set up a named analyzer 'text_des': The user is then immediately able to run queries with the said analyzer, e.g. -`FILTER doc.description IN TOKENS('Ein brauner Fuchs springt', 'text_des')` +`SEARCH doc.description IN TOKENS('Ein brauner Fuchs springt', 'text_des')` Similarly an administrator may choose to deploy a custom DNA analyzer 'DnaSeq': @@ -41,7 +42,7 @@ Similarly an administrator may choose to deploy a custom DNA analyzer 'DnaSeq': The user is then immediately able to run queries with the said analyzer, e.g. -`FILTER doc.dna IN TOKENS('ACGTCGTATGCACTGA', 'DnaSeq')` +`SEARCH doc.dna IN TOKENS('ACGTCGTATGCACTGA', 'DnaSeq')` To a limited degree the concept of 'analysis' is even available in non-IResearch AQL, e.g. the `TOKENS(...)` function will utilize the power of IResearch to break @@ -53,9 +54,9 @@ e.g. to match docs with 'word == quick' OR 'word == brown' OR 'word == fox' FOR doc IN someCollection FILTER doc.word IN TOKENS('a quick brown fox', 'text_en') - RETRUN doc + RETURN doc -Runtime-plugging functionality for analyzers is not avaiable in ArangoDB at this +Runtime-plugging functionality for analyzers is not available in ArangoDB at this point in time, so ArangoDB comes with a few default-initialized analyzers: * `identity` diff --git a/Documentation/Books/Manual/Views/ArangoSearch/DetailedOverview.md b/Documentation/Books/Manual/Views/ArangoSearch/DetailedOverview.md new file mode 100644 index 0000000000..099ec07bea --- /dev/null +++ b/Documentation/Books/Manual/Views/ArangoSearch/DetailedOverview.md @@ -0,0 +1,169 @@ +# Detailed overview of ArangoSearch views + +ArangoSearch is a powerful fulltext search component with additional functionality, +supported via the 'text' analyzer and 'tfidf'/'bm25' [scorers](Scorers.md), +without impact on performance when specifying documents from different collections or +filtering on multiple document attributes. + +## View datasource + +The IResearch functionality is exposed to ArangoDB via the the ArangoSearch view +API because the ArangoSearch view is merely an identity transformation applied +onto documents stored in linked collections of the same ArangoDB database. +In plain terms an ArangoSearch view only allows filtering and sorting of documents +located in collections of the same database. The matching documents themselves +are returned as-is from their corresponding collections. + +## Links to ArangoDB collections + +A concept of an ArangoDB collection 'link' is introduced to allow specifying +which ArangoDB collections a given ArangoSearch View should query for documents +and how these documents should be queried. + +An ArangoSearch Link is a uni-directional connection from an ArangoDB collection +to an ArangoSearch view describing how data coming from the said collection should +be made available in the given view. Each ArangoSearch Link in an ArangoSearch +view is uniquely identified by the name of the ArangoDB collection it links to. +An ArangoSearch view may have zero or more links, each to a distinct ArangoDB +collection. Similarly an ArangoDB collection may be referenced via links by zero +or more distinct ArangoSearch views. In plain terms any given ArangoSearch view +may be linked to any given ArangoDB collection of the same database with zero or +at most one link. However, any ArangoSearch view may be linked to multiple +distinct ArangoDB collections and similarly any ArangoDB collection may be +referenced by multiple ArangoSearch views. + +To configure an ArangoSearch view for consideration of documents from a given +ArangoDB collection a link definition must be added to the properties of the +said ArangoSearch view defining the link parameters as per the section +[View definition/modification](#view-definitionmodification). + +## Analyzers + +To simplify query syntax ArangoSearch provides a concept of +[named analyzers](Analyzers.md) which are merely aliases for +type+configuration of IResearch analyzers. Management of named analyzers +is exposed via REST, GUI and JavaScript APIs. + +## View definition/modification + +An ArangoSearch view is configured via an object containing a set of +view-specific configuration directives and a map of link-specific configuration +directives. + +During view creation the following directives apply: + +* id _(optional)_: the desired view identifier +* name _(required)_: the view name +* type _(required)_: the value "arangosearch" + any of the directives from the section [View properties](#view-properties-updatable) + +During view modification the following directives apply: + +* links _(optional)_: + a mapping of collection-name/collection-identifier to one of: + * link creation - link definition as per the section [Link properties](#link-properties) + * link removal - JSON keyword *null* (i.e. nullify a link if present) + any of the directives from the section [modifiable view properties](#view-properties-updatable) + +## View properties (non-updatable) + +* **locale** (_optional_, default: `C`)
+ the default locale used for ordering processed attribute names + +## View properties (updatable) + +* **commit** (_optional_, default: use defaults for all values)
+ configure ArangoSearch View commit policy for single-item inserts/removals, + e.g. when adding removing documents from a linked ArangoDB collection + + * **cleanupIntervalStep** (_optional_, default: `10`; to disable use: `0`)
+ wait at least this many commits between removing unused files in the + ArangoSearch data directory + for the case where the consolidation policies merge segments often (i.e. a + lot of commit+consolidate), a lower value will cause a lot of disk space to + be wasted + for the case where the consolidation policies rarely merge segments (i.e. + few inserts/deletes), a higher value will impact performance without any + added benefits + + * **commitIntervalMsec** (_optional_, default: `60000`; to disable use: `0`)
+ wait at least *count* milliseconds between committing view data store + changes and making documents visible to queries + for the case where there are a lot of inserts/updates, a lower value will + cause the view not to account for them, (unlit commit), and memory usage + would continue to grow + for the case where there are a few inserts/updates, a higher value will + impact performance and waste disk space for each commit call without any + added benefits + + * **consolidate** (_optional_, default: `none`)
+ a per-policy mapping of thresholds in the range `[0.0, 1.0]` to determine data + store segment merge candidates, if specified then only the listed policies + are used, keys are any of: + + * **bytes** (_optional_, for default values use an empty object: `{}`) + + * **segmentThreshold** (_optional_, default: `300`; to disable use: `0`)
+ apply consolidation policy IFF {segmentThreshold} >= #segments + + * **threshold** (_optional_, default: `0.85`)
+ consolidate `IFF {threshold} > segment_bytes / (all_segment_bytes / #segments)` + + * **bytes_accum** (_optional_, for default values use: `{}`)
+ + * **segmentThreshold** (_optional_, default: `300`; to disable use: `0`)
+ apply consolidation policy IFF {segmentThreshold} >= #segments + + * **threshold** (_optional_, default: `0.85`)
+ consolidate `IFF {threshold} > (segment_bytes + sum_of_merge_candidate_segment_bytes) / all_segment_bytes` + + * **count** (_optional_, for default values use: `{}`) + + * **segmentThreshold** (_optional_, default: `300`; to disable use: `0`)
+ apply consolidation policy IFF {segmentThreshold} >= #segments + + * **threshold** (_optional_, default: `0.85`)
+ consolidate `IFF {threshold} > segment_docs{valid} / (all_segment_docs{valid} / #segments)` + + * fill: (optional) + if specified, use empty object for default values, i.e. `{}` + + * **segmentThreshold** (_optional_, default: `300`; to disable use: `0`)
+ apply consolidation policy IFF {segmentThreshold} >= #segments + + * **threshold** (_optional_, default: `0.85`)
+ consolidate `IFF {threshold} > #segment_docs{valid} / (#segment_docs{valid} + #segment_docs{removed})` + +## Link properties + +* **analyzers** (_optional_, default: `[ 'identity' ]`)
+ a list of analyzers, by name as defined via the [Analyzers](Analyzers.md), that + should be applied to values of processed document attributes + +* **fields** (_optional_, default: `{}`)
+ an object `{attribute-name: [Link properties]}` of fields that should be + processed at each level of the document + each key specifies the document attribute to be processed, the value of + *includeAllFields* is also consulted when selecting fields to be processed + each value specifies the [Link properties](#link-properties) directives to be used when + processing the specified field, a Link properties value of `{}` denotes + inheritance of all (except *fields*) directives from the current level + +* **includeAllFields** (_optional_, default: `false`)
+ if true then process all document attributes (if not explicitly specified + then process the fields with default Link properties directives, i.e. `{}`), + otherwise only consider attributes mentioned in *fields* + +* **trackListPositions** (_optional_, default: `false`)
+ if true then for array values track the value position in the array, e.g. when + querying for the input: `{ attr: [ 'valueX', 'valueY', 'valueZ' ] }` + the user must specify: `doc.attr[1] == 'valueY'` + otherwise all values in an array are treated as equal alternatives, e.g. when + querying for the input: `{ attr: [ 'valueX', 'valueY', 'valueZ' ] }` + the user must specify: `doc.attr == 'valueY'` + +* **storeValues** (_optional_, default: `"none"`)
+ how should the view track the attribute values, this setting allows for + additional value retrieval optimizations, one of: + * none: Do not store values by the view + * id: Store only information about value presence, to allow use of the EXISTS() function diff --git a/Documentation/Books/Manual/Views/ArangoSearch/GettingStarted.md b/Documentation/Books/Manual/Views/ArangoSearch/GettingStarted.md new file mode 100644 index 0000000000..9b6d03c00f --- /dev/null +++ b/Documentation/Books/Manual/Views/ArangoSearch/GettingStarted.md @@ -0,0 +1,126 @@ +# Getting started with ArangoSearch views + +## The DDL configuration + +[DDL](https://en.wikipedia.org/wiki/Data_definition_language) is a data +definition language or data description language for defining data structures, +especially database schemas. + +All DDL operations on Views can be done via JavaScript or REST calls. The DDL +syntax follows the well established ArangoDB guidelines and thus is very +similar between JavaScript and REST. This article uses the JavaScript syntax. + +Assume the following collections were initially defined in a database using +the following commands: + +```js +c0 = db._create("ExampleCollection0"); +c1 = db._create("ExampleCollection1"); + +c0.save({ i: 0, name: "full", text: "是一个 多模 型数 据库" }); +c0.save({ i: 1, name: "half", text: "是一个 多模" }); +c0.save({ i: 2, name: "other half", text: "型数 据库" }); +c0.save({ i: 3, name: "quarter", text: "是一" }); + +c1.save({ a: "foo", b: "bar", i: 4 }); +c1.save({ a: "foo", b: "baz", i: 5 }); +c1.save({ a: "bar", b: "foo", i: 6 }); +c1.save({ a: "baz", b: "foo", i: 7 }); +``` + +## Creating a View (with default parameters) + +```js +v0 = db._createView("ExampleView", "arangosearch", {}); +``` + +## Linking created View with a collection and adding indexing parameters + +```js +v0 = db._view("ExampleView"); +v0.properties({ + links: { + 'ExampleCollection0': /* collection Link 0 with additional custom configuration */ + { + includeAllFields: true, /* examine fields of all linked collections using default configuration */ + fields: + { + name: /* a field to apply custom configuration that will index English text */ + { + analyzers: ["text_en"] + }, + text: /* another field to apply custom that will index Chineese text */ + { + analyzers: ["text_zh"] + } + } + }, + 'ExampleCollection1': /* collection Link 1 with custom configuration */ + { + includeAllFields: true, /* examine all fields using default configuration */ + fields: + { + a: + { + analyzers: ["text_en"] /* a field to apply custom configuration that will index English text */ + } + } + } + } + } +); +``` + +## Query data using created View with linked collections + +```js +db._query(`FOR doc IN ExampleView + SEARCH PHRASE(doc.text, '型数 据库', 'text_zh') OR STARTS_WITH(doc.b, 'ba') + SORT TFIDF(doc) DESC + RETURN doc`); +``` + +## Examine query result + +Result of the latter query will include all documents from both linked +collections that include `多模 型数` phrase in Chinese at any part of `text` +property or `b` property in English that starts with `ba`. Additionally, +descendant sorting using [TFIDF algorithm](https://en.wikipedia.org/wiki/TF-IDF) +will be applied during a search: + +```json +[ + { + "_key" : "120", + "_id" : "ExampleCollection0/120", + "_rev" : "_XPoMzCi--_", + "i" : 0, + "name" : "full", + "text" : "是一个 多模 型数 据库" + }, + { + "_key" : "124", + "_id" : "ExampleCollection0/124", + "_rev" : "_XPoMzCq--_", + "i" : 2, + "name" : "other half", + "text" : "型数 据库" + }, + { + "_key" : "128", + "_id" : "ExampleCollection1/128", + "_rev" : "_XPoMzCu--_", + "a" : "foo", + "b" : "bar", + "c" : 0 + }, + { + "_key" : "130", + "_id" : "ExampleCollection1/130", + "_rev" : "_XPoMzCy--_", + "a" : "foo", + "b" : "baz", + "c" : 1 + } +] +``` diff --git a/Documentation/Books/Manual/Views/ArangoSearch/README.md b/Documentation/Books/Manual/Views/ArangoSearch/README.md index e5c211c8d2..1c9c799285 100644 --- a/Documentation/Books/Manual/Views/ArangoSearch/README.md +++ b/Documentation/Books/Manual/Views/ArangoSearch/README.md @@ -1,259 +1,45 @@ -# Bringing the power of IResearch to ArangoDB +# ArangoSearch views powered by IResearch -## What is ArangoSearch +ArangoSearch is a natively integrated AQL extension making use of the +IResearch library. -ArangoSearch is a natively integrated AQL extension making use of the IResearch library. +ArangoSearch allows one to: -Arangosearch allows one to: * join documents located in different collections to one result list -* search documents based on AQL boolean expressions and functions -* sort the result set based on how closely each document matched the search condition +* filter documents based on AQL boolean expressions and functions +* sort the result set based on how closely each document matched the filter A concept of value 'analysis' that is meant to break up a given value into a set of sub-values internally tied together by metadata which influences both -the search and sort stages to provide the most appropriate match for the +the filter and sort stages to provide the most appropriate match for the specified conditions, similar to queries to web search engines. In plain terms this means a user can for example: + * request documents where the 'body' attribute best matches 'a quick brown fox' * request documents where the 'dna' attribute best matches a DNA sub sequence * request documents where the 'name' attribute best matches gender -* etc... (via custom analyzers described in the next section) +* etc. (via custom analyzers) -### The IResearch Library +## The IResearch Library -IResearch s a cross-platform open source indexing and searching engine written in C++, -optimized for speed and memory footprint, with source available from: -https://github.com/iresearch-toolkit/iresearch +IResearch is a cross-platform open source indexing and searching engine written +in modern C++, optimized for speed and memory footprint, with source available +from https://github.com/iresearch-toolkit/iresearch -IResearch is a framework for indexing, searching and sorting of data. The indexing stage can -treat each data item as an atom or use custom 'analyzers' to break the data item -into sub-atomic pieces tied together with internally tracked metadata. +IResearch is the framework for indexing, filtering and sorting of data. +The indexing stage can treat each data item as an atom or use custom 'analyzers' +to break the data item into sub-atomic pieces tied together with internally +tracked metadata. The IResearch framework in general can be further extended at runtime with -custom implementations of analyzers (used during the indexing and searching +custom implementations of analyzers (used during the indexing and filtering stages) and scorers (used during the sorting stage) allowing full control over the behavior of the engine. +## Using ArangoSearch views -### ArangoSearch Scorers - -ArangoSearch accesses scorers directly by their internal names. The -name (in upper-case) of the scorer is the function name to be used in the -['SORT' section](../../../AQL/Views/ArangoSearch/index.html#arangosearch-sort). -Function arguments, (excluding the first argument), are serialized as a -string representation of a JSON array and passed directly to the corresponding -scorer. The first argument to any scorer function is the reference to the -current document emitted by the `FOR` statement, i.e. it would be 'doc' for this -statement: - - FOR doc IN someView - -IResearch provides a 'bm25' scorer implementing the -[BM25 algorithm](https://en.wikipedia.org/wiki/Okapi_BM25). This scorer -optionally takes 'k' and 'b' positional parameters. - -The user is able to run queries with the said scorer, e.g. - - SORT BM25(doc, 1.2, 0.75) - -The function arguments will then be serialized into a JSON representation: - -```json -[ 1.2, 0.75 ] -``` - -and passed to the scorer implementation. - -Similarly an administrator may choose to deploy a custom DNA analyzer 'DnaRank'. - -The user is then immediately able to run queries with the said scorer, e.g. - - SORT DNARANK(doc, 123, 456, "abc", { "def", "ghi" }) - -The function arguments will then be serialized into a JSON representation: - -```json -[ 123, 456, "abc", { "def", "ghi" } ] -``` - -and passed to the scorer implementation. - -Runtime-plugging functionality for scores is not avaiable in ArangoDB at this -point in time, so ArangoDB comes with a few default-initialized scores: - -- *attribute-name* - order results based on the value of **attribute-name** - -- BM25 - order results based on the - [BM25 algorithm](https://en.wikipedia.org/wiki/Okapi_BM25) - -- TFIDF - order results based on the - [TFIDF algorithm](https://en.wikipedia.org/wiki/TF-IDF) - -### ArangoSearch is much more than a fulltext search - -But fulltext searching is a subset of its available functionality, supported via -the 'text' analyzer and 'tfidf'/'bm25' scorers, without impact to performance -when specifying documents from different collections or searching on multiple -document attributes. - -### View datasource - -The IResearch functionality is exposed to ArangoDB via the the ArangoSearch view -API because the ArangoSearch view is merely an identity transformation applied -onto documents stored in linked collections of the same ArangoDB database. -In plain terms an ArangoSearch view only allows searching and sorting of documents -located in collections of the same database. -The matching documents themselves are returned as-is from their corresponding collections. - -### Links to ArangoDB collections - -A concept of an ArangoDB collection 'link' is introduced to allow specifying -which ArangoDB collections a given ArangoSearch View should query for documents -and how these documents should be queried. - -An ArangoSearch Link is a uni-directional connection from an ArangoDB collection -to an ArangoSearch view describing how data coming from the said collection should -be made available in the given view. Each ArangoSearch Link in an ArangoSearch view is -uniquely identified by the name of the ArangoDB collection it links to. An -ArangoSearch view may have zero or more links, each to a distinct ArangoDB -collection. Similarly an ArangoDB collection may be referenced via links by zero -or more distinct ArangoSearch views. In plain terms any given ArangoSearch view may be -linked to any given ArangoDB collection of the same database with zero or at -most one link. However, any ArangoSearch view may be linked to multiple distinct -ArangoDB collections and similarly any ArangoDB collection may be referenced by -multiple ArangoSearch views. - -To configure an ArangoSearch view for consideration of documents from a given -ArangoDB collection a link definition must be added to the properties of the -said ArangoSearch view defining the link parameters as per the section -[View definition/modification](#view-definitionmodification). - -### Analyzers - -To simplify query syntax ArangoSearch provides a concept of -[named analyzers](Analyzers.md) which -are merely aliases for type+configuration of IResearch analyzers. Management of -named analyzers is exposed via both REST, GUI and JavaScript APIs, e.g. - - -### View definition/modification - -An ArangoSearch view is configured via an object containing a set of -view-specific configuration directives and a map of link-specific configuration -directives. - -During view creation the following directives apply: -* id: (optional) the desired view identifier -* name: (required) the view name -* type: \ the value "arangosearch" - any of the directives from the section [View properties](#view-properties-updatable) - -During view modification the following directives apply: -* links: (optional) - a mapping of collection-name/collection-identifier to one of: - * link creation - link definition as per the section [Link properties](#link-properties) - * link removal - JSON keyword *null* (i.e. nullify a link if present) - any of the directives from the section [modifiable view properties](#view-properties-updatable) - -### View properties (non-updatable) - -* locale: (optional; default: `C`) - the default locale used for ordering processed attribute names - -### View properties (updatable) - -* cleanupIntervalStep: (optional; default: `10`; to disable use: `0`) - wait at least this many commits between removing unused files in the - ArangoSearch data directory - for the case where the consolidation policies merge segments often (i.e. a - lot of commit+consolidate), a lower value will cause a lot of disk space to - be wasted - for the case where the consolidation policies rarely merge segments (i.e. - few inserts/deletes), a higher value will impact performance without any - added benefits - -* commitIntervalMsec: (optional; default: `60000`; to disable use: `0`) - wait at least *count* milliseconds between committing view data store - changes and making documents visible to queries - for the case where there are a lot of inserts/updates, a lower value will - cause the view not to account for them, (unlit commit), and memory usage - would continue to grow - for the case where there are a few inserts/updates, a higher value will - impact performance and waste disk space for each commit call without any - added benefits - -* consolidate: (optional; default: `none`) - a per-policy mapping of thresholds in the range `[0.0, 1.0]` to determine data - store segment merge candidates, if specified then only the listed policies - are used, keys are any of: - - * bytes: (optional; for default values use an empty object: `{}`) - - * segmentThreshold: (optional, default: `300`; to disable use: `0`) - apply consolidation policy IFF {segmentThreshold} >= #segments - - * threshold: (optional; default: `0.85`) - consolidate `IFF {threshold} > segment_bytes / (all_segment_bytes / #segments)` - - * bytes_accum: (optional; for default values use: `{}`) - - * segmentThreshold: (optional; default: `300`; to disable use: `0`) - apply consolidation policy IFF {segmentThreshold} >= #segments - - * threshold: (optional; default: `0.85`) - consolidate `IFF {threshold} > (segment_bytes + sum_of_merge_candidate_segment_bytes) / all_segment_bytes` - - * count: (optional; for default values use: `{}`) - - * segmentThreshold: (optional; default: `300`; to disable use: `0`) - apply consolidation policy IFF {segmentThreshold} >= #segments - - * threshold: (optional; default: `0.85`) - consolidate `IFF {threshold} > segment_docs{valid} / (all_segment_docs{valid} / #segments)` - - * fill: (optional) - if specified, use empty object for default values, i.e. `{}` - - * segmentThreshold: (optional; default: `300`; to disable use: `0`) - apply consolidation policy IFF {segmentThreshold} >= #segments - - * threshold: (optional; default: `0.85`) - consolidate `IFF {threshold} > #segment_docs{valid} / (#segment_docs{valid} + #segment_docs{removed})` - -### Link properties - -* analyzers: (optional; default: `[ 'identity' ]`) - a list of analyzers, by name as defined via the [Analyzers](Analyzers.md), that - should be applied to values of processed document attributes - -* fields: (optional; default: `{}`) - an object `{attribute-name: [Link properties]}` of fields that should be - processed at each level of the document - each key specifies the document attribute to be processed, the value of - *includeAllFields* is also consulted when selecting fields to be processed - each value specifies the [Link properties](#link-properties) directives to be used when - processing the specified field, a Link properties value of `{}` denotes - inheritance of all (except *fields*) directives from the current level - -* includeAllFields: (optional; default: `false`) - if true then process all document attributes (if not explicitly specified - then process the fields with default Link properties directives, i.e. `{}`), - otherwise only consider attributes mentioned in *fields* - -* trackListPositions: (optional; default: false) - if true then for array values track the value position in the array, e.g. when - querying for the input: `{ attr: [ 'valueX', 'valueY', 'valueZ' ] }` - the user must specify: `doc.attr[1] == 'valueY'` - otherwise all values in an array are treated as equal alternatives, e.g. when - querying for the input: `{ attr: [ 'valueX', 'valueY', 'valueZ' ] }` - the user must specify: `doc.attr == 'valueY'` - -* storeValues: (optional; default: "none") - how should the view track the attribute values, this setting allows for - additional value retrieval optimizations, one of: - * none: Do not store values by the view - * id: Store only information about value presence, to allow use of the EXISTS() function +To get more familiar with ArangoSearch usage, you may start with [Getting Started](GettingStarted.md) simple guide and then explore details of ArangoSearch in + [Detailed Overview](DetailedOverview.md), + [Analyzers](Analyzers.md) + and [Scorers](Scorers.md) topics. diff --git a/Documentation/Books/Manual/Views/ArangoSearch/Scorers.md b/Documentation/Books/Manual/Views/ArangoSearch/Scorers.md new file mode 100644 index 0000000000..12bfc51794 --- /dev/null +++ b/Documentation/Books/Manual/Views/ArangoSearch/Scorers.md @@ -0,0 +1,58 @@ +ArangoSearch Scorers +==================== + +ArangoSearch accesses scorers directly by their internal names. The +name (in upper-case) of the scorer is the function name to be used in the +['SORT' section](../../../AQL/Views/ArangoSearch/index.html#arangosearch-sort). +Function arguments, (excluding the first argument), are serialized as a +string representation of a JSON array and passed directly to the corresponding +scorer. The first argument to any scorer function is the reference to the +current document emitted by the `FOR` statement, i.e. it would be 'doc' for this +statement: + +```js +FOR doc IN someView +``` + +IResearch provides a 'bm25' scorer implementing the +[BM25 algorithm](https://en.wikipedia.org/wiki/Okapi_BM25). This scorer +optionally takes 'k' and 'b' positional parameters. + +The user is able to run queries with the said scorer, e.g. + +```js +SORT BM25(doc, 1.2, 0.75) +``` + +The function arguments will then be serialized into a JSON representation: + +```json +[ 1.2, 0.75 ] +``` + +and passed to the scorer implementation. + +Similarly an administrator may choose to deploy a custom DNA analyzer 'DnaRank'. + +The user is then immediately able to run queries with the said scorer, e.g. + +```js +SORT DNARANK(doc, 123, 456, "abc", { "def": "ghi" }) +``` + +The function arguments will then be serialized into a JSON representation: + +```json +[ 123, 456, "abc", { "def": "ghi" } ] +``` + +and passed to the scorer implementation. + +Runtime-plugging functionality for scores is not available in ArangoDB at this +point in time, so ArangoDB comes with a few default-initialized scores: + +- *attribute-name*: order results based on the value of **attribute-name** + +- BM25: order results based on the [BM25 algorithm](https://en.wikipedia.org/wiki/Okapi_BM25) + +- TFIDF: order results based on the [TFIDF algorithm](https://en.wikipedia.org/wiki/TF-IDF)