* issue 526.9.1: implement swagger interface, add documentation * address review comments * add ngram * Formatting * Move REST description to new Analyzers top chapter in HTTP book * Missed a DocuBlock * Add Analyzers chapter to Manual SUMMARY.md * Move REST API description back to Manual Headlines were broken * Add n-gram example |
||
---|---|---|
.. | ||
README.md |
README.md
Analyzers powered by IResearch
Background
The concept of value "analysis" refers to the process of breaking up a given value into a set of sub-values, which are internally tied together by metadata, which in turn influences both the search and sort stages to provide the most appropriate match for the specified conditions, similar to queries to web search engines.
In plain terms this means a user can for example:
- request documents where the
body
attribute best matchesa quick brown fox
- request documents where the
dna
attribute best matches a DNA sub sequence - request documents where the
name
attribute best matches gender - etc. (via custom analyzers)
What are Analyzers
Analyzers are helpers that allow a user the parse and transform an arbitrary value, currently only string values are supported, into zero or more resulting values. The parsing and transformation applied is directed by the analyzer type and the analyzer properties.
The Analyzer implementations themselves are provided by the underlying
IResearch library.
Therefore their most common usecase for filter condition matching is with
ArangoSearch Views.
However, Analyzers can be used as standalone helpers via the TOKENS(...)
function, allowing a user to leverage the value transformation power of the
Analyzer in any context where an AQL function can be used.
A user-visible Analyzer is simply an alias for an underlying implementation type + configuration properties and a set of features. The features dictate what term matching capabilities are available and as such are only applicable in the context of ArangoSearch Views.
The aforementioned three configuration attributes that an Analyzer is composed of are given a simple name that can be used to reference the said Analyzer. Thus an analyzer definition is composed of the following attributes:
- name: the analyzer name
- type: the analyzer type
- properties: the properties used to configure the specified type
- features: the set of features to set on the analyzer generated fields
The valid values for type is any Analyzer type available.
The valid values for the properties are dependant on what type is used. For
example for the text type its property may simply be an object with the value
"locale": "en"
, whereas for the "delimited" type its property may simply be
the delimiter ,
.
The valid values for the features are dependant on both the capabilities of
the underlying type and the query filtering and sorting functions that the
result can be used with. For example the text type will produce
frequency + norm + position and the PHRASE(...)
function requires
frequency + position to be available.
Currently the following features are supported:
- frequency: how often a term is seen, required for PHRASE(...)
- norm: the field normalization factor
- position: sequentially increasing term position, required for PHRASE(...) if present then the frequency feature is also required
Analyzer usage
For Analyzer usage in the context of ArangoSearch Views please see the section ArangoSearch Views.
The value transformation capabilities of a given analyzer can be invoked via
the TOKENS(...)
function to for example:
- break up a string of words into individual words, while also optionally filtering out stopwords, applying case conversion and extracting word stems
- parse CSV/TSV or other delimiter encoded string values into individual fields
The signature of the TOKENS(...)
function is:
TOKENS(<value-to-parse>, <analyzer-name-to-apply>)
It currently accepts any string value, and an analyzer name, and will produce an array of zero or more tokens generated by the specified analyzer transformation.
Analyzer management
The following operations are exposed via JavaScript and REST APIs for analyzer management:
- create: creation of a new analyzer definition
- get: retrieve an existing analyzer definition
- list: retrieve a listing of all available analyzer definitions
- remove: remove an analyzer definition
JavaScript
The JavaScript API is accessible via the @arangodb/analyzers
endpoint from
both server-side and client-side code, e.g.
var analyzers = require("@arangodb/analyzers");
The create operation is accessible via:
analyzers.save(<name>, <type>[, <properties>[, <features>]])
… where properties can be represented either as a string, an object or a null value and features is an array of string encoded feature names.
The get operation is accessible via:
analyzers.analyzer(<name>)
The list operation is accessible via:
analyzers.toArray()
The remove operation is accessible via:
analyzers.remove(<name> [, <force>])
Additionally individual analyzer instances expose getter accessors for the aforementioned definition attributes:
analyzer.name()
analyzer.type()
analyzer.properties()
analyzer.features()
RESTful API
The create operation is accessible via the POST method on the URL:
/_api/analyzer
With the Analyzer configuration passed via the body as an object with attributes:
- name: string (required)
- type: string (required)
- properties: string or object or null (optional) default:
null
- features: array of strings (optional) default: empty array
The get operation is accessible via the GET method on the URL:
/_api/analyzer/{analyzer-name}
A successful result will be an object with the fields:
- name
- type
- properties
- features
The list operation is accessible via the GET method on the URL:
/_api/analyzer
A successful result will be an array of object with the fields:
- name
- type
- properties
- features
The remove operation is accessible via the DELETE method on the URL:
/_api/analyzer/{analyzer-name}[?force=true]
Also see Analyzers in the HTTP book including a list of available Analyzer Types.