# Analyzers powered by IResearch ## Background The concept of value "analysis" refers to the process of breaking up a given value into a set of sub-values, which are internally tied together by metadata, which in turn influences both the search and sort stages to provide the most appropriate match for the specified conditions, similar to queries to web search engines. In plain terms this means a user can for example: - request documents where the `body` attribute best matches `a quick brown fox` - request documents where the `dna` attribute best matches a DNA sub sequence - request documents where the `name` attribute best matches gender - etc. (via custom analyzers) ## What are Analyzers Analyzers are helpers that allow a user the parse and transform an arbitrary value, currently only string values are supported, into zero or more resulting values. The parsing and transformation applied is directed by the analyzer *type* and the analyzer *properties*. The Analyzer implementations themselves are provided by the underlying [IResearch library](https://github.com/iresearch-toolkit/iresearch). Therefore their most common usecase for filter condition matching is with [ArangoSearch Views](../Views/ArangoSearch/README.md). However, Analyzers can be used as standalone helpers via the `TOKENS(...)` function, allowing a user to leverage the value transformation power of the Analyzer in any context where an AQL function can be used. A user-visible Analyzer is simply an alias for an underlying implementation *type* + configuration *properties* and a set of *features*. The *features* dictate what term matching capabilities are available and as such are only applicable in the context of ArangoSearch Views. The aforementioned three configuration attributes that an Analyzer is composed of are given a simple *name* that can be used to reference the said Analyzer. Thus an analyzer definition is composed of the following attributes: - *name*: the analyzer name - *type*: the analyzer type - *properties*: the properties used to configure the specified type - *features*: the set of features to set on the analyzer generated fields The valid values for *type* is any Analyzer type available. The valid values for the *properties* are dependant on what *type* is used. For example for the *text* type its property may simply be an object with the value `"locale": "en"`, whereas for the "delimited" type its property may simply be the delimiter `,`. The valid values for the *features* are dependant on both the capabilities of the underlying *type* and the query filtering and sorting functions that the result can be used with. For example the *text* type will produce *frequency* + *norm* + *position* and the `PHRASE(...)` function requires *frequency* + *position* to be available. Currently the following *features* are supported: - *frequency*: how often a term is seen, required for PHRASE(...) - *norm*: the field normalization factor - *position*: sequentially increasing term position, required for PHRASE(...) if present then the *frequency* feature is also required ## Analyzer usage For Analyzer usage in the context of ArangoSearch Views please see the section [ArangoSearch Views](../Views/ArangoSearch/README.md). The value transformation capabilities of a given analyzer can be invoked via the `TOKENS(...)` function to for example: - break up a string of words into individual words, while also optionally filtering out stopwords, applying case conversion and extracting word stems - parse CSV/TSV or other delimiter encoded string values into individual fields The signature of the `TOKENS(...)` function is: TOKENS(, ) It currently accepts any string value, and an analyzer name, and will produce an array of zero or more tokens generated by the specified analyzer transformation. ## Analyzer management The following operations are exposed via JavaScript and REST APIs for analyzer management: - *create*: creation of a new analyzer definition - *get*: retrieve an existing analyzer definition - *list*: retrieve a listing of all available analyzer definitions - *remove*: remove an analyzer definition ### JavaScript The JavaScript API is accessible via the `@arangodb/analyzers` endpoint from both server-side and client-side code, e.g. ```js var analyzers = require("@arangodb/analyzers"); ``` The *create* operation is accessible via: ```js analyzers.save(, [, [, ]]) ``` … where *properties* can be represented either as a string, an object or a null value and *features* is an array of string encoded feature names. The *get* operation is accessible via: ```js analyzers.analyzer() ``` The *list* operation is accessible via: ```js analyzers.toArray() ``` The *remove* operation is accessible via: ```js analyzers.remove( [, ]) ``` Additionally individual analyzer instances expose getter accessors for the aforementioned definition attributes: ```js analyzer.name() analyzer.type() analyzer.properties() analyzer.features() ``` ### RESTful API The *create* operation is accessible via the *POST* method on the URL: /_api/analyzer With the Analyzer configuration passed via the body as an object with attributes: - *name*: string (required) - *type*: string (required) - *properties*: string or object or null (optional) default: `null` - *features*: array of strings (optional) default: empty array The *get* operation is accessible via the *GET* method on the URL: /_api/analyzer/{analyzer-name} A successful result will be an object with the fields: - *name* - *type* - *properties* - *features* The *list* operation is accessible via the *GET* method on the URL: /_api/analyzer A successful result will be an array of object with the fields: - *name* - *type* - *properties* - *features* The *remove* operation is accessible via the *DELETE* method on the URL: /_api/analyzer/{analyzer-name}[?force=true] Also see [Analyzers](../../HTTP/Analyzers/index.html) in the HTTP book including a list of available [Analyzer Types](../../HTTP/Analyzers/index.html#analyzer-types).