1
0
Fork 0
arangodb/Documentation/Books/Manual/Programs/Arangodump/Maskings.md

729 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Arangodump Data Maskings
========================
*--maskings path-of-config*
This feature allows you to define how sensitive data shall be dumped.
It is possible to exclude collections entirely, limit the dump to the
structural information of a collection (name, indexes, sharding etc.)
or to obfuscate certain fields for a dump. A JSON configuration file is
used to define which collections and fields to mask and how.
The general structure of the configuration file looks like this:
```json
{
"collection-name": {
"type": MASKING_TYPE,
"maskings": [
MASKING1,
MASKING2,
...
]
},
...
}
```
At the top level, there is an object with collection names and the masking
settings to be applied to them. Using `"*"` as collection name defines a
default behavior for collections not listed explicitly.
Masking Types
-------------
`type` is a string describing how to mask the given collection.
Possible values are:
- `"exclude"`: the collection is ignored completely and not even the
structure data is dumped.
- `"structure"`: only the collection structure is dumped, but no data at all
- `"masked"`: the collection structure and all data is dumped. However, the data
is subject to obfuscation defined in the attribute `maskings`. It is an array
of objects, with one object per field to mask. Each object needs at least a
`path` and a `type` attribute to [define which field to mask](#path) and which
[masking function](#masking-functions) to apply. Depending on the
masking type, there may exist additional attributes.
- `"full"`: the collection structure and all data is dumped. No masking is
applied to this collection at all.
**Example**
```json
{
"private": {
"type": "exclude"
},
"log": {
"type": "structure"
},
"person": {
"type": "masked",
"maskings": [
{
"path": "name",
"type": "xifyFront",
"unmaskedLength": 2
},
{
"path": ".security_id",
"type": "xifyFront",
"unmaskedLength": 2
}
]
}
}
```
- The collection called _private_ is completely ignored.
- Only the structure of the collection _log_ is dumped, but not the data itself.
- The collection _person_ is dumped completely but with maskings applied:
- The _name_ field is masked if it occurs on the top-level.
- It also masks fields with the name _security_id_ anywhere in the document.
- The masking function is of type [_xifyFront_](#xify-front) in both cases.
The additional setting `unmaskedLength` is specific so _xifyFront_.
### Masking vs. dump-data option
*arangodump* also supports a very coarse masking with the option
`--dump-data false`. This basically removes all data from the dump.
You can either use `--masking` or `--dump-data false`, but not both.
### Masking vs. include-collection option
*arangodump* also supports a very coarse masking with the option
`--include-collection`. This will restrict the collections that are
dumped to the ones explicitly listed.
It is possible to combine `--masking` and `--include-collection`.
This will take the intersection of exportable collections.
Path
----
`path` defines which field to obfuscate. There can only be a single
path per masking, but an unlimited amount of maskings per collection.
To mask a top-level attribute value, the path is simply the attribute
name, for instance `"name"` to mask the value `"foobar"`:
```json
{
"_key": "1234",
"name": "foobar"
}
```
The path to a nested attribute `name` with a top-level attribute `person`
as its parent is `"person.name"`:
```json
{
"_key": "1234",
"person": {
"name": "foobar"
}
}
```
If the path starts with a `.` then it matches any path ending in `name`.
For example, `.name` will match the field `name` of all leaf attributes
in the document. Leaf attributes are attributes whose value is `null`,
`true`, `false`, or of data type `string`, `number` or `array`.
That means, it matches `name` at the top level
as well as at any nested level (e.g. `foo.bar.name`), but not nested
objects themselves.
On the other hand, `name` will only match leaf attributes
at top level. `person.name` will match the attribute `name` of a leaf
in the top-level object `person`. If `person` was itself an object,
then the masking settings for this path would be ignored, because it
is not a leaf attribute.
If the attribute value is an **array** then the masking is applied to
**all array elements individually**.
If you have an attribute name that contains a dot, you need to quote the
name with either a tick or a backtick. For example:
"path": "´name.with.dots´"
or
"path": "`name.with.dots`"
**Example**
The following configuration will replace the value of the `name`
attribute with an "xxxx"-masked string:
```json
{
"type": "xifyFront",
"path": ".name",
"unmaskedLength": 2
}
```
The document:
```json
{
"name": "top-level-name",
"age": 42,
"nicknames" : [ { "name": "hugo" }, "egon" ],
"other": {
"name": [ "emil", { "secret": "superman" } ]
}
}
```
… will be changed as follows:
```json
{
"name": "xxxxxxxxxxxxme",
"age": 42,
"nicknames" : [ { "name": "xxgo" }, "egon" ],
"other": {
"name": [ "xxil", { "secret": "superman" } ]
}
}
```
The values `"egon"` and `"superman"` are not replaced, because they
are not contained in an attribute value of which the attribute name is
`name`.
### Nested objects and arrays
If you specify a path and the attribute value is an array then the
masking decision is applied to each element of the array as if this
was the value of the attribute. This applies to arrays inside the array too.
If the attribute value is an object, then it is ignored and the attribute
does not get masked. To mask nested fields, specify the full path for each
leaf attribute.
{% hint 'tip' %}
If some documents have an attribute `email` with a string as value, but other
documents store a nested object under the same attribute name, then make sure
to set up proper masking for the latter case, in which sub-attributes will not
get masked if there is only a masking configured for the attribute `email`
but not its nested attributes.
{% endhint %}
**Examples**
Masking `email` with the _Xify Front_ function will convert:
```json
{
"email" : "email address"
}
```
… into:
```json
{
"email" : "xxil xxxxxxss"
}
```
because `email` is a leaf attribute. The document:
```json
{
"email" : [
"address one",
"address two",
[
"address three"
]
]
}
```
… will be converted into:
```json
{
"email" : [
"xxxxxss xne",
"xxxxxss xwo",
[
"xxxxxss xxxee"
]
]
}
```
… because the masking is applied to each array element individually
including the elements of the sub-array. The document:
```json
{
"email" : {
"address" : "email address"
}
}
```
… will not be changed because `email` is not a leaf attribute.
To mask the email address, you could use the paths `email.address`
or `.address`.
Masking Functions
-----------------
{% hint 'info' %}
The following masking functions are only available in the
[**Enterprise Edition**](https://www.arangodb.com/why-arangodb/arangodb-enterprise/).
{% endhint %}
- [Xify Front](#xify-front)
- [Zip](#zip)
- [Datetime](#datetime)
- [Integral Number](#integral-number)
- [Decimal Number](#decimal-number)
- [Credit Card Number](#credit-card-number)
- [Phone Number](#phone-number)
- [Email Address](#email-address)
The masking function:
- [Random String](#random-string)
… is available in the Community Edition as well as the Enterprise Edition.
### Random String
This masking type will replace all values of attributes with key
`name` with an anonymized string. It is not guaranteed that the string
will be of the same length.
A hash of the original string is computed. If the original string is
shorter then the hash will be used. This will result in a longer
replacement string. If the string is longer than the hash then
characters will be repeated as many times as needed to reach the full
original string length.
Masking settings:
- `path` (string): which field to mask
- `type` (string): masking function name `"randomString"`
**Example**
```json
{
"path": ".name",
"type": "randomString"
}
```
Above masking setting applies to all leaf attributes with name `.name`.
A document like:
```json
{
"_key" : "1234",
"name" : [
"My Name",
{
"other" : "Hallo Name"
},
[
"Name One",
"Name Two"
],
true,
false,
null,
1.0,
1234,
"This is a very long name"
],
"deeply": {
"nested": {
"name": "John Doe",
"not-a-name": "Pizza"
}
}
}
```
… will be converted to:
```json
{
"_key": "1234",
"name": [
"+y5OQiYmp/o=",
{
"other": "Hallo Name"
},
[
"ihCTrlsKKdk=",
"yo/55hfla0U="
],
true,
false,
null,
1.0,
1234,
"hwjAfNe5BGw=hwjAfNe5BGw="
],
"deeply": {
"nested": {
"name": "55fHctEM/wY=",
"not-a-name": "Pizza"
}
}
}
```
### Xify Front
This masking type replaces the front characters with `x` and
blanks. Alphanumeric characters, `_` and `-` are replaced by `x`,
everything else is replaced by a blank.
Masking settings:
- `path` (string): which field to mask
- `type` (string): masking function name `"xifyFront"`
- `unmaskedLength` (number, _default: `2`_): how many characters to
leave as-is on the right-hand side of each word as integer value
- `hash` (bool, _default: `false`_): whether to append a hash value to the
masked string to avoid possible unique constraint violations caused by
the obfuscation
- `seed` (integer, _default: `0`_): used as secret for computing the hash.
A value of `0` means a random seed
**Examples**
```json
{
"path": ".name",
"type": "xifyFront",
"unmaskedLength": 2
}
```
This will mask all alphanumeric characters of a word except the last
two characters. Words of length 1 and 2 are unmasked. If the
attribute value is not a string the result will be `xxxx`.
"This is a test!Do you agree?"
… will become:
"xxis is a xxst Do xou xxxee "
There is a catch. If you have an index on the attribute the masking
might distort the index efficiency or even cause errors in case of a
unique index.
```json
{
"type": "xifyFront",
"path": ".name",
"unmaskedLength": 2,
"hash": true
}
```
This will add a hash at the end of the string.
"This is a test!Do you agree?"
… will become
"xxis is a xxst Do xou xxxee NAATm8c9hVQ="
Note that the hash is based on a random secret that is different for
each run. This avoids dictionary attacks which can be used to guess
values based pre-computations on dictionaries.
If you need reproducible results, i.e. hashes that do not change between
different runs of *arangodump*, you need to specify a secret as seed,
a number which must not be `0`.
```json
{
"type": "xifyFront",
"path": ".name",
"unmaskedLength": 2,
"hash": true,
"seed": 246781478647
}
```
### Zip
This masking type replaces a zip code with a random one.
It uses the following rules:
- If a character of the original zip code is a digit it will be replaced
by a random digit.
- If a character of the original zip code is a letter it
will be replaced by a random letter keeping the case.
- If the attribute value is not a string then the default value is used.
Note that this will generate random zip codes. Therefore there is a
chance that the same zip code value is generated multiple times, which can
cause unique constraint violations if a unique index is or will be
used on the zip code attribute.
Masking settings:
- `path` (string): which field to mask
- `type` (string): masking function name `"zip"`
- `default` (string, _default: `"12345"`_): if the input field is not of
data type `string`, then this value is used
**Examples**
```json
{
"path": ".code",
"type": "zip",
}
```
This replaces real zip codes stored in fields called `code` at any level
with random ones. `"12345"` is used as fallback value.
```json
{
"path": ".code",
"type": "zip",
"default": "abcdef"
}
```
If the original zip code is:
50674
… it will be replaced by e.g.:
98146
If the original zip code is:
SA34-EA
… it will be replaced by e.g.:
OW91-JI
If the original zip code is `null`, `true`, `false` or a number, then the
user-defined default value of `"abcdef"` will be used.
### Datetime
This masking type replaces the value of the attribute with a random
date between two configured dates in a customizable format.
Masking settings:
- `path` (string): which field to mask
- `type` (string): masking function name `"datetime"`
- `begin` (string, _default: `"1970-01-01T00:00:00.000"`_):
earliest point in time to return. Date time string in ISO 8601 format.
- `end` (string, _default: now_):
latest point in time to return. Date time string in ISO 8601 format.
In case a partial date time string is provided (e.g. `2010-06` without day
and time) the earliest date and time is assumed (`2010-06-01T00:00:00.000`).
The default value is the current system date and time.
- `format` (string, _default: `""`_): the formatting string format is
described in [DATE_FORMAT()](../../../AQL/Functions/Date.html#dateformat).
If no format is specified, then the result will be an empty string.
**Example**
```json
{
"path": "eventDate",
"type": "datetime",
"begin" : "2019-01-01",
"end": "2019-12-31",
"format": "%yyyy-%mm-%dd",
}
```
Above example masks the field `eventDate` by returning a random date time
string in the range of January 1st and December 31st in 2019 using a format
like `2019-06-17`.
### Integral Number
This masking type replaces the value of the attribute with a random
integral number. It will replace the value even if it is a string,
Boolean, or `null`.
Masking settings:
- `path` (string): which field to mask
- `type` (string): masking function name `"integer"`
- `lower` (number, _default: `-100`_): smallest integer value to return
- `upper` (number, _default: `100`_): largest integer value to return
**Example**
```json
{
"path": "count",
"type": "integer",
"lower" : -100,
"upper": 100
}
```
This masks the field `count` with a random number between
-100 and 100 (inclusive).
### Decimal Number
This masking type replaces the value of the attribute with a random
floating point number. It will replace the value even if it is a string,
Boolean, or `null`.
Masking settings:
- `path` (string): which field to mask
- `type` (string): masking function name `"decimal"`
- `lower` (number, _default: `-1`_): smallest floating point value to return
- `upper` (number, _default: `1`_): largest floating point value to return
- `scale` (number, _default: `2`_): maximal amount of digits in the
decimal fraction part
**Examples**
```json
{
"path": "rating",
"type": "decimal",
"lower" : -0.3,
"upper": 0.3
}
```
This masks the field `rating` with a random floating point number between
-0.3 and +0.3 (inclusive). By default, the decimal has a scale of 2.
That means, it has at most 2 digits after the dot.
The configuration:
```json
{
"path": "rating",
"type": "decimal",
"lower" : -0.3,
"upper": 0.3,
"scale": 3
}
```
… will generate numbers with at most 3 decimal digits.
### Credit Card Number
This masking type replaces the value of the attribute with a random
credit card number (as integer number).
See [Luhn algorithm](https://en.wikipedia.org/wiki/Luhn_algorithm)
for details.
Masking settings:
- `path` (string): which field to mask
- `type` (string): masking function name `"creditCard"`
**Example**
```json
{
"path": "ccNumber",
"type": "creditCard"
}
```
This generates a random credit card number to mask field `ccNumber`,
e.g. `4111111414443302`.
### Phone Number
This masking type replaces a phone number with a random one.
It uses the following rule:
- If a character of the original number is a digit
it will be replaced by a random digit.
- If it is a letter it is replaced by a random letter.
- All other characters are left unchanged.
- If the attribute value is not a string it is replaced by the
default value.
Masking settings:
- `path` (string): which field to mask
- `type` (string): masking function name `"phone"`
- `default` (string, _default: `"+1234567890"`_): if the input field
is not of data type `string`, then this value is used
**Examples**
```json
{
"path": "phone.landline",
"type": "phone"
}
```
This will replace an existing phone number with a random one, for instance
`"+31 66-77-88-xx"` might get substituted by `"+75 10-79-52-sb"`.
```json
{
"path": "phone.landline",
"type": "phone",
"default": "+49 12345 123456789"
}
```
This masks a phone number as before, but falls back to a different default
phone number in case the input value is not a string.
### Email Address
This masking type takes an email address, computes a hash value and
splits it into three equal parts `AAAA`, `BBBB`, and `CCCC`. The
resulting email address is in the format `AAAA.BBBB@CCCC.invalid`.
The hash is based on a random secret that is different for each run.
Masking settings:
- `path` (string): which field to mask
- `type` (string): masking function name `"email"`
**Example**
```json
{
"path": ".email",
"type": "email"
}
```
This masks every leaf attribute `email` with a random email address
similar to `"EHwG.3AOg@hGU=.invalid"`.