mirror of https://gitee.com/bigwinds/arangodb
117 lines
5.3 KiB
Markdown
117 lines
5.3 KiB
Markdown
MMFiles Compaction options
|
|
==========================
|
|
|
|
The ArangoDB MMFiles storage engine will run a compaction over data files.
|
|
|
|
ArangoDB writes Documents in the WAL file. Once they have been sealed in the wal file,
|
|
the collector may copy them into a per collection journal file.
|
|
|
|
Once journal files fill up, they're sealed to become data files.
|
|
|
|
=> one collection may have documents in the WAL logs, its journal file, and an arbitrary
|
|
number of data files.
|
|
|
|
If a collection is loaded, each of these files are opened (thus use a file handle) and
|
|
are mmap'ed. Since file handles and memory mapped files are also a sparse resource,
|
|
that number should be kept low.
|
|
|
|
Once you update or remove documents from data files (or already did while it was the journal file)
|
|
these documents are marked as 'dead' with a deletion marker.
|
|
|
|
Over time the number of dead documents may rise, and we don't want to use the previously mentioned
|
|
resources, plus the disk space should be given back to the system.
|
|
Thus several journal files can be combined to one, ommitting the dead documents.
|
|
|
|
Combining several of these data files into one is called compaction. The compaction process reads
|
|
the alive documents from the original data files, and writes them into new data file.
|
|
|
|
Once that is done, the memory mappings to the old data files is released, and the files are erased.
|
|
|
|
Since the compaction locks the collection, and also uses I/O resources, its carefully configurable
|
|
under which conditions the system should perform which amount of these compaction jobs:
|
|
|
|
ArangoDB spawns one compactor thread per database. The settings below vary in scope.
|
|
|
|
|
|
# Activity control
|
|
|
|
The activity control parameters alter the behaviour in terms of scan / execution frequency
|
|
of the compaction.
|
|
|
|
Sleep interval between two compaction runs (in seconds):
|
|
`--compaction.db-sleep-time`
|
|
The number of seconds the collector thread will wait between two attempts to search for
|
|
compactable data files of collections in one Database. If the compactor has actually executed work,
|
|
a subsequent lookup is done. Scope: Database.
|
|
|
|
Minimum sleep time between two compaction runs (in seconds):
|
|
`--compaction.min-interval`
|
|
When an actual compaction was executed for one collection, we wait for this time
|
|
before we execute the compaction on this collection again.
|
|
This is here to let eventually piled up user load be worked out. Scope: collection.
|
|
|
|
|
|
# Source data files
|
|
These parameters control which data files are taken into account for a compaction run.
|
|
You can specify several criteria which each off may be sufficcient alone.
|
|
|
|
The scan over the data files belonging to one collection is executed from oldest
|
|
data file to newest; if files qualify for a compaction they may be merged with newer files
|
|
(containing younger documents)
|
|
|
|
Scope: Collection level, some are influenced by collection settings.
|
|
|
|
minimal filesize threshold original data files have to be below for a compaction:
|
|
`--compaction.min-small-data-file-size`
|
|
This is the threshold which controls below which minimum total size a data file
|
|
will always be taken into account for the compaction.
|
|
|
|
Minimum unused count of documents in a datafile:
|
|
`--compaction.dead-documents-threshold`
|
|
Data files will often contain dead documents. This parameter specifies their
|
|
top most accetpeable count until the data file qualifies for compaction.
|
|
|
|
How many bytes of the source data file are allowed to be unused at most:
|
|
`--compaction.dead-size-threshold`
|
|
The dead data size varies along with the size of your documents.
|
|
If you have many big documents, this threshold may hit before the document count threshold.
|
|
|
|
How many percent of the source data file should be unused at least:
|
|
`--compaction.dead-size-percent-threshold`
|
|
since the size of the documents may vary this threshold works on the
|
|
percentage of the dead documents size. Thus, if you have many huge
|
|
dead documents, this threshold kicks in earlier.
|
|
|
|
To name an example with numbers, if the data file contains 800 kbytes of alive
|
|
and 400 kbytes of dead documents, the share of the dead documents is:
|
|
|
|
`400 / (400 + 800) = 33 %`.
|
|
|
|
If this value if higher than the specified threshold, the data file will be compacted.
|
|
|
|
|
|
# Compacted target files
|
|
Once data files of a collection are qualified for a compaction run, these parameters control
|
|
how many data files are merged into one, (or even one source data file may be compacted into
|
|
one smaller target data file)
|
|
|
|
Scope: Collection level, some are influenced by collection settings.
|
|
|
|
Maximum number of files to merge to one file:
|
|
`--compaction.dest-max-files`
|
|
How many data files (at most) we may merge into one resulting data file during one compaction run.
|
|
|
|
How large the resulting file may be in comparison to the collections `database.maximal-journal-size` setting:
|
|
`--compaction.dest-max-file-size-factor`
|
|
In ArangoDB you can configure a default `journal filesize` globally and override it on a per collection level.
|
|
This value controls the size of collected data files relative to the configured journal file size of the collection in question.
|
|
|
|
A factor of 3 means that the maximum filesize of the compacted file is
|
|
3 times the size of the maximum collection journal file size.
|
|
|
|
how large may the compaction result file become:
|
|
`--compaction.dest-max-result-file-size`
|
|
next to the factor above, a totally maximum allowed filesize in bytes may be specified.
|
|
This will overrule all previous parameters.
|
|
|