mirror of https://gitee.com/bigwinds/arangodb
Operational factors [3.4] (#7426)
This commit is contained in:
parent
085e6706ea
commit
cfe7c02a05
|
@ -0,0 +1,241 @@
|
|||
Data Modeling and Operational Factors
|
||||
=====================================
|
||||
|
||||
Designing the data model of your application is a crucial task that can make or
|
||||
break the performance of your application. A well-designed data model will
|
||||
allow you to write efficient AQL queries, increase throughput of CRUD operations
|
||||
and will make sure your data is distributed in the most effective way.
|
||||
|
||||
Whether you design a new application with ArangoDB or port an existing one to
|
||||
use ArangoDB, you should always analyze the (expected) data access patterns of
|
||||
your application in conjunction with several factors:
|
||||
|
||||
Operation Atomicity
|
||||
-------------------
|
||||
|
||||
All insert / update / replace / remove operations in ArangoDB are atomic on a
|
||||
_single_ document. Using a single instance of ArangoDB, multi-document /
|
||||
multi-collection queries are guaranteed to be fully ACID, however in
|
||||
cluster mode only single-document operations are also fully ACID. This has
|
||||
implications if you try to ensure consistency across multiple operations.
|
||||
|
||||
### Denormalizing Data
|
||||
|
||||
In traditional _SQL_ databases it is considered a good practice to normalize
|
||||
all your data across multiple tables to avoid duplicated data and ensure
|
||||
consistency.
|
||||
|
||||
ArangoDB is a schema-less _NoSQL_ multi-model database, so a good data model
|
||||
is not necessarily normalized. On the contrary, to avoid extra joins it is
|
||||
often an advantage to deliberately _denormalize_ your data model.
|
||||
|
||||
To denormalize your data model you essentially combine all related entities
|
||||
into a single document instead of spreading it over multiple documents and
|
||||
collections. The advantage of this is that it allows you to atomically update
|
||||
all of your connected data, the downside is that your documents become larger
|
||||
(see below for more considerations on
|
||||
[large documents](#document-and-transaction-sizes)).
|
||||
|
||||
As a simple example, lets say you want to maintain the total amount of a
|
||||
shopping basket (from an online shop) together with a list of all included
|
||||
items and prices. The total balance of all items in the shopping basket should
|
||||
stay in sync with the contained items, then you may put all contained items
|
||||
inside the shopping basket document and only update them together:
|
||||
|
||||
```json
|
||||
{
|
||||
"_id": "basket/123",
|
||||
"_key": "123",
|
||||
"_rev": "_Xv0TA0O--_",
|
||||
"user": "some_user",
|
||||
"balance": "100",
|
||||
"items": [ { "price": 10, "title": "Harry Potter and the Philosopher’s Stone" },
|
||||
{ "price": 90, "title": "Vacuum XYZ" } ]
|
||||
}
|
||||
```
|
||||
|
||||
This allows you to avoid making lookups via the document keys in
|
||||
multiple collections.
|
||||
|
||||
### Ensuring Consistent Atomic Updates
|
||||
|
||||
There are ways to ensure atomicity and consistency when performing updates in
|
||||
your application. ArangoDB allows you to specify the revision ID (`_rev`) value
|
||||
of the existing document you want to update. The update or replace operation is
|
||||
only able to succeed if the values match. This way you can ensure that if your
|
||||
application has read a document with a certain `_rev` value, the modifications
|
||||
to it are only allowed to pass _if and only if_ the document was not changed by
|
||||
someone else in the meantime. By specifying a document's previous revision ID
|
||||
you can avoid losing updates on these documents without noticing it.
|
||||
|
||||
You can specify the revision via the `_rev` field inside the document or via
|
||||
the `If-Match: <revision>` HTTP header in the documents REST API.
|
||||
In the _arangosh_ you can perform such an operation like this:
|
||||
|
||||
```js
|
||||
db.basketCollection.update({"_key": "123", "_rev": "_Xv0TA0O--_"}, data)
|
||||
// or replace
|
||||
db.basketCollection.replace({"_key": "123", "_rev": "_Xv0TA0O--_"}, data)
|
||||
```
|
||||
|
||||
An AQL query with the same effect can be written by using the _ignoreRevs_
|
||||
option together with a modification operation. Either let ArangoDB compare
|
||||
the `_rev` value and only succeed if they still match, or let ArangoDB
|
||||
ignore them (default):
|
||||
|
||||
```js
|
||||
FOR i IN 1..1000
|
||||
UPDATE { _key: CONCAT('test', i), _rev: "1287623" }
|
||||
WITH { foobar: true } IN users
|
||||
OPTIONS { ignoreRevs: false }
|
||||
```
|
||||
|
||||
Indexes
|
||||
-------
|
||||
|
||||
Indexes can improve the performance of AQL queries drastically. Queries that
|
||||
frequently filter on or one more fields can be made faster by creating an index
|
||||
(in arangosh via the _ensureIndex_ command, the Web UI or your specific
|
||||
client driver). There is already an automatic (and non-deletable) primary index
|
||||
in every collection on the `_key` and `_id` fields as well as the edge index
|
||||
on `_from` and `_to` (for edge collections).
|
||||
|
||||
Should you decide to create an index you should consider a few things:
|
||||
|
||||
- Indexes are a trade-off between storage space, maintenance cost and query speed.
|
||||
- Each new index will increase the amount of RAM and (for the RocksDB storage)
|
||||
the amount of disk space needed.
|
||||
- Indexes with [indexed array values](../Indexing/IndexBasics.md#indexing-array-values)
|
||||
need an extra index entry per array entry
|
||||
- Adding indexes increases the write-amplification i.e. it negatively affects
|
||||
the write performance (how much depends on the storage engine)
|
||||
- Each index needs to add at least one index entry per document. You can use
|
||||
_sparse indexes_ to avoid adding _null_ index entries for rarely used attributes
|
||||
- Sparse indexes can be smaller than non-sparse indexes, but they can only be
|
||||
used if the optimizer determines that the _null_ value cannot be in the
|
||||
result range, e.g. by an explicit `FILTER doc.attribute != null` in AQL
|
||||
(also see [Type and value order](../../AQL/Fundamentals/TypeValueOrder.html)).
|
||||
- Collections that are more frequently read benefit the most from added indexes,
|
||||
provided the indexes can actually be utilized
|
||||
- Indexes on collections with a high rate of inserts or updates compared to
|
||||
reads may hurt overall performance.
|
||||
|
||||
Generally it is best to design your indexes with your queries in mind.
|
||||
Use the [query profiler](../../AQL/ExecutionAndPerformance/QueryProfiler.html)
|
||||
to understand the bottlenecks in your queries.
|
||||
|
||||
Always consider the additional space requirements of extra indexes when
|
||||
planning server capacities. For more information on indexes see
|
||||
[Index Basics](../Indexing/IndexBasics.md).
|
||||
|
||||
<!-- TODO eventually add a page on capacity planning -->
|
||||
|
||||
Number of Databases and Collections
|
||||
-----------------------------------
|
||||
|
||||
Sometimes you can consider to split up data over multiple collections.
|
||||
For example, one could create a new set of collections for each new customer
|
||||
instead of having a customer field on each documents. Having a few thousand
|
||||
collections has no significant performance penalty for most operations and
|
||||
results in good performance.
|
||||
|
||||
Grouping documents into collections by type (i.e. a session collection
|
||||
'sessions_dev', 'sessions_prod') allows you to avoid an extra index on a _type_
|
||||
field. Similarly you may consider to
|
||||
[split edge collections](../Graphs/README.md#multiple-edge-collections-vs-filters-on-edge-document-attributes)
|
||||
instead of specifying the type of the connection inside the edge document.
|
||||
|
||||
A few things to consider:
|
||||
- Adding an extra collection always incurs a small amount of overhead for the
|
||||
collection metadata and indexes.
|
||||
- You cannot use more than _2048_ collections per AQL query
|
||||
- Uniqueness constraints on certain attributes (via an unique index) can only
|
||||
be enforced by ArangoDB within one collection
|
||||
- Only with the _MMFiles storage engine_: Creating extra databases will require
|
||||
two compaction and cleanup threads per database. This might lead to
|
||||
undesirable effects should you decide to create many databases compared to
|
||||
the number of available CPU cores.
|
||||
|
||||
Cluster Sharding
|
||||
----------------
|
||||
|
||||
The ArangoDB cluster _partitions_ your collections into one or more _shards_
|
||||
across multiple _DBServers_. This enables efficient _horizontal scaling_:
|
||||
It allows you to store much more data, since ArangoDB distributes the data
|
||||
automatically to the different servers. In many situations one can also reap
|
||||
a benefit in data throughput, again because the load can be distributed to
|
||||
multiple machines.
|
||||
|
||||
ArangoDB uses the specified _shard keys_ to determine in which shard a given
|
||||
document is stored. Choosing the right shard key can have significant impact on
|
||||
your performance can reduce network traffic and increase performance.
|
||||
|
||||
ArangoDB uses consistent hashing to compute the target shard from the given
|
||||
values (as specified via 'shardKeys'). The ideal set of shard keys allows
|
||||
ArangoDB to distribute documents evenly across your shards and your _DBServers_.
|
||||
By default ArangoDB uses the `_key` field as a shard key. For a custom shard key
|
||||
you should consider a few different properties:
|
||||
|
||||
- **Cardinality**: The cardinality of a set is the number of distinct values
|
||||
that it contains. A shard key with only _N_ distinct values can not be hashed
|
||||
onto more than _N_ shards. Consider using multiple shard keys, if one of your
|
||||
values has a low cardinality.
|
||||
- **Frequency**: Consider how often a given shard key value may appear in
|
||||
your data. Having a lot of documents with identical shard keys will lead
|
||||
to unevenly distributed data.
|
||||
|
||||
See [Sharding](../Architecture/DeploymentModes/Cluster/Architecture.md#sharding)
|
||||
for more information
|
||||
|
||||
### Smart Graphs
|
||||
|
||||
Smart Graphs are an Enterprise Edition feature of ArangoDB. It enables you to
|
||||
manage graphs at scale, it will give a vast performance benefit for all graphs
|
||||
sharded in an ArangoDB Cluster.
|
||||
|
||||
To add a Smart Graph you need a smart graph attribute that partitions your
|
||||
graph into several smaller sub-graphs. Ideally these sub-graphs follow a
|
||||
"natural" structure in your data. These subgraphs have a large amount of edges
|
||||
that only connect vertices in the same subgraph and only have few edges
|
||||
connecting vertices from other subgraphs.
|
||||
|
||||
All the usual considerations for sharding keys also apply for smart attributes,
|
||||
for more information see [SmartGraphs](../Graphs/SmartGraphs/README.md)
|
||||
|
||||
Document and Transaction Sizes
|
||||
------------------------------
|
||||
|
||||
When designing your data-model you should keep in mind that the size of
|
||||
documents affects the performance and storage requirements of your system.
|
||||
Very large numbers of very small documents may have an unexpectedly big overhead:
|
||||
Each document needs has a certain amount extra storage space, depending on the
|
||||
storage engine and the indexes you added to the collection. The overhead may
|
||||
become significant if your store a large amount of very small documents.
|
||||
|
||||
Very large documents may reduce your write throughput:
|
||||
This is due to the extra time needed to send larger documents over the
|
||||
network as well as more copying work required inside the storage engines.
|
||||
|
||||
Consider some ways to minimize the required amount of storage space:
|
||||
|
||||
- Explicitly set the `_key` field to a custom unique value.
|
||||
This enables you to store information in the `_key` field instead of another
|
||||
field inside the document. The `_key` value is always indexed, setting a
|
||||
custom value means you can use a shorter value than what would have been
|
||||
generated automatically.
|
||||
- Shorter field names will reduce the amount of space needed to store documents
|
||||
(this has no effect on index size). ArangoDB is schemaless and needs to store
|
||||
the document structure inside each document. Usually this is a small overhead
|
||||
compared to the overall document size.
|
||||
- Combining many small related documents into one larger one can also
|
||||
reduce overhead. Common fields can be stored once and indexes just need to
|
||||
store one entry. This will only be beneficial if the combined documents are
|
||||
regularly retrieved together and not just subsets.
|
||||
|
||||
Especially for the RocksDB storage engine large documents and transactions may
|
||||
negatively impact the write performance:
|
||||
- Consider a maximum size of 50-75 kB _per document_ as a good rule of thumb.
|
||||
This will allow you to maintain steady write throughput even under very high load.
|
||||
- Transactions are held in-memory before they are committed.
|
||||
This means that transactions have to be split if they become too big, see the
|
||||
[limitations section](../Transactions/Limitations.md#with-rocksdb-storage-engine).
|
|
@ -1107,3 +1107,16 @@ often undesired in logs anyway.
|
|||
Another positive side effect of turning off the escaping is that it will slightly
|
||||
reduce the CPU overhead for logging. However, this will only be noticable when the
|
||||
logging is set to a very verbose level (e.g. log levels debug or trace).
|
||||
|
||||
|
||||
### Active Failover
|
||||
|
||||
The _Active Failover_ mode is now officially supported for multiple slaves.
|
||||
|
||||
Additionally you can now send read-only requests to followers, so you can
|
||||
use them for read scaling. To make sure only requests that are intended for
|
||||
this use-case are served by the follower you need to add a
|
||||
`X-Arango-Allow-Dirty-Read: true` header to HTTP requests.
|
||||
|
||||
For more information see
|
||||
[Active Failover Architecture](../Architecture/DeploymentModes/ActiveFailover/Architecture.md).
|
||||
|
|
|
@ -118,6 +118,7 @@
|
|||
* [Collection and View Names](DataModeling/NamingConventions/CollectionAndViewNames.md)
|
||||
* [Document Keys](DataModeling/NamingConventions/DocumentKeys.md)
|
||||
* [Attribute Names](DataModeling/NamingConventions/AttributeNames.md)
|
||||
* [Operational Factors](DataModeling/OperationalFactors.md)
|
||||
* [Indexing](Indexing/README.md)
|
||||
* [Index Basics](Indexing/IndexBasics.md)
|
||||
* [Which index to use when](Indexing/WhichIndex.md)
|
||||
|
|
|
@ -70,7 +70,7 @@ fully ACID as well.
|
|||
With RocksDB storage engine
|
||||
---------------------------
|
||||
|
||||
Data of ongoing transactions is stored in RAM. Query-Transactions that get too big
|
||||
Data of ongoing transactions is stored in RAM. Transactions that get too big
|
||||
(in terms of number of operations involved or the total size of data created or
|
||||
modified by the transaction) will be committed automatically. Effectively this
|
||||
means that big user transactions are split into multiple smaller RocksDB
|
||||
|
|
Loading…
Reference in New Issue