# Datacenter to datacenter replication.

## About 

At some point in the grows of a database, there comes a need for 
replicating it across multiple datacenters.

Reasons for that can be:
- Fallback in case of a disaster in one datacenter.
- Regional availability 
- Separation of concerns 

And many more.

This tutorial describes what the ArangoSync datacenter to datacenter 
replication solution (ArangoSync from now on) offers,
when to use it, when not to use it and  how to configure,
operate, troubleshoot it & keep it safe.

### What is it

ArangoSync is a solution that enables you to asynchronously replicate 
the entire structure and content in an ArangoDB cluster in one place to a cluster 
in another place. Typically it is used from one datacenter to another.
<br/>It is not a solution for replicating single server instances.

The replication done by ArangoSync in **asynchronous**. That means that when 
a client is writing data into the source datacenter, it will consider the 
request finished before the data has been replicated to the other datacenter.
The time needed to completely replicate changes to the other datacenter is
typically in the order of seconds, but this can vary significantly depending on 
load, network & computer capacity.

ArangoSync performs replication in a **single direction** only. That means that 
you can replicate data from cluster A to cluster B or from cluster B to cluster A,
but never at the same time. 
<br/>Data modified in the destination cluster **will be lost!**

Replication is a completely **autonomous** process. Once it is configured it is 
designed to run 24/7 without frequent manual intervention.
<br/>This does not mean that it requires no maintenance or attention at all.
<br/>As with any distributed system some attention is needed to monitor its operation 
and keep it secure (e.g. certificate & password rotation).

Once configured, ArangoSync will replicate both **structure and data** of an 
**entire cluster**. This means that there is no need to make additional configuration 
changes when adding/removing databases or collections.
<br/>Also meta data such as users, foxx application & jobs are automatically replicated.

### When to use it... and when not

ArangoSync is a good solution in all cases where you want to replicate 
data from one cluster to another without the requirement that the data 
is available immediately in the other cluster.

ArangoSync is not a good solution when one of the following applies:
- You want to replicate data from cluster A to cluster B and from cluster B
  to cluster A at the same time. 
- You need synchronous replication between 2 clusters.
- There is no network connection betwee cluster A and B.
- You want complete control over which database, collection & documents are replicate and which not.

## Requirements 

To use ArangoSync you need the following:

- Two datacenters, each running an ArangoDB Enterprise cluster, version 3.3 or higher.
- A network connection between both datacenters with accessible endpoints
  for several components (see individual components for details).
- TLS certificates for ArangoSync master instances (can be self-signed).
- TLS certificates for Kafka brokers (can be self-signed).
- Optional (but recommended) TLS certificates for ArangoDB clusters (can be self-signed).
- Client certificates CA for ArangoSync masters (typically self-signed).
- Client certificates for ArangoSync masters (typically self-signed).
- At least 2 instances of the ArangoSync master in each datacenter.
- One instances of the ArangoSync worker on every machine in each datacenter.

Note: In several places you will need a (x509) certificate. 
<br/>The [certificates](#certificates) section below provides more guidance for creating 
and renewing these certificates.

Besides the above list, you probably want to use the following:

- An orchestrator to keep all components running. In this tutorial we will use `systemd` as an example.
- A log file collector for centralized collection & access to the logs of all components.
- A metrics collector & viewing solution such as Prometheus + Grafana.

## Deployment 

In the following paragraphs you'll learn how to deploy all the components 
needed for datacenter to datacenter replication.

### ArangoDB cluster 

There are several ways to start an ArangoDB cluster. In this tutorial we will focus on our recommended
way to start ArangoDB: the ArangoDB starter.

Datacenter to datacenter replication requires the `rocksdb` storage engine. In this tutorial the
example setup will have `rocksdb` enabled. If you choose to deploy with a different strategy keep
in mind to set the storage engine.

For the other possibilities to deploy an ArangoDB cluster please refer to the Deployment chapter:

**TODO ONCE WE MERGE WITH THE STANDARD DOCUMENTATION: INLINE LINK**

The starter simplifies things for the operator and will coordinate a distributed cluster startup
across several machines and assign cluster roles automatically.

When started on several machines and enough machines have joined, the starters will start agents,
coordinators and dbservers on these machines.

When running the starter will supervise its child tasks (namely coordinators, dbservers and agents)
and restart them in case of failure.

To start the cluster using a systemd unit file use the following:

```
[Unit]
Description=Run the ArangoDB Starter 
After=network.target

[Service]
Restart=on-failure
EnvironmentFile=/etc/arangodb.env
EnvironmentFile=/etc/arangodb.env.local
Environment=DATADIR=/var/lib/arangodb/cluster
ExecStartPre=/usr/bin/sh -c "mkdir -p ${DATADIR}"
ExecStart=/usr/bin/arangodb \
    --starter.address=${PRIVATEIP} \
    --starter.data-dir=${DATADIR} \
    --starter.join=${STARTERENDPOINTS} \
    --server.storage-engine=rocksdb \
    --auth.jwt-secret=${CLUSTERSECRETPATH}
TimeoutStopSec=60

[Install]
WantedBy=multi-user.target
```

Note that we set `rocksdb` in the unit service file.

#### Cluster authentication

The communication between the cluster nodes use a token (JWT) to authenticate. This must
be shared between cluster nodes.

Sharing secrets is obviously a very delicate topic. The above workflow assumes that the operator
will put a secret in a file named `${CLUSTERSECRETPATH}`.

We recommend to use a dedicated system for managing secrets like HashiCorps' `Vault` or the
secret management of `DC/OS`.

#### Required ports

As soon as enough machines have joined, the starter will begin starting agents, coordinators
and dbservers.

Each of these tasks needs a port to communicate. Please make sure that the following ports
are available on all machines:

- `8529` for coordinators
- `8530` for dbservers
- `8531` for agents

The starter itself will use port `8528`.

### Kafka & Zookeeper 

- How to deploy zookeeper 
- How to deploy kafka 
- Accessible ports

### Sync Master 

The Sync Master is responsible for managing all synchronization, creating tasks and assigning 
those to workers.
<br/> At least 2 instances muts be deployed in each datacenter.
One instance will be the "leader", the other will be an inactive slave. When the leader 
is gone for a short while, one of the other instances will take over.

With clusters of a significant size, the sync master will require a significant set of resources.
Therefore it is recommended to deploy sync masters on their own servers, equiped with sufficient
CPU power and memory capacity.

ArangoSync will transport messages from one datacenter to another in "transport topics".
This is a limited set of well know topics. The list should be large enough to avoid 
performance bottlenecks. Typically 10-20 topics will do.
<br/> The list of transport topics cannot change over time without re-configuration.
To change the list of transport topics, first stop synchronization completely, then restart the 
sync masters with the new list of topics and re-configure synchronization.

To start a sync master using a `systemd` service, use a unit like this:

```
[Unit]
Description=Run ArangoSync in master mode 
After=network.target

[Service]
Restart=on-failure
EnvironmentFile=/etc/arangodb.env 
EnvironmentFile=/etc/arangodb.env.local
ExecStart=/usr/sbin/arangosync run master \
    --log.level=debug \
    --cluster.endpoint=${CLUSTERENDPOINTS} \
    --cluster.jwtSecret=${CLUSTERSECRET} \
    --server.keyfile=${CERTIFICATEDIR}/tls.keyfile \
    --server.client-cafile=${CERTIFICATEDIR}/client-auth-ca.crt \
    --server.endpoint=https://${PUBLICIP}:${MASTERPORT} \
    --server.port=${MASTERPORT} \
    --master.jwtSecret=${MASTERSECRET} \
    --mq.type=kafka \
    --mq.kafka-addr=${KAFKAENDPOINTS} \
    --mq.kafka-client-keyfile=${CERTIFICATEDIR}/kafka-client.key \
    --mq.kafka-cacert=${CERTIFICATEDIR}/tls-ca.crt \
    --mq.transport-topic=${CLUSTERNAME}-1 \
    --mq.transport-topic=${CLUSTERNAME}-2 \
    --mq.transport-topic=${CLUSTERNAME}-3
TimeoutStopSec=60

[Install]
WantedBy=multi-user.target
```

The `--mq.transport-topic` arguments can be replaced by a list of transport topic name in 
a file. Pass an `--mq.transport-topicsfile=filename` in that case.

The sync master needs a TLS server certificate and a 
If you want the service to create a TLS certificate & client authentication 
certificate, for authenticating with sync masters in another datacenter, for every start, 
add this to the `Service` section.

```
ExecStartPre=/usr/bin/sh -c "mkdir -p ${CERTIFICATEDIR}"
ExecStartPre=/usr/sbin/arangosync create tls keyfile \
    --cacert=${CERTIFICATEDIR}/tls-ca.crt \
    --cakey=${CERTIFICATEDIR}/tls-ca.key \
    --keyfile=${CERTIFICATEDIR}/tls.keyfile \
    --host=${PUBLICIP} \
    --host=${PRIVATEIP} \
    --host=${HOST}
ExecStartPre=/usr/sbin/arangosync create client-auth keyfile \
    --cacert=${CERTIFICATEDIR}/tls-ca.crt \
    --cakey=${CERTIFICATEDIR}/tls-ca.key \
    --keyfile=${CERTIFICATEDIR}/kafka-client.key \
    --host=${PUBLICIP} \
    --host=${PRIVATEIP} \
    --host=${HOST}
```

The sync master must be reachable on a TCP port `${MASTERPORT}` (used with `--server.port` option).
This port must be reachable from inside the datacenter (by sync workers and operations) 
and from inside of the other datacenter (by sync masters in the other datacenter).

### Sync Workers 

The Sync Worker is responsible for executing synchronization tasks.
<br/> For optimal performance at least 1 worker instance must be placed on 
every machine that has an ArangoDB `dbserver` running. This ensures that tasks 
can be executed with minimal network traffic outside of the machine.

Since sync workers will automatically stop once their TLS server certificate expires 
(which is set to 2 years by default),
it is recommended to run at least 2 instances of a worker on every machine in the datacenter.
That way, tasks can still be assigned in the most optimal way, even when a worker in temporarily 
down for a restart.

To start a sync worker using a `systemd` service, use a unit like this:

```
[Unit]
Description=Run ArangoSync in worker mode 
After=network.target

[Service]
Restart=on-failure
EnvironmentFile=/etc/arangodb.env 
EnvironmentFile=/etc/arangodb.env.local
Environment=PORT=8729
ExecStart=/usr/sbin/arangosync run worker \
    --log.level=debug \
    --server.port=${PORT} \
    --server.endpoint=https://${PRIVATEIP}:${PORT} \
    --master.endpoint=${MASTERENDPOINTS} \
    --master.jwtSecret=${MASTERSECRET}
TimeoutStopSec=60

[Install]
WantedBy=multi-user.target
```

The sync worker must be reachable on a TCP port `${PORT}` (used with `--server.port` option).
This port must be reachable from inside the datacenter (by sync masters).

### Prometheus & Grafana (optional)

ArangoSync provides metrics in a format supported by [Prometheus](https://prometheus.io).
We also provide a standard set of dashboards for viewing those metrics in [Grafana](https://grafana.org).

If you want to use these tools, go to their websites for instructions on how to deploy them.

After deployment, you must configure prometheus using a configuration file that instructs 
it about which targets to scrape. For ArangoSync you should configure scrape targets for 
all sync masters and all sync workers. To do so, you can use a configuration such as this:

```
global:
  scrape_interval:     10s # scrape targets every 10 seconds.

scrape_configs:
  # Scrap sync masters
  - job_name: 'sync_master'
    scheme: 'https'
    bearer_token: "${MONITORINGTOKEN}"
    tls_config:
      insecure_skip_verify: true
    static_configs:
      - targets:
        - "${IPMASTERA1}:8629"
        - "${IPMASTERA2}:8629"
        - "${IPMASTERB1}:8629"
        - "${IPMASTERB2}:8629"
        labels:
          type: "master"
    relabel_configs:
      - source_labels: [__address__]
        regex:         ${IPMASTERA1}\:8629|${IPMASTERA2}\:8629
        target_label:  dc
        replacement:   A
      - source_labels: [__address__]
        regex:         ${IPMASTERB1}\:8629|${IPMASTERB2}\:8629
        target_label:  dc
        replacement:   B
      - source_labels: [__address__]
        regex:         ${IPMASTERA1}\:8629|${IPMASTERB1}\:8629
        target_label:  instance
        replacement:   1
      - source_labels: [__address__]
        regex:         ${IPMASTERA2}\:8629|${IPMASTERB2}\:8629
        target_label:  instance
        replacement:   2

  # Scrap sync workers
  - job_name: 'sync_worker'
    scheme: 'https'
    bearer_token: "${MONITORINGTOKEN}"
    tls_config:
      insecure_skip_verify: true
    static_configs:
      - targets: 
        - "${IPWORKERA1}:8729"
        - "${IPWORKERA2}:8729"
        - "${IPWORKERB1}:8729"
        - "${IPWORKERB2}:8729"
        labels:
          type: "worker"
    relabel_configs:
      - source_labels: [__address__]
        regex:         ${IPWORKERA1}\:8729|${IPWORKERA2}\:8729
        target_label:  dc
        replacement:   A
      - source_labels: [__address__]
        regex:         ${IPWORKERB1}\:8729|${IPWORKERB2}\:8729
        target_label:  dc
        replacement:   B
      - source_labels: [__address__]
        regex:         ${IPWORKERA1}\:8729|${IPWORKERB1}\:8729
        target_label:  instance
        replacement:   1
      - source_labels: [__address__]
        regex:         ${IPWORKERA2}\:8729|${IPWORKERB2}\:8729
        target_label:  instance
        replacement:   2
```

Note: The above example assumes 2 datacenters, with 2 sync masters & 2 sync workers 
per datacenter. You have to replace all `${...}` variables in the above configuration 
with applicable values from your environment.

## Configuration 

Once all components of the ArangoSync solution have been deployed and are 
running properly, ArangoSync will not automatically replicate database structure 
and content. For that, it is is needed to configure synchronization.

To configure synchronization, you need the following:
- The endpoint of the sync master in the target datacenter.
- The endpoint of the sync master in the source datacenter.
- A certificate (in keyfile format) used for client authentication of the sync master 
  (with the sync master in the source datacenter).
- A CA certificate (public key only) for verifying the integrity of the sync masters.
- A username+password pair (or client certificate) for authenticating the configure 
  require with the sync master (in the target datacenter)

With that information, run:
```
arangosync configure sync \
  --master.endpoint=<endpoints of sync masters in target datacenter> \
  --master.keyfile=<keyfile of of sync masters in target datacenter> \
  --source.endpoint=<endpoints of sync masters in source datacenter> \
  --source.cacert=<public key of CA certificate used to verify sync master in source datacenter> \
  --auth.user=<username used for authentication of this command> \
  --auth.password=<password of auth.user>
```

The command will finish quickly. Afterwards it will take some time until 
the clusters in both datacenters are in sync.

Use the following command to inspect the status of the synchronization of a datacenter:
```
arangosync get status \
  --master.endpoint=<endpoints of sync masters in datacenter of interest> \
  --auth.user=<username used for authentication of this command> \
  --auth.password=<password of auth.user> \
  -v
```
Note: Invoking this command on the target datacenter will return different results from
invoking it on the source datacenter. You need insight in both results to get a "complete picture".

Where the `get status` command gives insight in the status of synchronization, there 
are more detailed commands to give insight in tasks & registered workers.

Use the following command to get a list of all synchronization tasks in a datacenter:
```
arangosync get tasks \
  --master.endpoint=<endpoints of sync masters in datacenter of interest> \
  --auth.user=<username used for authentication of this command> \
  --auth.password=<password of auth.user> \
  -v
```

Use the following command to get a list of all workers in a datacenter:
```
arangosync get workers \
  --master.endpoint=<endpoints of sync masters in datacenter of interest> \
  --auth.user=<username used for authentication of this command> \
  --auth.password=<password of auth.user> \
  -v
```

### Stop synchronization

If you no longer want to synchronize data from a source to a target datacenter 
you must stop it. To do so, run the following command:

```
arangosync stop sync \
  --master.endpoint=<endpoints of sync masters in target datacenter> \
  --auth.user=<username used for authentication of this command> \
  --auth.password=<password of auth.user>
```

The command will wait until synchronization has completely stopped before returning.
If the synchronization is not completely stopped within a reasonable period (2 minutes by default)
the command will fail.

If the source datacenter is no longer available it is not possible to stop synchronization in 
a graceful manner. If that happens abort the synchronization with the following command:
```
arangosync abort sync \
  --master.endpoint=<endpoints of sync masters in target datacenter> \
  --auth.user=<username used for authentication of this command> \
  --auth.password=<password of auth.user>
```
If the source datacenter recovers after an `abort sync` has been executed, it is 
needed to "cleanup" ArangoSync in the source datacenter. 
To do so, execute the following command:
```
arangosync abort outgoing sync \
  --master.endpoint=<endpoints of sync masters in source datacenter> \
  --auth.user=<username used for authentication of this command> \
  --auth.password=<password of auth.user>
```

### Reversing synchronization direction 

If you want to reverse the direction of synchronization (e.g. after a failure
in datacenter A and you switched to the datacenter B for fallback), you 
must first stop (or abort) the original synchronization.

Once that is finished (and cleanup has been applied in case of abort),
you must now configure the synchronization again, but with swapped 
source & target settings.

## Operations & Maintenance 

ArangoSync is a distributed system with a lot different components.
As with any such system, it requires some, but not a lot, of operational 
support. 

### What means are available to monitor status 

All of the components of ArangoSync provide means to monitor their status.
Below you'll find an overview per component.

- Sync master & workers: The `arangosync` servers running as either master
  or worker, provide:
  - A status API, see `arangosync get status`. Make sure that all statuses report `running`.
    <br/>For even more detail the following commands are also available:
    `arangosync get tasks` & `arangosync get workers`.
  - A log on the standard output. Log levels can be configured using `--log.level` settings.
  - A metrics API `GET /metrics`. This API is compatible with Prometheus.
    Sample Grafana dashboards for inspecting these metrics are available.

- ArangoDB cluster: The `arangod` servers that make up the ArangoDB cluster 
  provide:
  - A log file. This is configurable with settings with a `log.` prefix.
  E.g. `--log.output=file://myLogFile` or `--log.level=info`.
  - A statistics API `GET /_admin/statistics`

- Kafka cluster: The kafka brokers provide:
  - A log file, see settings with `log.` prefix in its `server.properties` configuration file.

- Zookeeper: The zookeeper agents provide:
  - A log on standard output.

### What to look for while monitoring status

The very first thing to do when monitoring the status of ArangoSync is to 
look into the status provided by `arangosync get status ... -v`.
When not everything is in the `running` state (on both datacenters), this is an 
indication that something may be wrong. In case that happens, give it some time 
(incremental synchronization may take quite some time for large collections)
and look at the status again. If the statuses do not change (or change, but not reach `running`)
it is time to inspects the metrics & log files.
<br/> When the metrics or logs seem to indicate a problem in a sync master or worker, it is 
safe to restart it, as long as only 1 instance is restarted at a time.
Give restarted instances some time to "catch up".

### What to do when problems remain 

When a problem remains and restarting masters/workers does not solve the problem,
contact support. Make sure to include provide support with the following information:

- Output of `arangosync get version ...` on both datacenters.
- Output of `arangosync get status ... -v` on both datacenters.
- Output of `arangosync get tasks ... -v` on both datacenters.
- Output of `arangosync get workers ... -v` on both datacenters.
- Log files of all components
- A complete description of the problem you observed and what you did to resolve it.

- How to monitor status of ArangoSync
- How to keep it alive
- What to do in case of failures or bugs

### What to do when a source datacenter is down

When you use ArangoSync for backup of your cluster from one datacenter
to another and the source datacenter has a complete outage, you may consider 
switching your applications to the target (backup) datacenter.

This is what you must do in that case.

1. [Stop configuration](#stop-synchronization) using `arangosync stop sync ...`.
   <br/>When the source datacenter is completely unresponsive this will not 
   succeed. In that case use `arangosync abort sync ...`. 
   <br/>See [Configuration](#stop-synchronization) for how to cleanup the source datacenter when 
   it becomes available again.
2. Verify that configuration has completely stopped using `arangosync get status ... -v`.
3. Reconfigure your applications to use the target (backup) datacenter.

When the original source datacenter is restored, you may switch roles and 
make it the target datacenter. To do so, use `arangosync configure sync ...` 
as described in [Configuration](#configuration).

### What to do in case of a planned network outage.

All ArangoSync tasks send out heartbeat messages out to the other datacenter 
to indicate "it is still alive". The other datacenter assumes the connection is 
"out of sync" when it does not receive any messages for a certain period of time.

If you're planning some sort of maintenance where you know the connectivity 
will be lost for some time (e.g. 3 hours), you can prepare ArangoSync for that 
such that it will hold of re-synchronization for a given period of time.

To do so, on both datacenters, run:
```
arangosync set message timeout \
    --master.endpoint=<endpoints of sync masters in the datacenter> \
    --auth.user=<username used for authentication of this command> \
    --auth.password=<password of auth.user> \
    3h
```
The last argument is the period that ArangoSync should hold-of resynchronization for.
This can be minutes (e.g. `15m`) or hours (e.g. `3h`).

If maintenance is taking longer than expected, you can use the same command the extend
the hold of period (e.g. to `4h`).

After the maintenance, use the same command restore the hold of period to its default of `1h`.

### What to do in case of a document that exceeds the message queue limits.

If you insert/update a document in a collection and the size of that document
is larger than the maximum message size of your message queue, the collection
will no longer be able to synchronize. It will go into a `failed` state.

To recover from that, first remove the document from the ArangoDB cluster
in the source datacenter. After that, for each failed shard, run:
```
arangosync reset failed shard \
    --master.endpoint=<endpoints of sync masters in the datacenter> \
    --auth.user=<username used for authentication of this command> \
    --auth.password=<password of auth.user> \
    --database=<name of the database> \
    --collection=<name of the collection> \
    --shard=<index of the shard (starting at 0)>
```

After this command, a new set of tasks will be started to synchronize the shard.
It can take some time for the shard to reach `running` state.

### Metrics

ArangoSync (master & worker) provide metrics that can be used for monitoring the ArangoSync 
solution. These metrics are available using the following HTTPS endpoints:

- GET `/metrics`: Provides metrics in a format supported by Prometheus.
- GET `/metrics.json`: Provides the same metrics in JSON format.

Both endpoints include help information per metrics.

Note: Both endpoints require authentication. Besides the usual authentication methods 
these endpoints are also accessible using a special bearer token specified using the `--monitoring.token`
command line option.

The Prometheus output (`/metrics`) looks like this:
```
...
# HELP arangosync_master_worker_registrations Total number of registrations
# TYPE arangosync_master_worker_registrations counter
arangosync_master_worker_registrations 2
# HELP arangosync_master_worker_storage Number of times worker info is stored, loaded
# TYPE arangosync_master_worker_storage counter
arangosync_master_worker_storage{kind="",op="save",result="success"} 20
arangosync_master_worker_storage{kind="empty",op="load",result="success"} 1
...
```

The JSON output (`/metrics.json`) looks like this:
```
{
  ...
  "arangosync_master_worker_registrations": {
    "help": "Total number of registrations",
    "type": "counter",
    "samples": [
      {
        "value": 2
      }
    ]
  },
  "arangosync_master_worker_storage": {
    "help": "Number of times worker info is stored, loaded",
    "type": "counter",
    "samples": [
      {
        "value": 8,
        "labels": {
          "kind": "",
          "op": "save",
          "result": "success"
        }
      },
      {
        "value": 1,
        "labels": {
          "kind": "empty",
          "op": "load",
          "result": "success"
        }
      }
    ]
  }
  ...
}
```

Hint: To get a list of a metrics and their help information, run:
```
alias jq='docker run --rm -i realguess/jq jq'
curl -sk -u "<user>:<password>" https://<syncmaster-IP>:8629/metrics.json | \
  jq 'with_entries({key: .key, value:.value.help})'
```

## Security

### Firewall settings

The components of ArangoSync use (TCP) network connections to communicate with each other.
Below you'll find an overview of these connections and the TCP ports that should be accessible.

1.  The sync masters must be allowed to connect to the following components
    within the same datacenter:

    - ArangoDB agents and coordinators (default ports: `8531` and `8529`)
    - Kafka brokers (default port `9092`)
    - Sync workers (default port `8729`)

    Additionally the sync masters must be allowed to connect to the sync masters in the other datacenter.

    By default the sync masters will operate on port `8629`.

2.  The sync workers must be allowed to connect to the following components within the same datacenter:

    - ArangoDB coordinators (default port `8529`)
    - Kafka brokers (default port `9092`)
    - Sync masters (default port `8629`)

    By default the sync workers will operate on port `8729`.

    Additionally the sync workers must be allowed to connect to the Kafka brokers in the other datacenter.

3.  Kafka

    The kafka brokers must be allowed to connect to the following components within the same datacenter:

    - Other kafka brokers (default port `9092`)
    - Zookeeper (default ports `2181`, `2888` and `3888`)

    The default port for kafka is `9092`. The default kafka installation will also expose some prometheus
    metrics on port `7071`. To gain more insight into kafka open this port for your prometheus
    installation.

4.  Zookeeper

    The zookeeper agents must be allowed to connect to the following components within the same datacenter:

    - Other zookeeper agents

    The setup here is a bit special as zookeeper uses 3 ports for different operations. All agents need to
    be able to connect to all of these ports.

    By default Zookeeper uses:
    
    - port `2181` for client communication
    - port `2888` for follower communication
    - port `3888` for leader elections

### Certificates 

Digital certificates are used in many places in ArangoSync for both encryption 
and authentication.

<br/> In ArangoSync all network connections are using Transport Layer Security (TLS),
a set of protocols that ensure that all network traffic is encrypted.
For this TLS certificates are used. The server side of the network connection 
offers a TLS certificate. This certificate is (often) verified by the client side of the network
connection, to ensure that the certificate is signed by a trusted Certificate Authority (CA). 
This ensures the integrity of the server.
<br/> In several places additional certificates are used for authentication. In those cases 
the client side of the connection offers a client certificate (on top of an existing TLS connection).
The server side of the connection uses the client certificate to authenticate 
the client and (optionally) decides which rights should be assigned to the client.

Note: ArangoSync does allow the use of certificates signed by a well know CA (eg. verisign)
however it is more convenient (and common) to use your own CA.

#### Formats

All certificates are x509 certificates with a public key, a private key and 
an optional chain of certificates used to sign the certificate (this chain is 
typically provided by the Certificate Authority (CA)).
<br/>Depending on their use, certificates stored in a different format.

The following formats are used:

- Public key only (`.crt`): A file that contains only the public key of 
  a certificate with an optional chain of parent certificates (public keys of certificates
  used to signed the certificate).
  <br/>Since this format contains only public keys, it is not a problem if its contents 
  are exposed. It must still be store it in a safe place to avoid losing it.
- Private key only (`.key`): A file that contains only the private key of a certificate.
  <br/>It is vital to protect these files and store them in a safe place.
- Keyfile with public & private key (`.keyfile`): A file that contains the public key of 
  a certificate, an optional chain of parent certificates and a private key.
  <br/>Since this format also contains a private key, it is vital to protect these files 
  and store them in a safe place.
- Java keystore (`.jks`): A file containing a set of public and private keys.
  <br/>It is possible to protect access to the content of this file using a keystore password.
  <br/>Since this format can contain private keys, it is vital to protect these files 
  and store them in a safe place (even when its content is protected with a keystore password).

#### Creating certificates

ArangoSync provides commands to create all certificates needed.

##### TLS server certificates

To create a certificate used for TLS servers in the **keyfile** format,
you need the public key of the CA (`--cacert`), the private key of 
the CA (`--cakey`) and one or more hostnames (or IP addresses).
Then run:
```
arangosync create tls keyfile \
    --cacert=my-tls-ca.crt --cakey=my-tls-ca.key \
    --host=<hostname> \
    --keyfile=my-tls-cert.keyfile
```
Make sure to store the generated keyfile (`my-tls-cert.keyfile`) in a safe place.

To create a certificate used for TLS servers in the **crt** & **key** format,
you need the public key of the CA (`--cacert`), the private key of 
the CA (`--cakey`) and one or more hostnames (or IP addresses).
Then run:
```
arangosync create tls certificate \
    --cacert=my-tls-ca.crt --cakey=my-tls-ca.key \
    --host=<hostname> \
    --cert=my-tls-cert.crt \
    --key=my-tls-cert.key \
```
Make sure to protect and store the generated files (`my-tls-cert.crt` & `my-tls-cert.key`) in a safe place.

##### Client authentication certificates

To create a certificate used for client authentication in the **keyfile** format,
you need the public key of the CA (`--cacert`), the private key of 
the CA (`--cakey`) and one or more hostnames (or IP addresses) or email addresses.
Then run:
```
arangosync create client-auth keyfile \
    --cacert=my-client-auth-ca.crt --cakey=my-client-auth-ca.key \
    [--host=<hostname> | --email=<emailaddress>] \
    --keyfile=my-client-auth-cert.keyfile 
```
Make sure to protect and store the generated keyfile (`my-client-auth-cert.keyfile`) in a safe place.

##### CA certificates

To create a CA certificate used to **sign TLS certificates**, run:
```
arangosync create tls ca \
    --cert=my-tls-ca.crt --key=my-tls-ca.key 
```
Make sure to protect and store both generated files (`my-tls-ca.crt` & `my-tls-ca.key`) in a safe place.
<br/>Note: CA certificates have a much longer lifetime than normal certificates. 
Therefore even more care is needed to store them safely.

To create a CA certificate used to **sign client authentication certificates**, run:
```
arangosync create client-auth ca \
    --cert=my-client-auth-ca.crt --key=my-client-auth-ca.key 
```
Make sure to protect and store both generated files (`my-client-auth-ca.crt` & `my-client-auth-ca.key`) 
in a safe place.
<br/>Note: CA certificates have a much longer lifetime than normal certificates. 
Therefore even more care is needed to store them safely.

#### Renewing certificates

All certificates have meta information in them the limit their use in function,
target & lifetime.
<br/> A certificate created for client authentication (function) cannot be used as a TLS server certificate
(same is true for the reverse).
<br/> A certificate for host `myserver` (target) cannot be used for host `anotherserver`.
<br/> A certficiate that is valid until October 2017 (limetime) cannot be used after October 2017.

If anything changes in function, target or lifetime you need a new certificate.

The procedure for creating a renewed certificate is the same as for creating a "first" certificate.
<br/> After creating the renewed certificate the process(es) using them have to be updated.
This mean restarting them. All ArangoSync components are designed to support stopping and starting 
single instances, but do not restart more than 1 instance at the same time. 
As soon as 1 instance has been restarted, give it some time to "catch up" before restarting 
the next instance.