Bug fix/failover with min replication factor (#9486)

* Improve collection time of IResearchQueryOptimizationTest * Added a minReplicationFactor field in Collections. It is not possible to modify it yet and noone cares for it * Added some assertion son minReplicationFactor * Transaction API will now reject writes as soon as minimal replication factor is NOT fulfilled * added minReplicationFactor to the user interface, preparation for the collection api changes * added minReplicationFactor to VocBaseCollection, RestReplicationHandler, RestCollectionHandler, ClusterMethods, ClusterInfo and ClusterCollectionCreationInfo * added minReplicationFactor usage to tests * TODO TEMOPORARY COMMIT FOR TESTING PLEASE REVERT ME * minReplicationFactor now able to change via collection properties route * fixed wrongly assert * added minReplicationFactor to the graph management ui * added minReplicationFactor to the gharial api * Fixed off-by-one error in minReplicationFactor. We actually enforced one more. * adjusted description of minReplicationFactor * FollowerInfo Refactoring * added gharial api graph creation tests with minimal replication factor * proper cleanup of shell collection tests, removed lots of duplicate code, preparation for some new tests * added collection create tests using invalid/valid names, replicationFactor and minReplicationFactor * Debug logging * MORE Debug logging * Included replication fast lane * Use correct minreplicationfactor * modified debug logging * Fixed compileissues * MORE Debug logging * MORE Debug logging * MORE Debug logging * MORE Debug logging * MORE Debug logging * MORE Debug logging * MORE Debug logging * Revert "MORE Debug logging" This reverts commit dab5af28c0. * Revert "MORE Debug logging" This reverts commit 6134b664bd. * Revert "MORE Debug logging" This reverts commit 80160bdf3b. * Revert "MORE Debug logging" This reverts commit 06aabcdfe1. * Removed debug output * Added replication fast lane. Also refactored the commands as i cannot take it any more... * Put some requests of RocksDBReplication onto CATCHUP Lane. * Put some requests of MMFilesReplication onto CATCHUP Lane. * Adjusted Fast and MED lane usage in Supervised scheduler * Added changelog entry * Added new features entry * A new leader will now keep old followers in case of failover * Update arangod/Cluster/ClusterCollectionCreationInfo.cpp Co-Authored-By: Tobias Gödderz <tobias@arangodb.com> * Fixed JSLINT * Unified lane handling of replication handlers * Sorry forgotten in last commit * replaced strings with static strings * more use of static strings * optimized min repl description in the ui * decr initial loop variable * clean up of the createWithId test * more use of static strings * Update js/apps/system/_admin/aardvark/APP/frontend/js/views/collectionsView.js Co-Authored-By: Tobias Gödderz <tobias@arangodb.com> * Added some comments on condition, renamed variable as suggested in review * Added check for min replicationFactor to be non-zero * Added assertion * Added function to modify min and max replication factor in one go * added missing semicolon * rm log devel * Added a second information to follower info that can keep track of followers that have been in sync before a failover has taken place * Maintenance reports previous version now to follower info. instead of lying by itself. The Follower Info now gets a failover save mode to report insync followers * check replFactor against nr dbservers * Add lie reporting in CURRENT * Reverted most of my recent commits about Failover situation. The intended plan simply does not work out * move replication checks from logical collection to rest collection handler * added more replication tests * Include assert only if we are not in gtest * jslint * set min repl factor to zero if satellite collection * check replication attributes in v8 collection * Initial commit, old plan, does not yet work * fixed ires tests * Included FailoverCandidates key. Not fully implemented * fixed wrong assert * unified in sync follower reporting * fixed compiler errors * Cleanup locking, and fixed potential deadlocks * Comments about locking order in FollowerInfo. * properly check uint * Keep old leader as potential failover candidate * Transaction methods now use followerInfo to check if the leader can write, this might have the sideeffect that 'failoverCandidates' are updated * Let agency check failoverCandidates if possible * Initialize member variables * Use unified follower reporting in DBServerAgencySync * Removed obsolete variable, collecting it somewhere else * repl factor attr check * Reimplemented previous followers, second attempt now. PhaseOne and PhaseTwo can now synchronize on current. * Fixed assertion, forgot an off-by-one * adjusted test to be more preciese now * Fixed failove candidates list * Disable write on dropping too many followers * Allow to run updateFailoerCandidates multiple times with same leader. * Final fixes, resilience tests now green, crossing fingers for jenkins * Fixed race on atomics comparison * Fixed invalid number type * added nullptr handling * added nullptr handling * Removed invalid assert * Make takeover of leadership an atomic operation * Update tests/js/common/shell/shell-cluster-collection.js Co-Authored-By: Tobias Gödderz <tobias@arangodb.com> * Review fixes * Fixed creation code to use takeoverLeadership * Update arangod/Cluster/FollowerInfo.h Co-Authored-By: Tobias Gödderz <tobias@arangodb.com> * Applied review fixes * There is no timeout * Moved AQL + Pregel to INTERNAL_AQL lane, which is medium priority, to avoid deadlocks with Sync replication * More review fixes * Use difference if you want to compare two vectors... * Use std::string ... * Now check if we are in recovery mode * Added documentation for minReplicationFactor * Added readme update as well in documenation
2019-07-19 15:00:30 +02:00 · 2019-07-19 15:00:30 +02:00 · 36b1d290a9
parent c922c5f133
commit 36b1d290a9
27 changed files with 722 additions and 310 deletions
--- a/Documentation/Books/Manual/Architecture/Replication/README.md
+++ b/Documentation/Books/Manual/Architecture/Replication/README.md
@ -38,6 +38,10 @@ factor. The number of _followers_ can be controlled using the
 `replicationFactor` parameter is the total number of copies being
 kept, that is, it is one plus the number of _followers_. 

+In addition to the `replicationFactor` we have a `minReplicationFactor`
+the locks down a collection as soon as we have lost too many followers.
+
+
 Asynchronous replication
 ------------------------

--- a/Documentation/Books/Manual/DataModeling/Collections/DatabaseMethods.md
+++ b/Documentation/Books/Manual/DataModeling/Collections/DatabaseMethods.md
@ -164,6 +164,15 @@ to the [naming conventions](../NamingConventions/README.md).
  dramatically when using joins in AQL at the costs of reduced write
  performance on these collections.

+- *minReplicationFactor* (optional, default is 1):  in a cluster, this
+  attribute determines how many copies of each shard are required
+  to be in sync on the different DBServers. If we have less then these
+  many copies in the cluster a shard will refuse to write. The
+  minReplicationFactor can not be larger than replicationFactor.
+  Please note: during server failures this might lead to writes
+  not beeing possible until the failover is sorted out and might cause
+  write slow downs in trade of data durability.
+
 - *distributeShardsLike*: distribute the shards of this collection
  cloning the shard distribution of another. If this value is set,
  it will copy the attributes *replicationFactor*, *numberOfShards* and 
--- a/Documentation/DocuBlocks/Rest/Collections/1_structs.md
+++ b/Documentation/DocuBlocks/Rest/Collections/1_structs.md
@ -43,6 +43,10 @@ determine the target shard for documents; *Cluster specific attribute.*
@RESTSTRUCT{replicationFactor,collection_info,integer,optional,}
 contains how many copies of each shard are kept on different DBServers.; *Cluster specific attribute.*

+@RESTSTRUCT{minReplicationFactor,collection_info,integer,optional,}
+contains how many minimal copies of each shard are kept on different DBServers.
+The shards will refuse to write, if we have less then these many copies in sync.; *Cluster specific attribute.*
+
@RESTSTRUCT{shardingStrategy,collection_info,string,optional,}
 the sharding strategy selected for the collection; *Cluster specific attribute.*
 One of 'hash' or 'enterprise-hash-smart-edge'
--- a/Documentation/DocuBlocks/Rest/Graph/1_structs.md
+++ b/Documentation/DocuBlocks/Rest/Graph/1_structs.md
@ -25,6 +25,11 @@ concurrent modifications to this graph.
@RESTSTRUCT{replicationFactor,graph_representation,integer,required,}
 The replication factor used for every new collection in the graph.

+@RESTSTRUCT{minReplicationFactor,graph_representation,integer,optional,}
+The minimal replication factor used for every new collection in the graph.
+If one shard has less then minimal replication factor copies, we cannot write
+to this shard, but to all others.
+
@RESTSTRUCT{isSmart,graph_representation,boolean,required,}
 Flag if the graph is a SmartGraph (Enterprise Edition only) or not.

--- a/Documentation/DocuBlocks/Rest/Graph/general_graph_create_http_examples.md
+++ b/Documentation/DocuBlocks/Rest/Graph/general_graph_create_http_examples.md
@ -42,6 +42,11 @@ Cannot be modified later.
@RESTSTRUCT{replicationFactor,post_api_gharial_create_opts,integer,required,}
 The replication factor used when initially creating collections for this graph.

+@RESTSTRUCT{minReplicationFactor,post_api_gharial_create_opts,integer,optional,}
+The minimal replication factor used for every new collection in the graph.
+If one shard has less then minimal replication factor copies, we cannot write
+to this shard, but to all others.
+
@RESTRETURNCODES

@RESTRETURNCODE{201}
--- a/Documentation/DocuBlocks/collectionProperties.md
+++ b/Documentation/DocuBlocks/collectionProperties.md
@ -52,7 +52,11 @@ In a cluster setup, the result will also contain the following attributes:
  determine the target shard for documents.

 * *replicationFactor*: determines how many copies of each shard are kept 
-  on different DBServers.
+  on different DBServers. Has to be in the range of 1-10 *(Cluster only)*
+
+  * *minReplicationFactor* : determines the number of minimal shard copies kept on 
+  different DBServers, a shard will refuse to write, if less then this amount
+  of copies are in sync. Has to be in the range of 1-replicationFactor *(Cluster only)*

 * *shardingStrategy*: the sharding strategy selected for the collection.
  This attribute will only be populated in cluster mode and is not populated
@ -77,6 +81,10 @@ one or more of the following attribute(s):
  different DBServers, valid values are  integer numbers
  in the range of 1-10 *(Cluster only)*

+* *minReplicationFactor* : Change the number of minimal shard copies kept on 
+  different DBServers, a shard will refuse to write, if less then this amount
+  of copies are in sync. Has to be in the range of 1-replicationFactor *(Cluster only)*
+
 **Note**: some other collection properties, such as *type*, *isVolatile*,
 *keyOptions*, *numberOfShards* or *shardingStrategy* cannot be changed once 
 the collection is created.
--- a/arangod/Agency/Job.cpp
+++ b/arangod/Agency/Job.cpp
@ -95,21 +95,22 @@ bool Job::finish(std::string const& server, std::string const& shard,
    try {
      jobType = pending.slice()[0].get("type").copyString();
    } catch (std::exception const&) {
-      LOG_TOPIC("76352", WARN, Logger::AGENCY) << "Failed to obtain type of job " << _jobId;
+      LOG_TOPIC("76352", WARN, Logger::AGENCY)
+          << "Failed to obtain type of job " << _jobId;
    }

    // Additional payload, which is to be executed in the finish transaction
    Slice operations = Slice::emptyObjectSlice();
-    Slice preconditions  = Slice::emptyObjectSlice();
+    Slice preconditions = Slice::emptyObjectSlice();

    if (payload != nullptr) {
      Slice slice = payload->slice();
      TRI_ASSERT(slice.isObject() || slice.isArray());
-      if (slice.isObject()) {     // opers only
+      if (slice.isObject()) {  // opers only
        operations = slice;
        TRI_ASSERT(operations.isObject());
      } else {
-        TRI_ASSERT(slice.length() < 3); // opers + precs only
+        TRI_ASSERT(slice.length() < 3);  // opers + precs only
        if (slice.length() > 0) {
          operations = slice[0];
          TRI_ASSERT(operations.isObject());
@ -125,7 +126,7 @@ bool Job::finish(std::string const& server, std::string const& shard,
    {
      VPackArrayBuilder guard(&finished);

-      { // operations --
+      {  // operations --
        VPackObjectBuilder operguard(&finished);

        addPutJobIntoSomewhere(finished, success ? "Finished" : "Failed",
@ -148,15 +149,14 @@ bool Job::finish(std::string const& server, std::string const& shard,
          addReleaseShard(finished, shard);
        }

-      } // -- operations
+      }  // -- operations

-      if (preconditions.isObject() && preconditions.length() > 0) { // preconditions --
+      if (preconditions.isObject() && preconditions.length() > 0) {  // preconditions --
        VPackObjectBuilder precguard(&finished);
        for (auto const& prec : VPackObjectIterator(preconditions)) {
          finished.add(prec.key.copyString(), prec.value);
        }
-      } // -- preconditions
-
+      }  // -- preconditions
    }

    write_ret_t res = singleWriteTransaction(_agent, finished, false);
@ -168,16 +168,16 @@ bool Job::finish(std::string const& server, std::string const& shard,
    }
  } catch (std::exception const& e) {
    LOG_TOPIC("1fead", WARN, Logger::AGENCY)
-      << "Caught exception in finish, message: " << e.what();
+        << "Caught exception in finish, message: " << e.what();
  } catch (...) {
    LOG_TOPIC("7762f", WARN, Logger::AGENCY)
-      << "Caught unspecified exception in finish.";
+        << "Caught unspecified exception in finish.";
  }
  return false;
 }

 std::string Job::randomIdleAvailableServer(Node const& snap,
-                                               std::vector<std::string> const& exclude) {
+                                           std::vector<std::string> const& exclude) {
  std::vector<std::string> as = availableServers(snap);
  std::string ret;

@ -189,11 +189,11 @@ std::string Job::randomIdleAvailableServer(Node const& snap,
    for (auto const& srv : snap.hasAsChildren(healthPrefix).first) {
      // ignore excluded servers
      if (std::find(std::begin(exclude), std::end(exclude), srv.first) != std::end(exclude)) {
-        continue ;
+        continue;
      }
      // ignore servers not in availableServers above:
      if (std::find(std::begin(as), std::end(as), srv.first) == std::end(as)) {
-        continue ;
+        continue;
      }

      std::string const& status = (*srv.second).hasAsString("Status").first;
@ -242,7 +242,7 @@ size_t Job::countGoodOrBadServersInList(Node const& snap, VPackSlice const& serv
  auto const& health = snap.hasAsChildren(healthPrefix);
  // Do we have a Health substructure?
  if (health.second) {
-    Node::Children const& healthData = health.first; // List of servers in Health
+    Node::Children const& healthData = health.first;  // List of servers in Health
    for (VPackSlice const serverName : VPackArrayIterator(serverList)) {
      if (serverName.isString()) {
        // serverName not a string? Then don't count
@ -269,13 +269,14 @@ size_t Job::countGoodOrBadServersInList(Node const& snap, VPackSlice const& serv
 }

 // The following counts in a given server list how many of the servers are
- // in Status "GOOD" or "BAD".
-size_t Job::countGoodOrBadServersInList(Node const& snap, std::vector<std::string> const& serverList) {
+// in Status "GOOD" or "BAD".
+size_t Job::countGoodOrBadServersInList(Node const& snap,
+                                        std::vector<std::string> const& serverList) {
  size_t count = 0;
  auto const& health = snap.hasAsChildren(healthPrefix);
  // Do we have a Health substructure?
  if (health.second) {
-    Node::Children const& healthData = health.first; // List of servers in Health
+    Node::Children const& healthData = health.first;  // List of servers in Health
    for (auto& serverStr : serverList) {
      // Now look up this server:
      auto it = healthData.find(serverStr);
@ -294,7 +295,8 @@ size_t Job::countGoodOrBadServersInList(Node const& snap, std::vector<std::strin
 }

 /// @brief Check if a server is cleaned or to be cleaned out:
-bool Job::isInServerList(Node const& snap, std::string const& prefix, std::string const& server, bool isArray) {
+bool Job::isInServerList(Node const& snap, std::string const& prefix,
+                         std::string const& server, bool isArray) {
  VPackSlice slice;
  bool found = false;
  if (isArray) {
@ -309,7 +311,7 @@ bool Job::isInServerList(Node const& snap, std::string const& prefix, std::strin
      }
    }
  } else {  // an object
-    auto const& children  = snap.hasAsChildren(prefix);
+    auto const& children = snap.hasAsChildren(prefix);
    if (children.second) {
      for (auto const& srv : children.first) {
        if (srv.first == server) {
@ -418,16 +420,15 @@ std::vector<Job::shard_t> Job::clones(Node const& snapshot, std::string const& d

  for (const auto& colptr : snapshot.hasAsChildren(databasePath).first) {  // collections

-    auto const &col = *colptr.second;
-    auto const &otherCollection = colptr.first;
+    auto const& col = *colptr.second;
+    auto const& otherCollection = colptr.first;

    if (otherCollection != collection && col.has("distributeShardsLike") &&  // use .has() form to prevent logging of missing
        col.hasAsSlice("distributeShardsLike").first.copyString() == collection) {
      auto const& theirshards = sortedShardList(col.hasAsNode("shards").first);
      if (theirshards.size() > 0) {  // do not care about virtual collections
        if (theirshards.size() == myshards.size()) {
-          ret.emplace_back(otherCollection,
-                           theirshards[steps]);
+          ret.emplace_back(otherCollection, theirshards[steps]);
        } else {
          LOG_TOPIC("3092e", ERR, Logger::SUPERVISION)
              << "Shard distribution of clone(" << otherCollection
@ -452,10 +453,11 @@ std::string Job::findNonblockedCommonHealthyInSyncFollower(  // Which is in "GOO

  std::unordered_map<std::string, size_t> currentServers;
  for (const auto& clone : cs) {
-    auto currentShardPath = curColPrefix + db + "/" + clone.collection + "/" +
-                            clone.shard + "/servers";
-    auto plannedShardPath =
-        planColPrefix + db + "/" + clone.collection + "/shards/" + clone.shard;
+    auto sharedPath = db + "/" + clone.collection + "/";
+    auto currentShardPath = curColPrefix + sharedPath + clone.shard + "/servers";
+    auto currentFailoverCandidatesPath =
+        curColPrefix + sharedPath + clone.shard + "/servers";
+    auto plannedShardPath = planColPrefix + sharedPath + "shards/" + clone.shard;
    size_t i = 0;

    // start up race condition  ... current might not have everything in plan
@ -464,13 +466,30 @@ std::string Job::findNonblockedCommonHealthyInSyncFollower(  // Which is in "GOO
      continue;
    }  // if

-    for (const auto& server :
-         VPackArrayIterator(snap.hasAsArray(currentShardPath).first)) {
-      auto id = server.copyString();
+    bool isArray = false;
+    VPackSlice serverList;
+    // If we do have failover candidates, we should use them
+    std::tie(serverList, isArray) = snap.hasAsArray(currentFailoverCandidatesPath);
+    if (!isArray) {
+      // We have old DBServers that do not report failover candidates,
+      // Need to rely on current
+      std::tie(serverList, isArray) = snap.hasAsArray(currentShardPath);
+      TRI_ASSERT(isArray);
+      if (!isArray) {
+        THROW_ARANGO_EXCEPTION_MESSAGE(
+            TRI_ERROR_SUPERVISION_GENERAL_FAILURE,
+            "Could not find common insync server for: " + currentShardPath +
+                ", value is not an array.");
+      }
+    }
+    // Guarantieed by if above
+    TRI_ASSERT(serverList.isArray());
+    for (const auto& server : VPackArrayIterator(serverList)) {
      if (i++ == 0) {
        // Skip leader
        continue;
      }
+      auto id = server.copyString();

      if (!good[id]) {
        // Skip unhealthy servers
@ -550,9 +569,9 @@ bool Job::abortable(Node const& snapshot, std::string const& jobId) {
  return false;
 }

-void Job::doForAllShards(Node const& snapshot, std::string& database,
-                         std::vector<shard_t>& shards,
-                         std::function<void(Slice plan, Slice current, std::string& planPath, std::string& curPath)> worker) {
+void Job::doForAllShards(
+    Node const& snapshot, std::string& database, std::vector<shard_t>& shards,
+    std::function<void(Slice plan, Slice current, std::string& planPath, std::string& curPath)> worker) {
  for (auto const& collShard : shards) {
    std::string shard = collShard.shard;
    std::string collection = collShard.collection;
--- a/arangod/Aql/RestAqlHandler.h
+++ b/arangod/Aql/RestAqlHandler.h
@ -49,9 +49,7 @@ class RestAqlHandler : public RestVocbaseBaseHandler {

 public:
  char const* name() const override final { return "RestAqlHandler"; }
-  RequestLane lane() const override final {
-    return RequestLane::CLUSTER_INTERNAL;
-  }
+  RequestLane lane() const override final { return RequestLane::CLUSTER_AQL; }
  RestStatus execute() override;
  RestStatus continueExecute() override;

--- a/arangod/Cluster/ClusterInfo.h
+++ b/arangod/Cluster/ClusterInfo.h
@ -165,6 +165,30 @@ class CollectionInfoCurrent {
    return v;
  }

+  //////////////////////////////////////////////////////////////////////////////
+  /// @brief returns the current failover candidates for the given shard
+  //////////////////////////////////////////////////////////////////////////////
+
+  TEST_VIRTUAL std::vector<ServerID> failoverCandidates(ShardID const& shardID) const {
+    std::vector<ServerID> v;
+
+    auto it = _vpacks.find(shardID);
+    if (it != _vpacks.end()) {
+      VPackSlice slice = it->second->slice();
+
+      VPackSlice servers = slice.get(StaticStrings::FailoverCandidates);
+      if (servers.isArray()) {
+        for (auto const& server : VPackArrayIterator(servers)) {
+          TRI_ASSERT(server.isString());
+          if (server.isString()) {
+            v.push_back(server.copyString());
+          }
+        }
+      }
+    }
+    return v;
+  }
+
  //////////////////////////////////////////////////////////////////////////////
  /// @brief returns the errorMessage entry for one shardID
  //////////////////////////////////////////////////////////////////////////////
--- a/arangod/Cluster/CreateCollection.cpp
+++ b/arangod/Cluster/CreateCollection.cpp
@ -93,7 +93,8 @@ CreateCollection::CreateCollection(MaintenanceFeature& feature, ActionDescriptio
  TRI_ASSERT(type == TRI_COL_TYPE_DOCUMENT || type == TRI_COL_TYPE_EDGE);

  if (!error.str().empty()) {
-    LOG_TOPIC("7c60f", ERR, Logger::MAINTENANCE) << "CreateCollection: " << error.str();
+    LOG_TOPIC("7c60f", ERR, Logger::MAINTENANCE)
+        << "CreateCollection: " << error.str();
    _result.reset(TRI_ERROR_INTERNAL, error.str());
    setState(FAILED);
  }
@ -156,10 +157,12 @@ bool CreateCollection::first() {
                              LOG_TOPIC("9db9a", DEBUG, Logger::MAINTENANCE)
                                  << "local collection " << database << "/"
                                  << shard << " successfully created";
-                              col->followers()->setTheLeader(leader);

                              if (leader.empty()) {
-                                col->followers()->clear();
+                                std::vector<std::string> noFollowers;
+                                col->followers()->takeOverLeadership(noFollowers);
+                              } else {
+                                col->followers()->setTheLeader(leader);
                              }
                            });

@ -186,7 +189,7 @@ bool CreateCollection::first() {
  }

  LOG_TOPIC("4562c", DEBUG, Logger::MAINTENANCE)
-    << "Create collection done, notifying Maintenance";
+      << "Create collection done, notifying Maintenance";

  notify();

--- a/arangod/Cluster/DBServerAgencySync.cpp
+++ b/arangod/Cluster/DBServerAgencySync.cpp
@ -70,7 +70,8 @@ Result DBServerAgencySync::getLocalCollections(VPackBuilder& collections) {
  }

  if (dbfeature == nullptr) {
-    LOG_TOPIC("d0ef2", ERR, Logger::HEARTBEAT) << "Failed to get feature database";
+    LOG_TOPIC("d0ef2", ERR, Logger::HEARTBEAT)
+        << "Failed to get feature database";
    return Result(TRI_ERROR_INTERNAL, "Failed to get feature database");
  }

@ -80,9 +81,7 @@ Result DBServerAgencySync::getLocalCollections(VPackBuilder& collections) {
    if (!vocbase.use()) {
      return;
    }
-    auto unuse = scopeGuard([&vocbase] {
-      vocbase.release();
-    });
+    auto unuse = scopeGuard([&vocbase] { vocbase.release(); });

    collections.add(VPackValue(vocbase.name()));

@ -100,9 +99,8 @@ Result DBServerAgencySync::getLocalCollections(VPackBuilder& collections) {
        // generate a collection definition identical to that which would be
        // persisted in the case of SingleServer
        collection->properties(collections,
-                               LogicalDataSource::makeFlags(
-                                   LogicalDataSource::Serialize::Detailed,
-                                   LogicalDataSource::Serialize::ForPersistence));
+                               LogicalDataSource::makeFlags(LogicalDataSource::Serialize::Detailed,
+                                                            LogicalDataSource::Serialize::ForPersistence));

        auto const& folls = collection->followers();
        std::string const theLeader = folls->getLeader();
@ -119,19 +117,7 @@ Result DBServerAgencySync::getLocalCollections(VPackBuilder& collections) {
          // we are the leader ourselves
          // In this case we report our in-sync followers here in the format
          // of the agency: [ leader, follower1, follower2, ... ]
-          collections.add(VPackValue("servers"));
-
-          {
-            VPackArrayBuilder guard(&collections);
-
-            collections.add(VPackValue(arangodb::ServerState::instance()->getId()));
-
-            std::shared_ptr<std::vector<ServerID> const> srvs = folls->get();
-
-            for (auto const& s : *srvs) {
-              collections.add(VPackValue(s));
-            }
-          }
+          folls->injectFollowerInfo(collections);
        }
      }
    }
@ -151,14 +137,20 @@ DBServerAgencySyncResult DBServerAgencySync::execute() {

  LOG_TOPIC("62fd8", DEBUG, Logger::MAINTENANCE)
      << "DBServerAgencySync::execute starting";
-
+  DBServerAgencySyncResult result;
  auto* sysDbFeature =
      application_features::ApplicationServer::lookupFeature<SystemDatabaseFeature>();
  MaintenanceFeature* mfeature =
      ApplicationServer::getFeature<MaintenanceFeature>("Maintenance");
+  if (mfeature == nullptr) {
+    LOG_TOPIC("3a1f7", ERR, Logger::MAINTENANCE)
+        << "Could not load maintenance feature, can happen during shutdown.";
+    result.success = false;
+    result.errorMessage = "Could not load maintenance feature";
+    return result;
+  }
  arangodb::SystemDatabaseFeature::ptr vocbase =
      sysDbFeature ? sysDbFeature->use() : nullptr;
-  DBServerAgencySyncResult result;

  if (vocbase == nullptr) {
    LOG_TOPIC("18d67", DEBUG, Logger::MAINTENANCE)
@ -196,20 +188,21 @@ DBServerAgencySyncResult DBServerAgencySync::execute() {
    VPackObjectBuilder o(&rb);

    auto startTimePhaseOne = std::chrono::steady_clock::now();
-    LOG_TOPIC("19aaf", DEBUG, Logger::MAINTENANCE) << "DBServerAgencySync::phaseOne";
+    LOG_TOPIC("19aaf", DEBUG, Logger::MAINTENANCE)
+        << "DBServerAgencySync::phaseOne";
    tmp = arangodb::maintenance::phaseOne(plan->slice(), local.slice(),
                                          serverId, *mfeature, rb);
    auto endTimePhaseOne = std::chrono::steady_clock::now();
    LOG_TOPIC("93f83", DEBUG, Logger::MAINTENANCE)
        << "DBServerAgencySync::phaseOne done";

-    if (endTimePhaseOne - startTimePhaseOne >
-        std::chrono::milliseconds(200)) {
+    if (endTimePhaseOne - startTimePhaseOne > std::chrono::milliseconds(200)) {
      // We take this as indication that many shards are in the system,
      // in this case: give some asynchronous jobs created in phaseOne a
      // chance to complete before we collect data for phaseTwo:
      LOG_TOPIC("ef730", DEBUG, Logger::MAINTENANCE)
-        << "DBServerAgencySync::hesitating between phases 1 and 2 for 0.1s...";
+          << "DBServerAgencySync::hesitating between phases 1 and 2 for "
+             "0.1s...";
      std::this_thread::sleep_for(std::chrono::milliseconds(100));
    }

@ -224,6 +217,8 @@ DBServerAgencySyncResult DBServerAgencySync::execute() {
    LOG_TOPIC("675fd", TRACE, Logger::MAINTENANCE)
        << "DBServerAgencySync::phaseTwo - current state: " << current->toJson();

+    mfeature->increaseCurrentCounter();
+
    local.clear();
    glc = getLocalCollections(local);
    // We intentionally refetch local collections here, such that phase 2
@ -237,7 +232,8 @@ DBServerAgencySyncResult DBServerAgencySync::execute() {
      return result;
    }

-    LOG_TOPIC("652ff", DEBUG, Logger::MAINTENANCE) << "DBServerAgencySync::phaseTwo";
+    LOG_TOPIC("652ff", DEBUG, Logger::MAINTENANCE)
+        << "DBServerAgencySync::phaseTwo";

    tmp = arangodb::maintenance::phaseTwo(plan->slice(), current->slice(),
                                          local.slice(), serverId, *mfeature, rb);
@ -246,7 +242,8 @@ DBServerAgencySyncResult DBServerAgencySync::execute() {
        << "DBServerAgencySync::phaseTwo done";

  } catch (std::exception const& e) {
-    LOG_TOPIC("cd308", ERR, Logger::MAINTENANCE) << "Failed to handle plan change: " << e.what();
+    LOG_TOPIC("cd308", ERR, Logger::MAINTENANCE)
+        << "Failed to handle plan change: " << e.what();
  }

  if (rb.isClosed()) {
@ -268,9 +265,9 @@ DBServerAgencySyncResult DBServerAgencySync::execute() {

            if (ao.value.hasKey("precondition")) {
              auto const precondition = ao.value.get("precondition");
-              preconditions.push_back(
-                AgencyPrecondition(
-                  precondition.keyAt(0).copyString(), AgencyPrecondition::Type::VALUE, precondition.valueAt(0)));
+              preconditions.push_back(AgencyPrecondition(precondition.keyAt(0).copyString(),
+                                                         AgencyPrecondition::Type::VALUE,
+                                                         precondition.valueAt(0)));
            }

            if (op == "set") {
@ -279,7 +276,6 @@ DBServerAgencySyncResult DBServerAgencySync::execute() {
            } else if (op == "delete") {
              operations.push_back(AgencyOperation(key, AgencySimpleOperationType::DELETE_OP));
            }
-            
          }
          operations.push_back(AgencyOperation("Current/Version",
                                               AgencySimpleOperationType::INCREMENT_OP));
@ -288,9 +284,8 @@ DBServerAgencySyncResult DBServerAgencySync::execute() {
          AgencyCommResult r = comm.sendTransactionWithFailover(currentTransaction);
          if (!r.successful()) {
            LOG_TOPIC("d73b8", INFO, Logger::MAINTENANCE)
-              << "Error reporting to agency: _statusCode: " << r.errorCode()
-              << " message: " << r.errorMessage()
-              << ". This can be ignored, since it will be retried automatically.";
+                << "Error reporting to agency: _statusCode: " << r.errorCode()
+                << " message: " << r.errorMessage() << ". This can be ignored, since it will be retried automatically.";
          } else {
            LOG_TOPIC("9b0b3", DEBUG, Logger::MAINTENANCE)
                << "Invalidating current in ClusterInfo";
@ -317,22 +312,23 @@ DBServerAgencySyncResult DBServerAgencySync::execute() {
      result.errorMessage = "Report from phase 1 and 2 was no object.";
      try {
        std::string json = report.toJson();
-        LOG_TOPIC("65fde", WARN, Logger::MAINTENANCE) << "Report from phase 1 and 2 was: " << json;
-      } catch(std::exception const& exc) {
+        LOG_TOPIC("65fde", WARN, Logger::MAINTENANCE)
+            << "Report from phase 1 and 2 was: " << json;
+      } catch (std::exception const& exc) {
        LOG_TOPIC("54de2", WARN, Logger::MAINTENANCE)
-          << "Report from phase 1 and 2 could not be dumped to JSON, error: "
-          << exc.what() << ", head byte:" << report.head();
+            << "Report from phase 1 and 2 could not be dumped to JSON, error: "
+            << exc.what() << ", head byte:" << report.head();
        uint64_t l = 0;
        try {
          l = report.byteSize();
          LOG_TOPIC("54dda", WARN, Logger::MAINTENANCE)
-            << "Report from phase 1 and 2, byte size: " << l;
+              << "Report from phase 1 and 2, byte size: " << l;
          LOG_TOPIC("67421", WARN, Logger::MAINTENANCE)
-            << "Bytes: "
-            << arangodb::basics::StringUtils::encodeHex((char const*) report.start(), l);
-        } catch(...) {
+              << "Bytes: "
+              << arangodb::basics::StringUtils::encodeHex((char const*)report.start(), l);
+        } catch (...) {
          LOG_TOPIC("76124", WARN, Logger::MAINTENANCE)
-            << "Report from phase 1 and 2, byte size throws.";
+              << "Report from phase 1 and 2, byte size throws.";
        }
      }
    }
@ -342,9 +338,10 @@ DBServerAgencySyncResult DBServerAgencySync::execute() {

  auto took = duration<double>(clock::now() - start).count();
  if (took > 30.0) {
-    LOG_TOPIC("83cb8", WARN, Logger::MAINTENANCE) << "DBServerAgencySync::execute "
-                                            "took "
-                                         << took << " s to execute handlePlanChange";
+    LOG_TOPIC("83cb8", WARN, Logger::MAINTENANCE)
+        << "DBServerAgencySync::execute "
+           "took "
+        << took << " s to execute handlePlanChange";
  }

  return result;
--- a/arangod/Cluster/FollowerInfo.cpp
+++ b/arangod/Cluster/FollowerInfo.cpp
@ -25,6 +25,7 @@
 #include "FollowerInfo.h"

 #include "ApplicationFeatures/ApplicationServer.h"
+#include "Cluster/MaintenanceStrings.h"
 #include "Cluster/ServerState.h"
 #include "VocBase/LogicalCollection.h"

@ -32,56 +33,12 @@

 using namespace arangodb;

-////////////////////////////////////////////////////////////////////////////////
-/// @brief change JSON under
-/// Current/Collection/<DB-name>/<Collection-ID>/<shard-ID>
-/// to add or remove a serverID, if add flag is true, the entry is added
-/// (if it is not yet there), otherwise the entry is removed (if it was
-/// there).
-////////////////////////////////////////////////////////////////////////////////
-
-static VPackBuilder newShardEntry(VPackSlice oldValue, ServerID const& sid, bool add) {
-  VPackBuilder newValue;
-  VPackSlice servers;
-  {
-    VPackObjectBuilder b(&newValue);
-    // Now need to find the `servers` attribute, which is a list:
-    for (auto const& it : VPackObjectIterator(oldValue)) {
-      if (it.key.isEqualString("servers")) {
-        servers = it.value;
-      } else {
-        newValue.add(it.key);
-        newValue.add(it.value);
-      }
-    }
-    newValue.add(VPackValue("servers"));
-    if (servers.isArray() && servers.length() > 0) {
-      VPackArrayBuilder bb(&newValue);
-      newValue.add(servers[0]);
-      VPackArrayIterator it(servers);
-      bool done = false;
-      for (++it; it.valid(); ++it) {
-        if ((*it).isEqualString(sid)) {
-          if (add) {
-            newValue.add(*it);
-            done = true;
-          }
-        } else {
-          newValue.add(*it);
-        }
-      }
-      if (add && !done) {
-        newValue.add(VPackValue(sid));
-      }
-    } else {
-      VPackArrayBuilder bb(&newValue);
-      newValue.add(VPackValue(ServerState::instance()->getId()));
-      if (add) {
-        newValue.add(VPackValue(sid));
-      }
-    }
+static std::string const inline reportName(bool isRemove) {
+  if (isRemove) {
+    return "FollowerInfo::remove";
+  } else {
+    return "FollowerInfo::add";
  }
-  return newValue;
 }

 static std::string CurrentShardPath(arangodb::LogicalCollection& col) {
@ -136,6 +93,15 @@ Result FollowerInfo::add(ServerID const& sid) {
    v = std::make_shared<std::vector<ServerID>>(*_followers);
    v->push_back(sid);  // add a single entry
    _followers = v;     // will cast to std::vector<ServerID> const
+    {
+      // insertIntoCandidates
+      if (std::find(_failoverCandidates->begin(), _failoverCandidates->end(), sid) ==
+          _failoverCandidates->end()) {
+        auto nextCandidates = std::make_shared<std::vector<ServerID>>(*_failoverCandidates);
+        nextCandidates->push_back(sid);  // add a single entry
+        _failoverCandidates = nextCandidates;  // will cast to std::vector<ServerID> const
+      }
+    }
 #ifdef DEBUG_SYNC_REPLICATION
    if (!AgencyCommManager::MANAGER) {
      return {TRI_ERROR_NO_ERROR};
@ -144,23 +110,15 @@ Result FollowerInfo::add(ServerID const& sid) {
  }

  // Now tell the agency
-  TRI_ASSERT(_docColl != nullptr);
-  std::string curPath = CurrentShardPath(*_docColl);
-  std::string planPath = PlanShardPath(*_docColl);
-  AgencyComm ac;
-  double startTime = TRI_microtime();
-  do {
-    AgencyReadTransaction trx(std::vector<std::string>(
-        {AgencyCommManager::path(planPath), AgencyCommManager::path(curPath)}));
-    AgencyCommResult res = ac.sendTransactionWithFailover(trx);
-
-    if (res.successful()) {
-      TRI_ASSERT(res.slice().isArray() && res.slice().length() == 1);
-      VPackSlice resSlice = res.slice()[0];
-      // Let's look at the results, note that both can be None!
-      velocypack::Slice planEntry = PlanShardEntry(*_docColl, resSlice);
-      velocypack::Slice currentEntry = CurrentShardEntry(*_docColl, resSlice);
+  auto agencyRes = persistInAgency(false);
+  if (agencyRes.ok() || agencyRes.is(TRI_ERROR_CLUSTER_NOT_LEADER)) {
+    // Not a leader is expected
+    return agencyRes;
+  }
+  // Real error, report

+<<<<<<< HEAD
+=======
      if (!currentEntry.isObject()) {
        LOG_TOPIC("b753d", ERR, Logger::CLUSTER)
            << "FollowerInfo::add, did not find object in " << curPath;
@ -210,14 +168,15 @@ Result FollowerInfo::add(ServerID const& sid) {
  int errorCode = (application_features::ApplicationServer::isStopping())
                      ? TRI_ERROR_SHUTTING_DOWN
                      : TRI_ERROR_CLUSTER_AGENCY_COMMUNICATION_FAILED;
+>>>>>>> c922c5f1332482ef29dff794d8af394d31c1b737
  std::string errorMessage =
      "unable to add follower in agency, timeout in agency CAS operation for "
      "key " +
      _docColl->vocbase().name() + "/" + std::to_string(_docColl->planId()) +
-      ": " + TRI_errno_string(errorCode);
+      ": " + TRI_errno_string(agencyRes.errorNumber());
  LOG_TOPIC("6295b", ERR, Logger::CLUSTER) << errorMessage;
-
-  return {errorCode, std::move(errorMessage)};
+  agencyRes.reset(agencyRes.errorNumber(), std::move(errorMessage));
+  return agencyRes;
 }

 ////////////////////////////////////////////////////////////////////////////////
@ -246,46 +205,192 @@ Result FollowerInfo::remove(ServerID const& sid) {
      << "Removing follower " << sid << " from " << _docColl->name();

  MUTEX_LOCKER(locker, _agencyMutex);
+  WRITE_LOCKER(canWriteLocker, _canWriteLock);
  WRITE_LOCKER(writeLocker, _dataLock);  // the data lock has to be locked until this function completes
                                         // because if the agency communication does not work
                                         // local data is modified again.

  // First check if there is anything to do:
-  bool found = false;
-  for (auto const& s : *_followers) {
-    if (s == sid) {
-      found = true;
-      break;
-    }
-  }
-  if (!found) {
+  if (std::find(_followers->begin(), _followers->end(), sid) == _followers->end()) {
+    TRI_ASSERT(std::find(_failoverCandidates->begin(), _failoverCandidates->end(),
+                         sid) == _failoverCandidates->end());
    return {TRI_ERROR_NO_ERROR};  // nothing to do
  }
-
-  auto v = std::make_shared<std::vector<ServerID>>();
-  if (_followers->size() > 0) {
+  // Both lists have to be in sync at any time!
+  TRI_ASSERT(std::find(_failoverCandidates->begin(), _failoverCandidates->end(),
+                       sid) != _failoverCandidates->end());
+  auto oldFollowers = _followers;
+  auto oldFailovers = _failoverCandidates;
+  {
+    auto v = std::make_shared<std::vector<ServerID>>();
+    TRI_ASSERT(!_followers->empty());  // well we found the element above \o/
    v->reserve(_followers->size() - 1);
-    for (auto const& i : *_followers) {
-      if (i != sid) {
-        v->push_back(i);
-      }
-    }
+    std::remove_copy(_followers->begin(), _followers->end(),
+                     std::back_inserter(*v.get()), sid);
+    _followers = v;  // will cast to std::vector<ServerID> const
+  }
+  {
+    auto v = std::make_shared<std::vector<ServerID>>();
+    TRI_ASSERT(!_failoverCandidates->empty());  // well we found the element above \o/
+    v->reserve(_failoverCandidates->size() - 1);
+    std::remove_copy(_failoverCandidates->begin(), _failoverCandidates->end(),
+                     std::back_inserter(*v.get()), sid);
+    _failoverCandidates = v;  // will cast to std::vector<ServerID> const
  }
-  auto _oldFollowers = _followers;
-  _followers = v;  // will cast to std::vector<ServerID> const
 #ifdef DEBUG_SYNC_REPLICATION
  if (!AgencyCommManager::MANAGER) {
    return {TRI_ERROR_NO_ERROR};
  }
 #endif
+  Result agencyRes = persistInAgency(true);
+  if (agencyRes.ok()) {
+    // +1 for the leader (me)
+    if (_followers->size() + 1 < _docColl->minReplicationFactor()) {
+      _canWrite = false;
+    }
+    // we are finished
+    LOG_TOPIC("be0cb", DEBUG, Logger::CLUSTER)
+        << "Removing follower " << sid << " from " << _docColl->name() << "succeeded";
+    return agencyRes;
+  }
+  if (agencyRes.is(TRI_ERROR_CLUSTER_NOT_LEADER)) {
+    // Next run in Maintenance will fix this.
+    return agencyRes;
+  }

+  // rollback:
+  _followers = oldFollowers;
+  _failoverCandidates = oldFailovers;
+  std::string errorMessage =
+      "unable to remove follower from agency, timeout in agency CAS operation "
+      "for key " +
+      _docColl->vocbase().name() + "/" + std::to_string(_docColl->planId()) +
+      ": " + TRI_errno_string(agencyRes.errorNumber());
+  LOG_TOPIC("a0dcc", ERR, Logger::CLUSTER) << errorMessage;
+  agencyRes.resetErrorMessage<std::string>(std::move(errorMessage));
+  return agencyRes;
+}
+
+//////////////////////////////////////////////////////////////////////////////
+/// @brief clear follower list, no changes in agency necessary
+//////////////////////////////////////////////////////////////////////////////
+
+void FollowerInfo::clear() {
+  WRITE_LOCKER(canWriteLocker, _canWriteLock);
+  WRITE_LOCKER(writeLocker, _dataLock);
+  _followers = std::make_shared<std::vector<ServerID>>();
+  _failoverCandidates = std::make_shared<std::vector<ServerID>>();
+  _canWrite = false;
+}
+
+//////////////////////////////////////////////////////////////////////////////
+/// @brief check whether the given server is a follower
+//////////////////////////////////////////////////////////////////////////////
+
+bool FollowerInfo::contains(ServerID const& sid) const {
+  READ_LOCKER(readLocker, _dataLock);
+  auto const& f = *_followers;
+  return std::find(f.begin(), f.end(), sid) != f.end();
+}
+
+////////////////////////////////////////////////////////////////////////////////
+/// @brief Take over leadership for this shard.
+///        Also inject information of a insync followers that we knew about
+///        before a failover to this server has happened
+////////////////////////////////////////////////////////////////////////////////
+
+void FollowerInfo::takeOverLeadership(std::vector<std::string> const& previousInsyncFollowers) {
+  // This function copies over the information taken from the last CURRENT into a local vector.
+  // Where we remove the old leader and ourself from the list of followers
+  WRITE_LOCKER(canWriteLocker, _canWriteLock);
+  WRITE_LOCKER(writeLocker, _dataLock);
+  // Reset local structures, if we take over leadership we do not know anything!
+  _followers = std::make_shared<std::vector<ServerID>>();
+  _failoverCandidates = std::make_shared<std::vector<ServerID>>();
+  // We disallow writes until the first write.
+  _canWrite = false;
+  // Take over leadership
+  _theLeader = "";
+  _theLeaderTouched = true;
+  TRI_ASSERT(_failoverCandidates != nullptr && _failoverCandidates->empty());
+  if (previousInsyncFollowers.size() > 1) {
+    auto ourselves = arangodb::ServerState::instance()->getId();
+    auto failoverCandidates =
+        std::make_shared<std::vector<ServerID>>(previousInsyncFollowers);
+    auto myEntry =
+        std::find(failoverCandidates->begin(), failoverCandidates->end(), ourselves);
+    // We are a valid failover follower
+    TRI_ASSERT(myEntry != failoverCandidates->end());
+    // The first server is a different leader! (For some reason the job can be
+    // triggered twice) TRI_ASSERT(myEntry != failoverCandidates->begin());
+    failoverCandidates->erase(myEntry);
+    // Put us in front, put old leader somewhere, we do not really care
+    _failoverCandidates = failoverCandidates;
+  }
+}
+
+////////////////////////////////////////////////////////////////////////////////
+/// @brief Update the current information in the Agency. We update the failover-
+///        list with the newest values, after this the guarantee is that
+///        _followers == _failoverCandidates
+////////////////////////////////////////////////////////////////////////////////
+bool FollowerInfo::updateFailoverCandidates() {
+  MUTEX_LOCKER(agencyLocker, _agencyMutex);
+  // Acquire _canWriteLock first
+  WRITE_LOCKER(canWriteLocker, _canWriteLock);
+  // Next acquire _dataLock
+  WRITE_LOCKER(dataLocker, _dataLock);
+  if (_canWrite) {
+// Short circuit, we have multiple writes in the above write lock
+// The first needs to do things and flips _canWrite
+// All followers can return as soon as the lock is released
+#ifdef ARANGODB_ENABLE_MAINTAINER_MODE
+    TRI_ASSERT(_failoverCandidates->size() == _followers->size());
+    std::vector<std::string> diff;
+    std::set_symmetric_difference(_failoverCandidates->begin(),
+                                  _failoverCandidates->end(), _followers->begin(),
+                                  _followers->end(), std::back_inserter(diff));
+    TRI_ASSERT(diff.empty());
+#endif
+    return _canWrite;
+  }
+  TRI_ASSERT(_followers->size() + 1 >= _docColl->minReplicationFactor());
+  // Update both lists (we use a copy here, as we are modifying them in other places individually!)
+  _failoverCandidates = std::make_shared<std::vector<ServerID> const>(*_followers);
+  // Just be sure
+  TRI_ASSERT(_failoverCandidates.get() != _followers.get());
+  TRI_ASSERT(_failoverCandidates->size() == _followers->size());
+#ifdef ARANGODB_ENABLE_MAINTAINER_MODE
+  std::vector<std::string> diff;
+  std::set_symmetric_difference(_failoverCandidates->begin(),
+                                _failoverCandidates->end(), _followers->begin(),
+                                _followers->end(), std::back_inserter(diff));
+  TRI_ASSERT(diff.empty());
+#endif
+  Result res = persistInAgency(true);
+  if (!res.ok()) {
+    // We could not persist the update in the agency.
+    // Collection left in RO mode.
+    LOG_TOPIC("7af00", INFO, Logger::CLUSTER)
+        << "Could not persist insync follower for " << _docColl->vocbase().name()
+        << "/" << std::to_string(_docColl->planId())
+        << " keep RO-mode for now, next write will retry.";
+    TRI_ASSERT(!_canWrite);
+  } else {
+    _canWrite = true;
+  }
+  return _canWrite;
+}
+
+////////////////////////////////////////////////////////////////////////////////
+/// @brief Persist information in Current
+////////////////////////////////////////////////////////////////////////////////
+Result FollowerInfo::persistInAgency(bool isRemove) const {
  // Now tell the agency
  TRI_ASSERT(_docColl != nullptr);
  std::string curPath = CurrentShardPath(*_docColl);
  std::string planPath = PlanShardPath(*_docColl);
-
  AgencyComm ac;
-  double startTime = TRI_microtime();
  do {
    AgencyReadTransaction trx(std::vector<std::string>(
        {AgencyCommManager::path(planPath), AgencyCommManager::path(curPath)}));
@ -299,7 +404,7 @@ Result FollowerInfo::remove(ServerID const& sid) {

      if (!currentEntry.isObject()) {
        LOG_TOPIC("01896", ERR, Logger::CLUSTER)
-            << "FollowerInfo::remove, did not find object in " << curPath;
+            << reportName(isRemove) << ", did not find object in " << curPath;
        if (!currentEntry.isNone()) {
          LOG_TOPIC("57c84", ERR, Logger::CLUSTER) << "Found: " << currentEntry.toJson();
        }
@ -307,16 +412,16 @@ Result FollowerInfo::remove(ServerID const& sid) {
        if (!planEntry.isArray() || planEntry.length() == 0 || !planEntry[0].isString() ||
            !planEntry[0].isEqualString(ServerState::instance()->getId())) {
          LOG_TOPIC("42231", INFO, Logger::CLUSTER)
-              << "FollowerInfo::remove, did not find myself in Plan: "
-              << _docColl->vocbase().name() << "/"
-              << std::to_string(_docColl->planId())
+              << reportName(isRemove)
+              << ", did not find myself in Plan: " << _docColl->vocbase().name()
+              << "/" << std::to_string(_docColl->planId())
              << " (can happen when the leader changed recently).";
          if (!planEntry.isNone()) {
            LOG_TOPIC("ffede", INFO, Logger::CLUSTER) << "Found: " << planEntry.toJson();
          }
          return {TRI_ERROR_CLUSTER_NOT_LEADER};
        } else {
-          auto newValue = newShardEntry(currentEntry, sid, false);
+          auto newValue = newShardEntry(currentEntry);
          AgencyWriteTransaction trx;
          trx.preconditions.push_back(
              AgencyPrecondition(curPath, AgencyPrecondition::Type::VALUE, currentEntry));
@ -328,19 +433,21 @@ Result FollowerInfo::remove(ServerID const& sid) {
              AgencyOperation("Current/Version", AgencySimpleOperationType::INCREMENT_OP));
          AgencyCommResult res2 = ac.sendTransactionWithFailover(trx);
          if (res2.successful()) {
-            // we are finished
-            LOG_TOPIC("be0cb", DEBUG, Logger::CLUSTER)
-                << "Removing follower " << sid << " from " << _docColl->name()
-                << "succeeded";
            return {TRI_ERROR_NO_ERROR};
          }
        }
      }
    } else {
      LOG_TOPIC("b7333", WARN, Logger::CLUSTER)
-          << "FollowerInfo::remove, could not read " << planPath << " and "
+          << reportName(isRemove) << ", could not read " << planPath << " and "
          << curPath << " in agency.";
    }
+<<<<<<< HEAD
+    using namespace std::chrono_literals;
+    std::this_thread::sleep_for(500ms);
+  } while (!application_features::ApplicationServer::isStopping());
+  return TRI_ERROR_SHUTTING_DOWN;
+=======
    std::this_thread::sleep_for(std::chrono::milliseconds(500));
  } while (TRI_microtime() < startTime + 7200 &&
           !application_features::ApplicationServer::isStopping());
@ -367,23 +474,58 @@ Result FollowerInfo::remove(ServerID const& sid) {
  LOG_TOPIC("a0dcc", ERR, Logger::CLUSTER) << errorMessage;

  return {errorCode, std::move(errorMessage)};
+>>>>>>> c922c5f1332482ef29dff794d8af394d31c1b737
 }

-//////////////////////////////////////////////////////////////////////////////
-/// @brief clear follower list, no changes in agency necessary
-//////////////////////////////////////////////////////////////////////////////
+////////////////////////////////////////////////////////////////////////////////
+/// @brief inject the information about "servers" and "failoverCandidates"
+////////////////////////////////////////////////////////////////////////////////

-void FollowerInfo::clear() {
-  WRITE_LOCKER(writeLocker, _dataLock);
-  _followers = std::make_shared<std::vector<ServerID>>();
+void FollowerInfo::injectFollowerInfoInternal(VPackBuilder& builder) const {
+  auto ourselves = arangodb::ServerState::instance()->getId();
+  TRI_ASSERT(builder.isOpenObject());
+  builder.add(VPackValue(maintenance::SERVERS));
+  {
+    VPackArrayBuilder bb(&builder);
+    builder.add(VPackValue(ourselves));
+    for (auto const& f : *_followers) {
+      builder.add(VPackValue(f));
+    }
+  }
+  builder.add(VPackValue(StaticStrings::FailoverCandidates));
+  {
+    VPackArrayBuilder bb(&builder);
+    builder.add(VPackValue(ourselves));
+    for (auto const& f : *_failoverCandidates) {
+      builder.add(VPackValue(f));
+    }
+  }
+  TRI_ASSERT(builder.isOpenObject());
 }

-//////////////////////////////////////////////////////////////////////////////
-/// @brief check whether the given server is a follower
-//////////////////////////////////////////////////////////////////////////////
+////////////////////////////////////////////////////////////////////////////////
+/// @brief change JSON under
+/// Current/Collection/<DB-name>/<Collection-ID>/<shard-ID>
+/// to add or remove a serverID, if add flag is true, the entry is added
+/// (if it is not yet there), otherwise the entry is removed (if it was
+/// there).
+////////////////////////////////////////////////////////////////////////////////

-bool FollowerInfo::contains(ServerID const& sid) const {
-  READ_LOCKER(readLocker, _dataLock);
-  auto const& f = *_followers;
-  return std::find(f.begin(), f.end(), sid) != f.end();
+VPackBuilder FollowerInfo::newShardEntry(VPackSlice oldValue) const {
+  VPackBuilder newValue;
+  TRI_ASSERT(oldValue.isObject());
+  {
+    VPackObjectBuilder b(&newValue);
+    // Copy all but SERVERS and FailoverCandidates.
+    // They will be injected later.
+    for (auto const& it : VPackObjectIterator(oldValue)) {
+      if (!it.key.isEqualString(maintenance::SERVERS) &&
+          !it.key.isEqualString(StaticStrings::FailoverCandidates)) {
+        newValue.add(it.key);
+        newValue.add(it.value);
+      }
+    }
+    injectFollowerInfoInternal(newValue);
+  }
+  return newValue;
 }
--- a/arangod/Cluster/FollowerInfo.h
+++ b/arangod/Cluster/FollowerInfo.h
@ -25,11 +25,15 @@
 #ifndef ARANGOD_CLUSTER_FOLLOWER_INFO_H
 #define ARANGOD_CLUSTER_FOLLOWER_INFO_H 1

+#include "ClusterInfo.h"
+
 #include "Basics/Mutex.h"
 #include "Basics/ReadWriteLock.h"
 #include "Basics/Result.h"
 #include "Basics/WriteLocker.h"
-#include "ClusterInfo.h"
+#include "StorageEngine/EngineSelectorFeature.h"
+#include "StorageEngine/StorageEngine.h"
+#include "VocBase/LogicalCollection.h"

 namespace arangodb {

@ -44,22 +48,41 @@ class Slice;
 class FollowerInfo {
  // This is the list of real local followers
  std::shared_ptr<std::vector<ServerID> const> _followers;
+  // This is the list of followers that have been insync BEFORE we
+  // triggered a failover to this server.
+  // The list is filled only temporarily, and will be deleted as
+  // soon as we can guarantee at least so many followers locally.
+  std::shared_ptr<std::vector<ServerID> const> _failoverCandidates;

  // The agencyMutex is used to synchronise access to the agency.
  // the _dataLock is used to sync the access to local data.
-  // The agencyMutex is always locked before the _dataLock is locked.
+  // The _canWriteLock is used to protect flag if we do have enough followers
+  // The locking ordering to avoid dead locks has to be as follows:
+  // 1.) _agencyMutex
+  // 2.) _canWriteLock
+  // 3.) _dataLock
  mutable Mutex _agencyMutex;
+  mutable arangodb::basics::ReadWriteLock _canWriteLock;
  mutable arangodb::basics::ReadWriteLock _dataLock;
+
  arangodb::LogicalCollection* _docColl;
-  std::string _theLeader;
  // if the latter is empty, then we are leading
+  std::string _theLeader;
  bool _theLeaderTouched;
+  // flag if we have enough insnc followers and can pass through writes
+  bool _canWrite;

 public:
  explicit FollowerInfo(arangodb::LogicalCollection* d)
      : _followers(std::make_shared<std::vector<ServerID>>()),
+        _failoverCandidates(std::make_shared<std::vector<ServerID>>()),
        _docColl(d),
-        _theLeaderTouched(false) {}
+        _theLeader(""),
+        _theLeaderTouched(false),
+        _canWrite(_docColl->replicationFactor() <= 1) {
+    // On replicationfactor 1 we do not have any failover servers to maintain.
+    // This should also disable satellite tracking.
+  }

  ////////////////////////////////////////////////////////////////////////////////
  /// @brief get information about current followers of a shard.
@ -70,6 +93,23 @@ class FollowerInfo {
    return _followers;
  }

+  ////////////////////////////////////////////////////////////////////////////////
+  /// @brief get information about current followers of a shard.
+  ////////////////////////////////////////////////////////////////////////////////
+
+  std::shared_ptr<std::vector<ServerID> const> getFailoverCandidates() const {
+    READ_LOCKER(readLocker, _dataLock);
+    return _failoverCandidates;
+  }
+
+  ////////////////////////////////////////////////////////////////////////////////
+  /// @brief Take over leadership for this shard.
+  ///        Also inject information of a insync followers that we knew about
+  ///        before a failover to this server has happened
+  ////////////////////////////////////////////////////////////////////////////////
+
+  void takeOverLeadership(std::vector<std::string> const& previousInsyncFollowers);
+
  //////////////////////////////////////////////////////////////////////////////
  /// @brief add a follower to a shard, this is only done by the server side
  /// of the "get-in-sync" capabilities. This reports to the agency under
@ -106,6 +146,9 @@ class FollowerInfo {
  //////////////////////////////////////////////////////////////////////////////

  void setTheLeader(std::string const& who) {
+    // Empty leader => we are now new leader.
+    // This needs to be handled with takeOverLeadership
+    TRI_ASSERT(!who.empty());
    WRITE_LOCKER(writeLocker, _dataLock);
    _theLeader = who;
    _theLeaderTouched = true;
@ -128,6 +171,54 @@ class FollowerInfo {
    READ_LOCKER(readLocker, _dataLock);
    return _theLeaderTouched;
  }
+
+  bool allowedToWrite() {
+    {
+      auto engine = arangodb::EngineSelectorFeature::ENGINE;
+      TRI_ASSERT(engine != nullptr);
+      if (engine->inRecovery()) {
+        return true;
+      }
+      READ_LOCKER(readLocker, _canWriteLock);
+      if (_canWrite) {
+        // Someone has decided we can write, fastPath!
+
+#ifdef ARANGODB_USE_MAINTAINER_MODE
+        // Invariant, we can only WRITE if we do not have other failover candidates
+        READ_LOCKER(readLockerData, _dataLock);
+        TRI_ASSERT(_followers->size() == _failoverCandidates->size());
+        TRI_ASSERT(_followers->size() > _docColl->minReplicationFactor());
+#endif
+        return _canWrite;
+      }
+      READ_LOCKER(readLockerData, _dataLock);
+      TRI_ASSERT(_docColl != nullptr);
+      if (_followers->size() + 1 < _docColl->minReplicationFactor()) {
+        // We know that we still do not have enough followers
+        return false;
+      }
+    }
+    return updateFailoverCandidates();
+  }
+
+  //////////////////////////////////////////////////////////////////////////////
+  /// @brief Inject the information about followers into the builder.
+  ///        Builder needs to be an open object and is not allowed to contain
+  ///        the keys "servers" and "failoverCandidates".
+  //////////////////////////////////////////////////////////////////////////////
+  void injectFollowerInfo(arangodb::velocypack::Builder& builder) const {
+    READ_LOCKER(readLockerData, _dataLock);
+    injectFollowerInfoInternal(builder);
+  }
+
+ private:
+  void injectFollowerInfoInternal(arangodb::velocypack::Builder& builder) const;
+
+  bool updateFailoverCandidates();
+
+  Result persistInAgency(bool isRemove) const;
+
+  arangodb::velocypack::Builder newShardEntry(arangodb::velocypack::Slice oldValue) const;
 };
 }  // end namespace arangodb

--- a/arangod/Cluster/Maintenance.cpp
+++ b/arangod/Cluster/Maintenance.cpp
@ -245,7 +245,8 @@ void handlePlanShard(VPackSlice const& cprops, VPackSlice const& ldb,
                {THE_LEADER, shouldBeLeading ? std::string() : leaderId},
                {SERVER_ID, serverId},
                {LOCAL_LEADER, lcol.get(THE_LEADER).copyString()},
-                {FOLLOWERS_TO_DROP, followersToDropString}},
+                {FOLLOWERS_TO_DROP, followersToDropString},
+                {OLD_CURRENT_COUNTER, std::to_string(feature.getCurrentCounter())}},
            HIGHER_PRIORITY, properties));
      } else {
        LOG_TOPIC("0285b", DEBUG, Logger::MAINTENANCE)
@ -726,26 +727,14 @@ static VPackBuilder assembleLocalCollectionInfo(
          }
        }
      }
-      ret.add(VPackValue(SERVERS));
-      {
-        VPackArrayBuilder a(&ret);
-        ret.add(VPackValue(ourselves));
-        // planServers may be `none` in the case that the shard is not
-        // contained in Plan, but in local.
-        if (planServers.isArray()) {
-          std::shared_ptr<std::vector<std::string> const> current =
-              collection->followers()->get();
-          for (auto const& server : *current) {
-            ret.add(VPackValue(server));
-          }
-        }
-      }
+      collection->followers()->injectFollowerInfo(ret);
    }
    return ret;
  } catch (std::exception const& e) {
    ret.clear();
    std::string errorMsg(
-        "Maintenance::assembleLocalCollectionInfo: Failed to lookup database ");
+        "Maintenance::assembleLocalCollectionInfo: Failed to lookup "
+        "database ");
    errorMsg += database;
    errorMsg += ", exception: ";
    errorMsg += e.what();
@ -852,8 +841,10 @@ arangodb::Result arangodb::maintenance::reportInCurrent(
        auto const planPath = std::vector<std::string>{dbName, colName, "shards", shName};
        if (!pdbs.hasKey(planPath)) {
          LOG_TOPIC("43242", DEBUG, Logger::MAINTENANCE)
-              << "Ooops, we have a shard for which we believe to be the leader,"
-                 " but the Plan does not have it any more, we do not report in "
+              << "Ooops, we have a shard for which we believe to be the "
+                 "leader,"
+                 " but the Plan does not have it any more, we do not report "
+                 "in "
                 "Current about this, database: "
              << dbName << ", shard: " << shName;
          continue;
@ -863,7 +854,8 @@ arangodb::Result arangodb::maintenance::reportInCurrent(
        if (!thePlanList.isArray() || thePlanList.length() == 0 ||
            !thePlanList[0].isString() || !thePlanList[0].isEqualStringUnchecked(serverId)) {
          LOG_TOPIC("87776", DEBUG, Logger::MAINTENANCE)
-              << "Ooops, we have a shard for which we believe to be the leader,"
+              << "Ooops, we have a shard for which we believe to be the "
+                 "leader,"
                 " but the Plan says otherwise, we do not report in Current "
                 "about this, database: "
              << dbName << ", shard: " << shName;
@ -923,7 +915,8 @@ arangodb::Result arangodb::maintenance::reportInCurrent(
              if (!pdbs.hasKey(planPath)) {
                LOG_TOPIC("65432", DEBUG, Logger::MAINTENANCE)
                    << "Ooops, we have a shard for which we believe that we "
-                       "just resigned, but the Plan does not have it any more,"
+                       "just resigned, but the Plan does not have it any "
+                       "more,"
                       " we do not report in Current about this, database: "
                    << dbName << ", shard: " << shName;
                continue;
--- a/arangod/Cluster/MaintenanceFeature.cpp
+++ b/arangod/Cluster/MaintenanceFeature.cpp
@ -57,7 +57,8 @@ bool findNotDoneActions(std::shared_ptr<maintenance::Action> const& action) {
 MaintenanceFeature::MaintenanceFeature(application_features::ApplicationServer& server)
    : ApplicationFeature(server, "Maintenance"),
      _forceActivation(false),
-      _maintenanceThreadsMax(2) {
+      _maintenanceThreadsMax(2),
+      _currentCounter(0) {
  // the number of threads will be adjusted later. it's just that we want to
  // initialize all members properly

@ -116,7 +117,8 @@ void MaintenanceFeature::validateOptions(std::shared_ptr<ProgramOptions> options
        << "Need at least" << minThreadLimit << "maintenance-threads";
    _maintenanceThreadsMax = minThreadLimit;
  } else if (_maintenanceThreadsMax >= maxThreadLimit) {
-    LOG_TOPIC("8fb0e", WARN, Logger::MAINTENANCE) << "maintenance-threads limited to " << maxThreadLimit;
+    LOG_TOPIC("8fb0e", WARN, Logger::MAINTENANCE)
+        << "maintenance-threads limited to " << maxThreadLimit;
    _maintenanceThreadsMax = maxThreadLimit;
  }
 }
@ -129,8 +131,9 @@ void MaintenanceFeature::start() {

  // _forceActivation is set by the catch tests
  if (!_forceActivation && (serverState->isAgent() || serverState->isSingleServer())) {
-    LOG_TOPIC("deb1a", TRACE, Logger::MAINTENANCE) << "Disable maintenance-threads"
-                                          << " for single-server or agents.";
+    LOG_TOPIC("deb1a", TRACE, Logger::MAINTENANCE)
+        << "Disable maintenance-threads"
+        << " for single-server or agents.";
    return;
  }

@ -442,7 +445,7 @@ std::shared_ptr<Action> MaintenanceFeature::findReadyAction(std::unordered_set<s
          }  // else
        }    // for
      }
-    }      // WRITE
+    }  // WRITE

    // no pointer ... wait 0.1 seconds unless woken up
    if (!_isShuttingDown) {
@ -733,3 +736,41 @@ void MaintenanceFeature::delShardVersion(std::string const& shname) {
    _shardVersion.erase(it);
  }
 }
+
+uint64_t MaintenanceFeature::getCurrentCounter() const {
+  // It is guaranteed that getCurrentCounter is not executed
+  // concurrent to increase / wait.
+  // This guarantee is created by the following:
+  // 1) There is one inifinite loop that will call
+  //    PhaseOne and PhaseTwo in exactly this ordering.
+  //    It is guaranteed that only one thread at a time is
+  //    in this loop.
+  //    Between PhaseOne and PhaseTwo the increaseCurrentCounter is called
+  //    Within PhaseOne this getCurrentCounter is called, but never after.
+  //    so getCurrentCounter and increaseCurrentCounter are strictily serialized.
+  // 2) waitForLargerCurrentCounter can be called in concurrent threads at any time.
+  //    It is read-only, so it is save to have it concurrent to getCurrentCounter
+  //    without any locking.
+  //    However we need locking for increase and waitFor in order to guarantee
+  //    it's functionallity.
+  // For now we actually do not need this guard, but as this is NOT performance
+  // critical we can simply get it, just to be save for later use.
+  std::unique_lock<std::mutex> guard(_currentCounterLock);
+  return _currentCounter;
+}
+
+void MaintenanceFeature::increaseCurrentCounter() {
+  std::unique_lock<std::mutex> guard(_currentCounterLock);
+  _currentCounter++;
+  _currentCounterCondition.notify_all();
+}
+
+void MaintenanceFeature::waitForLargerCurrentCounter(uint64_t old) {
+  std::unique_lock<std::mutex> guard(_currentCounterLock);
+  if (_currentCounter > old) {
+    return;
+  }
+  _currentCounterCondition.wait(guard);
+  TRI_ASSERT(_currentCounter > old);
+  return;
+}
--- a/arangod/Cluster/MaintenanceFeature.h
+++ b/arangod/Cluster/MaintenanceFeature.h
@ -36,7 +36,7 @@

 namespace arangodb {

-template<typename T>
+template <typename T>
 struct SharedPtrComparer {
  bool operator()(std::shared_ptr<T> const& a, std::shared_ptr<T> const& b) {
    if (a == nullptr || b == nullptr) {
@ -50,8 +50,6 @@ class MaintenanceFeature : public application_features::ApplicationFeature {
 public:
  explicit MaintenanceFeature(application_features::ApplicationServer&);

-  MaintenanceFeature();
-
  virtual ~MaintenanceFeature() {}

  struct errors_t {
@ -156,7 +154,8 @@ class MaintenanceFeature : public application_features::ApplicationFeature {
   * @brief Find and return first found not-done action or nullptr
   * @param desc Description of sought action
   */
-  std::shared_ptr<maintenance::Action> findFirstNotDoneAction(std::shared_ptr<maintenance::ActionDescription> const& desc);
+  std::shared_ptr<maintenance::Action> findFirstNotDoneAction(
+      std::shared_ptr<maintenance::ActionDescription> const& desc);

  /**
   * @brief add index error to bucket
@ -298,19 +297,44 @@ class MaintenanceFeature : public application_features::ApplicationFeature {
   */
  void delShardVersion(std::string const& shardId);

+  /**
+   * @brief Get the number of loadCurrent operations.
+   *        NOTE: The Counter functions can be removed
+   *        as soon as we use a push based approach on Plan and Current
+   * @return The most recent count for getCurrent calls
+   */
+  uint64_t getCurrentCounter() const;
+
+  /**
+   * @brief increase the counter for loadCurrent operations triggered
+   *        during maintenance. This is used to delay some Actions, that
+   *        require a recent current to continue
+   */
+  void increaseCurrentCounter();
+
+  /**
+   * @brief wait until the current counter is larger then the given old one
+   *        the idea here is to first request the `getCurrentCounter`.
+   * @param old  The last number of getCurrentCounter(). This function will
+   *             return only of the recent counter is larger than old.
+   */
+  void waitForLargerCurrentCounter(uint64_t old);
+
 private:
  /// @brief common code used by multiple constructors
  void init();

  /// @brief Search for first action matching hash and predicate
  /// @return shared pointer to action object if exists, empty shared_ptr if not
-  std::shared_ptr<maintenance::Action> findFirstActionHash(size_t hash,
-                                                           std::function<bool(std::shared_ptr<maintenance::Action> const&)> const& predicate);
+  std::shared_ptr<maintenance::Action> findFirstActionHash(
+      size_t hash,
+      std::function<bool(std::shared_ptr<maintenance::Action> const&)> const& predicate);

  /// @brief Search for first action matching hash and predicate (with lock already held by caller)
  /// @return shared pointer to action object if exists, empty shared_ptr if not
-  std::shared_ptr<maintenance::Action> findFirstActionHashNoLock(size_t hash,
-                                                                 std::function<bool(std::shared_ptr<maintenance::Action> const&)> const& predicate);
+  std::shared_ptr<maintenance::Action> findFirstActionHashNoLock(
+      size_t hash,
+      std::function<bool(std::shared_ptr<maintenance::Action> const&)> const& predicate);

  /// @brief Search for action by Id
  /// @return shared pointer to action object if exists, nullptr if not
@ -321,7 +345,6 @@ class MaintenanceFeature : public application_features::ApplicationFeature {
  std::shared_ptr<maintenance::Action> findActionIdNoLock(uint64_t hash);

 protected:
-
  /// @brief option for forcing this feature to always be enable - used by the catch tests
  bool _forceActivation;

@ -365,8 +388,8 @@ class MaintenanceFeature : public application_features::ApplicationFeature {
  // we need to leave the action in _prioQueue (since we cannot remove anything
  // but the top from it), and simply put it into a different state.
  std::priority_queue<std::shared_ptr<maintenance::Action>,
-                      std::vector<std::shared_ptr<maintenance::Action>>,
-                      SharedPtrComparer<maintenance::Action>> _prioQueue;
+                      std::vector<std::shared_ptr<maintenance::Action>>, SharedPtrComparer<maintenance::Action>>
+      _prioQueue;

  /// @brief lock to protect _actionRegistry and state changes to MaintenanceActions within
  mutable arangodb::basics::ReadWriteLock _actionRegistryLock;
@ -404,6 +427,15 @@ class MaintenanceFeature : public application_features::ApplicationFeature {
  /// @brief shards have versions in order to be able to distinguish between
  /// independant actions
  std::unordered_map<std::string, size_t> _shardVersion;
+
+  /// @brief Mutex for the current counter condition variable
+  mutable std::mutex _currentCounterLock;
+
+  /// @brief Condition variable where Actions can wait on until _currentCounter increased
+  std::condition_variable _currentCounterCondition;
+
+  /// @brief  counter for load_current requests.
+  uint64_t _currentCounter;
 };

 }  // namespace arangodb
--- a/arangod/Cluster/MaintenanceStrings.h
+++ b/arangod/Cluster/MaintenanceStrings.h
@ -69,6 +69,7 @@ constexpr char const* THE_LEADER = "theLeader";
 constexpr char const* UNDERSCORE = "_";
 constexpr char const* UPDATE_COLLECTION = "UpdateCollection";
 constexpr char const* WAIT_FOR_SYNC = "waitForSync";
+constexpr char const* OLD_CURRENT_COUNTER = "oldCurrentCounter";

 }  // namespace maintenance
 }  // namespace arangodb
--- a/arangod/Cluster/UpdateCollection.cpp
+++ b/arangod/Cluster/UpdateCollection.cpp
@ -77,21 +77,36 @@ UpdateCollection::UpdateCollection(MaintenanceFeature& feature, ActionDescriptio
  }
  TRI_ASSERT(desc.has(FOLLOWERS_TO_DROP));

+  TRI_ASSERT(desc.has(OLD_CURRENT_COUNTER));
+
  if (!error.str().empty()) {
-    LOG_TOPIC("a6e4c", ERR, Logger::MAINTENANCE) << "UpdateCollection: " << error.str();
+    LOG_TOPIC("a6e4c", ERR, Logger::MAINTENANCE)
+        << "UpdateCollection: " << error.str();
    _result.reset(TRI_ERROR_INTERNAL, error.str());
    setState(FAILED);
  }
 }

 void handleLeadership(LogicalCollection& collection, std::string const& localLeader,
-                      std::string const& plannedLeader, std::string const& followersToDrop) {
+                      std::string const& plannedLeader,
+                      std::string const& followersToDrop, std::string const& databaseName,
+                      uint64_t oldCounter, MaintenanceFeature& feature) {
  auto& followers = collection.followers();

  if (plannedLeader.empty()) {   // Planned to lead
    if (!localLeader.empty()) {  // We were not leader, assume leadership
-      followers->setTheLeader(std::string());
-      followers->clear();
+      // This will block the thread until we fetched a new current version
+      // in maintenance main thread.
+      feature.waitForLargerCurrentCounter(oldCounter);
+      auto currentInfo = ClusterInfo::instance()->getCollectionCurrent(
+          databaseName, std::to_string(collection.planId()));
+      if (currentInfo == nullptr) {
+        // Collection has been dropped we cannot continue here.
+        return;
+      }
+      TRI_ASSERT(currentInfo != nullptr);
+      auto failoverCandidates = currentInfo->failoverCandidates(collection.name());
+      followers->takeOverLeadership(failoverCandidates);
      transaction::cluster::abortFollowerTransactionsOnShard(collection.id());
    } else {
      // If someone (the Supervision most likely) has thrown
@ -138,6 +153,8 @@ bool UpdateCollection::first() {
  auto const& localLeader = _description.get(LOCAL_LEADER);
  auto const& followersToDrop = _description.get(FOLLOWERS_TO_DROP);
  auto const& props = properties();
+  auto const& oldCounterString = _description.get(OLD_CURRENT_COUNTER);
+  uint64_t oldCounter = basics::StringUtils::uint64(oldCounterString);

  try {
    DatabaseGuard guard(database);
@ -152,7 +169,8 @@ bool UpdateCollection::first() {
          // resignation case is not handled here, since then
          // ourselves does not appear in shards[shard] but only
          // "_" + ourselves.
-          handleLeadership(*coll, localLeader, plannedLeader, followersToDrop);
+          handleLeadership(*coll, localLeader, plannedLeader, followersToDrop,
+                           vocbase.name(), oldCounter, feature());
          _result = Collections::updateProperties(*coll, props, false);  // always a full-update

          if (!_result.ok()) {
@ -173,7 +191,8 @@ bool UpdateCollection::first() {
    std::stringstream error;

    error << "action " << _description << " failed with exception " << e.what();
-    LOG_TOPIC("79442", WARN, Logger::MAINTENANCE) << "UpdateCollection: " << error.str();
+    LOG_TOPIC("79442", WARN, Logger::MAINTENANCE)
+        << "UpdateCollection: " << error.str();
    _result.reset(TRI_ERROR_INTERNAL, error.str());
  }

--- a/arangod/GeneralServer/RequestLane.h
+++ b/arangod/GeneralServer/RequestLane.h
@ -65,6 +65,11 @@ enum class RequestLane {
  // V8 or having high priority.
  CLUSTER_INTERNAL,

+  // For requests from the DBserver to the Coordinator or
+  // from the Coordinator to the DBserver. Using AQL
+  // these have Medium priority.
+  CLUSTER_AQL,
+
  // For requests from the from the Coordinator to the
  // DBserver using V8.
  CLUSTER_V8,
@ -115,6 +120,8 @@ inline RequestPriority PriorityRequestLane(RequestLane lane) {
      return RequestPriority::LOW;
    case RequestLane::CLUSTER_INTERNAL:
      return RequestPriority::HIGH;
+    case RequestLane::CLUSTER_AQL:
+      return RequestPriority::MED;
    case RequestLane::CLUSTER_V8:
      return RequestPriority::LOW;
    case RequestLane::CLUSTER_ADMIN:
--- a/arangod/InternalRestHandler/InternalRestTraverserHandler.h
+++ b/arangod/InternalRestHandler/InternalRestTraverserHandler.h
@ -41,9 +41,7 @@ class InternalRestTraverserHandler : public RestVocbaseBaseHandler {
  char const* name() const override final {
    return "InternalRestTraverserHandler";
  }
-  RequestLane lane() const override final {
-    return RequestLane::CLUSTER_INTERNAL;
-  }
+  RequestLane lane() const override final { return RequestLane::CLUSTER_AQL; }

 private:
  // @brief create a new Traverser Engine.
--- a/arangod/Pregel/Conductor.cpp
+++ b/arangod/Pregel/Conductor.cpp
@ -82,7 +82,7 @@ Conductor::Conductor(uint64_t executionNumber, TRI_vocbase_t& vocbase,
  if (_asyncMode) {
    LOG_TOPIC("1b1c2", DEBUG, Logger::PREGEL) << "Running in async mode";
  }
-  VPackSlice lazy = _userParams.slice().get( Utils::lazyLoadingKey);
+  VPackSlice lazy = _userParams.slice().get(Utils::lazyLoadingKey);
  _lazyLoading = _algorithm->supportsLazyLoading();
  _lazyLoading = _lazyLoading && (lazy.isNone() || lazy.getBoolean());
  if (_lazyLoading) {
@ -98,8 +98,7 @@ Conductor::Conductor(uint64_t executionNumber, TRI_vocbase_t& vocbase,
 }

 Conductor::~Conductor() {
-  if (_state != ExecutionState::CANCELED &&
-      _state != ExecutionState::DEFAULT) {
+  if (_state != ExecutionState::CANCELED && _state != ExecutionState::DEFAULT) {
    try {
      this->cancel();
    } catch (...) {
@ -120,11 +119,13 @@ void Conductor::start() {
  _globalSuperstep = 0;
  _state = ExecutionState::RUNNING;

-  LOG_TOPIC("3a255", DEBUG, Logger::PREGEL) << "Telling workers to load the data";
+  LOG_TOPIC("3a255", DEBUG, Logger::PREGEL)
+      << "Telling workers to load the data";
  int res = _initializeWorkers(Utils::startExecutionPath, VPackSlice());
  if (res != TRI_ERROR_NO_ERROR) {
    _state = ExecutionState::CANCELED;
-    LOG_TOPIC("30171", ERR, Logger::PREGEL) << "Not all DBServers started the execution";
+    LOG_TOPIC("30171", ERR, Logger::PREGEL)
+        << "Not all DBServers started the execution";
  }
 }

@ -170,7 +171,8 @@ bool Conductor::_startGlobalStep() {
    _masterContext->_enterNextGSS = false;
    proceed = _masterContext->postGlobalSuperstep();
    if (!proceed) {
-      LOG_TOPIC("0aa8e", DEBUG, Logger::PREGEL) << "Master context ended execution";
+      LOG_TOPIC("0aa8e", DEBUG, Logger::PREGEL)
+          << "Master context ended execution";
    }
  }

@ -212,7 +214,8 @@ bool Conductor::_startGlobalStep() {
  res = _sendToAllDBServers(Utils::startGSSPath, b);  // call me maybe
  if (res != TRI_ERROR_NO_ERROR) {
    _state = ExecutionState::IN_ERROR;
-    LOG_TOPIC("f34bb", ERR, Logger::PREGEL) << "Conductor could not start GSS " << _globalSuperstep;
+    LOG_TOPIC("f34bb", ERR, Logger::PREGEL)
+        << "Conductor could not start GSS " << _globalSuperstep;
    // the recovery mechanisms should take care od this
  } else {
    LOG_TOPIC("411a5", DEBUG, Logger::PREGEL) << "Conductor started new gss " << _globalSuperstep;
@ -236,8 +239,9 @@ void Conductor::finishedWorkerStartup(VPackSlice const& data) {
    return;
  }

-  LOG_TOPIC("76631", INFO, Logger::PREGEL) << "Running pregel with " << _totalVerticesCount
-                                  << " vertices, " << _totalEdgesCount << " edges";
+  LOG_TOPIC("76631", INFO, Logger::PREGEL)
+      << "Running pregel with " << _totalVerticesCount << " vertices, "
+      << _totalEdgesCount << " edges";
  if (_masterContext) {
    _masterContext->_globalSuperstep = 0;
    _masterContext->_vertexCount = _totalVerticesCount;
@ -356,7 +360,8 @@ void Conductor::finishedRecoveryStep(VPackSlice const& data) {
    res = _sendToAllDBServers(Utils::continueRecoveryPath, b);

  } else {
-    LOG_TOPIC("6ecf2", INFO, Logger::PREGEL) << "Recovery finished. Proceeding normally";
+    LOG_TOPIC("6ecf2", INFO, Logger::PREGEL)
+        << "Recovery finished. Proceeding normally";

    // build the message, works for all cases
    VPackBuilder b;
@ -393,7 +398,8 @@ void Conductor::startRecovery() {
  if (_state != ExecutionState::RUNNING && _state != ExecutionState::IN_ERROR) {
    return;  // maybe we are already in recovery mode
  } else if (_algorithm->supportsCompensation() == false) {
-    LOG_TOPIC("12e0e", ERR, Logger::PREGEL) << "Algorithm does not support recovery";
+    LOG_TOPIC("12e0e", ERR, Logger::PREGEL)
+        << "Algorithm does not support recovery";
    cancelNoLock();
    return;
  }
@ -407,14 +413,15 @@ void Conductor::startRecovery() {

  // let's wait for a final state in the cluster
  _workHandle = SchedulerFeature::SCHEDULER->queueDelay(
-      RequestLane::CLUSTER_INTERNAL, std::chrono::seconds(2), [this](bool cancelled) {
+      RequestLane::CLUSTER_AQL, std::chrono::seconds(2), [this](bool cancelled) {
        if (cancelled || _state != ExecutionState::RECOVERING) {
          return;  // seems like we are canceled
        }
        std::vector<ServerID> goodServers;
        int res = PregelFeature::instance()->recoveryManager()->filterGoodServers(_dbServers, goodServers);
        if (res != TRI_ERROR_NO_ERROR) {
-          LOG_TOPIC("3d08b", ERR, Logger::PREGEL) << "Recovery proceedings failed";
+          LOG_TOPIC("3d08b", ERR, Logger::PREGEL)
+              << "Recovery proceedings failed";
          cancelNoLock();
          return;
        }
@ -614,15 +621,15 @@ int Conductor::_initializeWorkers(std::string const& suffix, VPackSlice addition
  }

  std::shared_ptr<ClusterComm> cc = ClusterComm::instance();
-  size_t nrGood = cc->performRequests(requests, 5.0 * 60.0,
-                                      LogTopic("Pregel Conductor"), false);
+  size_t nrGood =
+      cc->performRequests(requests, 5.0 * 60.0, LogTopic("Pregel Conductor"), false);
  Utils::printResponses(requests);
  return nrGood == requests.size() ? TRI_ERROR_NO_ERROR : TRI_ERROR_FAILED;
 }

 int Conductor::_finalizeWorkers() {
  _callbackMutex.assertLockedByCurrentThread();
-   _finalizationStartTimeSecs = TRI_microtime();
+  _finalizationStartTimeSecs = TRI_microtime();

  bool store = _state == ExecutionState::DONE;
  store = store && _storeResults;
@ -651,7 +658,6 @@ int Conductor::_finalizeWorkers() {
 }

 void Conductor::finishedWorkerFinalize(VPackSlice data) {
-
  MUTEX_LOCKER(guard, _callbackMutex);
  _ensureUniqueResponse(data);
  if (_respondedServers.size() != _dbServers.size()) {
@ -674,9 +680,8 @@ void Conductor::finishedWorkerFinalize(VPackSlice data) {

  LOG_TOPIC("063b5", INFO, Logger::PREGEL) << "Done. We did " << _globalSuperstep << " rounds";
  LOG_TOPIC("3cfa8", INFO, Logger::PREGEL)
-  << "Startup Time: " << _computationStartTimeSecs - _startTimeSecs << "s";
-  LOG_TOPIC("d43cb", INFO, Logger::PREGEL)
-  << "Computation Time: " << compTime << "s";
+      << "Startup Time: " << _computationStartTimeSecs - _startTimeSecs << "s";
+  LOG_TOPIC("d43cb", INFO, Logger::PREGEL) << "Computation Time: " << compTime << "s";
  LOG_TOPIC("74e05", INFO, Logger::PREGEL) << "Storage Time: " << storeTime << "s";
  LOG_TOPIC("06f03", INFO, Logger::PREGEL) << "Overall: " << totalRuntimeSecs() << "s";
  LOG_TOPIC("03f2e", DEBUG, Logger::PREGEL) << "Stats: " << debugOut.toString();
@ -686,7 +691,7 @@ void Conductor::finishedWorkerFinalize(VPackSlice data) {
    auto* scheduler = SchedulerFeature::SCHEDULER;
    if (scheduler) {
      uint64_t exe = _executionNumber;
-      scheduler->queue(RequestLane::CLUSTER_INTERNAL, [exe] {
+      scheduler->queue(RequestLane::CLUSTER_AQL, [exe] {
        auto pf = PregelFeature::instance();
        if (pf) {
          pf->cleanupConductor(exe);
@ -770,8 +775,7 @@ int Conductor::_sendToAllDBServers(std::string const& path, VPackBuilder const&
        if (conductor) {
          TRI_vocbase_t& vocbase = conductor->_vocbaseGuard.database();
          VPackBuilder response;
-          PregelFeature::handleWorkerRequest(vocbase, path,
-                                             message.slice(), response);
+          PregelFeature::handleWorkerRequest(vocbase, path, message.slice(), response);
        }
      });
    }
@ -794,9 +798,10 @@ int Conductor::_sendToAllDBServers(std::string const& path, VPackBuilder const&
    requests.emplace_back("server:" + server, rest::RequestType::POST, base + path, body);
  }

-  size_t nrGood = cc->performRequests(requests, 5.0 * 60.0,
-                                      LogTopic("Pregel Conductor"), false);
-  LOG_TOPIC("9de62", TRACE, Logger::PREGEL) << "Send " << path << " to " << nrGood << " servers";
+  size_t nrGood =
+      cc->performRequests(requests, 5.0 * 60.0, LogTopic("Pregel Conductor"), false);
+  LOG_TOPIC("9de62", TRACE, Logger::PREGEL)
+      << "Send " << path << " to " << nrGood << " servers";
  Utils::printResponses(requests);
  if (handle && nrGood == requests.size()) {
    for (ClusterCommRequest const& req : requests) {
--- a/arangod/RestHandler/RestReplicationHandler.cpp
+++ b/arangod/RestHandler/RestReplicationHandler.cpp
@ -1147,7 +1147,8 @@ Result RestReplicationHandler::processRestoreCollectionCoordinator(
  }

  if (!isValidMinReplFactorSlice) {
-    if (replFactorSlice.isString() && replFactorSlice.isEqualString("satellite")) {
+    if (replFactorSlice.isString() &&
+        replFactorSlice.isEqualString("satellite")) {
      minReplicationFactor = 0;
    } else if (minReplicationFactor <= 0) {
      minReplicationFactor = 1;
--- a/arangod/Scheduler/SupervisedScheduler.cpp
+++ b/arangod/Scheduler/SupervisedScheduler.cpp
@ -53,9 +53,9 @@ bool isDirectDeadlockLane(RequestLane lane) {
  // Those tasks can not be executed directly.
  return lane == RequestLane::TASK_V8 || lane == RequestLane::CLIENT_V8 ||
         lane == RequestLane::CLUSTER_V8 || lane == RequestLane::INTERNAL_LOW ||
-         lane == RequestLane::SERVER_REPLICATION ||
-         lane == RequestLane::CLUSTER_ADMIN || lane == RequestLane::CLUSTER_INTERNAL ||
-         lane == RequestLane::AGENCY_CLUSTER || lane == RequestLane::CLIENT_AQL;
+         lane == RequestLane::SERVER_REPLICATION || lane == RequestLane::CLUSTER_ADMIN ||
+         lane == RequestLane::CLUSTER_INTERNAL || lane == RequestLane::AGENCY_CLUSTER ||
+         lane == RequestLane::CLIENT_AQL || lane == RequestLane::CLUSTER_AQL;
 }

 }  // namespace
--- a/arangod/Transaction/Methods.cpp
+++ b/arangod/Transaction/Methods.cpp
@ -1720,7 +1720,7 @@ OperationResult transaction::Methods::insertLocal(std::string const& collectionN
      if (!options.isSynchronousReplicationFrom.empty()) {
        return OperationResult(TRI_ERROR_CLUSTER_SHARD_LEADER_REFUSES_REPLICATION, options);
      }
-      if (followerInfo->get()->size() + 1 < collection->minReplicationFactor()) {
+      if (!followerInfo->allowedToWrite()) {
        // We cannot fulfill minimum replication Factor.
        // Reject write.
        LOG_TOPIC("d7306", ERR, Logger::REPLICATION)
@ -2044,7 +2044,7 @@ OperationResult transaction::Methods::modifyLocal(std::string const& collectionN
      if (!options.isSynchronousReplicationFrom.empty()) {
        return OperationResult(TRI_ERROR_CLUSTER_SHARD_LEADER_REFUSES_REPLICATION);
      }
-      if (followerInfo->get()->size() + 1 < collection->minReplicationFactor()) {
+      if (!followerInfo->allowedToWrite()) {
        // We cannot fulfill minimum replication Factor.
        // Reject write.
        LOG_TOPIC("2e35a", ERR, Logger::REPLICATION)
@ -2324,7 +2324,7 @@ OperationResult transaction::Methods::removeLocal(std::string const& collectionN
      if (!options.isSynchronousReplicationFrom.empty()) {
        return OperationResult(TRI_ERROR_CLUSTER_SHARD_LEADER_REFUSES_REPLICATION);
      }
-      if (followerInfo->get()->size() + 1 < collection->minReplicationFactor()) {
+      if (!followerInfo->allowedToWrite()) {
        // We cannot fulfill minimum replication Factor.
        // Reject write.
        LOG_TOPIC("f1f8e", ERR, Logger::REPLICATION)
@ -2559,7 +2559,7 @@ OperationResult transaction::Methods::truncateLocal(std::string const& collectio
      if (!options.isSynchronousReplicationFrom.empty()) {
        return OperationResult(TRI_ERROR_CLUSTER_SHARD_LEADER_REFUSES_REPLICATION);
      }
-      if (followerInfo->get()->size() + 1 < collection->minReplicationFactor()) {
+      if (!followerInfo->allowedToWrite()) {
        // We cannot fulfill minimum replication Factor.
        // Reject write.
        LOG_TOPIC("7c1d4", ERR, Logger::REPLICATION)
--- a/lib/Basics/StaticStrings.cpp
+++ b/lib/Basics/StaticStrings.cpp
@ -228,6 +228,7 @@ std::string const StaticStrings::GraphCreateCollection("createCollection");

 // Replication
 std::string const StaticStrings::ReplicationSoftLockOnly("doSoftLockOnly");
+std::string const StaticStrings::FailoverCandidates("failoverCandidates");

 // misc strings
 std::string const StaticStrings::LastValue("lastValue");
--- a/lib/Basics/StaticStrings.h
+++ b/lib/Basics/StaticStrings.h
@ -210,6 +210,7 @@ class StaticStrings {

  // Replication
  static std::string const ReplicationSoftLockOnly;
+  static std::string const FailoverCandidates;

  // misc strings
  static std::string const LastValue;
--- a/tests/IResearch/IResearchQueryOptimization-test.cpp
+++ b/tests/IResearch/IResearchQueryOptimization-test.cpp
@ -252,13 +252,39 @@ class IResearchQueryOptimizationTest : public ::testing::Test {

 NS_END

+static std::vector<std::string> const EMPTY;
+
 // -----------------------------------------------------------------------------
 // --SECTION--                                                        test suite
 // -----------------------------------------------------------------------------

+void addLinkToCollection(std::shared_ptr<arangodb::iresearch::IResearchView>& view) {
+  auto updateJson = VPackParser::fromJson(
+    "{ \"links\" : {"
+    "\"collection_1\" : { \"includeAllFields\" : true }"
+    "}}");
+  EXPECT_TRUE((view->properties(updateJson->slice(), true).ok()));
+
+  arangodb::velocypack::Builder builder;
+
+  builder.openObject();
+  view->properties(builder, arangodb::LogicalDataSource::makeFlags(
+    arangodb::LogicalDataSource::Serialize::Detailed));
+  builder.close();
+
+  auto slice = builder.slice();
+  EXPECT_TRUE(slice.isObject());
+  EXPECT_TRUE(slice.get("name").copyString() == "testView");
+  EXPECT_TRUE(slice.get("type").copyString() ==
+              arangodb::iresearch::DATA_SOURCE_TYPE.name());
+  EXPECT_TRUE(slice.get("deleted").isNone());  // no system properties
+  auto tmpSlice = slice.get("links");
+  EXPECT_TRUE((true == tmpSlice.isObject() && 1 == tmpSlice.length()));
+}
+
+
 // dedicated to https://github.com/arangodb/arangodb/issues/8294
 TEST_F(IResearchQueryOptimizationTest, test) {
-  static std::vector<std::string> const EMPTY;

  auto createJson = VPackParser::fromJson(
      "{ \
@ -285,29 +311,7 @@ TEST_F(IResearchQueryOptimizationTest, test) {
  ASSERT_TRUE((false == !view));

  // add link to collection
-  {
-    auto updateJson = VPackParser::fromJson(
-        "{ \"links\" : {"
-        "\"collection_1\" : { \"includeAllFields\" : true }"
-        "}}");
-    EXPECT_TRUE((view->properties(updateJson->slice(), true).ok()));
-
-    arangodb::velocypack::Builder builder;
-
-    builder.openObject();
-    view->properties(builder, arangodb::LogicalDataSource::makeFlags(
-                                  arangodb::LogicalDataSource::Serialize::Detailed));
-    builder.close();
-
-    auto slice = builder.slice();
-    EXPECT_TRUE(slice.isObject());
-    EXPECT_TRUE(slice.get("name").copyString() == "testView");
-    EXPECT_TRUE(slice.get("type").copyString() ==
-                arangodb::iresearch::DATA_SOURCE_TYPE.name());
-    EXPECT_TRUE(slice.get("deleted").isNone());  // no system properties
-    auto tmpSlice = slice.get("links");
-    EXPECT_TRUE((true == tmpSlice.isObject() && 1 == tmpSlice.length()));
-  }
+  addLinkToCollection(view);

  std::deque<arangodb::ManagedDocumentResult> insertedDocs;