1
0
Fork 0
Commit Graph

237 Commits

Author SHA1 Message Date
Lars Maier 51af263960 Added precondition to ensure that server is still as seen before. (#10468) 2019-11-21 09:21:36 +01:00
Jan 46e98d7110
avoid string copies in several cases (#10317) 2019-10-25 10:47:04 +02:00
Dan Larkin-York a83c2323c9 Refactor ApplicationServer stack (#9965) 2019-09-25 17:31:59 +02:00
jsteemann 8a812ec8c0 use StaticString 2019-09-23 18:17:37 +02:00
Markus Pfeiffer 753ff4aa67 Feature/atomic database creation 2 (#9826) 2019-09-05 12:38:07 +02:00
Jan 7220af9602
cover more cases of "unique constraint violated" issues during replication (#9830) 2019-08-30 10:37:32 +02:00
Frank Celler aa3d3f8e40
Feature/cleanup ccpcheck (#9665) 2019-08-12 11:11:49 +02:00
Jan 1a58cc2213
add VelocyPackHelper::equal method (#9389) 2019-07-03 12:15:11 +02:00
Jan 9cb08ded92
make the comparison functions unambiguous (#9349)
* make the comparison functions unambiguous

* added @kaveh's suggestion
2019-07-01 16:35:28 +02:00
Lars Maier 1e94ecf414 Bug fix/supervision fixes4 (#9016)
* Try to fix agency problems with snapshots.

* Abort MoveShards jobs that have the failed server as fromServer.

* Report aborts.

* CHANGELOG.
2019-05-31 17:20:06 +02:00
Kaveh Vahedipour 773f3c8422 [devel] fix state clientlookuptable (#9066) 2019-05-30 04:24:46 +02:00
Max Neunhöffer 80bfb85695
Port agency performance tuning for many shards to devel. (#8647)
* Port agency performance tuning for many shards to devel.
* Add more IDs to LOG_TOPIC calls.
* Even more IDs for LOG_TOPIC.
* Fix a duplicate LOG_TOPIC ID.
* Fix an old merging bug in devel.
* Don't hesitate between phases one and two for small clusters.
2019-04-11 11:14:56 +02:00
Max Neunhöffer 02281d3be4
Handle InitDone correctly. (#8552)
* precondition plan / version in compaction / store TTL removal independent of local _ttl set
* Agency init loops break when shutting down.
* assertion failures in store on restarting following agents
* Minor porting fixes from 3.4
2019-04-01 17:01:05 +02:00
Jan d6d3e3daa4
initialize some member variables, added TODOs (#8545) 2019-03-26 12:57:32 +01:00
Jan Christoph Uhde c3f7961b88 apply unique log ids (#8561) 2019-03-25 20:26:51 +01:00
Max Neunhöffer 55706e3c74
Make addfollower jobs less aggressive. (#8490)
* Make addfollower jobs less aggressive.
* CHANGELOG.
2019-03-21 15:24:31 +01:00
Kaveh Vahedipour 5038dfe685 supervision must not copy snapshots into jobs (#8425)
* supervision must not copy snapshots into jobs
* CHANGELOG.
2019-03-20 17:07:54 +01:00
Kaveh Vahedipour 237e079614 leader check needs to sit inside waitfor loop (#8445)
* leader check needs to sit inside waitfor loop
* Do not wait in Supervision for commits of new writes.
* CHANGELOG.
2019-03-20 16:34:54 +01:00
Simon 49cc3bcd1e Refactorings from cluster trx improvement branch (#8391) 2019-03-14 23:13:17 +01:00
Kaveh Vahedipour fa98e94d23 Supervision must not waitfor if no longer leading (#8403)
* Supervision must not waitfor if no longer leading

* Supervision must not waitfor if no longer leading
2019-03-13 13:18:10 +01:00
Max Neunhöffer 2a4f606df2
Various agency improvements. (#8380)
* Ignore satellite collections in shrinkCluster in agency.
* Abort RemoveFollower job if not enough in-sync followers or leader failure.
* Break quick wait loop in supervision if leadership is lost.
* In case of resigned leader, set isReady=false in clusterInventory.
* Fix catch tests.
2019-03-12 15:25:16 +01:00
Kaveh Vahedipour ee751e8ba3 [devel] clear compilation warnings (#8345) 2019-03-08 10:35:09 +01:00
Kaveh Vahedipour 4b464aeb97 oversight (#8324)
* oversight of an abort
* fix waitFor trap in supervision
2019-03-05 23:31:18 +01:00
Kaveh Vahedipour 68178ba165 [devel] supervision bug fix backports (#8314)
* back ports for supervision fixes from 3.4 part 1

* back ports for supervision fixes from 3.4 part 2
2019-03-04 19:27:24 +01:00
Manuel Pöter ecf4d9d62a Fix race conditions in thread management. (#8032) 2019-01-28 15:44:46 +01:00
Frank Celler ac9f375fb5 big reformat 2018-12-26 00:54:03 +01:00
Simon a2a0b03f43 Rdb index background (preliminary) (#7644) 2018-12-21 19:24:10 +01:00
Lars Maier 908df47cd7 [devel] Bug fix/cluster health ui timestamp (#7562) 2018-11-30 16:26:21 +01:00
Lars Maier 52cff7ad55 Feature/engine version added to agent configuration (#7481) (#7524)
* agents' is obtained from leader's configuration
* corrections in Supervision for advertised endpoints
* change log
* Updated Documentation for cluster/health.
* Unified naming convention.
* Fixed missing update of volatile fields.
* Set version in right order.
* Removed debug output.
* Fixed jslint - missing ;
2018-11-29 14:25:40 +01:00
Lars Maier f3ade0f860 Version/Engine Cluster Health (#7474)
* Export Version and Engine in Cluster Health. Additionally export `versionString` in registered Servers.

* Updated Changelog.
2018-11-27 14:56:00 +01:00
Kaveh Vahedipour 9ec6619b84 Bug fix/index readiness (#6541)
* indexes are marked  while still missing in Current
* index handling getCollection
* supervision gets indexes from isbuilding, when coordinator is gone before finishing
* seems right now
* fixed broken views
* remove junk comments
* cleanup
* node / supervision adjustements
* supervision fixes
* neunhoef remarks part i
* neunhoef remarks part ii
* neunhoef remarks part ii
* neunhoef remarks part iiI
* collection's current version please
* no need to wait for current once again
* no longer necessary code
* clear comments
* delete left overs
* dead code revived
2018-11-21 14:42:58 +01:00
Jan 7306cdaa03
try not to throw so many exceptions from Supervision (#7227) 2018-11-07 15:36:41 +01:00
Max Neunhöffer 2452dcc5d0
Remove a relic from early days in /Target/FailedServers. (#6690)
* Remove a relic from early days in /Target/FailedServers.
* Fix a test.
2018-10-09 13:52:32 +02:00
Lars Maier 6546b908be Bug fix/cleanup lost collection inc plan v (#6720)
* Increase the current version rather than the plan version.
2018-10-04 15:38:41 +02:00
Lars Maier 14d1487710 Catch all exceptions to prevent maintenance workers from crashing. (#6645)
* Catch all exceptions to prevent maintenance workers from crashing.
* Please don't free this.
* Unified code paths.
* Remove dub comment.
* Removed debug output.
* Deleted unneeded constructors.
* Assignment operator deleted.
2018-09-28 17:10:44 +02:00
Lars Maier 3dbb0558f3 Clean lost collections in supervision (#6592)
* Working draft: clean lost collections in supervision.
* Added early exit as in spec.
* Finished test. Fixed logging.
2018-09-26 16:54:29 +02:00
Simon 0a9afccde5 Fix crash on Agency / DBserver with user JWT tokens (#6594) 2018-09-26 14:26:35 +02:00
Max Neunhöffer 84735955ea Add advertised endpoints. (#6104) 2018-09-13 16:30:55 +02:00
Kaveh Vahedipour 28754cbf15 Feature/schmutz plus plus (#5972)
- Schmutz now called "Maintenance" and completely implemented in C++
 - Fix index locking bug in mmfiles
 - Fix a bug in mmfiles with silent option and repsert
 - Slightly increase supervision okperiod and graceperiod
2018-08-24 12:15:35 +02:00
Simon 468231efc5 AQL Profiling code (#5165)
* initial start of profiling

* adding profiling code

* Fixing remote block tracing, fixing width and units

* Fixing some tests

* Various fixes

* adressing review comments
2018-04-24 16:17:30 +02:00
Matthew Von-Maszewski a84f7805ad Feature/mv thread death logging (#5111)
* Initial low level interface for thread crash reporting (and management).
* Add a member version of isClusterRole()
* isolate heartbeat thread creation to new StartHeartbeatThread().  create heartbeat thread even if not a cluster or if an agent.
* update runDBServer() and runCoordinator() to shutdown more quickly by polling isStopping() at additional locations.
* copying updates from different branch / PR
* basic thread crash logging.  Not yet tied into Agency arangod or have any specific threads posting crashes
* make Supervision thread a CriticalThread
* sandwich CriticalThread between Thread and other classes to create long term, repeating thread crash reporting.
* restore code lost upon branch update relating to new startHeartbeatThread() function
* add CriticalThread.cpp to build
* add new runAgentServer() function to loop for Agents.  Make Heartbeat thread derive from CriticalThread.
* remove debug line
2018-04-23 15:50:14 +02:00
Simon 45fbed497b Supervision Job for Active Failover (#5066) 2018-04-23 12:49:41 +02:00
Kaveh Vahedipour 3d043b35a3 Feature/supervsion maintenance mode (#5108)
* Supervision goes to Maintenance mode, when /arango/Supervision/Maintenance exists
* coordinator route stands
* stop updates in transient, when supervision off
2018-04-20 13:23:22 +02:00
Matthew Von-Maszewski c0c149cf5b Create non-throwing wrappers for Node access in Agency (#4598)
* safety checkin of Node throw reduction.
* final round of Node throw protection.  Common accessors now protected to force code to hasAsXXX() functions.
2018-04-17 10:21:14 +02:00
Kaveh Vahedipour f4edcc7ba8 Bug fix/supervision engine starting early on leadership change (#5062)
* supervision must not work as long as agent is still preparing
* leadersince atomic and pushed to end of leader preparation
* More consistent use of integer types.
* Slightly change order of events in Supervision loop.
2018-04-10 15:28:26 +02:00
Kaveh Vahedipour 7f9786eb27 builder fixed for agency transaction. worked only for a single server. (#4436) 2018-02-06 23:14:53 +01:00
Kaveh Vahedipour 7715c75c59 let's not miss failedserver removal (#4208)
* let's not miss failedserver removal
* remove resetting of FailedServers in test code
* Only call abortRequestsToFailedServers at most every 3 seconds.
2018-01-03 21:55:40 +01:00
Matthew Von-Maszewski ae77ff80c2 create independent executeLockedRead and executeLockedWrite to speed performance (#4177) 2017-12-29 12:02:27 +01:00
Max Neunhöffer 7bae6980e8
Bug fix/agent lead hanger (#4147)
* Really enforce the hidden option --server.maximal-threads if given.
* Switch off --log.force-direct in scripts/startStandAloneAgency.sh
* Lower the timeout for sending AppendEntriesRPC to 150s.
* Erase _earliestPackage when becoming a leader.
* Challenge leadership in agent main loop.
* Use steady_clock for _earliestPackage.
* Change _lastAcked and _leaderSince to steady_clock as well.
* time difference calculations based on old readSystemClock to steadyClockToDouble
* All system_clock transitioned to steady_clock in Agent. Remaining system_clock are user input / output or timestamps
* Inception system_clock to steady_clock
2017-12-27 16:45:39 +01:00
Matthew Von-Maszewski 8723df7681 Fix supervisor thread crash (#4083)
* Server short name could arrive too late for first health check.  Would lead to supervisor thread crash.  Add test for this condition and defense against other unknown throws in health check.

* Correct capitalization of ShortName.  Add spaces to two Log lines.
2017-12-27 16:10:47 +01:00