arangodb

Commit Graph

Author	SHA1	Message	Date
Max Neunhöffer	54f84cab92	Performance tuning for many shards. (#8577 )	2019-03-29 21:34:45 +01:00
Max Neunhöffer	46e479376d	Further supervision fixes. (#8259 ) * Do not schedule Coordinators in Plan. * Finish failed server when server is no longer in health. * Fix removeServer checks. Check that server is no longer in use before removing it. Give 60s waiting time for condition to be met. Also observer agency lock. * Finish FailedFollower job if server no longer follower. This can happen because RemoveFollower was faster. * Only use GOOD servers as replacement followers. * Fix AddFollower for satellite collections. * Fix RemoveServer for satellite collections. * MoveShard handles moves from leader to followers * Prepare CleanoutServer and FailedServer for satellite collections. * More sorting out of AddFollower and RemoveFollower. * Fix RemoveFollower job w.r.t. choice of follower to remove. * Fix message. * kill you own sub jobs, please * Added preconditions to payloads for supervision's job finishers * Improve logging. * Add agency diagnostics to failed move shard test, start. * Add coordinator agency diagnostics. * Remove warning. * Add changelog entry. * Add agency diagnostics if things go sour with move shard. * Add agency diags when things go wrong 2. * API /_api/agency/state: back to old format. * Fix Windows compilation. * handle aborts in supervision and wait for the last Raft log to be committed * tests compiling, 2 failing for valid reasons * Correctly report TRI_ERROR_CLUSTER_CONNECTION_LOST as 503. * FailedLeader /FailedFollower cannot continue, when aborting blocks	2019-03-04 11:43:35 +01:00
Frank Celler	9477af198b	big reformat	2018-12-26 00:57:05 +01:00
Kaveh Vahedipour	860fa21219	Bug fix 3.4/index readiness (#6716 ) * backport of test data generation for maintenance from devel * 3.4 working * fixing index use in cluster while still being built * fixed broken views * correct 200 for ensureIndex * merge with 3.4 * agency comm to handle replace in array * supervision changes * cluster info's exsureIndex * 3.4 ready * timeout * missing files from origin * neunhoef complaints * bogus entry * no need to wait for current once again * no longer necessary. done in IndexFactory now * correct comments * left overs * dead code revived * Move CHANGELOG entry to the right place.	2018-11-21 14:41:36 +01:00
Lars Maier	0e9aa10c2a	Feature 3.4/cleanup lost collections (#6627 ) * Working draft: clean lost collections in supervision. * Added early exit as in spec. * Finished test. Fixed logging.	2018-09-27 10:35:39 +02:00
Kaveh Vahedipour	28754cbf15	Feature/schmutz plus plus (#5972 ) - Schmutz now called "Maintenance" and completely implemented in C++ - Fix index locking bug in mmfiles - Fix a bug in mmfiles with silent option and repsert - Slightly increase supervision okperiod and graceperiod	2018-08-24 12:15:35 +02:00
Matthew Von-Maszewski	a84f7805ad	Feature/mv thread death logging (#5111 ) * Initial low level interface for thread crash reporting (and management). * Add a member version of isClusterRole() * isolate heartbeat thread creation to new StartHeartbeatThread(). create heartbeat thread even if not a cluster or if an agent. * update runDBServer() and runCoordinator() to shutdown more quickly by polling isStopping() at additional locations. * copying updates from different branch / PR * basic thread crash logging. Not yet tied into Agency arangod or have any specific threads posting crashes * make Supervision thread a CriticalThread * sandwich CriticalThread between Thread and other classes to create long term, repeating thread crash reporting. * restore code lost upon branch update relating to new startHeartbeatThread() function * add CriticalThread.cpp to build * add new runAgentServer() function to loop for Agents. Make Heartbeat thread derive from CriticalThread. * remove debug line	2018-04-23 15:50:14 +02:00
Kaveh Vahedipour	3d043b35a3	Feature/supervsion maintenance mode (#5108 ) * Supervision goes to Maintenance mode, when /arango/Supervision/Maintenance exists * coordinator route stands * stop updates in transient, when supervision off	2018-04-20 13:23:22 +02:00
Kaveh Vahedipour	7f9786eb27	builder fixed for agency transaction. worked only for a single server. (#4436 )	2018-02-06 23:14:53 +01:00
Kaveh Vahedipour	255d90d26a	cherry pick from 3.2 pull request for bug-fix/supervision-thread-exists-on-pre3.2-agency (#3709 ) This is the HealthRecord upgrade patch.	2017-11-17 10:14:14 +01:00
Jan	bef52d7dc3	Bug fix/cleanup after cppcheck (#3639 )	2017-11-10 13:53:28 +01:00
Kaveh Vahedipour	627f344266	fixed a bug, where when servers failed, when also agency leadership c… (#3189 ) * fixed a bug, where when servers failed, when also agency leadership changes * redid entire design of checkDBServers/checkCoordinators. * comparison in supervision must be between oldPersisted and newHealth * UI stuff * UI stuff * FailedServer test needed adjustment * Hopefully final round * fixed supervision failure detection * FailedServer tests back to origin devel * oldNot documented among preconditions in Agency HTTP API docs * changed only look for status updated * non action line in api-cluster	2017-09-07 16:10:23 +02:00
Kaveh Vahedipour	00650e6a3f	Bug fix/agency mt fixes (#3158 ) * added debugging methods * try to fix invalid access in case of error * remove unused members * bugfixes and comments * all agency fixes in * merge bug * partially unguarded Agent::lead fixed * all agency fixes in * added nrBlocked to thread startup eval * added nrBlocked to thread startup eval * recombination of cases in State::get * some maps replaced with unordered_maps * optimized maps some	2017-08-30 10:43:51 +02:00
Andreas Streichardt	fe59502848	Fix server health	2017-05-11 12:20:15 +02:00
Kaveh Vahedipour	68efba18e8	keep agencyPrefix, when non set	2017-04-26 15:32:26 +02:00
Kaveh Vahedipour	1f81ce28b0	merge in cpp & js from 3.1.18 yet to do tests	2017-04-21 15:41:05 +02:00
Kaveh Vahedipour	8d66d69f83	supervision handles coordinator demise correctly	2017-02-07 11:29:37 +01:00
Kaveh Vahedipour	aaee2f9e61	transient heartbeats	2017-01-18 13:43:33 +01:00
Kaveh Vahedipour	55985ed5de	missing prototypes	2017-01-09 10:38:34 +01:00
jsteemann	7359ac44b2	more style cleanup	2017-01-05 10:52:03 +01:00
Kaveh Vahedipour	12e54902df	agency's supervision must wait grace period after becoming leader before acting on db server failure	2016-12-21 11:17:41 +01:00
Max Neunhoeffer	985ccaeb70	Get rid of Supervision::wakeUp().	2016-12-20 10:19:24 +01:00
Kaveh Vahedipour	51b279346b	redirects to myelf should be hinstory	2016-12-06 17:10:15 +01:00
Andreas Streichardt	63a173f002	Delete all shard move jobs when server is healthy again	2016-11-22 14:13:09 +01:00
Kaveh Vahedipour	9a6f605f2f	fixed small double / long conversion	2016-10-31 17:00:55 +01:00
Kaveh Vahedipour	f8235b9c63	agency locks code review	2016-10-25 15:07:57 +02:00
Max Neunhoeffer	3a76784af4	Protect memory accesses to _snapshot in Supervision.	2016-10-12 10:23:21 +00:00
Kaveh Vahedipour	1f4abf3c36	upgrade 3.0 agency to 3.1	2016-10-06 17:04:29 +02:00
jsteemann	f5a595f464	Merge branch 'devel' of https://github.com/arangodb/arangodb into generic-col-types	2016-09-07 08:52:07 +02:00
Andreas Streichardt	6396ac4dc7	Implement removeServer job	2016-09-06 16:49:25 +02:00
jsteemann	6ddf8bab54	Merge branch 'devel' of https://github.com/arangodb/arangodb into generic-col-types	2016-09-06 11:22:14 +02:00
Kaveh Vahedipour	85ea1d5ff9	clang-format	2016-09-06 10:01:33 +02:00
Andreas Streichardt	f9fea70c3e	readd method	2016-09-05 15:50:41 +02:00
Kaveh Vahedipour	9808a55a33	some cleaning up	2016-09-05 15:12:46 +02:00
jsteemann	c6efe26198	cppcheck	2016-08-25 14:04:23 +02:00
Andreas Streichardt	89ebeefbb9	Proper shutdown	2016-08-24 13:51:23 +02:00
Andreas Streichardt	47a0f8602a	Better shutdown handling	2016-08-23 12:51:38 +02:00
Andreas Streichardt	03b9d97e2f	Implement proper cluster shutdown	2016-08-18 11:23:23 +02:00
Andreas Streichardt	3f412debf0	Revert futile attempts to implement client resilience tests	2016-08-17 18:12:40 +02:00
Andreas Streichardt	70af1e3647	Implement proper cluster shutdown	2016-08-17 17:25:39 +02:00
Andreas Streichardt	526c8f42c2	Fix foxx issues in cluster Bootstrap will now be done on the bootstrap coordinator. queues will now be executed by the "foxxmaster"	2016-07-29 16:06:31 +02:00
jsteemann	f21561b25f	use nullptr, don't include Thread.h when unnecessary	2016-06-15 19:21:53 +02:00
Kaveh Vahedipour	beba4887a3	shrink cluster in supervision	2016-06-10 18:10:37 +02:00
Kaveh Vahedipour	00d6111a3e	server health for aardvark	2016-06-03 14:27:04 +02:00
Kaveh Vahedipour	427453bcc7	server health for aardvark	2016-06-03 12:19:39 +02:00
Kaveh Vahedipour	9957270df6	hunting down exceptions in agency supervision	2016-05-31 21:42:41 +02:00
Max Neunhoeffer	b600ddbeb4	Fix getUniqueIds and updateAgencyPrefix in Supervision. This prevents some race conditions at cluster startup that crashed the agency.	2016-05-31 12:38:17 -06:00
Kaveh Vahedipour	7b440f94dc	Moving Job classes out of Supervision	2016-05-31 16:28:54 +02:00
Kaveh Vahedipour	bad7a6a35a	leader fail seems good	2016-05-31 15:21:42 +02:00
Kaveh Vahedipour	68478f530d	visual studio warning	2016-05-30 15:47:08 +02:00

1 2

75 Commits