1
0
Fork 0
Commit Graph

357 Commits

Author SHA1 Message Date
Jan 2c5f79c9fb Make scheduler enforce queue limits (#10026)
* initial commit

* fix typo

* honor @mpoeter 's comments. Thanks!

* honor @mpoeter 's comment

* adjust scheduler queue sizes

* apply suggestion

* adjust the PR for 3.5: do not use bounded_push
2019-10-16 17:43:04 +03:00
Jan 4a749a66b3 Bug fix 3.5/multi bugs (#9792)
* add missing whitespace to make error message readable

* try to continue running scheduler threads even when there are exceptions

* give up trying to persist follower info in agency for already dropped collections.

* updated CHANGELOG
2019-08-26 15:51:14 +03:00
Jan 5a20828210 fix potential spurious wakeups in scheduler code (#9777)
* fix potential spurious wakeups in scheduler code

* added CHANGELOG entry
2019-08-21 13:10:05 +03:00
Dan Larkin-York 663212ba19 [3.5] Check scheduler queue return value (#9759)
* Add to cmake.

* Backport changes from devel.

* added CHANGELOG entry, port adjustments from devel
2019-08-20 12:57:51 +03:00
KVS85 e64080e207
Merge 3.5.1 back to 3.5 (#9713)
* Bug fix 3.5/make arangosh reconnect (#9615)

* make arangosh reconnect

* added CHANGELOG entry

* fix lagging AgencyCallbacks (#9620)

* fix lagging AgencyCallbacks

* optimizations, discussed with @mchacki

* fix wording

* updated CHANGELOG

* fix yet another undefined behavior (#9629)

* [3.5.1] Fail the FailedLeader Job if the new leader fails. (#9628)

* Fail the FailedLeader Job if the new leader fails.

* Updated changelog.

* In case of timeout do not rollback.

* Fixed catch tests.

* Changed wording.

* DELETED rollback.

* reduce wait timeouts as a mitigation for notifying waiters without ho… (#9619)

* reduce wait timeouts as a mitigation for notifying waiters without holding the required mutex

this is a quick mitigation only, which reduces maximum wait time from 1
second to 100 milliseconds without changing other behavior.

the main problem of notifying pending writers without successfully
acquiring the required mutex still needs proper addressing.

* adjust timing-dependent test

* [3.5.1] Fast Controlled Leaderchange (#9634)

* First draft of keeping in sync during controlled leader change.

* Test if server is actually the leader in plan.

* Updated changelog.

* Added oldLeader check for set-the-leader request.

* Small fixes.

* Removed LOG_DEVEL.

* less copying, more moving! 🚚 (#9645)

* attempt to fix load_balancing tests in slow test environments (#9626)

* Bug fix/fix swagger datatype (#9045) (#9602)

* Bug fix/fix swagger datatype (#9045)

* remove http so https arangos will work

* verify that query parameters are proper swagger data types, fix offending documentation files

* return the actual type - not the list of available ones

* check formats

* there is no uint64 in swagger

* Fresh Swagger

* Port TakeoverShardLeadership from devel to 3.5.1 (#9659)

* Create TakeoverShardLeader job.
* Add TakeoverShardLeadership to Action factory.
* Add log message at level debug.
* Sort out LOG_TOPIC ids.
* Fix unit tests.
* CHANGELOG.

* Bug fix 3.5/hide mmfiles specific info in web ui (#9668)

* attempt to fix load_balancing tests in slow test environments (#9626)

* Bug fix/fix swagger datatype (#9045) (#9602)

* Bug fix/fix swagger datatype (#9045)

* remove http so https arangos will work

* verify that query parameters are proper swagger data types, fix offending documentation files

* return the actual type - not the list of available ones

* check formats

* there is no uint64 in swagger

* Fresh Swagger

* hide MMFiles-specific information when we don't need it

* Ported ResignLeadership to 3.5 (#9656)

* attempt to fix load_balancing tests in slow test environments (#9626)

* Bug fix/fix swagger datatype (#9045) (#9602)

* Bug fix/fix swagger datatype (#9045)

* remove http so https arangos will work

* verify that query parameters are proper swagger data types, fix offending documentation files

* return the actual type - not the list of available ones

* check formats

* there is no uint64 in swagger

* Fresh Swagger

* Ported ResignLeadership to 3.5

* Add the actual http route.

* Aardvark: Add k Shortest Paths example graph to UI (#9491) (#9661)

* Aardvark: Add k Shortest Paths example graph to UI (#9491)

* Add example graph to UI

* Add kShortestPathsGraph to examples.js

* Update example-graph.js

* Update aardvark.js

* Regenerate UI

* add the ability to have cluster special examples (#9613) (#9663)

* add the ability to have cluster special examples

* Update get_cluster_health.md

* fix abort condition, fix negative filtering for cluster tests

* Test if job fails with unmet assertion

* Remove cluster test example

* germanize

* better skip reasons

* removing superfluous semicolons

* Revert skip reasons, too noisy

* various replication improvements: (#9675)

* attempt to fix load_balancing tests in slow test environments (#9626)

* Bug fix/fix swagger datatype (#9045) (#9602)

* Bug fix/fix swagger datatype (#9045)

* remove http so https arangos will work

* verify that query parameters are proper swagger data types, fix offending documentation files

* return the actual type - not the list of available ones

* check formats

* there is no uint64 in swagger

* Fresh Swagger

* various replication improvements:

- better debuggability (more log details)
- shorter minimum wait delay in active failover
- fixed too early pruning of WAL files on leaders

* Bug fix 3.5/fix rocksdb return code (#9692)

* attempt to fix load_balancing tests in slow test environments (#9626)

* Bug fix/fix swagger datatype (#9045) (#9602)

* Bug fix/fix swagger datatype (#9045)

* remove http so https arangos will work

* verify that query parameters are proper swagger data types, fix offending documentation files

* return the actual type - not the list of available ones

* check formats

* there is no uint64 in swagger

* Fresh Swagger

* fix return codes for concurrent writes to same documents

* [3.5] Feature/rebootid notice changes, backport of #9523 (#9684)

* Feature/rebootid notice changes, backport of #9523

* Fixed error code to not re-use an old one

* Bug fix 3.5/issue 9679 (#9682)

* attempt to fix load_balancing tests in slow test environments (#9626)

* Bug fix/fix swagger datatype (#9045) (#9602)

* Bug fix/fix swagger datatype (#9045)

* remove http so https arangos will work

* verify that query parameters are proper swagger data types, fix offending documentation files

* return the actual type - not the list of available ones

* check formats

* there is no uint64 in swagger

* Fresh Swagger

* fixed issue #9679

* bug-fix/issue-#9660 (#9704) (#9707)

* bug-fix/issue-#9660 (#9704)

* fix issue

* Update tests/js/common/aql/aql-view-arangosearch-cluster.inc

Co-Authored-By: Jan <jsteemann@users.noreply.github.com>

* Update tests/js/common/aql/aql-view-arangosearch-noncluster.js

Co-Authored-By: Jan <jsteemann@users.noreply.github.com>

* fix cluster tests

* Update CHANGELOG

* [3.5] agency node fixes (#9698)

* node fixes port from 3.4
* fixed change log

* update rocksdb statistics to deliver sums from column family instead of single value from default family. (#9706)

* Feature 3.5/geo functions (#9710)

* Add support for WGS84 on distances (#9672)

* Add area calculations (#9693)

* Update CHANGELOG
2019-08-14 20:24:47 +03:00
Jan 2209d568d0 fix undefined behavior in Scheduler calculations (#9538) 2019-07-22 19:14:16 +03:00
Michael Hackstein d5840c125a Bug fix 3.5/min replication factor (#9524)
* Cherry-pick minReplicationFactor

* Bug fix/failover with min replication factor (#9486)

* Improve collection time of IResearchQueryOptimizationTest

* Added a minReplicationFactor field in Collections. It is not possible to modify it yet and noone cares for it

* Added some assertion son minReplicationFactor

* Transaction API will now reject writes as soon as minimal replication factor is NOT fulfilled

* added minReplicationFactor to the user interface, preparation for the collection api changes

* added minReplicationFactor to VocBaseCollection, RestReplicationHandler, RestCollectionHandler, ClusterMethods, ClusterInfo and ClusterCollectionCreationInfo

* added minReplicationFactor usage to tests

* TODO TEMOPORARY COMMIT FOR TESTING PLEASE REVERT ME

* minReplicationFactor now able to change via collection  properties route

* fixed wrongly assert

* added minReplicationFactor to the graph management ui

* added minReplicationFactor to the gharial api

* Fixed off-by-one error in minReplicationFactor. We actually enforced one more.

* adjusted description of minReplicationFactor

* FollowerInfo Refactoring

* added gharial api graph creation tests with minimal replication factor

* proper cleanup of shell collection tests, removed lots of duplicate code, preparation for some new tests

* added collection create tests using invalid/valid names, replicationFactor and minReplicationFactor

* Debug logging

* MORE Debug logging

* Included replication fast lane

* Use correct minreplicationfactor

* modified debug logging

* Fixed compileissues

* MORE Debug logging

* MORE Debug logging

* MORE Debug logging

* MORE Debug logging

* MORE Debug logging

* MORE Debug logging

* MORE Debug logging

* Revert "MORE Debug logging"

This reverts commit dab5af28c0.

* Revert "MORE Debug logging"

This reverts commit 6134b664bd.

* Revert "MORE Debug logging"

This reverts commit 80160bdf3b.

* Revert "MORE Debug logging"

This reverts commit 06aabcdfe1.

* Removed debug output

* Added replication fast lane. Also refactored the commands as i cannot take it any more...

* Put some requests of RocksDBReplication onto CATCHUP Lane.

* Put some requests of MMFilesReplication onto CATCHUP Lane.

* Adjusted Fast and MED lane usage in Supervised scheduler

* Added changelog entry

* Added new features entry

* A new leader will now keep old followers in case of failover

* Update arangod/Cluster/ClusterCollectionCreationInfo.cpp

Co-Authored-By: Tobias Gödderz <tobias@arangodb.com>

* Fixed JSLINT

* Unified lane handling of replication handlers

* Sorry forgotten in last commit

* replaced strings with static strings

* more use of static strings

* optimized min repl description in the ui

* decr initial loop variable

* clean up of the createWithId test

* more use of static strings

* Update js/apps/system/_admin/aardvark/APP/frontend/js/views/collectionsView.js

Co-Authored-By: Tobias Gödderz <tobias@arangodb.com>

* Added some comments on condition, renamed variable as suggested in review

* Added check for min replicationFactor to be non-zero

* Added assertion

* Added function to modify min and max replication factor in one go

* added missing semicolon

* rm log devel

* Added a second information to follower info that can keep track of followers that have been in sync before a failover has taken place

* Maintenance reports previous version now to follower info. instead of lying by itself. The Follower Info now gets a failover save mode to report insync followers

* check replFactor against nr dbservers

* Add lie reporting in CURRENT

* Reverted most of my recent commits about Failover situation. The intended plan simply does not work out

* move replication checks from logical collection to rest collection handler

* added more replication tests

* Include assert only if we are not in gtest

* jslint

* set min repl factor to zero if satellite collection

* check replication attributes in v8 collection

* Initial commit, old plan, does not yet work

* fixed ires tests

* Included FailoverCandidates key. Not fully implemented

* fixed wrong assert

* unified in sync follower reporting

* fixed compiler errors

* Cleanup locking, and fixed potential deadlocks

* Comments about locking order in FollowerInfo.

* properly check uint

* Keep old leader as potential failover candidate

* Transaction methods now use followerInfo to check if the leader can write, this might have the sideeffect that 'failoverCandidates' are updated

* Let agency check failoverCandidates if possible

* Initialize member variables

* Use unified follower reporting in DBServerAgencySync

* Removed obsolete variable, collecting it somewhere else

* repl factor attr check

* Reimplemented previous followers, second attempt now. PhaseOne and PhaseTwo can now synchronize on current.

* Fixed assertion, forgot an off-by-one

* adjusted test to be more preciese now

* Fixed failove candidates list

* Disable write on dropping too many followers

* Allow to run updateFailoerCandidates multiple times with same leader.

* Final fixes, resilience tests now green, crossing fingers for jenkins

* Fixed race on atomics comparison

* Fixed invalid number type

* added nullptr handling

* added nullptr handling

* Removed invalid assert

* Make takeover of leadership an atomic operation

* Update tests/js/common/shell/shell-cluster-collection.js

Co-Authored-By: Tobias Gödderz <tobias@arangodb.com>

* Review fixes

* Fixed creation code to use takeoverLeadership

* Update arangod/Cluster/FollowerInfo.h

Co-Authored-By: Tobias Gödderz <tobias@arangodb.com>

* Applied review fixes

* There is no timeout

* Moved AQL + Pregel to INTERNAL_AQL lane, which is medium priority, to avoid deadlocks with Sync replication

* More review fixes

* Use difference if you want to compare two vectors...

* Use std::string ...

* Now check if we are in recovery mode

* Added documentation for minReplicationFactor

* Added readme update as well in documenation

* Removed merge conflict leftovers 0o, i should not trust the IDE

* Update js/apps/system/_admin/aardvark/APP/frontend/js/views/collectionsView.js

Co-Authored-By: Jan <jsteemann@users.noreply.github.com>

* Update js/apps/system/_admin/aardvark/APP/frontend/js/views/collectionsView.js

Co-Authored-By: Jan <jsteemann@users.noreply.github.com>

* Update Documentation/Books/Manual/Architecture/Replication/README.md

Co-Authored-By: Jan <jsteemann@users.noreply.github.com>

* Update CHANGELOG

Co-Authored-By: Jan <jsteemann@users.noreply.github.com>

* Update Documentation/Books/Manual/DataModeling/Collections/DatabaseMethods.md

Co-Authored-By: Jan <jsteemann@users.noreply.github.com>

* Update Documentation/Books/Manual/ReleaseNotes/NewFeatures35.md

Co-Authored-By: Jan <jsteemann@users.noreply.github.com>

* Update Documentation/DocuBlocks/Rest/Collections/1_structs.md

Co-Authored-By: Jan <jsteemann@users.noreply.github.com>

* Update js/apps/system/_admin/aardvark/APP/frontend/js/views/graphManagementView.js

Co-Authored-By: Jan <jsteemann@users.noreply.github.com>

* Update js/apps/system/_admin/aardvark/APP/frontend/js/views/graphManagementView.js

Co-Authored-By: Jan <jsteemann@users.noreply.github.com>

* Update Documentation/DocuBlocks/Rest/Graph/1_structs.md

Co-Authored-By: Jan <jsteemann@users.noreply.github.com>

* Apply suggestions from code review

Co-Authored-By: Jan <jsteemann@users.noreply.github.com>

* Adepted review requests, thanks for finding!

* Removed unnecessary const

* Apply suggestions from code review

Co-Authored-By: Jan <jsteemann@users.noreply.github.com>

* Moved initilization of variable more downwards

* Apply lock before notify_all()

* Remove documentation except DocuBlocks, covered by PR in docs repo

* Remove accidental indent
2019-07-22 17:48:34 +03:00
Jan 47d7f9e3e2 Bug fix 3.5/fix uninitialized value (#9515)
* fix uninitialized time point value, which led to potential bogus error
messages

* fix typo in info message
2019-07-18 23:26:51 +03:00
Lars Maier c4ff5ccce9 [Devel] Queue-Full-Logging (#9388) 2019-07-03 13:07:24 +02:00
Jan 727bb39f1a
fix JSON statistics (#9385) 2019-07-02 18:24:04 +02:00
Tobias Gödderz f4baf0c436 Specialize scheduler pointer to allow devirtualization of method calls (#9225) 2019-06-21 11:15:18 +02:00
Wilfried Goesgens 6fd3dd024c windows warning as errors, masquerade remaining unfixeable errors (#9249) 2019-06-17 16:21:29 +02:00
Matthew Von-Maszewski 7119ccfdf1
copy logic of 3.4 scheduler that leaves threads available for FAST lane work (#9235) 2019-06-11 11:06:42 -04:00
Jan 6a07476c41
don't include the Logger in header files if it's not necessary (#9216) 2019-06-07 10:08:03 +02:00
Tobias Gödderz 4fbff3280f Added missing initialization; use alignas instead of padding; mark move constructor noexcept (#9111) 2019-05-27 18:50:20 +02:00
Jan 3d79491f36
exclude VST requests from direct execution (#9075) 2019-05-24 09:53:15 +02:00
Jan 79258e072a
Bug fix/remove io task (#9056) 2019-05-22 14:34:49 +02:00
jsteemann 0e24c18253 re-add optimization for single servers 2019-05-22 14:29:12 +02:00
jsteemann d0868e76f8 fix "random" scheduler hangs 2019-05-22 14:01:03 +02:00
Lars Maier 4fc2790863 [devel] Direct Exec Scheduler (#9004) 2019-05-20 11:38:57 +02:00
Jan 976dc2b726
Bug fix/issues 2019 05 06 (#8913) 2019-05-07 12:17:16 +02:00
Jan 1408654d2c
fix a few issues in BackupNoAuthSysTest (#8868) 2019-04-29 19:13:39 +02:00
Jan 449ab1ed8e
Bug fix/cppcheck 13042019 (#8752) 2019-04-15 10:13:56 +02:00
Jan e36f7d429e
Bug fix/fix scheduler shutdown task assertion (#8727) 2019-04-10 19:51:45 +02:00
Jan Christoph Uhde c3f7961b88 apply unique log ids (#8561) 2019-03-25 20:26:51 +01:00
Jan 23f7fc1368
fix some memleaks (#8432) 2019-03-18 18:11:37 +01:00
Kaveh Vahedipour ee751e8ba3 [devel] clear compilation warnings (#8345) 2019-03-08 10:35:09 +01:00
Wilfried Goesgens 492d05c1f1 Feature/upgrade v8 7.1.302.28 (#8088) 2019-02-19 11:15:34 +01:00
jsteemann 004c1cd642 fix warnings 2019-02-07 13:20:27 +01:00
Jan 1a1f6935c6
🚨 Fix clang warnings (#8123) 2019-02-07 13:06:03 +01:00
Simon a392bbe224 Fix race in supervised scheduler (#8117) 2019-02-06 15:56:22 +01:00
Manuel Pöter ecf4d9d62a Fix race conditions in thread management. (#8032) 2019-01-28 15:44:46 +01:00
Simon f748aee240 Added collectAll, updated fuerte (#7949) 2019-01-16 11:31:08 +01:00
Jan fa7de56cf8
upgrade to boost 1.69.0 (#7910) 2019-01-09 17:17:33 +01:00
Lars Maier 423cf7a8d4 cppcheck/Scheduler (#7909) 2019-01-08 16:39:56 +01:00
jsteemann 14faf75f16 fix compile warnings 2019-01-08 11:44:13 +01:00
Lars Maier 12eebb15fe Feature/new server infra (#7733)
* Decoupled IO handling from Scheduler.

* Fixed SSL start up bug.

* Replaced Scheduler with new worker farm implementation.

* Added minimal statistics and info string for Scheduler.

* Added support for timed submissions.

* Updated delayed submission api. Updated code that used timers.

* Extracted new Scheduler into a virtual parent class. The implementation can now depend on the usecase.

* Signal handler now working.

* Changed threads names, `_stop` is atomic, check for failure during thread start + exception handling like old scheduler did.

* Commented on source code and added TODOs.

* Played around with start-stop-conditions

* Play around with start stop condition.

* start stop cond

* Sart Stop Conditions

* Removed bad cv_status check.

* Bug fix: now compare the actual objects instead of pointer values. Setup t1 and t2 depending on the thread id.

* Moved most of the stuff now unrelated to the Scheduler to GeneralServer. Got rid of JobGuard.

* Instead of waiting for a thread to terminate, put it on a clean up list and check for its termination in each supervisor run.

* Allow detaching long running threads.

* Fixed test mock.

* Updated the WorkHandle logic. Removed post functions.

* Fixed crash when obtaining shared_ptr from this in destructor.

* Added lost mutex.

* Fixed memory leak.

* Fixed merge bug.

* Changed a lot of code to optimize the scheduler.

* Fixed bug of invalidated iterator. Dont remove task on shutdown at different places. Let scheduler threads run until queue is empty.

* Only by value calls to queue.

* Added options again.

* Clean up of code.

* UI Request Lane added.

* Bug fixes in Scheduler.

* Applied reformat.

* Use sigaction.
2019-01-08 10:12:02 +01:00
Frank Celler ac9f375fb5 big reformat 2018-12-26 00:54:03 +01:00
Jan 5bae3742e5
Feature/internal 3306 (#7683) 2018-12-06 16:19:28 +01:00
jsteemann b4e888ef13 revert Scheduler changes 2018-11-26 09:57:53 +01:00
Jan c7869f1c46
Bug fix/remove shutdown assertion (#7388) 2018-11-22 15:35:55 +01:00
Matthew Von-Maszewski 0d39ff66f5
Bugfix: backport defensive Communicator change and revert constant change in Scheduler (#7214)
* revert accidental change to MIN_SECONDS

* from bugfix-3.4/mv-communicator-defensive:  simplify lambda usage to static functions.
2018-11-05 15:18:31 -06:00
Simon c72818a9dc Make ensureIndexOnCoordinator more robust (#7110) 2018-10-29 17:45:46 +01:00
Matthew Von-Maszewski 97ba8ca2be Bugfix: More 3.4 scheduler changes backported (#7091) 2018-10-26 17:09:20 +02:00
jsteemann ac19d0a627 pass by reference 2018-10-12 18:21:50 +02:00
Lars Maier fac7b48c74 [3.5] Feature/decoupled io (#6281)
* Decoupled IO from Scheduler.
* Fixed SSL start up bug.
* Updated messages and thread names. Fixed missing code from cherry-pick.
* Reintroduced checks for executing thread to be correct. Modifed default value for io-context depending on cores.
* Fixed memory leak caused by cyclic references.
* Actually distribute endpoints. Move handlers into function and do not copy them for each encapsulation.
* Inserted debug output.
* BUG FIXED! One has to call drain() on every queue as temporary work around.
* Added some flags and output for testing.
* More debug output!!!
* Manuel is right.
* Removed debug output.
2018-10-08 13:05:12 +02:00
Simon 5837291495 Debug logs for ActiveFailover (#6684) 2018-10-02 15:10:50 +02:00
Simon 0ef43eefa3 close socket when shutting down (#6650) 2018-09-28 17:50:39 +02:00
Simon 912f109968 Add simple Future library (#6464) 2018-09-21 16:14:17 +02:00
Jan 8b26c9db3c
Bug fix/fix ssl vst (#6547) 2018-09-19 23:21:46 +02:00