https://forge.softwareheritage.org/source/swh-scheduler.git

sort by:
Revision Author Date Message Commit Date
a298a20 Merge branch 'generated-differential-D6587-source' into 'generated-differential-D6587-target' Improve docs rendering for recurrent visits scheduler See merge request swh/devel/swh-scheduler!262 06 January 2023, 22:12:34 UTC
7f434c3 Improve docs rendering for recurrent visits scheduler 29 October 2021, 13:44:56 UTC
50d7fd7 Add a new cli endpoint to schedule recurrent visits in Celery For each known visit type, we run a loop which: - monitors the size of the relevant celery queue - schedules more visits of the relevant type once the number of available slots goes over a given threshold (currently set to 5% of the max queue size). The scheduling of visits combines multiple scheduling policies, for now using static ratios set in the `POLICY_RATIOS` dict. We emit a warning if the ratio of origins fetched for each policy is skewed with respect to the original request (allowing, for now, manual adjustement of the ratios). The CLI endpoint spawns one thread for each visit type, which all handle connections to RabbitMQ and the scheduler backend separately. For now, we handle exceptions in the visit scheduling threads by (stupidly) respawning the relevant thread directly. We should probably improve this to give up after a specific number of tries. Co-authored-by: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> 28 October 2021, 11:06:56 UTC
0c7ef27 grab_next_visits: avoid time interval calculations in PostgreSQL When the database is in a non-UTC timezone with DST, and a `timestamptz - interval` calculation crosses a DST change, the result of the calculation can be one hour off from the expected value: PostgreSQL will vary the timestamp by the amount of days in the interval, and will keep the same (local) time, which will be offset by an hour because of the DST change. Doing the datetime +- timedelta calculations in Python instead of PostgreSQL avoids this caveat altogether. 27 October 2021, 13:45:09 UTC
ecc0e28 Restrict the click version to avoid conflict version with celery's Otherwise, in some edge case, like run in docker, the install fails on conflict. Related to P1205#8092 22 October 2021, 09:21:36 UTC
243a69f Add docstring to runner and listener modules Related to T3667 20 October 2021, 09:25:38 UTC
5b53196 Drop deprecated listener module It's been deprecated for enough time. Related to T3667 20 October 2021, 09:02:11 UTC
f15c510 scheduler: Deprecate unused main celery runner 20 October 2021, 08:31:28 UTC
3aed688 Use swh_storage fixture for cli tests This actually fixes the debian build failure. Related to T3666 18 October 2021, 12:16:38 UTC
3aed7bf Return 0 slot if no more slots available in the queues This scenario happens with the loader oneshot for example. This loader deals with more than 1 type of origins to ingest in the same queue. So the computation of that function returned negative value [1]. Which is ultimately not possible to execute in sql [1]. This commits fixes that behavior. This also explicits that the function must return positive values in its docstring. [1] ``` ... psycopg2.errors.InvalidRowCountInLimitClause: LIMIT must not be negative ``` 15 October 2021, 13:22:52 UTC
ecc1400 runner: Improve help message on the task types flag. 02 September 2021, 09:15:36 UTC
63fdda0 send-to-celery: Add more options to allow scheduling of edge cases In the non optimal case, we may want to trigger specific case (not-yet enabled origins, origin from specific lister...). Related to T3350 27 August 2021, 11:26:38 UTC
7cc37fa Refine scheduling policy for origins with no known last update For origins that have never been visited, and for which we don't have a queue position yet, we want to visit them in the order they've been added. 26 August 2021, 14:49:37 UTC
2efad28 Add a swh scheduler origin send-to-celery subcommand The subcommand bypasses the legacy task-based mechanism to directly send new origin visits to celery 26 August 2021, 14:48:46 UTC
5e8007f Add table sampling option to grab_next_visits Running common operations on all git origins is pretty intense. Using table sampling gives us the opportunity to at least schedule some jobs in (decently small) time. 26 August 2021, 14:47:52 UTC
cc76a57 journal_client: Only upsert if we have something to upsert 26 August 2021, 09:44:14 UTC
506f78c journal_client: Ensure queue position does not overflow Queue positions are date and the current next_position_offset used to compute the new queue position was not bounded. This has the side-effect of making overflow error. This commit adapts the journal client computations to limit such next_position_offset to 10. This value was chosen because above that exponent the dates overflow (and we are way in the future already). Related to T3502 26 August 2021, 08:24:11 UTC
28ae1d8 Replace index-fossology-license-for-range with index-fossology-license-for-partition We changed the task name/interface a while ago 18 August 2021, 09:20:25 UTC
8281e35 journal_client: Disable origins when too many visited attempts failed This disable origins for either failed or not found attempts 3 times in a row. It's not definitive though as it's the lister's responsibility to activate back origins if they get listed again. Related to T2345 03 August 2021, 11:56:32 UTC
1bcf84d Add a successive_visits counter to origin visit stats This maintains the number of successive visits resulting in the same status. This will help implementing disabling of too many successive failed or not_found visits for a given origin. Related to T2345 03 August 2021, 10:49:45 UTC
4fa29fe journal_client: Update get_last_status docstring Related to T2345 30 July 2021, 13:35:17 UTC
3b929d0 journal_client: Refactor by inlining the update_position_offset This is no longer required as it's called once. Related to T2345 30 July 2021, 13:23:14 UTC
87e66fa Only record last_visited and last_successful in origin_visit_stats After using this schema for a while, all queries can be implemented in terms of these two timestamps, instead of the four original last_eventful, last_uneventful, last_failed and last_notfound timestamps. This ends up simplifying the logic within the journal client, as well as that of the grab_next_visits query builder. To make this change work, we also stop considering out of order messages altogether in journal_client. This welcome simplification is an accuracy tradeoff that is explained in the updated documentation of the journal client: .. [1] Ignoring out of order messages makes the initialization of the origin_visit_status table (from a full journal) less deterministic: only the `last_visit`, `last_visit_state` and `last_successful` fields are guaranteed to be exact, the `next_position_offset` field is a best effort estimate (which should converge once the client has run for a while on in-order messages). 23 July 2021, 09:56:32 UTC
3ca0d65 test_journal_client: Unify test assertion like the rest Related to D5917 23 July 2021, 07:22:46 UTC
8cf2238 test: Refactor assert_visit_stats_ok to ignore_fields This simplifies and unifies properly the utility test function to compare visit stats. 23 July 2021, 07:18:20 UTC
d58776a Introduce new scheduling policy to grab origins without last update This is in charge of scheduling origins without last update. This also updates the global queue position so the journal client can initialize correctly the next position per origin and visit type. Related to T2345 22 July 2021, 10:23:44 UTC
825e8cf grab_next_visits: make the handling of CTEs more modular This allows us to insert extra CTEs if a scheduling policy needs it. 22 July 2021, 10:19:42 UTC
8c4ae9f journal_client: Compute next position for origin visit For origin without any last_update information [1], the journal client is now also in charge of moving their next position in the queue for rescheduling. Depending on their status, the next position offset and next_visit_queue_position are updated after each visit completes: - if the visit has failed, increase the next visit target by the minimal visit interval (to take into account transient loading issues) - if the visit is successful, and records some changes, decrease the visit interval index by 2 (visit the origin *way* more often). - if the visit is successful, and records no changes, increase the visit interval index by 1 (visit the origin less often). We then set the next visit target to its current value + the new visit interval multiplied by a random fudge factor (picked in the -/+ 10% range). The fudge factor allows the visits to spread out, avoiding "bursts" of loaded origins e.g. when a number of origins from a single hoster are processed at once. Note that the computations happen for all origins for simplicity and code maintenance but it will only be used by a new soon-to-be scheduling policy. [1] Lister cannot provide it for some reason. 06 July 2021, 12:35:13 UTC
cb1edf1 Introduce storage for the recurrent visit scheduler queue position 01 July 2021, 08:36:44 UTC
ec6e69f Start handling of recurrent loading tasks in scheduler This deals first and foremost with the next_position_offset update done by the scheduler journal client. 01 July 2021, 08:36:44 UTC
c486b28 journal_client: Explicit docstring 29 June 2021, 13:16:15 UTC
98f99b9 journal_client: Only check last_* fields for some permutation tests In a future commit, we will add new fields whose values will be permutation dependent. 23 June 2021, 15:02:34 UTC
1006f0a journal_client: Auto-generate the empty object from model fields This will help us when adding new fields to the table. 23 June 2021, 14:54:34 UTC
6400cc2 backend: Auto-generate origin visit stats upsert query This will help us when adding new fields to the table. 23 June 2021, 14:54:34 UTC
3762c34 cli/task: Ensure cli output is always in the same order 23 June 2021, 14:54:34 UTC
ed81870 Add a specific cooldown for notfound origins This allows us to avoid repeating visits on them, until a next pass of the lister can mark them as disabled. 23 June 2021, 09:13:00 UTC
651ddcc Add a (longer) specific cooldown for failed origin visits 23 June 2021, 09:13:00 UTC
ce8608d Make the origin visit scheduling cooldown configurable 23 June 2021, 09:13:00 UTC
7f51f27 interface: Add get_listers method Add new method to scheduler interface returning the full list of listers registered in the database. Related to T3127 22 June 2021, 12:36:08 UTC
9e1b414 Drop duplicate docstring from backend 21 June 2021, 13:46:12 UTC
c7707b5 runner: Separate scheduling tasks with and without priority concern In effect, this will allow to run 2 runners: - one for recurring tasks - one for the save code now This should decrease the probability of the scheduling tasks for the save code now to be stuck behind the main scheduler runner. Related to T3367 10 June 2021, 12:55:04 UTC
21c4279 Refactor and extract a get_available_slots utility This adds coverage as well. This will be needed for subsidiary diffs. Related to T3367 10 June 2021, 10:15:22 UTC
9d2618d Add typing stubs dependencies for mypy>0.900 This also explicits missing dependencies 09 June 2021, 12:13:36 UTC
9f7ab8f pytest_plugin: Explicitly set hostname in broker_url for celery TestApp Since the release of kombu 5.1.0, a warning is now issued when a hostname is not set in the broker_url config value of a celery app. That change makes the test_celery_monitor_ping test fails due to that new unexpected warning. So explicitly add localhost hostname in the broker_url value of the celery TestApp config. 25 May 2021, 11:43:03 UTC
fe9d949 Fix flaky test_grab_ready_* tests 06 May 2021, 14:20:57 UTC
8a892e2 Use swh.core 0.14 It renamed db_name to dbname, which is a breaking change. 06 May 2021, 13:49:47 UTC
bab557e Remove row locking from SQL queries This would only be useful if we had multiple runners running concurrently, but that's not the case. 30 April 2021, 18:13:38 UTC
feff179 tox: Add sphinx environments to check sane doc build Enable to check package documentation can be built without producing sphinx warnings. The sphinx environment is designed to be used in continuous integration in order to prevent breaking documentation build when committing changes. The sphinx-dev environment is designed to be used inside a full swh development environment. Related to T3258 26 April 2021, 16:01:59 UTC
f186910 Add default index to task(type, next_run) in schema The staging scheduler runner was slow when fetching task due to that missing index. Related to T3271#63831 20 April 2021, 13:50:19 UTC
f33f743 Simplify priority computation in tests + improve exhaustivity We no longer need to deal with ratios, so let's count the objects directly instead. Plus, the existing tests did not check tasks with None priority (because they did not have access to it when ratios were given by the backend), so they do now. 20 April 2021, 11:01:33 UTC
f4e6292 sql/updates/27: Fix sql upgrade script Related to TT3271 20 April 2021, 10:18:23 UTC
befccb9 scheduler: Clean up priority/ratio task dead code Since [1], tasks with priority are routed to dedicated queues (see tasks for more details). The tasks with priority to be scheduled have their own dedicated endpoints to be called. [1] Related to T3084 Related to T3271 20 April 2021, 09:27:18 UTC
4e06bcd Parse task_ids before calling set_status_tasks. So errors on the CLI side do not trigger an exception on the server 20 April 2021, 09:19:52 UTC
974c0c2 tests: Complete checks on message with priority consumption Related to T3084 15 April 2021, 12:57:25 UTC
17052c4 Route priority tasks to dedicated save code now queues This splits the calls to read tasks into 2 calls, one for tasks with no priority (standard), another call for tasks with priority. If any tasks with priority are detected, they are routed to dedicated `save_code_now:` prefixed named queues (per task type). Related to T3084 15 April 2021, 11:24:13 UTC
bfc1a87 Fix various Sphinx warnings 15 April 2021, 08:19:50 UTC
3e2ae3d backend: Open endpoints to peek/grab tasks with any priority The priority notion becomes a blur. Any tasks with a non null priority is considered for reading or grabbing. In a future commit, this should allow to make the runner evolve to reroute tasks with priority to other queues. Related to T3084 13 April 2021, 16:05:29 UTC
ecab745 Make origin_visit_stats_get return results from all pages psycopg2.extras.execute_values executes queries in batches of 100 by default. At the end of execute_values, only the last batch of results is available in the cursor; To fetch all results, one needs to set fetch=True instead of using the cursor. 11 February 2021, 18:39:29 UTC
86ada44 journal client: Filter out status messages without type This allows us to support reading the journal from the beginning, ignoring messages with the old schema. 11 February 2021, 18:38:44 UTC
cdb1775 Simplify max_date() The built-in `max` function can take an iterable directly, no need to reimplement it. 11 February 2021, 18:24:01 UTC
cf32e37 journal_client: Fix date computations for (un)eventful visits Fix a wrong computation when several messages (>=3) for the same snapshot are received in the wrong order For example, before the fix, the following occurs: ``` | date | snapshot | | last_ev | last_unev | Snap | | ---- | -------- | --- | -------- | --------- | ---- | | 2022 | S2 | | 2022 | | S2 | | 2020 | S2 | | 2020 | 2022 | S2 | | 2021 | S2 | | **2021** | **2020** | S2 | ``` as it should be: ``` | date | snapshot | | last_ev | last_unev | Snap | | ---- | -------- | --- | -------- | --------- | ---- | | 2022 | S2 | | 2022 | | S2 | | 2020 | S2 | | 2020 | 2022 | S2 | | 2021 | S2 | | **2020** | **2022** | S2 | ``` Related to T3000 09 February 2021, 17:10:46 UTC
aa507ac journal_client: Deal with failed status message As loader will start to create failed status message, deal with them if any. Related to T3030 05 February 2021, 14:06:48 UTC
14feab9 celery: acknowledge tasks as soon as they're received With late acknowledgements, RabbitMQ will re-send tasks to clients even if they can't ever complete the task (e.g. when the task gets killed because the machine is out of memory). This problem only increases over time, leading to complete starvation of the ingestion system. Now that we have multiple mechanisms to issue retries of tasks, we can use early acknowledgements for tasks instead, which should mitigate the ongoing starvation, at the expense of having to retry tasks externally. 03 February 2021, 19:10:26 UTC
aaffff2 Simulator: allow to export results in a csv file 01 February 2021, 14:37:31 UTC
9fce3f6 Add minimal tests for the SimulationReport.format() method 01 February 2021, 14:37:31 UTC
aaf7dd6 Make plottings optional in simulator cli output 29 January 2021, 15:00:36 UTC
cf0583b simulator: stop validating the scheduling policy in the CLI We already do that in the scheduler backend function 26 January 2021, 12:33:16 UTC
ebb5847 Run simulator tests on all known scheduling policies 26 January 2021, 12:33:05 UTC
1f77521 simulator: record visit metrics alongside scheduler metrics This allows us to check the behavior of the archive over time in terms of number of visits. 26 January 2021, 12:32:54 UTC
8898394 simulator: stop using the database as a cache for origin data This was a significant bottleneck of the simulator. To work around this, we: - Generate snapshot ids consistently in the OriginModel - Cache the origin data locally in the simulator, to compute the eventfulness of visits - Cache the last visit time for all origins to compute the estimated run time of visit tasks. 26 January 2021, 12:31:57 UTC
c92ead5 grab_next_visits: don't re-schedule visits too fast The earlier implementation would just schedule new visits for origins forever, regardless of whether they were already scheduled or not. 26 January 2021, 12:20:39 UTC
2b39cbc Allow overriding the timestamp of grab_next_visits This makes the simulator behavior more consistent with reality. 26 January 2021, 12:20:39 UTC
7ffbdd1 Construct grab_next_visits query arguments incrementally 26 January 2021, 12:20:39 UTC
ea068b4 simulator: add simple lister simulation 26 January 2021, 12:20:39 UTC
7af98e2 Factor out ListedOrigin generation to use the OriginModel This generates consistent last_update values according to the model and simulated time. 25 January 2021, 13:39:30 UTC
2906b4e model/ListedOrigin: Set extra_loader_arguments type to Dict[str, Any] Some loaders, for instance the debian one, can have non string arguments so change the extra_loader_arguments type of the ListedOrigin model to something more generic. Related to T2979 25 January 2021, 13:10:25 UTC
3d13cda Solve uneventful/eventful with unordered messages with snapshots Fix the case: m1: date2/snapshot1 m2: date1/snaptshot1 which results to: last_eventful = date2 last_uneventful = date2 The upsert was always keeping the most recent date when the eventful/uneventful dates were switched Related to T2978 23 January 2021, 18:57:17 UTC
d528998 Do not consider duplicated messages as uneventful event Avoid to copy the eventful date to the uneventful date when a duplicated message (same date/same snapshot) is received, related to T2978 23 January 2021, 18:57:17 UTC
86b2555 Add a --num-origins option to the fill-test-data cli command 22 January 2021, 13:10:59 UTC
abb513c Simulation: log at info level recorded metrics this allows to follows what the simulation is doing. 22 January 2021, 13:08:30 UTC
b93aa5b Make PaginatedListedOriginList a concretization of PagedResult 1. consistent with swh-storage and swh-indexer-storage 2. we can use swh.core.api.classes.stream_results on scheduler.get_listed_origins. 21 January 2021, 13:26:39 UTC
2f47936 Add scheduling policy for already visited origins with known last update This policy schedules origins by decreasing order of "visit lag" (that is, origins with the most lag are scheduled first). 21 January 2021, 12:02:39 UTC
acad712 Add scheduling policy for never visited origins This policy orders never visited origins by increasing date of last update (scheduling the "oldest" never visited origins first). 21 January 2021, 12:02:39 UTC
0346020 Reorganize grab_next_visits tests to better check sorting behavior - factor out test setup and results checking - properly exercize corner cases of the oldest_scheduled_first policy 21 January 2021, 12:02:39 UTC
af37898 Run Black. It wasn't ran on d464b4cc1f9ae6a5c5c94a534826eff5cc27f12f. 21 January 2021, 11:04:55 UTC
b641ac8 Make the grab_next_visits sql query modular This will allow us to easily plug new scheduling policies in that function. 21 January 2021, 10:32:33 UTC
9fb0dd6 journal_client: Read visit_stats entries by batch out of the loop Related to T2967 21 January 2021, 09:53:48 UTC
d464b4c scheduler: Make origin_visit_stats_get read multiple entries Related to T2967 21 January 2021, 09:53:46 UTC
ffe2aed Simplify journal client tests - sort visits by default (there is a test dedicated to dealing with unsorted messagaes from the journal), - remove "intermediate checks" in several tests: these do not help much but make the code more difficult to read and maintain, - rename VISIT_STATUSES1 as VISIT_STATUSES_1 to make less prone to being confused with VISIT_STATUSES (which also exists). 20 January 2021, 17:02:57 UTC
c7b740c Revert "Make sure swh.scheduler.cli.journal is loaded in test_cli_journal.py" This reverts commit b03d978241a67e741e0f62696a0bbca17d768271. It's actually not needed, after all... 20 January 2021, 17:01:51 UTC
898820f simulator: collect and plot scheduler metrics over time For now, only plot the known_origins and origins_never_visited metrics. 20 January 2021, 16:37:44 UTC
9ce68f8 simulator: stop using get_scheduler directly This reuses the scheduler instantiated by the cli instead of hardcoding our own using the PG* variables. 20 January 2021, 16:37:44 UTC
88e0b42 simulator: Add documentation. 20 January 2021, 16:37:44 UTC
62c6d90 simulator: Make min_batch_size a parameter defined in the setup. 20 January 2021, 16:37:44 UTC
9468bb9 simulator: add basic tests for fill_test_data and run 20 January 2021, 16:37:44 UTC
ead7b34 simulator: implement a simulator for the "old" task-based scheduler We extend the Task object with an autogenerated uuid allowing us to track the task lifetime between its creation and the generation of visit statuses, as the task-based scheduler does. 20 January 2021, 16:37:44 UTC
aecd27e Move the simulator cli to the main cli module 20 January 2021, 16:37:44 UTC
05067e3 simulator: Replace attrs with dataclasses for consistency 20 January 2021, 16:37:44 UTC
24922fe simulator: wrap tasks and task events in typechecked objects This allows us to extend these objects without redefining a bunch of type annotations. 20 January 2021, 16:37:44 UTC
d5318ae simulator: also fill data for the task-based scheduler 20 January 2021, 16:37:44 UTC
back to top