9c1f05e | Antoine R. Dumont (@ardumont) | 08 February 2024, 09:08:23 UTC | cli/origin: Make utils function compute only the required data And let the caller function do the actual display. This also make the utility function compute only the required data to be displayed (the listing data is optionally outputted). This also renames the 'list' bool to 'with_listing'. 'list' is a python keyword (it's ok to reuse but editors usually color it differently than plain function name and that can be confusing). | 08 February 2024, 09:08:23 UTC |
fe72a50 | Antoine R. Dumont (@ardumont) | 07 February 2024, 11:16:07 UTC | cli/origin: Unify check-ingested-origins cli summary This adapts the print output to expose the success rate even with --watch is activated. This also change the formatting of the 2-digits float percentage. Refs. swh/infra/sysadm-environment#5230 | 07 February 2024, 11:16:07 UTC |
863d931 | Antoine R. Dumont (@ardumont) | 07 February 2024, 10:43:44 UTC | cli/origin: Bump default watch period to 30 minutes The previous default of 10 minutes is too high a period. Refs. swh/infra/sysadm-environment#5230 | 07 February 2024, 10:43:44 UTC |
d9dcef4 | Antoine R. Dumont (@ardumont) | 07 February 2024, 10:41:31 UTC | cli/origin: Add --watch-period flag to specify ingestion check period Refs. swh/infra/sysadm-environment#5230 | 07 February 2024, 10:41:31 UTC |
a9e5043 | Antoine Lambert | 05 February 2024, 15:14:30 UTC | tox: Bump mypy to 1.8.0 Related to swh/meta#5075. | 05 February 2024, 15:14:30 UTC |
3e10fdb | Nicolas Dandrimont | 30 January 2024, 17:21:45 UTC | Update for pytest-postgresql >= 5 | 02 February 2024, 09:54:35 UTC |
a716cc7 | Nicolas Dandrimont | 30 January 2024, 17:22:11 UTC | tox.ini: keep pytest invocation at the beginning of the line This didn't actually fail in CI because the dash at the start of the line, when either slow or !slow were enabled, ignores failures. | 30 January 2024, 17:24:24 UTC |
869514d | Antoine Lambert | 06 December 2023, 15:48:17 UTC | requirements-test: Remove outdated types-* packages Typing stubs for click and flask are now maintained by upstreams so remove their outdated typing packages. | 06 December 2023, 15:48:20 UTC |
210eb6a | David Douard | 04 December 2023, 18:28:28 UTC | python: Fix black formatting after bump to 23.1.0 in pre-commit And replace comment type annotations by explicit ones. | 04 December 2023, 18:28:28 UTC |
23dc6fb | David Douard | 03 December 2023, 17:36:44 UTC | Apply swh-py-template 0.1.6 | 03 December 2023, 17:36:44 UTC |
88d6854 | David Douard | 22 November 2023, 17:23:14 UTC | Migrate to copier-based swh-py-template | 29 November 2023, 15:34:21 UTC |
7c775bb | David Douard | 16 November 2023, 17:01:25 UTC | docs: include the README file in the main index page Convert README from markdown to ReST to make it embeddable in docs/index.rst | 16 November 2023, 17:01:25 UTC |
aae9583 | David Douard | 18 October 2023, 14:49:18 UTC | sql: remove task type creation from the sql init scripts This is now handled by the worker plugin system. | 08 November 2023, 09:13:46 UTC |
6247685 | Antoine Lambert | 22 June 2023, 14:01:58 UTC | backend: Add a temporary postgresql scheduler backend When using that backend, a temporary scheduler database is spawned then removed when the backend gets destroyed. It can be used for testing SWH components that require a scheduler instance (listers for instance). | 07 November 2023, 16:21:29 UTC |
a726552 | David Douard | 18 October 2023, 15:05:22 UTC | Remove version restriction on importlib_metadata for python > 3.7 The issue has been solved for some time now. | 20 October 2023, 09:50:11 UTC |
2a64df0 | Nicolas Dandrimont | 06 October 2023, 13:53:20 UTC | parse_time_interval: Improve parsing to deal down to seconds This also improves: - the spacing between time period, making it optional. - the units can be simple up to the full units (e.g h, hr, hrs, hour, hours...) | 06 October 2023, 14:04:30 UTC |
c175460 | Antoine R. Dumont (@ardumont) | 06 October 2023, 13:36:20 UTC | cli.origin.send-to-celery: Allow providing cooldown flags This should ease rescheduling origins manually in staging for testing purposes. | 06 October 2023, 13:36:20 UTC |
5b13c1d | Antoine R. Dumont (@ardumont) | 06 October 2023, 13:14:26 UTC | Drop duplicated, desynchonized and misplaced test | 06 October 2023, 13:14:32 UTC |
0dba862 | Antoine R. Dumont (@ardumont) | 02 October 2023, 15:33:13 UTC | Provide a default max_queue_length value to task_type Because the register task type routine does not provide the value, it's left unchecked. Once a new lister starts listing origins, the scheduler keeps on scheduling new origins in the queue without limits. As the consumption may be slower than the production, that tends towards too much resources usage in rabbitmq. This should limit the issue for new deployments. | 02 October 2023, 15:34:12 UTC |
9a91b8d | Antoine Lambert | 30 August 2023, 11:45:18 UTC | MANIFEST.in: Include missing tests datadir | 30 August 2023, 11:51:28 UTC |
c99be45 | Antoine Lambert | 30 August 2023, 09:01:07 UTC | requirements-test: Remove swh.lister dependency The swh.lister package was required as testing dependency to check registration of celery tasks for listers through plugins declared in swh.workers entrypoints. However, it is easy to create a fake lister in tests data to check its correct registration through scheduler CLI so that swh.lister dependency is not really needed. As a consequence, worker plugins are now discovered in the function register_task_types from the swh.scheduler.cli.task_type module, previoulsy it was done when importing the module. | 30 August 2023, 11:51:28 UTC |
f052c9b | Valentin Lorentz | 07 June 2023, 14:51:25 UTC | cli: Document configuration expected by every endpoint | 30 August 2023, 11:39:36 UTC |
d379ab6 | Guillaume Samson | 06 July 2023, 15:07:45 UTC | add-forge-now/cli: add check-ingested-origins command Related to swh/devel/swh-scheduler#4684 | 07 August 2023, 11:12:48 UTC |
8d543cf | Guillaume Samson | 05 July 2023, 15:48:47 UTC | add-forge-now/cli: add check-listed-origins command Related to swh/devel/swh-scheduler#4683 | 07 August 2023, 11:12:48 UTC |
4b316fa | David Douard | 07 July 2023, 14:51:48 UTC | Fix mypy/click: add swh.core[testing] in requirements-test.txt It now needs types-click which is indeed a dependency of swh.core[testing]. | 07 July 2023, 14:51:48 UTC |
9f849c2 | Antoine R. Dumont (@ardumont) | 04 July 2023, 15:57:33 UTC | scheduler: Update default policy to schedule origins without last update Prior to this, we considered not to do it. However, we do have some listers which are not able to list origins with a last update. And we still need to be able to schedule those origins nonetheless hence this change. Refs. swh/infra/sysadm-environment#4971 | 04 July 2023, 15:58:34 UTC |
836d9e2 | Valentin Lorentz | 07 June 2023, 14:10:43 UTC | cli: Fix docstring format | 20 June 2023, 12:51:41 UTC |
006d60c | Antoine Lambert | 13 June 2023, 11:07:58 UTC | sql: Fix task creation when providing a custom next_run value Previously, it was not possible to create sequentially two oneshot tasks whose only differ by their next_run value. Related to swh/devel/swh-web#4548 | 13 June 2023, 11:07:58 UTC |
04a2207 | vlorentz | 07 June 2023, 14:43:47 UTC | cli: Use ctx.fail instead of raising an exception | 07 June 2023, 14:43:47 UTC |
cf2ca93 | Antoine R. Dumont (@ardumont) | 30 May 2023, 13:09:20 UTC | cli.add_forge_now: Allow queue name prefix override This will allow to schedule add-forge-now requests to different queues. For example [1] will send git tasks to the add_forge_now_slow:swh.loader.git... queue. [1] ``` swh ... add-forge-now ... \ --preset $ENVIRONMENT \ schedule-first-visits \ --type-name git \ --prefix-queue add_forge_now_slow ``` | 30 May 2023, 15:14:22 UTC |
6299df4 | Antoine R. Dumont (@ardumont) | 12 April 2023, 09:28:17 UTC | add-forge-now: Improve conditional so incremental listing is delayed Otherwise, listing types without a 'list-%-full', 'list-%-incremental' pattern (e.g list-cgit, ...) are systematically delayed 1 day the first time the add-forge-now schedules them. Refs. swh/infra/sysadm-environment#4845 | 12 April 2023, 09:28:17 UTC |
ddcd7c8 | Antoine Lambert | 24 March 2023, 15:48:36 UTC | celery_backend/config: Enable to set Sentry DSN per task type Add a task_prerun celery signal handler in order to set Sentry DSN based on task name or package name. The mapping between a task/package name and its DSN must be stored in configuration under a "sentry_settings_for_celery_tasks" key. For this feature to work, no SWH_SENTRY_DSN and SWH_MAIN_PACKAGE environment variables should be defined as they override the sentry_dsn and main_package values passed to init_sentry function. Related to swh/meta#4949. | 28 March 2023, 15:52:15 UTC |
5936ae1 | Antoine R. Dumont (@ardumont) | 21 March 2023, 10:57:08 UTC | add-forge-now: Allow scheduling of cgit task type Refs. swh/infra/sysadm-environment#4813 | 21 March 2023, 12:00:04 UTC |
c24b0c8 | Antoine Lambert | 17 February 2023, 16:09:37 UTC | mypy: Bump to 1.0.1 and fix new typing errors Related to swh/meta#4960 | 17 February 2023, 16:59:03 UTC |
4cb605e | Jérémy Bobbio (Lunar) | 16 February 2023, 16:10:00 UTC | Update and clean tox configuration for version 4 Related to swh/meta#4959 | 16 February 2023, 16:10:00 UTC |
e33d0ad | Antoine Lambert | 02 February 2023, 10:07:36 UTC | pre-commit: Bump isort from 5.10.1 to 5.11.5 This fixes python 3.7 support due to poetry, a dependency of isort, that removed support for that Python version in a recent release. | 02 February 2023, 10:07:36 UTC |
9beef90 | Antoine R. Dumont (@ardumont) | 30 January 2023, 16:05:17 UTC | Configure logging from environment variable SWH_LOG_CONFIG When not provided, this uses the logging configuration coded in the scheduler (as before). Refs. swh/infra/sysadm-environment#4524 | 31 January 2023, 16:59:14 UTC |
bebf298 | Antoine R. Dumont (@ardumont) | 27 January 2023, 09:04:24 UTC | swh.scheduler.cli: Pass initialization exceptions to subcommands | 30 January 2023, 15:27:11 UTC |
a65c4ed | Antoine Lambert | 26 January 2023, 13:06:45 UTC | celery_backend/config: Fix missing comma in setup_log_handler Because of that missing comma, an exception was raised (tuple object is not callable) but it was caught and displayed by the _print_errors decorator so tests could not detect it. As a consequence, the logging configuration of celery workers was broken. Add a test to check if an exception was raised by the setup_log_handler function to avoid bad surprises when deploying to production or in docker. | 26 January 2023, 15:11:11 UTC |
7d3e9ae | Valentin Lorentz | 25 January 2023, 13:48:56 UTC | require pytest-postgresql < 4.0.0 | 25 January 2023, 13:48:56 UTC |
037946a | Valentin Lorentz | 25 January 2023, 13:37:18 UTC | Add missing dependency on pytest-postgresql It is used by the pytest plugin | 25 January 2023, 13:37:42 UTC |
8f0849a | Antoine R. Dumont (@ardumont) | 16 January 2023, 14:22:12 UTC | Allow logging configuration from configuration yaml file This will allow proper logging configuration for the services which are currently running in the dynamic infrastructure. Their logs are current written in the wrong elasticsearch indices. Ref. swh/infra/sysadm-environment#4524 | 23 January 2023, 17:03:12 UTC |
fccf944 | Antoine R. Dumont (@ardumont) | 12 December 2022, 13:05:46 UTC | Add missing __init__.py so find_packages keep finding sql modules Otherwise, at some point, this will get discarded as per the debian build warning [1] [1] https://jenkins.softwareheritage.org/view/swh-debian%20(draft)/job/debian/job/packages/job/DSCH/job/gbp-buildpackage/182/console | 02 January 2023, 09:21:57 UTC |
d521ab7 | Antoine Lambert | 19 December 2022, 14:10:54 UTC | docs: Include module indices only when building standalone package doc In order to remove warnings about /apidoc/*.rst files being included multiple times in toc when building full swh documentation, prefer to include module indices only when building standalone package documentation. Also include them the proper sphinx way. Related to T4496 | 19 December 2022, 14:10:54 UTC |
8e125f1 | Antoine R. Dumont (@ardumont) | 07 December 2022, 15:57:32 UTC | cli.add_forge_now: Open `register-lister` with sensible defaults This will ease scheduling of new add-forge-now requests, on: - staging: this will list a subset of disabled origins once - production: this will register recurring tasks (full, incremental if any) to list that new forge This also unifies the previous subcommand schedule-first-visits with the --preset flag. So, the following would be enough to list appropriately in staging/production: ``` swh scheduler add-forge-now \ ( --preset [production|staging] \ # to enable a pre-defined set of rules ) register-lister \ gitea \ url=https://git.afpy.org/api/v1/ ``` Related to https://gitlab.softwareheritage.org/infra/sysadm-environment/-/issues/4674 | 08 December 2022, 17:51:45 UTC |
1c34e98 | Antoine R. Dumont (@ardumont) | 07 December 2022, 14:14:32 UTC | cli.add_forge_now: Open `schedule-first-visits` with sensible defaults This should ease scheduling the first visits for add-forge-now request. The following would be enough to fetch and schedule the forge just listed (be it in production or staging): ``` swh scheduler add-forge-now \ schedule-first-visits \ --visit-type git \ (--visit-type svn \ # if a lister lists multiple kinds of visit, we can mention it ) --lister-name gitea \ --lister-instance-name git.afpy.org \ ( --production | --staging ) # to list only enabled | disabled origins ``` Related to https://gitlab.softwareheritage.org/infra/sysadm-environment/-/issues/4674 | 07 December 2022, 15:44:28 UTC |
e2878b5 | Antoine R. Dumont (@ardumont) | 07 December 2022, 11:38:58 UTC | task add: Ensure task type provided exist and raise otherwise Related to https://gitlab.softwareheritage.org/infra/sysadm-environment/-/issues/4674 | 07 December 2022, 11:57:04 UTC |
cd16fce | Antoine R. Dumont (@ardumont) | 06 December 2022, 16:01:41 UTC | grab_next_visits: Open lister name and instance name filtering Related to https://gitlab.softwareheritage.org/infra/sysadm-environment/-/issues/4674 | 06 December 2022, 16:03:32 UTC |
a776963 | Antoine R. Dumont (@ardumont) | 06 December 2022, 11:24:33 UTC | send-to-celery: Adapt to schedule from lister name & instance_name This allows to bypass the lister id retrieval step using directly the name and instance name of the lister to discover the uuid. This also drops the --lister-uuid flag which is somewhat difficult to use. Related to https://gitlab.softwareheritage.org/infra/sysadm-environment/-/issues/4674 | 06 December 2022, 15:54:02 UTC |
ff75e74 | Nicolas Dandrimont | 25 October 2022, 13:48:55 UTC | Ensure origins are not visited faster than twice a day The scheduled_cooldown only applies to tasks that have not been executed yet. absolute_cooldown avoids archiving objects faster than that. | 25 October 2022, 14:48:51 UTC |
1f9109f | Nicolas Dandrimont | 25 October 2022, 13:47:37 UTC | Refresh task type data from the database every time recurrent tasks are run Avoids inconsistencies between the database state and an ongoing recurrent task scheduler. | 25 October 2022, 14:48:51 UTC |
bde27a9 | Nicolas Dandrimont | 25 October 2022, 13:46:26 UTC | Use json instead of msgpack for serializers Recent celery versions generate serialized messages with mime types incompatible with older versions when using msgpack | 25 October 2022, 13:51:01 UTC |
aeb870a | David Douard | 18 October 2022, 16:21:00 UTC | pre-commit, tox: Bump pre-commit, codespell, black and flake8 - pre-commit from 4.1.0 to 4.3.0, - codespell from 2.2.1 to 2.2.2, - black from 22.3.0 to 22.10.0 and - flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies. Also change flake8's repo config to github (the gitlab mirror being outdated). | 18 October 2022, 16:53:38 UTC |
17c6d48 | Antoine R. Dumont (@ardumont) | 03 October 2022, 11:43:32 UTC | Fix compatibility issue with latest dependency version This currently fails all swh related builds which depend on the celery/kombu stack due to that dependency's latest version release. | 03 October 2022, 11:58:46 UTC |
6d0b1d1 | Antoine R. Dumont (@ardumont) | 23 September 2022, 07:48:30 UTC | backend: Prevent query exception when lister ids is empty Related to T4545 | 23 September 2022, 07:49:04 UTC |
b1afdab | Antoine Lambert | 14 September 2022, 14:18:51 UTC | recurrent_visits: Allow to set no origins scheduled backoff in config The send_visits_for_visit_type function uses a default schedule backoff of 20 minutes where there is no origins to schedule for a given visit type. It exists use cases when we would like that schedule backoff to be shorter in order to schedule listed origins for loading into the archive more rapidly, typically in the docker environment. So allow to set that backoff value through configuration. | 15 September 2022, 08:41:20 UTC |
7cfaa98 | Antoine Lambert | 22 August 2022, 13:19:50 UTC | sql/Makefile: Fix swh-scheduler SQL file paths Those files have been renamed so the database could not be filled. | 22 August 2022, 13:19:50 UTC |
fd6df6a | Antoine R. Dumont (@ardumont) | 29 July 2022, 08:12:23 UTC | api/server: Clarify load and check configuration backend This adds type to the function, update its docstring and clarify its associated tests as well. | 29 July 2022, 08:12:23 UTC |
d847448 | David Douard | 08 July 2022, 12:00:33 UTC | Fix the load_and_check_config() function to support the "postgresql" cls value and replace usage of the "local" scheduler cls with "postgresql" everywhere. | 08 July 2022, 12:23:46 UTC |
0496c39 | Antoine R. Dumont (@ardumont) | 03 June 2022, 12:41:51 UTC | Remove unused get_current_version method Attribute current_version is already set and directly used by swh db [version|init|upgrade] clis. Related to T4305 | 03 June 2022, 12:44:56 UTC |
ef15385 | David Douard | 31 May 2022, 12:21:31 UTC | tests: use stock pytest_postgresql factory function instead of (soon-to-be-deprecated) swh-core's postgresql_fact one. | 31 May 2022, 14:46:05 UTC |
e56fc4d | Antoine Lambert | 12 May 2022, 09:08:09 UTC | interface: Return enabled origins only by default in get_listed_origins Add a new enabled_only parameter set to True by default in get_listed_origins scheduler method. It enables to filter out by default disabled listed origins when requesting the result of a listing and avoid possible errors in listers implementation. | 12 May 2022, 10:07:17 UTC |
c7c53ea | Pratyush Desai | 09 May 2022, 10:13:54 UTC | add strict asyncio_mode in pytest.ini | 09 May 2022, 10:13:54 UTC |
1d50b2e | Antoine Lambert | 06 May 2022, 15:05:20 UTC | cli/task: Fix sphinx >= 4.4 warning Fix "more than one target found for cross-reference 'Origin'" sphinx warning. | 06 May 2022, 15:06:23 UTC |
881b521 | Benoit Chauvet | 28 April 2022, 13:56:01 UTC | Add missing sentry captures | 28 April 2022, 13:59:44 UTC |
82274c1 | Valentin Lorentz | 27 April 2022, 13:15:28 UTC | cli/utils: Fix parsing of empty strings | 27 April 2022, 13:15:28 UTC |
353cf2a | Valentin Lorentz | 26 April 2022, 11:05:15 UTC | Bump mypy to v0.942 | 26 April 2022, 11:05:15 UTC |
0365b85 | Valentin Lorentz | 21 April 2022, 16:40:55 UTC | Add a 'lister_instance_name' argument to all tasks created from ListedOrigin This will allow loaders to use the right API credentials to fetch extrinsic metadata for the origin from the forge. | 26 April 2022, 10:28:37 UTC |
42e362d | Valentin Lorentz | 21 April 2022, 10:22:03 UTC | Add a 'lister_name' argument to all tasks created from ListedOrigin This will allow loaders to guess the forge type, and use the right API to fetch extrinsic metadata for the origin from the forge. | 26 April 2022, 10:28:33 UTC |
3687931 | David Douard | 25 April 2022, 16:14:29 UTC | Update a bit the documentation for the new origin visit scheduler | 26 April 2022, 08:38:05 UTC |
9483493 | Valentin Lorentz | 21 April 2022, 09:22:48 UTC | Make create_origin_task_dict a standalone function It feels off as an object method; and I am going to make it use joins in a future commit, so it makes more sense this way. | 21 April 2022, 15:15:06 UTC |
5e9ee60 | Valentin Lorentz | 21 April 2022, 09:21:05 UTC | test_utils.py: Convert to pytest-style tests | 21 April 2022, 11:47:58 UTC |
9627e6d | Antoine Lambert | 21 April 2022, 11:39:49 UTC | pre-commit: Remove codespell commit-msg hook That hook can be frustrating as it can discard a long commit message if it finds a typo in it so better removing it. | 21 April 2022, 11:39:49 UTC |
a76bb02 | David Douard | 15 April 2022, 16:08:49 UTC | Make scheduling policy used in schedule_recurrent configurable Add support for a configuration option "scheduling_policy" in the config file loaded by the 'swh scheduler schedule-recurrent' command. This config entry allows to specify the scheduling policies used by the schedule-recurrent tool, instead of having them hardcoded in the source code. A visit type policy config entry should have at least a 'weight' value for each policy. Default values are unchanged. Eg.: scheduling_policy: git: - policy: already_visited_order_by_lag weight: 55 tablesample: 0.5 - policy: never_visited_oldest_update_first weight: 45 tablesample: 0.5 Note: there may not be configuration entries for all visit types, but if a visit type policy is configured, the config entry should be complete (in other words, the merging of the configuration with the default values is only done at first config level). | 20 April 2022, 14:34:23 UTC |
5302efd | Antoine Lambert | 08 April 2022, 13:15:35 UTC | Add .git-blame-ignore-revs file with automatic reformatting commits | 08 April 2022, 13:15:35 UTC |
3f0843b | Antoine Lambert | 08 April 2022, 13:15:09 UTC | python: Reformat code with black 22.3.0 Related to T3922 | 08 April 2022, 13:15:09 UTC |
d9a2512 | Antoine Lambert | 08 April 2022, 13:13:50 UTC | pre-commit, tox: Bump black from 19.10b0 to 22.3.0 black is considered stable since release 22.1.0 and the version we are currently using is quite outdated and not compatible with click 8.1.0, so it is time to bump it to its latest stable release. Please note that E501 pycodestyle warning related to line length is replaced by B950 one from flake8-bugbear as recommended by black. https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#line-length Related to T3922 | 08 April 2022, 13:13:50 UTC |
bafe03f | Antoine Lambert | 06 April 2022, 15:14:52 UTC | requirements-test: Remove pytest pinning to < 7 pytest-postgresql 3.1.3 and pytest-redis 2.4.0 added support for pytest >= 7 so we can now drop the pytest pinning. | 06 April 2022, 15:14:52 UTC |
78f5579 | Antoine Lambert | 22 March 2022, 10:58:10 UTC | pytest: Exclude build directory for tests discovery Due to test modules being copied in subdirectories of the build directory by setuptools, it makes pytest fail by raising ImportPathMismatchError exceptions when invoked from root directory of the module. So ignore the build folder to discover tests. | 22 March 2022, 10:58:10 UTC |
5cc62be | David Douard | 08 February 2022, 13:59:29 UTC | Adapt to swh.core 2.0.0 - add the `get_datastore` function in `swh.scheduler` - add the `get_current_version` method in `SchedulerBackend`, - remove dbversion management from sql init script - update tests accordingly | 24 February 2022, 14:51:44 UTC |
234e165 | Antoine Lambert | 10 February 2022, 16:23:34 UTC | pre-commit: Bump hooks and add new one to check commit message spelling To install the new hook: $ pre-commit install -t commit-msg | 10 February 2022, 16:23:34 UTC |
fddec02 | Antoine Lambert | 09 February 2022, 13:22:06 UTC | requirements: Remove click version pin Latest versions of celery and flask now support click >= 8.0 so we can remove the version pin. | 09 February 2022, 13:22:46 UTC |
c46ffad | David Douard | 08 February 2022, 16:26:17 UTC | Prefix task types used in tests with 'test-' so that tests do not depend on a lucky guess on what the scheduler db state actually is. DB initialization scripts do create task types for git, hg and svn (used in tests) but these tests depends on the fact the db fixture has been called already once before, so tables are truncated (especially the task and task_type ones). For example running a single test involved in task-type creation was failing (eg. 'pytest swh -k test_create_task_type_idempotence'). This commit does make tests not collide with any existing task or task type initialization scripts may create. Note that this also means that there is actually no test dealing with the scheduler db state after initialization, which is not grat and should be addressed. | 08 February 2022, 16:34:10 UTC |
9f601f5 | Antoine R. Dumont (@ardumont) | 07 February 2022, 15:46:47 UTC | requirements-test: Pin pytest to < 7.0.0 Related to T3916 | 07 February 2022, 15:47:00 UTC |
ce11283 | Valentin Lorentz | 21 January 2022, 10:10:48 UTC | Fix ReST syntax | 21 January 2022, 10:14:59 UTC |
b5477ea | Antoine R. Dumont (@ardumont) | 12 January 2022, 09:58:58 UTC | sql: Clean up task/task_run data model This archives current task and task_run tables, creating new ones filtering only necessary tasks (last 2 months' oneshot tasks plus some recurring tasks; lister, indexer, ...). Those filtered tasks are the ones scheduled by the runner and runner priority services. This archiving will allow those services to be faster (corresponding query execution time will outputs results faster without the archived data). Related to T3837 | 12 January 2022, 10:30:36 UTC |
5c836d6 | Vincent SELLIER | 04 January 2022, 23:08:50 UTC | Allow to specify the visit grab parameters per visit type and policy Related to T3827 | 05 January 2022, 17:18:32 UTC |
559f345 | Antoine R. Dumont (@ardumont) | 16 December 2021, 14:47:56 UTC | Pin mypy and drop type annotations which makes mypy unhappy This also drops spurious copyright headers to those files if present. Related to T3812 | 16 December 2021, 14:47:56 UTC |
e051b32 | Nicolas Dandrimont | 09 December 2021, 13:54:09 UTC | Use a temporary table to update scheduler metrics When using ``insert into <...> select <...>``, PostgreSQL disables parallel querying. Under some circumstances (in our large production database), this makes updating the scheduler metrics take a (very) long time. Parallel querying is allowed for ``create table <...> as select <...>``, and doing so restores the small(er) runtimes for this query (15 minutes instead of multiple hours). To use that, we have to turn the function into plpgsql instead of plain sql. | 09 December 2021, 14:16:06 UTC |
a8edbdb | Antoine R. Dumont (@ardumont) | 07 December 2021, 13:31:34 UTC | Clean up disabled scheduler archival task related services This is dead code now as this has long been stopped and disabled in production. Related to T3777 | 08 December 2021, 10:12:53 UTC |
5de8ba4 | Nicolas Dandrimont | 07 December 2021, 12:57:51 UTC | Make next_visit_queue_position an integer In visit types with small amounts of origins having no last_update field, we would end up overflowing Python datetimes (which only go up to 31 December 9999) pretty quickly. Making the queue position a 64-bit integer should give us some more leeway. The queue position now defaults to zero instead of an arbitrary point in time. Queue offsets are still commensurate with seconds, but that's mostly to give them some space to be splayed by the fudge factors. | 07 December 2021, 16:39:48 UTC |
0a6aac5 | Vincent SELLIER | 06 December 2021, 15:23:49 UTC | Ensure there is no duplicated origins in the insertion batches when a lister try to insert duplicate origins in the same batch, the insertion is failing because the "on cascade do update" instruction cannot manage duplicates in the same transaction Related to T3769 | 06 December 2021, 20:11:40 UTC |
2abb393 | Valentin Lorentz | 22 November 2021, 12:32:20 UTC | Fix CardinalityViolation in grab_next_visits on duplicate origins grab_next_visits grabs from `listed_origins`, whose primary key is `(lister_id, url, visit_type)` and uses it to upsert in origin_visit_stats, whose primary key is `(url, visit_type)`. This causes the error `ON CONFLICT DO UPDATE command cannot affect row a second time` when the same (origin, type) pair is grabbed twice. This commit deduplicates the (origin, type) pairs before upserting. | 22 November 2021, 12:36:24 UTC |
00ff02e | Nicolas Dandrimont | 29 October 2021, 13:58:31 UTC | recurrent visits: use policy weights instead of ratios The ratios weren't checked for normalization; using relative weights explicitly ensures that the settings won't be misinterpreted. | 29 October 2021, 13:58:31 UTC |
7f434c3 | Nicolas Dandrimont | 29 October 2021, 13:44:56 UTC | Improve docs rendering for recurrent visits scheduler | 29 October 2021, 13:44:56 UTC |
50d7fd7 | Nicolas Dandrimont | 27 October 2021, 10:09:42 UTC | Add a new cli endpoint to schedule recurrent visits in Celery For each known visit type, we run a loop which: - monitors the size of the relevant celery queue - schedules more visits of the relevant type once the number of available slots goes over a given threshold (currently set to 5% of the max queue size). The scheduling of visits combines multiple scheduling policies, for now using static ratios set in the `POLICY_RATIOS` dict. We emit a warning if the ratio of origins fetched for each policy is skewed with respect to the original request (allowing, for now, manual adjustement of the ratios). The CLI endpoint spawns one thread for each visit type, which all handle connections to RabbitMQ and the scheduler backend separately. For now, we handle exceptions in the visit scheduling threads by (stupidly) respawning the relevant thread directly. We should probably improve this to give up after a specific number of tries. Co-authored-by: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> | 28 October 2021, 11:06:56 UTC |
0c7ef27 | Nicolas Dandrimont | 27 October 2021, 13:45:09 UTC | grab_next_visits: avoid time interval calculations in PostgreSQL When the database is in a non-UTC timezone with DST, and a `timestamptz - interval` calculation crosses a DST change, the result of the calculation can be one hour off from the expected value: PostgreSQL will vary the timestamp by the amount of days in the interval, and will keep the same (local) time, which will be offset by an hour because of the DST change. Doing the datetime +- timedelta calculations in Python instead of PostgreSQL avoids this caveat altogether. | 27 October 2021, 13:45:09 UTC |
ecc0e28 | Antoine R. Dumont (@ardumont) | 22 October 2021, 08:44:08 UTC | Restrict the click version to avoid conflict version with celery's Otherwise, in some edge case, like run in docker, the install fails on conflict. Related to P1205#8092 | 22 October 2021, 09:21:36 UTC |
243a69f | Antoine R. Dumont (@ardumont) | 20 October 2021, 08:42:34 UTC | Add docstring to runner and listener modules Related to T3667 | 20 October 2021, 09:25:38 UTC |
5b53196 | Antoine R. Dumont (@ardumont) | 20 October 2021, 08:44:34 UTC | Drop deprecated listener module It's been deprecated for enough time. Related to T3667 | 20 October 2021, 09:02:11 UTC |