sort by:
Revision Author Date Message Commit Date
b4249ca Updated backport on buster-swh from debian/0.10.0-1_swh1 (unstable-swh) 03 February 2021, 22:12:07 UTC
f548822 Merge tag 'debian/0.10.0-1_swh1' into debian/buster-swh 03 February 2021, 22:12:07 UTC
a78d5b2 Updated debian changelog for version 0.10.0 03 February 2021, 22:10:13 UTC
97ca32a Update upstream source from tag 'debian/upstream/0.10.0' Update to upstream version '0.10.0' with Debian dir 17cb8a0e3def3b15efe5ce2e2ae36c621314e1f3 03 February 2021, 22:10:12 UTC
2cf46e3 New upstream version 0.10.0 03 February 2021, 22:10:12 UTC
14feab9 celery: acknowledge tasks as soon as they're received With late acknowledgements, RabbitMQ will re-send tasks to clients even if they can't ever complete the task (e.g. when the task gets killed because the machine is out of memory). This problem only increases over time, leading to complete starvation of the ingestion system. Now that we have multiple mechanisms to issue retries of tasks, we can use early acknowledgements for tasks instead, which should mitigate the ongoing starvation, at the expense of having to retry tasks externally. 03 February 2021, 19:10:26 UTC
aaffff2 Simulator: allow to export results in a csv file 01 February 2021, 14:37:31 UTC
9fce3f6 Add minimal tests for the SimulationReport.format() method 01 February 2021, 14:37:31 UTC
aaf7dd6 Make plottings optional in simulator cli output 29 January 2021, 15:00:36 UTC
cf0583b simulator: stop validating the scheduling policy in the CLI We already do that in the scheduler backend function 26 January 2021, 12:33:16 UTC
ebb5847 Run simulator tests on all known scheduling policies 26 January 2021, 12:33:05 UTC
1f77521 simulator: record visit metrics alongside scheduler metrics This allows us to check the behavior of the archive over time in terms of number of visits. 26 January 2021, 12:32:54 UTC
8898394 simulator: stop using the database as a cache for origin data This was a significant bottleneck of the simulator. To work around this, we: - Generate snapshot ids consistently in the OriginModel - Cache the origin data locally in the simulator, to compute the eventfulness of visits - Cache the last visit time for all origins to compute the estimated run time of visit tasks. 26 January 2021, 12:31:57 UTC
c92ead5 grab_next_visits: don't re-schedule visits too fast The earlier implementation would just schedule new visits for origins forever, regardless of whether they were already scheduled or not. 26 January 2021, 12:20:39 UTC
2b39cbc Allow overriding the timestamp of grab_next_visits This makes the simulator behavior more consistent with reality. 26 January 2021, 12:20:39 UTC
7ffbdd1 Construct grab_next_visits query arguments incrementally 26 January 2021, 12:20:39 UTC
ea068b4 simulator: add simple lister simulation 26 January 2021, 12:20:39 UTC
0e21ece Updated backport on buster-swh from debian/0.9.2-1_swh1 (unstable-swh) 25 January 2021, 15:33:19 UTC
80baa0f Merge tag 'debian/0.9.2-1_swh1' into debian/buster-swh 25 January 2021, 15:33:19 UTC
cfafc72 Updated debian changelog for version 0.9.2 25 January 2021, 15:31:22 UTC
ac65074 Update upstream source from tag 'debian/upstream/0.9.2' Update to upstream version '0.9.2' with Debian dir 0835a7d748607669851821974686036d543f5e79 25 January 2021, 15:31:21 UTC
db8fa8e New upstream version 0.9.2 25 January 2021, 15:31:20 UTC
7af98e2 Factor out ListedOrigin generation to use the OriginModel This generates consistent last_update values according to the model and simulated time. 25 January 2021, 13:39:30 UTC
2906b4e model/ListedOrigin: Set extra_loader_arguments type to Dict[str, Any] Some loaders, for instance the debian one, can have non string arguments so change the extra_loader_arguments type of the ListedOrigin model to something more generic. Related to T2979 25 January 2021, 13:10:25 UTC
3d13cda Solve uneventful/eventful with unordered messages with snapshots Fix the case: m1: date2/snapshot1 m2: date1/snaptshot1 which results to: last_eventful = date2 last_uneventful = date2 The upsert was always keeping the most recent date when the eventful/uneventful dates were switched Related to T2978 23 January 2021, 18:57:17 UTC
d528998 Do not consider duplicated messages as uneventful event Avoid to copy the eventful date to the uneventful date when a duplicated message (same date/same snapshot) is received, related to T2978 23 January 2021, 18:57:17 UTC
86b2555 Add a --num-origins option to the fill-test-data cli command 22 January 2021, 13:10:59 UTC
abb513c Simulation: log at info level recorded metrics this allows to follows what the simulation is doing. 22 January 2021, 13:08:30 UTC
f0f4541 Updated backport on buster-swh from debian/0.9.1-1_swh1 (unstable-swh) 21 January 2021, 18:29:57 UTC
75c7db4 Merge tag 'debian/0.9.1-1_swh1' into debian/buster-swh 21 January 2021, 18:29:57 UTC
835296a Updated debian changelog for version 0.9.1 21 January 2021, 18:28:00 UTC
f266da3 Update upstream source from tag 'debian/upstream/0.9.1' Update to upstream version '0.9.1' with Debian dir 83c90829de6abb48a74e477387ee087a4de998ee 21 January 2021, 18:28:00 UTC
70532dc New upstream version 0.9.1 21 January 2021, 18:27:59 UTC
82b7a8a Solve uneventful/eventful with unordered messages with snapshots Fix the case: m1: date2/snapshot1 m2: date1/snaptshot1 which results to: last_eventful = date2 last_uneventful = date2 The upsert was always keeping the most recent date when the eventful/uneventful dates were switched Related to T2978 21 January 2021, 18:15:05 UTC
25d036e Do not consider duplicated messages as uneventful event Avoid to copy the eventful date to the uneventful date when a duplicated message (same date/same snapshot) is received, related to T2978 21 January 2021, 18:15:04 UTC
e50e17e Updated backport on buster-swh from debian/0.9.0-1_swh2 (unstable-swh) 21 January 2021, 13:30:17 UTC
081148a Merge tag 'debian/0.9.0-1_swh2' into debian/buster-swh 21 January 2021, 13:30:17 UTC
b93aa5b Make PaginatedListedOriginList a concretization of PagedResult 1. consistent with swh-storage and swh-indexer-storage 2. we can use swh.core.api.classes.stream_results on scheduler.get_listed_origins. 21 January 2021, 13:26:39 UTC
58ec03f d/changelog: Bump new release Related to T2978 21 January 2021, 13:21:41 UTC
b0e941d d/control: Update dependencies This builds the debian package without the swh.scheduler.simulator module though. It is currently missing the plotille debian package. It will be dealt with later. Related to T2978 21 January 2021, 13:17:25 UTC
0346020 Reorganize grab_next_visits tests to better check sorting behavior - factor out test setup and results checking - properly exercize corner cases of the oldest_scheduled_first policy 21 January 2021, 12:02:39 UTC
2f47936 Add scheduling policy for already visited origins with known last update This policy schedules origins by decreasing order of "visit lag" (that is, origins with the most lag are scheduled first). 21 January 2021, 12:02:39 UTC
acad712 Add scheduling policy for never visited origins This policy orders never visited origins by increasing date of last update (scheduling the "oldest" never visited origins first). 21 January 2021, 12:02:39 UTC
af37898 Run Black. It wasn't ran on d464b4cc1f9ae6a5c5c94a534826eff5cc27f12f. 21 January 2021, 11:04:55 UTC
d2c4725 Updated debian changelog for version 0.9.0 21 January 2021, 11:00:29 UTC
5a9c2e7 Update upstream source from tag 'debian/upstream/0.9.0' Update to upstream version '0.9.0' with Debian dir 77a48e68dc2ef078b65dde6e263407b0ad75c59d 21 January 2021, 11:00:29 UTC
2fd414f New upstream version 0.9.0 21 January 2021, 11:00:28 UTC
b641ac8 Make the grab_next_visits sql query modular This will allow us to easily plug new scheduling policies in that function. 21 January 2021, 10:32:33 UTC
9fb0dd6 journal_client: Read visit_stats entries by batch out of the loop Related to T2967 21 January 2021, 09:53:48 UTC
d464b4c scheduler: Make origin_visit_stats_get read multiple entries Related to T2967 21 January 2021, 09:53:46 UTC
ffe2aed Simplify journal client tests - sort visits by default (there is a test dedicated to dealing with unsorted messagaes from the journal), - remove "intermediate checks" in several tests: these do not help much but make the code more difficult to read and maintain, - rename VISIT_STATUSES1 as VISIT_STATUSES_1 to make less prone to being confused with VISIT_STATUSES (which also exists). 20 January 2021, 17:02:57 UTC
c7b740c Revert "Make sure swh.scheduler.cli.journal is loaded in test_cli_journal.py" This reverts commit b03d978241a67e741e0f62696a0bbca17d768271. It's actually not needed, after all... 20 January 2021, 17:01:51 UTC
898820f simulator: collect and plot scheduler metrics over time For now, only plot the known_origins and origins_never_visited metrics. 20 January 2021, 16:37:44 UTC
9ce68f8 simulator: stop using get_scheduler directly This reuses the scheduler instantiated by the cli instead of hardcoding our own using the PG* variables. 20 January 2021, 16:37:44 UTC
88e0b42 simulator: Add documentation. 20 January 2021, 16:37:44 UTC
62c6d90 simulator: Make min_batch_size a parameter defined in the setup. 20 January 2021, 16:37:44 UTC
9468bb9 simulator: add basic tests for fill_test_data and run 20 January 2021, 16:37:44 UTC
ead7b34 simulator: implement a simulator for the "old" task-based scheduler We extend the Task object with an autogenerated uuid allowing us to track the task lifetime between its creation and the generation of visit statuses, as the task-based scheduler does. 20 January 2021, 16:37:44 UTC
aecd27e Move the simulator cli to the main cli module 20 January 2021, 16:37:44 UTC
05067e3 simulator: Replace attrs with dataclasses for consistency 20 January 2021, 16:37:44 UTC
24922fe simulator: wrap tasks and task events in typechecked objects This allows us to extend these objects without redefining a bunch of type annotations. 20 January 2021, 16:37:44 UTC
d5318ae simulator: also fill data for the task-based scheduler 20 January 2021, 16:37:44 UTC
22ebb7a simulator: Split into smaller files in the same package 20 January 2021, 16:37:44 UTC
ad7bfbe simulator: Make the run time a CLI argument 20 January 2021, 16:37:44 UTC
df34db0 simulator: tweak simulation environment constants 20 January 2021, 16:37:44 UTC
21ce2c8 simulator: generate more origins in fill_data 20 January 2021, 16:37:44 UTC
2920419 simulator: add typing for Environment.scheduler 20 January 2021, 16:37:44 UTC
6433266 simulator: add support for a basic SimulationReport For now, this collects the runtime of tasks that have run, and gets printed at the end of the simulation. 20 January 2021, 16:37:44 UTC
c474a82 simulator: refine origin model to follow an exponential distribution This models origins using a consistent characteristic "time between commits" that follows an exponential distribution between 1 second and 10 years. From this characteristic time, and feedback from the OriginVisitStats, we can generate the expected run time and output status of the next visit of that origin. 20 January 2021, 16:37:44 UTC
2459bad simulator: Remove some debug statements and lower log level 20 January 2021, 16:37:44 UTC
cb12449 simulator: simulate the scheduler journal client 20 January 2021, 16:37:44 UTC
20b7f9c simulator: generate OriginVisitStatus objects in modeled visits To be able to generate uneventful visits, we would need to store the last snapshot seen for a given origin. Instead of storing this within the simulator, which would be a concern for large scale simulations, we use the scheduler visit cache directly. 20 January 2021, 16:37:44 UTC
39ad47d simulator: Move scheduler into the simulation environment object The scheduler is used by a lot of the simulated actors, it makes sense to share it all the time. 20 January 2021, 16:37:44 UTC
31967fa simulator: Use datetimes instead of a floating point simulated time 20 January 2021, 16:37:44 UTC
fc3f06b Introduce scaffolding for a scheduler simulator This simulator will allow us to compare the behavior of the old and new schedulers, as well as to test the impact of scheduler policies and their parameters on the performance of the Software Heritage archival infrastructure as a whole. 20 January 2021, 16:37:44 UTC
7905a6b Add a cli for the scheduler metrics update endpoint 20 January 2021, 16:35:05 UTC
c386fdf Make the max_date() helper function accept *dates as argument so it can be called with more than 2 dates. 20 January 2021, 11:28:02 UTC
b03d978 Make sure swh.scheduler.cli.journal is loaded in test_cli_journal.py needed to make pytest able to run directly (without tox). 20 January 2021, 11:18:25 UTC
737d12e Introduce a new lister_get endpoint 20 January 2021, 10:02:21 UTC
114ed95 Implement some basic aggregated metrics on listed origins Metrics are computed and cached database-side by the `update_metrics` function. The `get_metrics` function only retrieves the cached data. The metrics are aggregated for each lister instance and visit type (allowing complete reaggregation by visit type for cross-cutting statistics). The following metrics have been implemented: - number of known origins overall - number of enabled origins (origins seen in the last listing) - number of enabled origins that have never been successfully visited - number of enabled origins with known activity since our last successful visit 20 January 2021, 09:54:27 UTC
9852653 Import the journal subcommand in the main swh.scheduler cli This issue was masked by tox.ini using pytest with --doctest-modules, which imports all modules during test collection, and therefore executing the side-effects of swh.scheduler.cli.journal. 20 January 2021, 09:35:09 UTC
f8627a9 Move the `last_scheduled` ts from ListedOrigin to OriginVisitStatus this timestamp being actually a loading-related value, it makes more sense to keep it in the OriginVisitStatus table. Related to T2444. 19 January 2021, 16:48:51 UTC
0a32a31 Make the journal-client cli subcommand automagically loaded otherwise it won't be advertized as a `swh scehduler` subcommand by default. Also add a short dosctring for better --help. 19 January 2021, 15:18:49 UTC
5e609d5 requirements: Make swh.journal and optional dependency This avoids pulling journal dependencies when modules only needs the swh-scheduler dependency. 19 January 2021, 11:04:37 UTC
9395aa0 scheduler.cli.journal: Add `swh scheduler journal-client` cli This adds the cli entrypoint to actually process origin_visit_status topics and write to the origin_visit_stats db table. Related to T2967 19 January 2021, 10:10:41 UTC
58ca796 journal_client: Improve stats detection This adds an integration test which permutes input to ensure out of order renders the same result. This also improves the current algorithm which revealed some hit-and-miss cases: - Initialization of the first visit detection (through the "last_snapshot" absence field, the previous implementation check could fail otherwise). - out of order policy (ignore old event) in case of supposedly "eventful" event was done too early which ignored too much messages (those new test cases failed in some permutations). This is now specifically checked in case of referenced snapshots which led to cases of possibly changing eventful event into uneventful one. For example, the case of an anterior eventful event is caught which means that the current most-up-to-date eventful event is actually an uneventful one). ... Related to T2967 19 January 2021, 09:17:05 UTC
d3afd14 Use the recorded task end time for the task scheduler feedback loop This allows us to run "time-warping" simulations without interference from the real wall clock time. 15 January 2021, 16:04:30 UTC
a5fb291 backend: Make origin_visit_stats_upsert a batch api Related to T2967 15 January 2021, 13:34:06 UTC
608aa20 Populate origin_visit_stats table out of the origin_visit_status topic The snapshot is used to determine the "eventful/uneventful" nature of the origin visit status. When no snapshot is provided, the visit is considered as failed so the last_failed column is updated. As there is no time guarantee when reading message from the topic, the code tries to keep the data in the most timely ordered as possible. Only most recent information is kept. Related to T2967 15 January 2021, 13:34:05 UTC
ca45d40 Filter origins by visit type when scheduling the next visits We have separate task queues and workers for each visit type, so it makes sense to split this endpoint along these lines too, at least for now. 14 January 2021, 12:53:31 UTC
59b4cb3 Reorganize ListedOrigin fixtures to generate multiple visit_types 14 January 2021, 12:53:31 UTC
4f5338f Introduce a `swh scheduler origin schedule-next` cli This creates one-shot tasks in the classic scheduler for the next visits to run according to the visit scheduling policy. 14 January 2021, 12:53:31 UTC
3dd1d5f Rename test task types to names that match real tasks The success of tests using these task types would depend on the test run order, because these task types are (currently) being created by swh/scheduler/sql/50-data.sql, but the table is truncated after the first test completes. 14 January 2021, 12:53:31 UTC
5d7b002 Introduce a `swh scheduler origin grab-next` cli This returns, as CSV, the next origins to be visited according to the passed scheduling policy. 14 January 2021, 12:53:31 UTC
a620033 Add an new origin visit info model object and related backend api Upsert and Read methods Related to T2443 12 January 2021, 13:47:49 UTC
b13cb1f Implement a basic endpoint for getting the next origins to visit The basic policy implemented is a FIFO, to get things going. 11 January 2021, 14:40:17 UTC
619100e Add a cli section to the doc 18 December 2020, 14:57:00 UTC
dd4b698 Updated backport on buster-swh from debian/0.8.2-1_swh2 (unstable-swh) 08 December 2020, 10:37:45 UTC
88ba897 Merge tag 'debian/0.8.2-1_swh2' into debian/buster-swh 08 December 2020, 10:37:45 UTC
42d50be d/changelog: Bump new release 08 December 2020, 10:32:58 UTC
back to top