https://github.com/SoftwareHeritage/swh-storage

sort by:
Revision Author Date Message Commit Date
6ae5098 New upstream version 0.29.0 11 May 2021, 13:12:36 UTC
f328367 content_get: Add support for queries by sha1_git Before this commit, the only way to get Content objects from their sha1_git was to call content_find for each object. This was obviously neither convenient nor efficient. Using this endpoint to batch calls reduces the runtime of the git-bare vault cooker by 30%. 11 May 2021, 12:36:30 UTC
e3cbd5e Add endpoint directory_get_entries, to quickly list a directory's entries It spares a join with the content table, which should hopefully make the vault (and possibly other users) faster when they don't need this join. 11 May 2021, 10:00:27 UTC
f140f63 cassandra: Add tests checking directory_add and snapshot_add are atomic. 11 May 2021, 08:22:23 UTC
b487a21 Deprecate the "local" storage cls in favor of "postgresql" 10 May 2021, 12:56:44 UTC
9105253 Move all proxy storages in swh/storage/proxies/ to clean a bit the swh.storage namespace. 10 May 2021, 12:55:07 UTC
7617099 Make the TenaciousProxyStorage retry when a single object add fails give a chance to one-object batches to be ingested, and reduce the number of objects wrongly reported as non-ingested, e.g. during a replayer session, where this situation can occur. 07 May 2021, 11:46:00 UTC
455191b New upstream version 0.28.0 06 May 2021, 14:06:49 UTC
35ae94a Use swh.core 0.14 It renamed db_name to dbname, which is a breaking change. 06 May 2021, 12:23:09 UTC
652e3d5 tenacious: Document potential issues about objects being dropped 06 May 2021, 09:56:32 UTC
e170fb2 Stop storing authority/fetcher metadata. We still don't have a use for them, and they are causing issues; such as being unable to add an authority/fetcher based only on a REMD object, which is needed by the replayer. 05 May 2021, 10:54:04 UTC
77ef651 Make postgresql's origin_add not raise an error in case of conflict there is no need for an url insertion in the origin table to result in a unicity error. Conflicting insertion of the same URL in this table may happen in case of concurrent process (loading or in a replayer session). 05 May 2021, 10:18:44 UTC
ffb38f7 Add a new TenaciousProxyStorage This proxy storage attempt to add buckets of objects, but in case of failure, it splits the bucket in parts so every valid object in the bucket get a chance to be inserted. Also provides an error rate-limiting feature. This proxy storage is mainly dedicated to help mirrorring an archive using the replayer stack. 05 May 2021, 09:57:58 UTC
051b771 cassandra: Add a test of a 'complex' migration, with a PK update 03 May 2021, 15:40:37 UTC
f233461 cassandra: Add 'check_missing' option, to allow updating objects as part of a migration. Also write a first test that simulates how a simple migration would go. 03 May 2021, 15:40:36 UTC
2b20af5 New upstream version 0.27.4 29 April 2021, 13:04:39 UTC
92d551a Normalize all Storage.xxx_add() methods to return a summary but origin_visit_add() which requires more work to do so. Note that this will change the way 'raw_extrinsinc_metadata_add()' report statsd metrics: the 'method_name' tag will now remain 'raw_extrinsic_metadata_add' instead of a forged '<type_name>_metadata_add'. 29 April 2021, 10:40:33 UTC
ff7ecb4 Properly annotate output of Storage.xxx_add() methods as Dict[str, int] when applicable. 29 April 2021, 10:03:10 UTC
98804f9 Add a fixer for ExtrinsicRawMetadata the 'type' attribute has been removed in swh.model v1.0.0 in favor of an ExtendedSWHID 'target'. 28 April 2021, 12:12:22 UTC
615d719 tox: Add sphinx environments to check sane doc build Enable to check package documentation can be built without producing sphinx warnings. The sphinx environment is designed to be used in continuous integration in order to prevent breaking documentation build when committing changes. The sphinx-dev environment is designed to be used inside a full swh development environment. Related to T3258 27 April 2021, 11:57:23 UTC
2c477ec Fix storage_data hardcoded id values and add a test to check this stays accurate, so that these objects can pass throught the validate proxy storage, for example. 23 April 2021, 13:45:35 UTC
eb8c147 cassandra: Deduplicate table names This removes all table names from cassandra/cql.py, and gets them from cassandra/schema.py instead. When possible, this uses existing constants (BaseRow.TABLE), otherwise it uses a function to compute these names. This is needed to support schema migrations, as updating a table's primary key requires creating a new table with a different name. 22 April 2021, 15:22:18 UTC
a1fc5fb cassandra: Use prepared statements in extid_index_* All other statements are, and there is no reason for them not to be too 15 April 2021, 13:56:32 UTC
3b00e3a Fix various Sphinx warnings 15 April 2021, 08:19:23 UTC
b999952 sql/Makefile: Also call dropdb prior createdb when using pifpaf Now that PGDATABASE value from pifpaf is used, that call is now needed otherwise the overall swh doc build in development mode fails. 14 April 2021, 16:41:20 UTC
1bacea5 docs: Fix db-schema.svg generation to use pifpaf-created database This makes 'tox -e sphinx-dev' not rely on the existence of the database on the system. 13 April 2021, 15:12:13 UTC
c96942b Cassandra: Deduplicate lists passed to *_add endpoints Previously only release_add supported deduplication. This commit aligns other _add endpoints with it 12 April 2021, 11:27:22 UTC
933289e Remove last references to no longer used SQLAlchemy package 09 April 2021, 13:07:35 UTC
a5342f9 New upstream version 0.27.3 09 April 2021, 13:06:55 UTC
50becef docs: Fix db-schema.svg inclusion when building full swh documentation The image was correctly included when building standalone swh-storage documentation but was not when building the full swh one. Closes T3227 09 April 2021, 11:37:53 UTC
1562a78 New upstream version 0.27.2 08 April 2021, 08:05:41 UTC
ccaac11 migrate_extrinsic_metadata: Allow 'atom:title' as alternative to 'title' Some revisions use it instead. 07 April 2021, 12:20:19 UTC
39507b2 Make the replayer drop the Revision.metadata this attribute is deprecated and on the verge of being replaced by RawExtrinsicMetadata objects, and the kafka journal currently in production contains a few invalid metadata entries that makes the replayer unhappy. Closes T3201. 06 April 2021, 14:31:49 UTC
84dcbe3 Merge test_replay's _check_replayed and check_replayed in a single function 06 April 2021, 14:01:37 UTC
36a7fd3 Fix pg Storage.extid_add(): write ExtID objects to the journal and explicitely check for extid objects in the journal in TestStorage. 06 April 2021, 14:01:01 UTC
491e920 New upstream version 0.27.1 30 March 2021, 15:58:58 UTC
0a270d1 migrate_extrinsic_metadata: Filter out git revisions They can't have any extrinsic metadata, so fetching git revisions wastes a lot of time. 30 March 2021, 15:51:55 UTC
3309765 buffer: Add support for 'extid' Will be used by the extid migration script, and loaders can probably use it too. 30 March 2021, 15:33:00 UTC
29e04cf New upstream version 0.27.0 29 March 2021, 12:44:11 UTC
cfb2417 extid: remove unicity on (extid_type, extid) and (target_type, target) It did not make sense for multiple reasons: 1. two extids can point to the same target (eg. extids with type git and git-sha256; or two package managers with different checksums) 2. inserting two objects with the same target or extid in a single call actually wrote both, but would crash when reading 3. inserting extid1 then extid2 would write both to Kafka, but only extid1 would be inserted. When replaying on a new DB, extid2 may be inserted and extid1 ignored Points 2 and 3 are simply fixable bugs, but 1 is an issue by design, and this commit fixes all of them at once. 26 March 2021, 15:08:13 UTC
ac6f642 origin_visit_status_add: Fix inconsistent/incorrect errors when type is None and visit is missing. 26 March 2021, 14:30:43 UTC
9a0834b New upstream version 0.26.0 22 March 2021, 21:53:37 UTC
eff2383 raw_extrinsic_metadata: Make (target, authority_id, discovery_date, fetcher_id) non-unique Uniqueness is only based on the id from now on. Also adds the 'id' column to the Cassandra schema (it was already present in postgresql's schema) 22 March 2021, 11:42:46 UTC
2d540b0 Add raw_extrinsic_metadata.id column in postgresql. For now, this has absolutely no effect on the API users, as rows are already deduplicated based on a subset of the fields hashed by the id. 22 March 2021, 08:53:16 UTC
7e25bb8 New upstream version 0.25.0 18 March 2021, 13:02:00 UTC
8dd9f7b Document the existing metadata formats 15 March 2021, 14:59:07 UTC
ffc0841 content_add: Write to the objstorage before the DB or Kafka Must add to the objstorage before the DB and journal. Otherwise: 1. in case of a crash the DB may "believe" we have the content, but we didn't have time to write to the objstorage before the crash 2. the objstorage mirroring, which reads from the journal, may attempt to read from the objstorage before we finished writing it This is already done in the postgresql backend unintentionally since 209de5dbaa127dacd114fbbd084f22632982eb77. This commit documents it, makes the cassandra backend behave that way too, and adds a test. 15 March 2021, 11:55:29 UTC
b565201 storage: Allow to filter out branches by prefix when counting them Add an optional branch_name_exclude_prefix parameter to the snapshot_count_branches method of the Storage interface. It enables to filter out branches whose name starts with a given prefix when counting. The purpose is to get accurate counters in swh-web as pull request branches will be filtered out by default. Related to T2782 12 March 2021, 14:23:54 UTC
93301a1 storage: Add branch names filtering support in snapshot_get_branches Add optional branch_name_include_substring parameter to snapshot_get_branches, if provided only branches whose name contains the given substring will be returned. Add optional branch_name_exclude_prefix parameter to snapshot_get_branches, if provided branches whose name starts with the given prefix will not be returned. Purpose of these new features: add a search form in the branches view of swh-web and filter out pull request branches (whose names start with "refs/pull/") by default. Related to T2782 12 March 2021, 14:23:28 UTC
b8e10f0 Add ExtID query support to the Storage These endpoints allow to add and query the storage for known ExtID from SWHID (typically get original VCS' revision intrinsic identifier from SWHID). The underlying data structure is to be filled typically by loaders using the `extid_add()` endpoint. This only provides the Postgresql implementation. Related to T2849. 11 March 2021, 13:20:18 UTC
6a77732 Add hg revisions to the test data set 10 March 2021, 15:25:00 UTC
e83452b Import TEST_OBJECTS from swh.model instead of swh.journal this later has been deprecated for a while now. 10 March 2021, 15:25:00 UTC
82ce7bf Make sure test_backfill does not depend on 2 dict keys being miraculously listed the same. 10 March 2021, 14:49:48 UTC
c4fdd6d Add support for raw_extrinsic_metadata in the replayer This also checks the basic raw_extrinsic_metadata codepaths in the backfiller tests. 10 March 2021, 13:07:11 UTC
53a58fa Add basic support for raw_extrinsic_metadata in the backfiller 10 March 2021, 13:00:05 UTC
89ae0a1 Add simple unit test for the backfill.byte_ranges function 10 March 2021, 08:34:27 UTC
0d785d2 Add support for reading RawExtrinsicMetadata with raw URL targets We convert the target attribute to a hashed ExtendedSWHID before returning the object. 10 March 2021, 08:33:53 UTC
b4574cb New upstream version 0.24.1 04 March 2021, 22:39:01 UTC
88ff2c2 postgresql: Ensure a minimum limit for the snapshot branches query With small limits (< 10), the snapshot branches query can degenerate into using the deduplication index on snapshot_branch (name, target, target_type), and the postgresql planner happily scans several hundred million rows. So ensure a minimum limit value of 10 before executing the query for optimal performances when a small branches_count value is provided to the snapshot_get_branches method of the Storage interface. Related to P966 03 March 2021, 16:49:20 UTC
ce8335d Remove the remaining references to the deprecated SWHID class 03 March 2021, 16:46:50 UTC
f46244b tests: Drop hypothesis < 6 requirement Ensure tests can be executed using hypothesis >= 6 by suppressing the function_scoped_fixture health check on tests that use a function scope fixture in combination with @given that does not need to be reset between individual hypothesis examples. 03 March 2021, 10:53:08 UTC
fd0efad New upstream version 0.24.0 02 March 2021, 09:11:13 UTC
14739c5 RawExtrinsicMetadata: update to use the API in swh-model 1.0.0 01 March 2021, 16:38:44 UTC
2388748 storage_tests: recompute ids when evolving RawExtrinsicMetadata objects. For now this does nothing as RawExtrinsicMetadata has no 'id' field, but the equality assertions will become errors when the next version of swh.model is released. 25 February 2021, 15:33:40 UTC
f56267f New upstream version 0.23.2 19 February 2021, 10:58:48 UTC
f3ef6e6 storage: Implement visit types filtering in origin_search method Enable to filter searched origins by visit types. Add a new optional visit_types parameter to origin_search method in StorageInterface. Implement visit types filtering in storage backends, an origin wil be returned if it has any of the requested visit types. This is clearly not designed to be used in production due to performance issues but rather in testing environments with small archive dataset. Related to T2869 19 February 2021, 10:36:29 UTC
7b4c124 167: Make the migration script unblocking 17 February 2021, 09:18:26 UTC
f7f161d New upstream version 0.23.1 16 February 2021, 16:28:23 UTC
cc3eb4b Switch anonymized replayer test to use pytest parametrization This allows us to only read the kafka topics once instead of twice in the same tests, which is apparently a hard thing to do in a way compatible with both confluent-kafka 1.5 and 1.6. 16 February 2021, 16:09:03 UTC
5c6b53c New upstream version 0.23.0 15 February 2021, 14:39:02 UTC
e0e88b2 storage: Refactor OriginVisitStatus instantiation 09 February 2021, 16:01:26 UTC
d30ca93 db: Unify sql joins on origin_visit_status using "USING" 09 February 2021, 16:01:26 UTC
046fe57 storage.postgresql: Use origin_visit_status.type value as source This stops using the origin_visit.type as fallback values as now, the database has been migrated. So this makes the origin_visit_status.type a not nullable column. This also drops now redundant join instructions on origin_visit table when reading. Related to T2968 09 February 2021, 16:01:25 UTC
51df58e test_replay: Fix hang since confluent-kafka 1.6 release Side effect of the following commit in librdkafka 1.6: https://github.com/edenhill/librdkafka/commit/f418e0f721518d71ff533759698b647cb2e89b80 Tests was relying on a buggy behavior of the mocked kafka cluster: two subsequent consumers setup with the same group id should receive a different set of messages, rather than the same set of messages. Also explicitly commit messages once consumed. 09 February 2021, 14:56:15 UTC
b038383 postgresql: Fix dbversion() to return the max version instead of a random one. 08 February 2021, 11:13:03 UTC
efd8815 buffer: ensure objects are flushed in topological order This new integration test checks that, when flushing the buffer storage, the addition functions of the underlying storage backend are called in topological order (content, directory, revision, release then snapshot). This reduces the probability of "data consistency" regressions caused by the use of the buffering storage proxy alone. 04 February 2021, 18:17:11 UTC
1526107 Return an accurate summary from buffer's flush() method The earlier implementation would only return summary data from keys that existed in the last `_add` backend method run, rather than collating all the results. 04 February 2021, 18:14:03 UTC
5b3e6c9 buffer: add support for snapshots This is mostly a consistency addition, considering that most (if not all) loaders will only add a single snapshot. The common pattern of loading objects in topological order (content > directory > revision > release > snapshot), then flushing the storage, is now fully consistent; Without this addition, the snapshot addition would reach the backend storage before all other objects are added, leading to potential inconsistencies if the flush of other object types fails. 04 February 2021, 13:37:12 UTC
18967ed buffer: add type annotations for tests 04 February 2021, 09:19:34 UTC
f1e523e New upstream version 0.22.0 03 February 2021, 11:15:26 UTC
9a9f234 storage: Make origin_get_latest_visit_status return OriginVisitStatus This returned a Tuple[OriginVisit, OriginVisitStatus]. This was required to have the missing information "type" for visit-status. This is no longer needed as now OriginVisitStatus holds the type information. 01 February 2021, 11:06:35 UTC
626b0bf Change origin_visit_status_get_random interface to return visit_status This returned a Tuple[OriginVisit, OriginVisitStatus] which is no longer needed as now OriginVisitStatus held the type information now. 01 February 2021, 11:06:34 UTC
f6ae8a0 Write introduction to swh-storage. Explains: * when to use swh-web instead * that `get_storage` should always be used to instantiate the storage * `StorageInterface` * model objects * pagination * backends 01 February 2021, 11:03:02 UTC
57d3066 New upstream version 0.21.1 28 January 2021, 13:19:21 UTC
76de53c Correctly return origin_visit_status.type value everywhere If the type is not present on an origin_visit_status, it should be computed from the origin_visit. There were some methods which only return the origin_visit_status value. It breaks the webapp mangling the type to empty value on the search result page. Related to T3001 28 January 2021, 11:15:11 UTC
47e0a4c New upstream version 0.21.0 20 January 2021, 14:52:18 UTC
e433255 db: Allow new status values not_found, failed to OriginVisitStatus Related to T2961 20 January 2021, 14:36:12 UTC
45803cf New upstream version 0.20.0 20 January 2021, 09:29:52 UTC
d04165f Add type to the origin_visit_status topic useful when the type is not yet populated in the database Related to T2966 18 January 2021, 10:49:34 UTC
c24d35f Add persistence of the field OriginVisitStatus.type (!) A new database upgrade is needed (165.sql) for postgresql backend Related to T2964 15 January 2021, 11:38:38 UTC
da55308 Make test_content_add_race fail for the right reason. Since 209de5dbaa127dacd114fbbd084f22632982eb77, it was failing because of: TypeError("content_add() got an unexpected keyword argument 'db'") 15 January 2021, 10:30:36 UTC
2204346 New upstream version 0.19.0 14 January 2021, 10:18:30 UTC
0b44b37 Adapt cassandra storage to ignore the new OriginVisitStatus.type field Depends on D4848 Related to T2443 13 January 2021, 10:06:12 UTC
728c3ee Allow to use the JAVA_HOME environment for cassandra tests This allows to enforce a specific version of java to be used. For example, since cassandra seems not to support java 14 yet, this allows to run tests on bullseye: JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/ pytest swh 13 January 2021, 09:56:07 UTC
30945a5 Enforce hypothesis <6 to prevent test breakage hypothesis 6 upgraded a warning into an error: now raises a FailedHealthCheck when using a pytest fixture with a @given generative test set. See https://hypothesis.readthedocs.io/en/latest/healthchecks.html 13 January 2021, 09:42:42 UTC
74e6f58 Make the CREATE_TABLES_QUERIES in cassandra/schema.py an explicit list prevent being fooled by a missing '\n'. 08 January 2021, 13:20:08 UTC
2b35198 Add a cli section in the doc 18 December 2020, 12:41:23 UTC
04ae89f storage.backfill: Allow cli run for origin_visit_status as well 24 November 2020, 17:21:21 UTC
64ee845 conftest: Reference swh.core.db.pytest_plugin As it's exposed through the swh.storage.pytest_plugin itself used by other swh modules, this needs to be declared to avoid other swh module build failures. Related to T2746 24 November 2020, 13:08:12 UTC
4c46835 New upstream version 0.18.0 23 November 2020, 13:52:31 UTC
back to top