https://github.com/facebook/rocksdb

sort by:
Revision Author Date Message Commit Date
2a42e3e Several small improvements Summary: This has several small improvements. benchmark.sh * add BYTES_PER_SYNC as an env variable * use --prepopulate_block_cache when O_DIRECT is used * use --undefok to list options that don't work for all 7.x releases * print "failure" in report.tsv when a benchmark fails * parse the slightly different throughput line used by db_bench for multireadrandom * remove the trailing comma for BlobDB size before printing it in report.tsv * use the last line of the output from /bin/time as there can be more than one line when db_bench has a non-zero exit * fix more bash lint warnings * add ",stats" to the --benchmark=... lines to get stats at the end of each benchmark benchmark_compare.sh * run revrange immediately after fillseq to let compaction debt get removed * add --multiread_batched when --benchmarks=multireadrandom is used * use --benchmarks=overwriteandwait when supported to get a more accurate measure of write-amp Test Plan: Run it for leveled, universal and BlobDB Reviewers: Subscribers: Tasks: Tags: 11 October 2022, 20:43:20 UTC
5a5f21c Allow the last level data moving up to penultimate level (#10782) Summary: Lock the penultimate level for the whole compaction inputs range, so any key in that compaction is safe to move up from the last level to penultimate level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10782 Reviewed By: siying Differential Revision: D40231540 Pulled By: siying fbshipit-source-id: ca115cc8b4018b35d797329fa85a19b06cc8c13e 11 October 2022, 05:50:34 UTC
2d0380a Allow manifest fix-up without requiring prior state (#10796) Summary: This change is motivated by ensuring that `ldb update_manifest` or `UpdateManifestForFilesState` can run without expecting files to open when the old temperature is provided (in case the FileSystem strictly interprets non-kUnknown), but ended up fixing a problem in `OfflineManifestWriter` (used by `ldb unsafe_remove_sst_file`) where it would open some SST files during recovery and expect them to match the prior manifest state, even if not required by the intended new state. Also update BackupEngine to retry with Temperature kUnknown when reading file with potentially "wrong" temperature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10796 Test Plan: tests added/updated, that fail before the change(s) and now pass Reviewed By: jay-zhuang Differential Revision: D40232645 Pulled By: jay-zhuang fbshipit-source-id: b5aa2688aecfe0c320b80a7da689b315414c20be 11 October 2022, 00:59:17 UTC
f6a0065 Allow Flush(sync=true) not supported in DB::Open() and db_stress (#10784) Summary: **Context:** https://github.com/facebook/rocksdb/pull/10698 made `Flush(sync=true)` required for` DB::Open()` (to pass the original but now deleted assertion `impl->TEST_WALBufferIsEmpty()` under `manual_wal_flush=true`, see https://github.com/facebook/rocksdb/pull/10698 summary for more ) as well as db_stress to pass. However RocksDB users may not implement SyncWAL() (used inFlush(sync=true)). Therefore we replace such in DB::Open and db_stress in this PR and align with https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc#L1883-L1887 and https://github.com/facebook/rocksdb/blob/main/db_stress_tool/db_stress_test_base.cc#L847-L849 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10784 Test Plan: make check Reviewed By: anand1976 Differential Revision: D40193354 Pulled By: anand1976 fbshipit-source-id: e80d53880799ae01bdd717641d07997d3bfe2b54 10 October 2022, 22:52:10 UTC
ebf8c45 Provide support for async_io with tailing iterators (#10781) Summary: Provide support for async_io if ReadOptions.tailing is set true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10781 Test Plan: - Update unit tests - Ran db_bench: ./db_bench --benchmarks="readrandom" --use_existing_db --use_tailing_iterator=1 --async_io=1 Reviewed By: anand1976 Differential Revision: D40128882 Pulled By: anand1976 fbshipit-source-id: 55e17855536871a5c47e2de92d238ae005c32d01 10 October 2022, 22:48:48 UTC
5182bf3 Skip column validation for non-value types when iter_start_ts is set (#10799) Summary: When the `iter_start_ts` read option is set, iterator exposes internal keys. This also includes tombstones, which by definition do not have a value (or columns). The patch makes sure we skip the wide-column consistency check in this case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10799 Test Plan: Tested using a simple blackbox crash test with timestamps enabled. Reviewed By: jay-zhuang, riversand963 Differential Revision: D40235628 fbshipit-source-id: 49519fb55d8fe2bb9249ced809f7a81bff2b9df2 10 October 2022, 22:07:07 UTC
a6ce195 Fix flaky test ShuttingDownNotBlockStalledWrites (#10800) Summary: DBTest::ShuttingDownNotBlockStalledWrites is flaky, added new sync point dependency to fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10800 Test Plan: gtest-parallel --repeat=1000 ./db_test --gtest_filter="*ShuttingDownNotBlockStalledWrites" Reviewed By: jay-zhuang Differential Revision: D40239116 Pulled By: jay-zhuang fbshipit-source-id: 8c2d7e7df58f202d287bd9f5c9b60b7eff270d0c 10 October 2022, 20:58:55 UTC
62ba5c8 Deflake DBBloomFilterTest.OptimizeFiltersForHits (#10792) Summary: The test may fail because the L5 files may only cover small portion of the whole key range. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10792 Test Plan: ``` gtest-parallel ./db_bloom_filter_test --gtest_filter=DBBloomFilterTest.OptimizeFiltersForHits -r 1000 -w 100 ``` Reviewed By: siying Differential Revision: D40217600 Pulled By: siying fbshipit-source-id: 18db549184bccf5e513eaa7e31ab17385b71ef71 10 October 2022, 19:34:25 UTC
fac7a31 Fix a few errors in async IO blog post (#10795) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10795 Reviewed By: jay-zhuang, akankshamahajan15 Differential Revision: D40229329 fbshipit-source-id: 7ec5347e0a8a52f80a0a9cc2a0c17b094736d6d9 10 October 2022, 17:47:07 UTC
a45e687 fix issue 10751 (#10765) Summary: Fix https://github.com/facebook/rocksdb/issues/10751 where a stalled write could be blocked forever when DB shutdown. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10765 Reviewed By: ajkr Differential Revision: D40110069 Pulled By: ajkr fbshipit-source-id: 598c05777db9be85913a0a85e421b3295ecdff5e 10 October 2022, 16:46:09 UTC
c401f28 Add option `preserve_internal_time_seconds` to preserve the time info (#10747) Summary: Add option `preserve_internal_time_seconds` to preserve the internal time information. It's mostly for the migration of the existing data to tiered storage ( `preclude_last_level_data_seconds`). When the tiering feature is just enabled, the existing data won't have the time information to decide if it's hot or cold. Enabling this feature will start collect and preserve the time information for the new data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10747 Reviewed By: siying Differential Revision: D39910141 Pulled By: siying fbshipit-source-id: 25c21638e37b1a7c44006f636b7d714fe7242138 08 October 2022, 01:49:40 UTC
f366f90 Blog post for asynchronous IO (#10789) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10789 Reviewed By: akankshamahajan15 Differential Revision: D40198988 Pulled By: akankshamahajan15 fbshipit-source-id: 5db74f12dd8854f6288fbbf8775c8e759778c307 08 October 2022, 00:42:48 UTC
11943e8 Exclude timestamp when checking compaction boundaries (#10787) Summary: When checking if a range [start, end) overlaps with a compaction whose range is [start1, end1), always exclude timestamp from start, end, start1 and end1, otherwise some versions of one user key may be compacted to bottommost layer while others remain in the original level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10787 Test Plan: make check Reviewed By: ltamasi Differential Revision: D40187672 Pulled By: ltamasi fbshipit-source-id: 81226267fd3e33ffa79665c62abadf2ebec45496 07 October 2022, 21:11:23 UTC
7af47c5 Verify wide columns during prefix scan in stress tests (#10786) Summary: The patch adds checks to the `{NonBatchedOps,BatchedOps,CfConsistency}StressTest::TestPrefixScan` methods to make sure the wide columns exposed by the iterators are as expected (based on the value base encoded into the iterator value). It also makes some code hygiene improvements in these methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10786 Test Plan: Ran some simple blackbox tests in the various modes (non-batched, batched, CF consistency). Reviewed By: riversand963 Differential Revision: D40163623 Pulled By: riversand963 fbshipit-source-id: 72f4c3b51063e48c15f974c4ec64d751d3ed0a83 07 October 2022, 18:17:57 UTC
943247b Expand stress test coverage for min_write_buffer_number_to_merge (#10785) Summary: As title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10785 Test Plan: CI Reviewed By: ltamasi Differential Revision: D40162583 Pulled By: ltamasi fbshipit-source-id: 4e01f9b682f397130e286cf5d82190b7973fa3c1 07 October 2022, 01:08:19 UTC
23fa5b7 Use `sstableKeyCompare()` for compaction output boundary check (#10763) Summary: To make it consistent with the compaction picker which uses the `sstableKeyCompare()` to pick the overlap files. For example, without this change, it may cut L1 files like: ``` L1: [2-21] [22-30] L2: [1-10] [21-30] ``` Because "21" on L1 is smaller than "21" on L2. But for compaction, these 2 files are overlapped. `sstableKeyCompare()` also take range delete into consideration which may cut file for the same key. It also makes the `max_compaction_bytes` calculation more accurate for cases like above, the overlapped bytes was under estimated. Also make sure the 2 keys won't be splitted to 2 files because of reaching `max_compaction_bytes`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10763 Reviewed By: cbi42 Differential Revision: D39971904 Pulled By: cbi42 fbshipit-source-id: bcc309e9c3dc61a8f50667a6f633e6132c0154a8 06 October 2022, 22:54:58 UTC
d6d8c00 Verify columns in NonBatchedOpsStressTest::VerifyDb (#10783) Summary: As the first step of covering the wide-column functionality of iterators in our stress tests, the patch adds verification logic to `NonBatchedOpsStressTest::VerifyDb` that checks whether the iterator's value and columns are in sync. Note: I plan to update the other types of stress tests and add similar verification for prefix scans etc. in separate PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10783 Test Plan: Ran some simple blackbox crash tests. Reviewed By: riversand963 Differential Revision: D40152370 Pulled By: riversand963 fbshipit-source-id: 8f9d17d7af5da58ccf1bd2057cab53cc9645ac35 06 October 2022, 22:07:16 UTC
b205c6d Fix bug in HyperClockCache ApplyToEntries; cleanup (#10768) Summary: We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.) This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10768 Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test) Reviewed By: anand1976 Differential Revision: D40034747 Pulled By: anand1976 fbshipit-source-id: a06fcefe146e17ee35001984445cedcf3b63eb68 06 October 2022, 21:54:21 UTC
f461e06 Address feedback on recent recovery testing blog post (#10780) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10780 Reviewed By: hx235 Differential Revision: D40120327 Pulled By: hx235 fbshipit-source-id: 08b43a11cee11743b4428dd2a9aff44270668e05 05 October 2022, 22:31:04 UTC
4d82b94 Sanitize min_write_buffer_number_to_merge to 1 with atomic_flush (#10773) Summary: With current implementation, within the same RocksDB instance, all column families with non-empty memtables will be scheduled for flush if RocksDB determines that any column family needs to be flushed, e.g. memtable full, write buffer manager, etc., if atomic flush is enabled. Not doing so can lead to data loss and inconsistency when WAL is disabled, which is a common setting when atomic flush is enabled. Therefore, setting a per-column-family knob, min_write_buffer_number_to_merge to a value greater than 1 is not compatible with atomic flush, and should be sanitized during column family creation and db open. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10773 Test Plan: Reproduce: D39993203 has detailed steps. Run the test with and without the fix. Reviewed By: cbi42 Differential Revision: D40077955 Pulled By: cbi42 fbshipit-source-id: 451a9179eb531ac42eaccf40b451b9dec4085240 05 October 2022, 19:24:39 UTC
eca47fb Ignore kBottommostFiles compaction logic when allow_ingest_behind (#10767) Summary: fix for https://github.com/facebook/rocksdb/issues/10752 where RocksDB could be in an infinite compaction loop (with compaction reason kBottommostFiles) if allow_ingest_behind is enabled and the bottommost level is unfilled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10767 Test Plan: Added a unit test to reproduce the compaction loop. Reviewed By: ajkr Differential Revision: D40031861 Pulled By: ajkr fbshipit-source-id: 71c4b02931fbe507a847632905404c9b8fa8c96b 05 October 2022, 16:27:14 UTC
00d697b blog post: Verifying crash-recovery with lost buffered writes (#10775) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10775 Reviewed By: hx235 Differential Revision: D40090300 Pulled By: hx235 fbshipit-source-id: 1358f0a4a1583b49548305cfd1477e520c8985ba 05 October 2022, 06:24:54 UTC
ffde463 Cleanup SuperVersion in Iterator::Refresh() (#10770) Summary: Fix a bug in Iterator::Refresh() where the local SV it obtained could be obsolete upon return, and should be cleaned up. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10770 Test Plan: added a unit test to reproduce the issue. Reviewed By: ajkr Differential Revision: D40063809 Pulled By: ajkr fbshipit-source-id: 619e728eb0f1ac9540b4d0ad38e43acc37a514b2 05 October 2022, 05:23:24 UTC
edda219 Manual flush with `wait=false` should not stall when writes stopped (#10001) Summary: When `FlushOptions::wait` is set to false, manual flush should not stall forever. If the database has already stopped writes, then the thread calling `DB::Flush()` with `FlushOptions::wait=false` should not enter the `DBImpl::write_thread_`. To prevent this, we should do a check at the beginning and return `TryAgain()` Resolves: https://github.com/facebook/rocksdb/issues/9892 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10001 Reviewed By: siying Differential Revision: D36422303 Pulled By: siying fbshipit-source-id: 723bd3065e8edc4f17c82449d0d6b95a2381ac0a 04 October 2022, 23:43:01 UTC
f007ad8 RoundRobin TTL compaction (#10725) Summary: For RoundRobin compaction, the data should be mostly sorted per level and within level. Use normal compaction picker for RR until all expired data is compacted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10725 Reviewed By: ajkr Differential Revision: D39771069 Pulled By: jay-zhuang fbshipit-source-id: 7ccf88d7c093fad5673bda73a7b08cc4757780cd 04 October 2022, 21:53:32 UTC
626eaa4 ci: add GitHub token permissions for workflow (#10549) Summary: This PR adds minimum token permissions for the GITHUB_TOKEN in GitHub Actions workflows using https://github.com/step-security/secure-workflows. GitHub recommends defining minimum GITHUB_TOKEN permissions for securing GitHub Actions workflows - https://github.blog/changelog/2021-04-20-github-actions-control-permissions-for-github_token/ - https://docs.github.com/en/actions/security-guides/automatic-token-authentication#modifying-the-permissions-for-the-github_token - The Open Source Security Foundation (OpenSSF) [Scorecards](https://github.com/ossf/scorecard) treats not setting token permissions as a high-risk issue This project is part of the top 100 critical projects as per OpenSSF (https://github.com/ossf/wg-securing-critical-projects), so fixing the token permissions to improve security. Before the change: `GITHUB_TOKEN` has `write` permissions for multiple scopes, e.g. https://github.com/facebook/rocksdb/runs/7936368166?check_suite_focus=true#step:1:19 After the change: `GITHUB_TOKEN` will have minimum permissions needed for the jobs. Signed-off-by: Varun Sharma <varunsh@stepsecurity.io> Pull Request resolved: https://github.com/facebook/rocksdb/pull/10549 Reviewed By: ajkr Differential Revision: D38923184 Pulled By: jay-zhuang fbshipit-source-id: 0c48f98fe90665e53724f57a7d3b01dd80f34a93 04 October 2022, 19:10:30 UTC
5f4391d Some clean-up of secondary cache (#10730) Summary: This is intended as a step toward possibly separating secondary cache integration from the Cache implementation as much as possible, to (hopefully) minimize code duplication in adding secondary cache support to HyperClockCache. * Major clarifications to API docs of secondary cache compatible parts of Cache. For example, previously the docs seemed to suggest that Wait() was not needed if IsReady()==true. And it wasn't clear what operations were actually supported on pending handles. * Add some assertions related to these requirements, such as that we don't Release() before Wait() (which would leak a secondary cache handle). * Fix a leaky abstraction with dummy handles, which are supposed to be internal to the Cache. Previously, these just used value=nullptr to indicate dummy handle, which meant that they could be confused with legitimate value=nullptr cases like cache reservations. Also fixed blob_source_test which was relying on this leaky abstraction. * Drop "incomplete" terminology, which was another name for "pending". * Split handle flags into "mutable" ones requiring mutex and "immutable" ones which do not. Because of single-threaded access to pending handles, the "Is Pending" flag can be in the "immutable" set. This allows removal of a TSAN work-around and removing a mutex acquire-release in IsReady(). * Remove some unnecessary handling of charges on handles of failed lookups. Keeping total_charge=0 means no special handling needed. (Removed one unnecessary mutex acquire/release.) * Simplify handling of dummy handle in Lookup(). There is no need to explicitly Ref & Release w/Erase if we generally overwrite the dummy anyway. (Removed one mutex acquire/release, a call to Release().) Intended follow-up: * Clarify APIs in secondary_cache.h * Doesn't SecondaryCacheResultHandle transfer ownership of the Value() on success (implementations should not release the value in destructor)? * Does Wait() need to be called if IsReady() == true? (This would be different from Cache.) * Do Value() and Size() have undefined behavior if IsReady() == false? * Why have a custom API for what is essentially a std::future<std::pair<void*, size_t>>? * Improve unit testing of standalone handle case * Apparent null `e` bug in `free_standalone_handle` case * Clean up secondary cache testing in lru_cache_test * Why does TestSecondaryCacheResultHandle hold on to a Cache::Handle? * Why does TestSecondaryCacheResultHandle::Wait() do nothing? Shouldn't it establish the post-condition IsReady() == true? * (Assuming that is sorted out...) Shouldn't TestSecondaryCache::WaitAll simply wait on each handle in order (no casting required)? How about making that the default implementation? * Why does TestSecondaryCacheResultHandle::Size() check Value() first? If the API is intended to be returning 0 before IsReady(), then that is weird but should at least be documented. Otherwise, if it's intended to be undefined behavior, we should assert IsReady(). * Consider replacing "standalone" and "dummy" entries with a single kind of "weak" entry that deletes its value when it reaches zero refs. Suppose you are using compressed secondary cache and have two iterators at similar places. It will probably common for one iterator to have standalone results pinned (out of cache) when the second iterator needs those same blocks and has to re-load them from secondary cache and duplicate the memory. Combining the dummy and the standalone should fix this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10730 Test Plan: existing tests (minor update), and crash test with sanitizers and secondary cache Performance test for any regressions in LRUCache (primary only): Create DB with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 ``` Test before & after (run at same time) with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X100] -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=233000000 -duration 30 -threads=16 ``` Before: readrandom [AVG 100 runs] : 22234 (± 63) ops/sec; 1.6 (± 0.0) MB/sec After: readrandom [AVG 100 runs] : 22197 (± 64) ops/sec; 1.6 (± 0.0) MB/sec That's within 0.2%, which is not significant by the confidence intervals. Reviewed By: anand1976 Differential Revision: D39826010 Pulled By: anand1976 fbshipit-source-id: 3202b4a91f673231c97648ae070e502ae16b0f44 04 October 2022, 05:23:38 UTC
3ae00de Disable ingestion in stress tests when PutEntity is used (#10769) Summary: `SstFileWriter` currently does not support the `PutEntity` API, so in `TestIngestExternalFile` all key-values are written using regular `Put`s. This violates the assumption that whether or not a key corresponds to a plain old key-value or a wide-column entity can be determined by solely looking at the "value base" used when generating the value. The patch fixes this issue by disabling ingestion when `PutEntity` is enabled in the stress tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10769 Test Plan: Ran a simple blackbox stress test. Reviewed By: akankshamahajan15 Differential Revision: D40042132 Pulled By: ltamasi fbshipit-source-id: 93e75ff55545b7b69fa4ddef1d96093c961158a0 04 October 2022, 01:09:56 UTC
8b430e0 Add iterator refresh to stress test (#10766) Summary: added calls to `Iterator::Refresh()` in `NonBatchedOpsStressTest::TestIterateAgainstExpected()`. The testing key range is locked in `TestIterateAgainstExpected` so I do not expect this change to provide thorough stress test to `Iterator::Refresh()`. However, it can still be helpful for catching bugs like https://github.com/facebook/rocksdb/issues/10739. Will add calls to refresh in `TestIterate` once we support iterator refresh with snapshots. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10766 Test Plan: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=2` Reviewed By: ajkr Differential Revision: D40008320 Pulled By: ajkr fbshipit-source-id: cec93b07f915ef6476d41c1fee9b23c115188085 03 October 2022, 23:22:39 UTC
ae0f9c3 Add new property in IOOptions to skip recursing through directories and list only files during GetChildren. (#10668) Summary: Add new property "do_not_recurse" in IOOptions for underlying file system to skip iteration of directories during DB::Open if there are no sub directories and list only files. By default this property is set to false. This property is set true currently in the code where RocksDB is sure only files are needed during DB::Open. Provided support in PosixFileSystem to use "do_not_recurse". TestPlan: - Existing tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/10668 Reviewed By: anand1976 Differential Revision: D39471683 Pulled By: akankshamahajan15 fbshipit-source-id: 90e32f0b86d5346d53bc2714d3a0e7002590527f 03 October 2022, 17:59:45 UTC
9f2363f User-defined timestamp support for `DeleteRange()` (#10661) Summary: Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are - internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps. - Garbage collection of range tombstones and range tombstone covered keys during compaction. - Get()/MultiGet() to return the timestamp of a range tombstone when needed. - Get/Iterator with range tombstones bounded by readoptions.timestamp. - timestamp crash test now issues DeleteRange by default. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661 Test Plan: - Added unit test: `make check` - Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4` - Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`. Did not see consistent regression in no timestamp case. | micros/op | fillrandom | seekrandom | | --- | --- | --- | |main| 2.58 |10.96| |PR 10661| 2.68 |10.63| Reviewed By: riversand963 Differential Revision: D39441192 Pulled By: cbi42 fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2 30 September 2022, 23:13:03 UTC
3b81649 Add manual_wal_flush, FlushWAL() to stress/crash test (#10698) Summary: **Context/Summary:** Introduce `manual_wal_flush_one_in` as titled. - When `manual_wal_flush_one_in > 0`, we also need tracing to correctly verify recovery because WAL data can be lost in this case when `FlushWAL()` is not explicitly called by users of RocksDB (in our case, db stress) and the recovery from such potential WAL data loss is a prefix recovery that requires tracing to verify. As another consequence, we need to disable features can't run under unsync data loss with `manual_wal_flush_one_in` Incompatibilities fixed along the way: ``` db_stress: db/db_impl/db_impl_open.cc:2063: static rocksdb::Status rocksdb::DBImpl::Open(const rocksdb::DBOptions&, const string&, const std::vector<rocksdb::ColumnFamilyDescriptor>&, std::vector<rocksdb::ColumnFamilyHandle*>*, rocksdb::DB**, bool, bool): Assertion `impl->TEST_WALBufferIsEmpty()' failed. ``` - It turns out that `Writer::AddCompressionTypeRecord` before this assertion `EmitPhysicalRecord(kSetCompressionType, encode.data(), encode.size());` but do not trigger flush if `manual_wal_flush` is set . This leads to `impl->TEST_WALBufferIsEmpty()' is false. - As suggested, assertion is removed and violation case is handled by `FlushWAL(sync=true)` along with refactoring `TEST_WALBufferIsEmpty()` to be `WALBufferIsEmpty()` since it is used in prod code now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10698 Test Plan: - Locally running `python3 tools/db_crashtest.py blackbox --manual_wal_flush_one_in=1 --manual_wal_flush=1 --sync_wal_one_in=100 --atomic_flush=1 --flush_one_in=100 --column_families=3` - Joined https://github.com/facebook/rocksdb/pull/10624 in auto CI testings with all RocksDB stress/crash test jobs Reviewed By: ajkr Differential Revision: D39593752 Pulled By: ajkr fbshipit-source-id: 3a2135bb792c52d2ffa60257d4fbc557fb04d2ce 30 September 2022, 22:48:33 UTC
793fd09 Track expected state only if expected values dir is non-empty (#10764) Summary: If the `-expected_values_dir` argument to db_stress is empty, then verification against expected state is effectively disabled. But `RunStressTest` still calls `TrackExpectedState`, which returns `NotSupported` causing a the crash test to fail with a false alarm. Fix it by only calling `TrackExpectedState` if necessary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10764 Reviewed By: ajkr Differential Revision: D39980129 Pulled By: anand1976 fbshipit-source-id: d02651746fe3a297877a4b2b2fbcb7274860f49c 30 September 2022, 20:37:05 UTC
9078fcc Add the PutEntity API to the stress/crash tests (#10760) Summary: The patch adds the `PutEntity` API to the non-batched, batched, and CF consistency stress tests. Namely, when the new `db_stress` command line parameter `use_put_entity_one_in` is greater than zero, one in N writes on average is performed using `PutEntity` rather than `Put`. The wide-column entity written has the generated value in its default column; in addition, it contains up to three additional columns where the original generated value is divided up between the column name and the column value (with the column name containing the first k characters of the generated value, and the column value containing the rest). Whether `PutEntity` is used (and if so, how many columns the entity has) is completely determined by the "value base" used to generate the value (that is, there is no randomness involved). Assuming the same `use_put_entity_one_in` setting is used across `db_stress` invocations, this enables us to reconstruct and validate the entity during subsequent `db_stress` runs. Note that `PutEntity` is currently incompatible with `Merge`, transactions, and user-defined timestamps; these combinations are currently disabled/disallowed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10760 Test Plan: Ran some batched, non-batched, and CF consistency stress tests using the script. Reviewed By: riversand963 Differential Revision: D39939032 Pulled By: ltamasi fbshipit-source-id: eafdf124e95993fb7d73158e3b006d11819f7fa9 30 September 2022, 18:11:07 UTC
fd71a82 Use actual file size when checking max_compaction_size (#10728) Summary: currently, there are places in compaction_picker where we add up `compensated_file_size` of files being compacted and limit the sum to be under `max_compaction_bytes`. `compensated_file_size` contains booster for point tombstones and should be used only for determining file's compaction priority. This PR replaces `compensated_file_size` with actual file size in such places. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10728 Test Plan: CI Reviewed By: ajkr Differential Revision: D39789427 Pulled By: cbi42 fbshipit-source-id: 1f89fb6c0159c53bf01d8dc783f465959f442c81 30 September 2022, 17:50:44 UTC
f3cc666 Align compaction output file boundaries to the next level ones (#10655) Summary: Try to align the compaction output file boundaries to the next level ones (grandparent level), to reduce the level compaction write-amplification. In level compaction, there are "wasted" data at the beginning and end of the output level files. Align the file boundary can avoid such "wasted" compaction. With this PR, it tries to align the non-bottommost level file boundaries to its next level ones. It may cut file when the file size is large enough (at least 50% of target_file_size) and not too large (2x target_file_size). db_bench shows about 12.56% compaction reduction: ``` TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432 # baseline: Flush(GB): cumulative 25.882, interval 7.216 Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds # with this change: Flush(GB): cumulative 25.882, interval 7.753 Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds ``` The compaction simulator shows a similar result (14% with 100G random data). As a side effect, with this PR, the SST file size can exceed the target_file_size, but is capped at 2x target_file_size. And there will be smaller files. Here are file size statistics when loading 100GB with the target file size 32MB: ``` baseline this_PR count 1.656000e+03 1.705000e+03 mean 3.116062e+07 3.028076e+07 std 7.145242e+06 8.046139e+06 ``` The feature is enabled by default, to revert to the old behavior disable it with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false` Also includes https://github.com/facebook/rocksdb/issues/1963 to cut file before skippable grandparent file. Which is for use case like user adding 2 or more non-overlapping data range at the same time, it can reduce the overlapping of 2 datasets in the lower levels. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10655 Reviewed By: cbi42 Differential Revision: D39552321 Pulled By: jay-zhuang fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2 30 September 2022, 02:43:55 UTC
47b57a3 add SetCapacity and GetCapacity for secondary cache (#10712) Summary: To support tuning secondary cache dynamically, add `SetCapacity()` and `GetCapacity()` for CompressedSecondaryCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10712 Test Plan: Unit Tests Reviewed By: anand1976 Differential Revision: D39685212 Pulled By: gitbw95 fbshipit-source-id: 19573c67237011927320207732b5de083cb87240 30 September 2022, 02:15:04 UTC
aa71464 Remove and recreate expected values dir in white-box testing 2nd half (#10743) Summary: **Context:** https://github.com/facebook/rocksdb/pull/10732#pullrequestreview-1121076205 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10743 Test Plan: - Locally run `python3 ./tools/db_crashtest.py whitebox --simple -max_key=1000000 -value_size_mult=33 -write_buffer_size=524288 -target_file_size_base=524288 -max_bytes_for_level_base=2097152 --duration=120 --interval=10 --ops_per_thread=1000 --random_kill_odd=887` - CI jobs testing Reviewed By: ajkr Differential Revision: D39838733 Pulled By: ajkr fbshipit-source-id: 9e819b66b0293dfc7a31a908a9d42c6baca4aeaa 29 September 2022, 23:29:51 UTC
5f4b736 cmake : Add ALL plugin LIBS to THIRD_PARTYLIBS (#10727) Summary: Bringing in multiple libraries failed as they were not considered as separate arguments. In this commit we make sure to add *all* the libraries to THIRD_PARTYLIBS. Additionally we add more informative status messages for when the plugins get added. Signed-off-by: Joel Granados <joel.granados@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/10727 Reviewed By: riversand963 Differential Revision: D39778566 Pulled By: ajkr fbshipit-source-id: 34306b26ab4c726d17353ddd765f368967a1b59f 29 September 2022, 19:42:52 UTC
dc9f499 db_stress TestIngestExternalFile avoid empty files (#10754) Summary: If all the keys in range [key_base, shared->GetMaxKey()) are non-overwritable `TestIngestExternalFile()` would attempt to ingest a file with zero keys, leading to the following error: "Cannot create sst file with no entries". This PR changes `TestIngestExternalFile()` to return early in that case instead of going through with the ingestion attempt. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10754 Reviewed By: hx235 Differential Revision: D39909195 Pulled By: ajkr fbshipit-source-id: e06e6b9cc24826fbd450e5130885e6f07164badd 28 September 2022, 23:21:43 UTC
b0d8ccb db_stress print TestMultiGet error value in hex (#10753) Summary: Without this fix, db_crashtest.py could fail with useless output such as: `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 267: invalid start byte` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10753 Reviewed By: hx235 Differential Revision: D39905809 Pulled By: ajkr fbshipit-source-id: 50ba2cf20d206eeb168309cec137e827a34c8f0b 28 September 2022, 22:17:12 UTC
d2578ab Add DECLARE_uint32 to gflags compatibility (#10729) Summary: Older versions of gflags do not have `DEFINE_uint32` and `DECLARE_uint32`. In util/gflag_compat.h, we already add a hack for `DEFINE_uint32`. This PR adds a hack for `DECLARE_uint32`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10729 Test Plan: ROCKSDB_NO_FBCODE=1 make V=1 -j16 db_stress make check Resolves https://github.com/facebook/rocksdb/issues/10704 Reviewed By: pdillinger Differential Revision: D39789183 Pulled By: riversand963 fbshipit-source-id: a58747e0163dcf55dd762733aa5c40d8f0ae70a6 28 September 2022, 03:12:13 UTC
f3b359a Set options.num_levels in db_stress_test_base (#10732) Summary: An add-on to https://github.com/facebook/rocksdb/pull/6818 to complete adding single-level universal compaction to stress/crash testing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10732 Test Plan: - Locally run for 10 min `python3 ./tools/db_crashtest.py whitebox --simple --compaction_style=1 --num_levels=1 -max_key=1000000 -value_size_mult=33 -write_buffer_size=524288 -target_file_size_base=524288 -max_bytes_for_level_base=2097152 --duration=120 --interval=10 --ops_per_thread=1000 --random_kill_odd=887` - Check LOG to confirm single-level universal compaction is called - Manual testing and log checking to ensure destroy_db_initially=1 is correctly set across runs with different compaction styles (i.e, in the second half of whitebox testing). - [ongoing]CI jobs stress test Reviewed By: ajkr Differential Revision: D39797612 Pulled By: ajkr fbshipit-source-id: 16f5c40c3464c57360c06c8305f92118e426149c 27 September 2022, 19:18:28 UTC
7045b74 Remove timestamp before inserting to WBWI's index (#10742) Summary: Currently, this original behavior should not lead to incorrect result, but will violate the contract of CompareWithTimestamp() that when a_has_ts or b_has_ts is false, the slice does not include timestamp. Resolves https://github.com/facebook/rocksdb/issues/10709 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10742 Test Plan: make check Reviewed By: ltamasi Differential Revision: D39834096 Pulled By: riversand963 fbshipit-source-id: c597600f5a7820734f07d0926cdc224cea5eabe1 27 September 2022, 16:04:57 UTC
df49279 Fix segfault in Iterator::Refresh() (#10739) Summary: when a new internal iterator is constructed during iterator refresh, pointer to the previous memtable range tombstone iterator was not cleared. This could cause segfault for future `Refresh()` calls when they try to free the memtable range tombstones. This PR fixes this issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10739 Test Plan: added a unit test in db_range_del_test.cc to reproduce this issue. Reviewed By: ajkr, riversand963 Differential Revision: D39825283 Pulled By: cbi42 fbshipit-source-id: 3b59a2b73865aed39e28cdd5c1b57eed7991b94c 27 September 2022, 01:57:23 UTC
aed30dd Support WriteCommit policy with sync_fault_injection=1 (#10624) Summary: **Context:** Prior to this PR, correctness testing with un-sync data loss [disabled](https://github.com/facebook/rocksdb/pull/10605) transaction (`use_txn=1`) thus all of the `txn_write_policy` . This PR improved that by adding support for one policy - WriteCommit (`txn_write_policy=0`). **Summary:** They key to this support is (a) handle Mark{Begin, End}Prepare/MarkCommit/MarkRollback in constructing ExpectedState under WriteCommit policy correctly and (b) monitor CI jobs and solve any test incompatibility issue till jobs are stable. (b) will be part of the test plan. For (a) - During prepare (i.e, between `MarkBeginPrepare()` and `MarkEndPrepare(xid)`), `ExpectedStateTraceRecordHandler` will buffer all writes by adding all writes to an internal `WriteBatch`. - On `MarkEndPrepare()`, that `WriteBatch` will be associated with the transaction's `xid`. - During the commit (i.e, on `MarkCommit(xid)`), `ExpectedStateTraceRecordHandler` will retrieve and iterate the internal `WriteBatch` and finally apply those writes to `ExpectedState` - During the rollback (i.e, on `MarkRollback(xid)`), `ExpectedStateTraceRecordHandler` will erase the internal `WriteBatch` from the map. For (b) - one major issue described below: - TransactionsDB in db stress recovers prepared-but-not-committed txns from the previous crashed run by randomly committing or rolling back it at the start of the current run, see a historical [PR](https://github.com/facebook/rocksdb/commit/6d06be22c083ccf185fd38dba49fde73b644b4c1) predated correctness testing. - And we will verify those processed keys in a recovered db against their expected state. - However since now we turn on `sync_fault_injection=1` where the expected state is constructed from the trace instead of using the LATEST.state from previous run. The expected state now used to verify those processed keys won't contain UNKNOWN_SENTINEL as they should - see test 1 for a failed case. - Therefore, we decided to manually update its expected state to be UNKNOWN_SENTINEL as part of the processing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10624 Test Plan: 1. Test exposed the major issue described above. This test will fail without setting UNKNOWN_SENTINEL in expected state during the processing and pass after ``` db=/dev/shm/rocksdb_crashtest_blackbox exp=/dev/shm/rocksdb_crashtest_expected dbt=$db.tmp expt=$exp.tmp rm -rf $db $exp mkdir -p $exp echo "RUN 1" ./db_stress \ --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \ --use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 & pid=$! sleep 0.2 sleep 20 kill $pid sleep 0.2 echo "RUN 2" ./db_stress \ --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \ --use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 & pid=$! sleep 0.2 sleep 20 kill $pid sleep 0.2 echo "RUN 3" ./db_stress \ --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \ --use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 ``` 2. Manual testing to ensure ExpectedState is constructed correctly during recovery by verifying it against previously crashed TransactionDB's WAL. - Run the following command to crash a TransactionDB with WriteCommit policy. Then `./ldb dump_wal` on its WAL file ``` db=/dev/shm/rocksdb_crashtest_blackbox exp=/dev/shm/rocksdb_crashtest_expected rm -rf $db $exp mkdir -p $exp ./db_stress \ --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \ --use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 & pid=$! sleep 30 kill $pid sleep 1 ``` - Run the following command to verify recovery of the crashed db under debugger. Compare the step-wise result with WAL records (e.g, WriteBatch content, xid, prepare/commit/rollback marker) ``` ./db_stress \ --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \ --use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 ``` 3. Automatic testing by triggering all RocksDB stress/crash test jobs for 3 rounds with no failure. Reviewed By: ajkr, riversand963 Differential Revision: D39199373 Pulled By: hx235 fbshipit-source-id: 7a1dec0e3e2ee6ea86ddf5dd19ceb5543a3d6f0c 27 September 2022, 01:01:59 UTC
5d7cf31 Add OpenSSL to docker image (#10741) Summary: Update the docker image with OpenSSL, required by the folly build. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10741 Reviewed By: jay-zhuang Differential Revision: D39831081 Pulled By: anand1976 fbshipit-source-id: 900154f70a456d1b6f9e384b8bdbcc227af4adbc 27 September 2022, 00:36:57 UTC
52f2411 Update HISTORY to mention PR #10724 (#10737) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10737 Reviewed By: cbi42 Differential Revision: D39825386 Pulled By: riversand963 fbshipit-source-id: a3c55f2777e034d6ae6ff44ef0219d9fbbf1cc96 26 September 2022, 22:59:30 UTC
2280b26 Small cleanup in NonBatchedOpsStressTest::VerifyDb (#10740) Summary: The PR cleans up the logic in `NonBatchedOpsStressTest::VerifyDb` so that the verification method is picked using a single random number generation. It also eliminates some repeated key comparisons and makes some small code hygiene improvements. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10740 Test Plan: Ran a simple blackbox crash test. Reviewed By: riversand963 Differential Revision: D39828646 Pulled By: ltamasi fbshipit-source-id: 60ee5a3bb1851278f62c7d83b0c93b902ed9702e 26 September 2022, 22:33:36 UTC
07249fe Fix DBImpl::GetLatestSequenceForKey() for Merge (#10724) Summary: Currently, without this fix, DBImpl::GetLatestSequenceForKey() may not return the latest sequence number for merge operands of the key. This can cause conflict checking during optimistic transaction commit phase to fail. Fix it by always returning the latest sequence number of the key, also considering range tombstones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10724 Test Plan: make check Reviewed By: cbi42 Differential Revision: D39756847 Pulled By: riversand963 fbshipit-source-id: 0764c3dd4cb24960b37e18adccc6e7feed0e6876 24 September 2022, 00:29:05 UTC
c76a90c CI benchmarks return NUM_KEYS to previous size (#10649) Summary: Larger size is necessary to stress levels 2, 3 of LSM tree Pull Request resolved: https://github.com/facebook/rocksdb/pull/10649 Reviewed By: ajkr Differential Revision: D39744515 Pulled By: jay-zhuang fbshipit-source-id: 62ff097bfbfdfc26ff1e6290e1e3b71506b7042c 23 September 2022, 16:39:40 UTC
6d2a983 Clarify API comments for blob_cache/prepopulate_blob_cache (#10723) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10723 Reviewed By: riversand963 Differential Revision: D39749277 Pulled By: ltamasi fbshipit-source-id: 4bda94b4620a0db1fcd4309c7ad03fc23e8718cb 23 September 2022, 15:27:41 UTC
1b351fd Add C API to set avoid_unnecessary_blocking_io option (#10693) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10693 Reviewed By: riversand963 Differential Revision: D39668399 Pulled By: ajkr fbshipit-source-id: 66d46e5104c49d235a14e8df8eec3af285ab9752 23 September 2022, 01:41:06 UTC
4a83b16 Use grep instead of obsolete egrep (#10701) Summary: It fixes "egrep: warning: egrep is obsolescent; using grep -E" warning at the systems with newer gnu grep. https://www.phoronix.com/news/GNU-Grep-3.8-Stop-egrep-fgrep Pull Request resolved: https://github.com/facebook/rocksdb/pull/10701 Reviewed By: ajkr Differential Revision: D39737908 Pulled By: cbi42 fbshipit-source-id: 13f3ccfc1a37d18541d156a6b4c2ba24f6c66589 22 September 2022, 23:58:21 UTC
80d010a Bump commonmarker from 0.23.4 to 0.23.6 in /docs (#10722) Summary: Bumps [commonmarker](https://github.com/gjtorikian/commonmarker) from 0.23.4 to 0.23.6. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/gjtorikian/commonmarker/releases">commonmarker's releases</a>.</em></p> <blockquote> <h2>v0.23.6</h2> <h2>What's Changed</h2> <p>This release includes two updates from the upstream <code>cmark-gfm</code> library, namely:</p> <ul> <li><a href="https://github.com/github/cmark-gfm/releases">DoS vulnerability in autolink extension</a> per <a href="https://github.com/github/cmark-gfm/security/advisories/GHSA-cgh3-p57x-9q7q">GHSA-cgh3-p57x-9q7q</a></li> <li><a href="https://github.com/github/cmark-gfm/releases/tag/0.29.0.gfm.5">Added <code>xmpp:</code> and <code>mailto:</code> support to the autolink extension</a></li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/gjtorikian/commonmarker/blob/main/CHANGELOG.md">commonmarker's changelog</a>.</em></p> <blockquote> <h1>Changelog</h1> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/gjtorikian/commonmarker/commit/a8f8d76fbc8c92ddb2e539a06bd93c5f8326705e"><code>a8f8d76</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/gjtorikian/commonmarker/issues/190">https://github.com/facebook/rocksdb/issues/190</a> from anticomputer/main</li> <li><a href="https://github.com/gjtorikian/commonmarker/commit/ac916346314aef3015a713f92f7b46a8c34e98ed"><code>ac91634</code></a> :gem: release 0.23.6</li> <li><a href="https://github.com/gjtorikian/commonmarker/commit/777fd3054be0e0ba18f2fda18ccc7eeeee82db92"><code>777fd30</code></a> Update cmark-upstream to <a href="https://github.com/github/cmark-gfm/commit/9d57d8a23">https://github.com/github/cmark-gfm/commit/9d57d8a23</a>...</li> <li><a href="https://github.com/gjtorikian/commonmarker/commit/7aaeb37e9780e87a9e9fbdddfe1feba1c9f360f4"><code>7aaeb37</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/gjtorikian/commonmarker/issues/188">https://github.com/facebook/rocksdb/issues/188</a> from stevenlaidlaw/update-to-0290gfm5</li> <li><a href="https://github.com/gjtorikian/commonmarker/commit/795e628a406ec67f169440ed3aa84ba9483e2700"><code>795e628</code></a> Update cmark-upstream to <a href="https://github.com/github/cmark-gfm/commit/0578e1e4f">https://github.com/github/cmark-gfm/commit/0578e1e4f</a>...</li> <li><a href="https://github.com/gjtorikian/commonmarker/commit/39d19d65300c5735efcc77b2a57b65c207c013e7"><code>39d19d6</code></a> Update cmark-upstream to <a href="https://github.com/github/cmark-gfm/commit/766f161ef">https://github.com/github/cmark-gfm/commit/766f161ef</a>...</li> <li><a href="https://github.com/gjtorikian/commonmarker/commit/63b7bf89ee1be857c5a5757d6ea678c5c759b8b9"><code>63b7bf8</code></a> Update FUNDING.yml</li> <li><a href="https://github.com/gjtorikian/commonmarker/commit/558c7275b18a7ae16136c0fc55f444458dd8cc58"><code>558c727</code></a> Bump to 0.23.5</li> <li><a href="https://github.com/gjtorikian/commonmarker/commit/41eee7265f501305834d80e3d045ed6c9df77de2"><code>41eee72</code></a> lint</li> <li><a href="https://github.com/gjtorikian/commonmarker/commit/897e8ed07d04b902a5cc3c928852bafdab4468aa"><code>897e8ed</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/gjtorikian/commonmarker/issues/180">https://github.com/facebook/rocksdb/issues/180</a> from lumaxis/main</li> <li>Additional commits viewable in <a href="https://github.com/gjtorikian/commonmarker/compare/v0.23.4...v0.23.6">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=commonmarker&package-manager=bundler&previous-version=0.23.4&new-version=0.23.6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `dependabot rebase` will rebase this PR - `dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `dependabot merge` will merge this PR after your CI passes on it - `dependabot squash and merge` will squash and merge this PR after your CI passes on it - `dependabot cancel merge` will cancel a previously requested merge and block automerging - `dependabot reopen` will reopen this PR if it is closed - `dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/facebook/rocksdb/network/alerts). </details> Pull Request resolved: https://github.com/facebook/rocksdb/pull/10722 Reviewed By: riversand963 Differential Revision: D39735507 Pulled By: ajkr fbshipit-source-id: 3a54f9542a3d2b88c2993dd8a125b0e7f990a886 22 September 2022, 23:20:08 UTC
ef443ce Refactor to avoid confusing "raw block" (#10408) Summary: We have a lot of confusing code because of mixed, sometimes completely opposite uses of of the term "raw block" or "raw contents", sometimes within the same source file. For example, in `BlockBasedTableBuilder`, `raw_block_contents` and `raw_size` generally referred to uncompressed block contents and size, while `WriteRawBlock` referred to writing a block that is already compressed if it is going to be. Meanwhile, in `BlockBasedTable`, `raw_block_contents` either referred to a (maybe compressed) block with trailer, or a maybe compressed block maybe without trailer. (Note: left as follow-up work to use C++ typing to better sort out the various kinds of BlockContents.) This change primarily tries to apply some consistent terminology around the kinds of block representations, avoiding the unclear "raw". (Any meaning of "raw" assumes some bias toward the storage layer or toward the logical data layer.) Preferred terminology: * **Serialized block** - bytes that go into storage. For block-based table (usually the case) this includes the block trailer. WART: block `size` may or may not include the trailer; need to be clear about whether it does or not. * **Maybe compressed block** - like a serialized block, but without the trailer (or no promise of including a trailer). Must be accompanied by a CompressionType. * **Uncompressed block** - "payload" bytes that are either stored with no compression, used as input to compression function, or result of decompression function. * **Parsed block** - an in-memory form of a block in block cache, as it is used by the table reader. Different C++ types are used depending on the block type (see block_like_traits.h). Other refactorings: * Misc corrections/improvements of internal API comments * Remove a few misleading / unhelpful / redundant comments. * Use move semantics in some places to simplify contracts * Use better parameter names to indicate which parameters are used for outputs * Remove some extraneous `extern` * Various clean-ups to `CacheDumperImpl` (mostly unnecessary code) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10408 Test Plan: existing tests Reviewed By: akankshamahajan15 Differential Revision: D38172617 Pulled By: pdillinger fbshipit-source-id: ccb99299f324ac5ca46996d34c5089621a4f260c 22 September 2022, 18:25:32 UTC
12f5a1e Clarify comments for cache priorities and pool options (#10718) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10718 Reviewed By: riversand963 Differential Revision: D39707115 Pulled By: ltamasi fbshipit-source-id: 59aec8c732482f063d0abaad4d9200ba57ebf437 21 September 2022, 23:02:08 UTC
93f46da Mention in HISTORY.md the fix in #10705 (#10720) Summary: Mention in HISTORY.md the fix in https://github.com/facebook/rocksdb/issues/10705. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10720 Reviewed By: riversand963 Differential Revision: D39709455 Pulled By: cbi42 fbshipit-source-id: 1a2c9dd8425c73f7095ddb1d0b1cca8ed35b7ef2 21 September 2022, 22:08:43 UTC
fb9a025 Fix platform 10 build with folly (#10708) Summary: Change the library order in PLATFORM_LDFLAGS to enable fbcode platform 10 build with folly. This PR also has a few fixes for platform 10 compiler errors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10708 Test Plan: ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_COROUTINES=1 make -j64 check ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_FOLLY=1 make -j64 check Reviewed By: ajkr Differential Revision: D39666590 Pulled By: anand1976 fbshipit-source-id: 256a1127ef561399cd6299a6a392ca29bd68ca44 21 September 2022, 21:43:44 UTC
0f4978d Fix sqe->addr passed in cancel request in io_uring (#10644) Summary: Update io_uring_prep_cancel as it is now backward compatible. Also, io_uring_prep_cancel expects sqe->addr to match with read request submitted. It's being set wrong which is now fixed in this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10644 Test Plan: - Ran internally with lastest liburing package and on RocksDB github repo with older version. - Ran seekrandom regression to confirm there is no regression. Reviewed By: anand1976 Differential Revision: D39284229 Pulled By: akankshamahajan15 fbshipit-source-id: fd52cdf23d676da114896163626b75c8ae09c980 21 September 2022, 21:21:59 UTC
013305a Fix potential memory leak in ArenaWrappedDBIter::Refresh() (#10716) Summary: Fix potential memory leak in ArenaWrappedDBIter::Refresh() introduced in https://github.com/facebook/rocksdb/issues/10705. See https://github.com/facebook/rocksdb/pull/10705#discussion_r976765905 for detail. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10716 Test Plan: make check Reviewed By: ajkr Differential Revision: D39698561 Pulled By: cbi42 fbshipit-source-id: dc0d0c6e3878eaa84f87623fbe4916b9b08b077a 21 September 2022, 21:08:10 UTC
dd40f83 Fix lint issues after enable BLACK (#10717) Summary: As title Pull Request resolved: https://github.com/facebook/rocksdb/pull/10717 Test Plan: Unit Tests CI Reviewed By: riversand963 Differential Revision: D39700707 Pulled By: gitbw95 fbshipit-source-id: 54de27e695535a50159f5f6467da36aaf21bebae 21 September 2022, 20:37:51 UTC
749b849 Fix memtable-only iterator regression (#10705) Summary: when there is a single memtable without range tombstones and no SST files in the database, DBIter should wrap memtable iterator directly. Currently we create a merging iterator on top of the memtable iterator, and have DBIter wrap around it. This causes iterator regression and this PR fixes this issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10705 Test Plan: - `make check` - Performance: - Set up: `./db_bench -benchmarks=filluniquerandom -write_buffer_size=$((1 << 30)) -num=10000` - Benchmark: `./db_bench -benchmarks=seekrandom -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=$((1 << 30)) -num=10000 -threads=16 -duration=60 -seek_nexts=$seek_nexts` ``` seek_nexts main op/sec https://github.com/facebook/rocksdb/issues/10705 RocksDB v7.6 0 5746568 5749033 5786180 30 2411690 3006466 2837699 1000 102556 128902 124667 ``` Reviewed By: ajkr Differential Revision: D39644221 Pulled By: cbi42 fbshipit-source-id: 8063ff611ba31b0e5670041da3927c8c54b2097d 21 September 2022, 16:49:31 UTC
9e01de9 Enable BLACK for internal_repo_rocksdb (#10710) Summary: Enable BLACK for internal_repo_rocksdb. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10710 Reviewed By: riversand963, zsol Differential Revision: D39666245 Pulled By: gitbw95 fbshipit-source-id: ef364318d2bbba66e96f3211dd6a975174d52c21 21 September 2022, 00:47:52 UTC
00050d4 Disable tiered storage + BlobDB stress test (#10699) Summary: There're 2 knobs to disable blobdb, adding that. Also print call stack when there's assert failure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10699 Reviewed By: gitbw95 Differential Revision: D39596448 Pulled By: jay-zhuang fbshipit-source-id: 5ce9fd0630d8b6ff1e157a2685a1e80a99997098 19 September 2022, 22:39:31 UTC
92df369 Deflake CompactionServiceTest.BasicCompactions (#10697) Summary: The background compaction may still running while the test end, which would cause ASAN stack-use-after-scope error. Explicitly close the DB before test end. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10697 Test Plan: able to reproduce with: ``` gtest-parallel ./compaction_service_test --gtest_filter=CompactionServiceTest.BasicCompactions -r 10000 -w 100 ``` Reviewed By: gitbw95 Differential Revision: D39590974 Pulled By: jay-zhuang fbshipit-source-id: da264b2e6a276afbda7d5ff7adb9d7b8d4213d90 19 September 2022, 21:10:05 UTC
2b6f351 Update version number and HISTORY in main branch (#10694) Summary: This PR bumps up version number from 7.7 to 7.8 in main branch, indicating that next release will be 7.8. We are going to release 7.7 soon. Since 7.7.fb branch has been created, we can land this to main. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10694 Reviewed By: gitbw95 Differential Revision: D39581577 Pulled By: riversand963 fbshipit-source-id: 84f3fecf25fd9ac96e46b4cd6d50ddb6edc89427 19 September 2022, 19:25:44 UTC
01ebe8a Fix invalid reference in MultiGet due to vector resizing (#10702) Summary: Fix invalid reference in MultiGet due to resizing of the ```batches``` autovector. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10702 Test Plan: Run asan crash test Reviewed By: riversand963 Differential Revision: D39608753 Pulled By: anand1976 fbshipit-source-id: 7a9e7fc6f436f08eb22003d0e6b0e1e4dcdc1a2a 19 September 2022, 02:00:48 UTC
2cc5b39 Add enable_split_merge option for CompressedSecondaryCache (#10690) Summary: `enable_custom_split_merge` is added for enabling the custom split and merge feature, which split the compressed value into chunks so that they may better fit jemalloc bins. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10690 Test Plan: Unit Tests Stress Tests Reviewed By: anand1976 Differential Revision: D39567604 Pulled By: gitbw95 fbshipit-source-id: f6d1d46200f365220055f793514601dcb0edc4b7 16 September 2022, 22:41:49 UTC
e053ccd Fix an incorrect MultiGet assertion (#10695) Summary: The assertion in ```FilePickerMultiGet::ReplaceRange()``` was incorrect. The function should only be called to replace the range after finishing the search in the current level, which is indicated by ```hit_file_ == nullptr``` i.e no more overlapping files in this level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10695 Reviewed By: gitbw95 Differential Revision: D39583217 Pulled By: anand1976 fbshipit-source-id: d4cedfb2b62fb9f3a083e9848a403ae6342f0519 16 September 2022, 20:18:42 UTC
0f91c72 Call experimental new clock cache HyperClockCache (#10684) Summary: This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons: * We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache." * We want to highlight the key feature: it's fast (especially under parallel load) * Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements. * We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.) Some API detail: * To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`. * Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated. For performance tests see https://github.com/facebook/rocksdb/pull/10626 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684 Test Plan: no interesting functional changes; tests updated Reviewed By: anand1976 Differential Revision: D39547800 Pulled By: pdillinger fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf 16 September 2022, 19:47:29 UTC
5724348 Revamp, optimize new experimental clock cache (#10626) Summary: * Consolidates most metadata into a single word per slot so that more can be accomplished with a single atomic update. In the common case, Lookup was previously about 4 atomic updates, now just 1 atomic update. Common case Release was previously 1 atomic read + 1 atomic update, now just 1 atomic update. * Eliminate spins / waits / yields, which likely threaten some "lock free" benefits. Compare-exchange loops are only used in explicit Erase, and strict_capacity_limit=true Insert. Eviction uses opportunistic compare- exchange. * Relaxes some aggressiveness and guarantees. For example, * Duplicate Inserts will sometimes go undetected and the shadow duplicate will age out with eviction. * In many cases, the older Inserted value for a given cache key will be kept (i.e. Insert does not support overwrite). * Entries explicitly erased (rather than evicted) might not be freed immediately in some rare cases. * With strict_capacity_limit=false, capacity limit is not tracked/enforced as precisely as LRUCache, but is self-correcting and should only deviate by a very small number of extra or fewer entries. * Use smaller "computed default" number of cache shards in many cases, because benefits to larger usage tracking / eviction pools outweigh the small cost of more lock-free atomic contention. The improvement in CPU and I/O is dramatic in some limit-memory cases. * Even without the sharding change, the eviction algorithm is likely more effective than LRU overall because it's more stateful, even though the "hot path" state tracking for it is essentially free with ref counting. It is like a generalized CLOCK with aging (see code comments). I don't have performance numbers showing a specific improvement, but in theory, for a Poisson access pattern to each block, keeping some state allows better estimation of time to next access (Poisson interval) than strict LRU. The bounded randomness in CLOCK can also reduce "cliff" effect for repeated range scans approaching and exceeding cache size. ## Hot path algorithm comparison Rough descriptions, focusing on number and kind of atomic operations: * Old `Lookup()` (2-5 atomic updates per probe): ``` Loop: Increment internal ref count at slot If possible hit: Check flags atomic (and non-atomic fields) If cache hit: Three distinct updates to 'flags' atomic Increment refs for internal-to-external Return Decrement internal ref count while atomic read 'displacements' > 0 ``` * New `Lookup()` (1-2 atomic updates per probe): ``` Loop: Increment acquire counter in meta word (optimistic) If visible entry (already read meta word): If match (read non-atomic fields): Return Else: Decrement acquire counter in meta word Else if invisible entry (rare, already read meta word): Decrement acquire counter in meta word while atomic read 'displacements' > 0 ``` * Old `Release()` (1 atomic update, conditional on atomic read, rarely more): ``` Read atomic ref count If last reference and invisible (rare): Use CAS etc. to remove Return Else: Decrement ref count ``` * New `Release()` (1 unconditional atomic update, rarely more): ``` Increment release counter in meta word If last reference and invisible (rare): Use CAS etc. to remove Return ``` ## Performance test setup Build DB with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 ``` Test with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics ``` Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations: base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6) folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry) gt_clock: experimental ClockCache before this change new_clock: experimental ClockCache with this change ## Performance test results First test "hot path" read performance, with block cache large enough for whole DB: 4181MB 1thread base -> kops/s: 47.761 4181MB 1thread folly -> kops/s: 45.877 4181MB 1thread gt_clock -> kops/s: 51.092 4181MB 1thread new_clock -> kops/s: 53.944 4181MB 16thread base -> kops/s: 284.567 4181MB 16thread folly -> kops/s: 249.015 4181MB 16thread gt_clock -> kops/s: 743.762 4181MB 16thread new_clock -> kops/s: 861.821 4181MB 24thread base -> kops/s: 303.415 4181MB 24thread folly -> kops/s: 266.548 4181MB 24thread gt_clock -> kops/s: 975.706 4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944) 4181MB 32thread base -> kops/s: 311.251 4181MB 32thread folly -> kops/s: 274.952 4181MB 32thread gt_clock -> kops/s: 1045.98 4181MB 32thread new_clock -> kops/s: 1370.38 4181MB 48thread base -> kops/s: 310.504 4181MB 48thread folly -> kops/s: 268.322 4181MB 48thread gt_clock -> kops/s: 1195.65 4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944) 4181MB 64thread base -> kops/s: 307.839 4181MB 64thread folly -> kops/s: 272.172 4181MB 64thread gt_clock -> kops/s: 1204.47 4181MB 64thread new_clock -> kops/s: 1615.37 4181MB 128thread base -> kops/s: 310.934 4181MB 128thread folly -> kops/s: 267.468 4181MB 128thread gt_clock -> kops/s: 1188.75 4181MB 128thread new_clock -> kops/s: 1595.46 Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x. Now test a large block cache with low miss ratio, but some eviction is required: 1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23 1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43 1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4 1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56 1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59 1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8 1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89 1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45 1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98 1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91 1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26 1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63 610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137 610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996 610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934 610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5 610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402 610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742 610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062 610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453 610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457 610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426 610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273 610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812 The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.) Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc. 233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371 233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293 233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844 233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461 233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227 233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738 233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688 233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402 233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84 233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785 233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94 233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016 89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086 89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984 89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441 89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754 89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812 89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418 89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422 89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293 89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43 89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824 89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32 89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223 ^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.) Even smaller cache size: 34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914 34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281 34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523 34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125 34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48 34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531 34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465 34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793 34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484 34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457 34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41 34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52 As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn: 13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328 13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633 13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684 13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383 13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492 13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863 13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121 13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758 13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539 13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098 13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77 13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27 gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention: 13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852 13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516 13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688 13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707 13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57 13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219 13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871 13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626 Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN Reviewed By: anand1976 Differential Revision: D39368406 Pulled By: pdillinger fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9 16 September 2022, 07:24:11 UTC
37b75e1 Fix some MultiGet stats (#10673) Summary: The stats were not accurate for the coroutine version of MultiGet. This PR fixes it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10673 Reviewed By: akankshamahajan15 Differential Revision: D39492615 Pulled By: anand1976 fbshipit-source-id: b46c04e15ea27e66f4c31f00c66497aa283bf9d3 16 September 2022, 05:48:06 UTC
088b984 Re-enable user-defined timestamp and subcompactions (#10689) Summary: Hopefully, we can re-enable the combination of user-defined timestamp and subcompactions after https://github.com/facebook/rocksdb/issues/10658. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10689 Test Plan: Make sure the following succeeds on devserver. make crash_test_with_ts Reviewed By: ltamasi Differential Revision: D39556558 Pulled By: riversand963 fbshipit-source-id: 4695f420b1bc9ebf3b24640b693746f4db82c149 16 September 2022, 03:21:07 UTC
c206aeb Fix a MultiGet crash (#10688) Summary: Fix a bug in the async IO/coroutine version of MultiGet that may cause a segfault or assertion failure due to accessing an invalid file index in a LevelFilesBrief. The bug is that when a MultiGetRange is split into two, we may re-process keys in the original range that were already marked to be skipped (in ```current_level_range_```) due to not overlapping the level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10688 Reviewed By: gitbw95 Differential Revision: D39556131 Pulled By: anand1976 fbshipit-source-id: 65e79438508a283cb19e64eca5c91d0714b81458 16 September 2022, 02:18:52 UTC
6ce782b move db_stress locking to `StressTest::Test*()` functions (#10678) Summary: One problem of the previous strategy was `NonBatchedOpsStressTest::TestIngestExternalFile()` could release the lock for `rand_keys[0]` in `rand_column_families[0]`, and then subsequent operations in the same loop iteration (e.g., `TestPut()`) would run without locking. This PR changes the strategy so each `Test*()` function is responsible for acquiring and releasing its own locks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10678 Reviewed By: hx235 Differential Revision: D39516401 Pulled By: ajkr fbshipit-source-id: bf67f12ebbd293ba8c24fdf8754ff28737bcd758 15 September 2022, 22:55:37 UTC
7dad485 Support JemallocNodumpAllocator for the block/blob cache in db_bench (#10685) Summary: The patch makes it possible to use the `JemallocNodumpAllocator` with the block/blob caches in `db_bench`. In addition to its stated purpose of excluding cache contents from core dumps, `JemallocNodumpAllocator` also uses a dedicated arena and jemalloc tcaches for cache allocations, which can reduce fragmentation and thus memory usage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10685 Reviewed By: riversand963 Differential Revision: D39552261 Pulled By: ltamasi fbshipit-source-id: b5c58eab6b7c1baa9a307d9f1248df1d7a77d2b5 15 September 2022, 20:44:46 UTC
b418ace Disable PersistentCacheTierTest.BasicTest (#10683) Summary: Disable this flaky test since PersistentCache is not used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10683 Test Plan: Unit Tests Reviewed By: cbi42 Differential Revision: D39545974 Pulled By: gitbw95 fbshipit-source-id: ac53e96f6ba880e7612e325eb5ff22ee2799efed 15 September 2022, 18:14:48 UTC
1cdc841 Tiered Storage feature doesn't support BlobDB yet (#10681) Summary: Disable the tiered storage + BlobDB test. Also enable different hot data setting for Tiered compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/10681 Reviewed By: ajkr Differential Revision: D39531941 Pulled By: jay-zhuang fbshipit-source-id: aa0595eb38d03f17638d300d2e4cc9061429bf61 15 September 2022, 15:17:16 UTC
849cf1b Refactor Compaction file cut `ShouldStopBefore()` (#10629) Summary: Consolidate compaction output cut logic to `ShouldStopBefore()` and move it inside of CompactionOutputs class. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10629 Reviewed By: cbi42 Differential Revision: D39315536 Pulled By: jay-zhuang fbshipit-source-id: 7d81037babbd35c276bbaad02dbc2bb555fdac18 15 September 2022, 05:09:12 UTC
ce2c11d Fix a bug by setting up subcompaction bounds properly (#10658) Summary: When user-defined timestamp is enabled, subcompaction bounds should be set up properly. When creating InputIterator for the compaction, the `start` and `end` should have their timestamp portions set to kMaxTimestamp, which is the highest possible timestamp. This is similar to what we do with setting up their sequence numbers to `kMaxSequenceNumber`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10658 Test Plan: ```bash make check rm -rf /dev/shm/rocksdb/* && mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress --allow_data_in_errors=True --clear_column_family_one_in=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5 --delrangepercent=0 --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected --iterpercent=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1 --ops_per_thread=300000 --paranoid_file_checks=1 --partition_filters=0 --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0 --snapshot_hold_ops=100000 --subcompactions=4 --target_file_size_base=65536 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1 --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1 --write_buffer_size=65536 --writepercent=60 -disable_wal=1 -column_families=1 ``` Reviewed By: akankshamahajan15 Differential Revision: D39393402 Pulled By: riversand963 fbshipit-source-id: f276e35b19fce51a175c368a502fb0718d1f3871 15 September 2022, 04:59:56 UTC
be04a3b Fix data race in accessing `cached_range_tombstone_` (#10680) Summary: fix a data race introduced in https://github.com/facebook/rocksdb/issues/10547 (P5295241720), first reported by pdillinger. The race is between the `std::atomic_load_explicit` in NewRangeTombstoneIteratorInternal and the `std::atomic_store_explicit` in MemTable::Add() that operate on `cached_range_tombstone_`. P5295241720 shows that `atomic_store_explicit` initializes some mutex which `atomic_load_explicit` could be trying to call `lock()` on at the same time. This fix moves the initialization to memtable constructor. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10680 Test Plan: `USE_CLANG=1 COMPILE_WITH_TSAN=1 make -j24 whitebox_crash_test` Reviewed By: ajkr Differential Revision: D39528696 Pulled By: cbi42 fbshipit-source-id: ee740841044438e18ad2b8ea567444dd542dd8e2 15 September 2022, 03:50:10 UTC
832fd64 Reset pessimistic transaction's read/commit timestamps during Initialize() (#10677) Summary: RocksDB allows reusing old `Transaction` objects when creating new ones. Therefore, we need to reset the transaction's read and commit timestamps back to default values `kMaxTxnTimestamp`. Otherwise, `CommitAndTryCreateSnapshot()` may fail with "Status::InvalidArgument("Different commit ts specified")". Pull Request resolved: https://github.com/facebook/rocksdb/pull/10677 Test Plan: make check Reviewed By: ltamasi Differential Revision: D39513543 Pulled By: riversand963 fbshipit-source-id: bea01cac149bff3a23a2978fc0c3b198243a6291 15 September 2022, 01:28:21 UTC
87c8bb4 Add comments describing {Put,Get}Entity, update/clarify comment for Get and iterator (#10676) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10676 Reviewed By: riversand963 Differential Revision: D39512081 Pulled By: ltamasi fbshipit-source-id: 55704478ceb8081003eceeb0c5a3875cb806587e 14 September 2022, 21:33:05 UTC
bb9a6d4 Bypass a MultiGet test when async_io is used (#10669) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10669 Reviewed By: akankshamahajan15 Differential Revision: D39492658 Pulled By: anand1976 fbshipit-source-id: abef79808e30762654680f7dd7e46487c631febc 14 September 2022, 16:59:54 UTC
7b11d48 Change MultiGet multi-level optimization to default on (#10671) Summary: Change the ```ReadOptions.optimize_multiget_for_io``` flag to default on. It doesn't impact regular MultiGet users as its only applicable when ```ReadOptions.async_io``` is also set to true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10671 Reviewed By: akankshamahajan15 Differential Revision: D39477439 Pulled By: anand1976 fbshipit-source-id: 47abcdbfa69f9bc60422ab68a238b232e085d4ba 14 September 2022, 15:51:16 UTC
06ab0a8 Add wide-column support to iterators (#10670) Summary: The patch extends the iterator API with a new `columns` method which can be used to retrieve all wide columns for the current key. Similarly to the `Get` and `GetEntity` APIs, the classic `value` API returns the value of the default (anonymous) column for wide-column entities, and `columns` returns an entity with a single default column for plain old key-values. (The goal here is to maintain the invariant that `value()` is the same as the value of the default column in `columns()`.) The patch also involves a smaller refactoring: historically, `value()` was implemented using a bunch of conditions, that is, the `Slice` to be returned was decided based on the direction of the iteration, whether a merge had been done etc. when the method was called; with the patch, the value to be exposed is stored in a member `Slice value_` when the iterator lands on a new key, and `value()` simply returns this `Slice`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10670 Test Plan: Ran `make check` and a simple blackbox crash test. Reviewed By: riversand963 Differential Revision: D39475551 Pulled By: ltamasi fbshipit-source-id: 29e7a6ed9ef340841aab36803b832b7c8f668b0b 14 September 2022, 04:01:36 UTC
f291eef Cache fragmented range tombstone list for mutable memtables (#10547) Summary: Each read from memtable used to read and fragment all the range tombstones into a `FragmentedRangeTombstoneList`. https://github.com/facebook/rocksdb/issues/10380 improved the inefficient here by caching a `FragmentedRangeTombstoneList` with each immutable memtable. This PR extends the caching to mutable memtables. The fragmented range tombstone can be constructed in either read (This PR) or write path (https://github.com/facebook/rocksdb/issues/10584). With both implementation, each `DeleteRange()` will invalidate the cache, and the difference is where the cache is re-constructed.`CoreLocalArray` is used to store the cache with each memtable so that multi-threaded reads can be efficient. More specifically, each core will have a shared_ptr to a shared_ptr pointing to the current cache. Each read thread will only update the reference count in its core-local shared_ptr, and this is only needed when reading from mutable memtables. The choice between write path and read path is not an easy one: they are both improvement compared to no caching in the current implementation, but they favor different operations and could cause regression in the other operation (read vs write). The write path caching in (https://github.com/facebook/rocksdb/issues/10584) leads to a cleaner implementation, but I chose the read path caching here to avoid significant regression in write performance when there is a considerable amount of range tombstones in a single memtable (the number from the benchmark below suggests >1000 with concurrent writers). Note that even though the fragmented range tombstone list is only constructed in `DeleteRange()` operations, it could block other writes from proceeding, and hence affects overall write performance. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10547 Test Plan: - TestGet() in stress test is updated in https://github.com/facebook/rocksdb/issues/10553 to compare Get() result against expected state: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4` - Perf benchmark: tested read and write performance where a memtable has 0, 1, 10, 100 and 1000 range tombstones. ``` ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=200000 --reads=100000 --disable_auto_compactions --max_num_range_tombstones=1000 ``` Write perf regressed since the cost of constructing fragmented range tombstone list is shifted from every read to a single write. 6cbe5d8e172dc5f1ef65c9d0a6eedbd9987b2c72 is included in the last column as a reference to see performance impact on multi-thread reads if `CoreLocalArray` is not used. micros/op averaged over 5 runs: first 4 columns are for fillrandom, last 4 columns are for readrandom. | |fillrandom main | write path caching | read path caching |memtable V3 (https://github.com/facebook/rocksdb/issues/10308) | readrandom main | write path caching | read path caching |memtable V3 | |--- |--- |--- |--- |--- | --- | --- | --- | --- | | 0 |6.35 |6.15 |5.82 |6.12 |2.24 |2.26 |2.03 |2.07 | | 1 |5.99 |5.88 |5.77 |6.28 |2.65 |2.27 |2.24 |2.5 | | 10 |6.15 |6.02 |5.92 |5.95 |5.15 |2.61 |2.31 |2.53 | | 100 |5.95 |5.78 |5.88 |6.23 |28.31 |2.34 |2.45 |2.94 | | 100 25 threads |52.01 |45.85 |46.18 |47.52 |35.97 |3.34 |3.34 |3.56 | | 1000 |6.0 |7.07 |5.98 |6.08 |333.18 |2.86 |2.7 |3.6 | | 1000 25 threads |52.6 |148.86 |79.06 |45.52 |473.49 |3.66 |3.48 |4.38 | - Benchmark performance of`readwhilewriting` from https://github.com/facebook/rocksdb/issues/10552, 100 range tombstones are written: `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=500 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --reads=500000 --disable_auto_compactions --max_num_range_tombstones=10000 --finish_after_writes` readrandom micros/op: | |main |write path caching |read path caching |memtable V3 | |---|---|---|---|---| | single thread |48.28 |1.55 |1.52 |1.96 | | 25 threads |64.3 |2.55 |2.67 |2.64 | Reviewed By: ajkr Differential Revision: D38895410 Pulled By: cbi42 fbshipit-source-id: 930bfc309dd1b2f4e8e9042f5126785bba577559 14 September 2022, 03:07:28 UTC
03fc439 Async optimization in scan path (#10602) Summary: Optimizations 1. In FilePrefetchBuffer, when data is overlapping between two buffers, it copies the data from first to third buffer, then from second to third buffer to return continuous buffer. This optimization will call ReadAsync on first buffer as soon as buffer is empty instead of getting blocked by second buffer to copy the data. 2. For fixed size readahead_size, FilePrefetchBuffer will issues two async read calls. One with length + readahead_size_/2 on first buffer(if buffer is empty) and readahead_size_/2 on second buffer during seek. - Add readahead_size to db_stress for stress testing these changes in https://github.com/facebook/rocksdb/pull/10632 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10602 Test Plan: - CircleCI tests - stress_test completed successfully export CRASH_TEST_EXT_ARGS="--async_io=1" make crash_test -j32 - db_bench showed no regression With this PR: ``` ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1 Set seed to 1661876074584472 because --seed was 0 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags Integrated BlobDB: blob cache disabled RocksDB: version 7.7.0 Date: Tue Aug 30 09:14:34 2022 CPU: 32 * Intel Xeon Processor (Skylake) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 50000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 25939.9 MB (estimated) FileSize: 13732.9 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 ------------------------------------------------ DB path: [/tmp/prefix_scan_prefetch_main1] seekrandom : 270878.018 micros/op 3 ops/sec 30.068 seconds 111 operations; 618.7 MB/s (111 of 111 found) ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=true -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1 Set seed to 1661875332862922 because --seed was 0 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags Integrated BlobDB: blob cache disabled RocksDB: version 7.7.0 Date: Tue Aug 30 09:02:12 2022 CPU: 32 * Intel Xeon Processor (Skylake) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 50000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 25939.9 MB (estimated) FileSize: 13732.9 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ DB path: [/tmp/prefix_scan_prefetch_main1] seekrandom : 358352.488 micros/op 2 ops/sec 30.102 seconds 84 operations; 474.4 MB/s (84 of 84 found) ``` Without PR in main: ``` ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1 Set seed to 1661876425983045 because --seed was 0 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags Integrated BlobDB: blob cache disabled RocksDB: version 7.7.0 Date: Tue Aug 30 09:20:26 2022 CPU: 32 * Intel Xeon Processor (Skylake) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 50000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 25939.9 MB (estimated) FileSize: 13732.9 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 ------------------------------------------------ DB path: [/tmp/prefix_scan_prefetch_main1] seekrandom : 280881.953 micros/op 3 ops/sec 30.054 seconds 107 operations; 605.2 MB/s (107 of 107 found) ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=0 Set seed to 1661876475267771 because --seed was 0 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags Integrated BlobDB: blob cache disabled RocksDB: version 7.7.0 Date: Tue Aug 30 09:21:15 2022 CPU: 32 * Intel Xeon Processor (Skylake) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 50000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 25939.9 MB (estimated) FileSize: 13732.9 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 ------------------------------------------------ DB path: [/tmp/prefix_scan_prefetch_main1] seekrandom : 363155.084 micros/op 2 ops/sec 30.142 seconds 83 operations; 468.1 MB/s (83 of 83 found) ``` Reviewed By: anand1976 Differential Revision: D39141328 Pulled By: akankshamahajan15 fbshipit-source-id: 560655922c1a437a8569c228abb31b8c0b413120 13 September 2022, 00:42:01 UTC
03c4ea2 db_stress option to preserve all files until verification success (#10659) Summary: In `db_stress`, DB and expected state files containing changes leading up to a verification failure are often deleted, which makes debugging such failures difficult. On the DB side, flushed WAL files and compacted SST files are marked obsolete and then deleted. Without those files, we cannot pinpoint where a key that failed verification changed unexpectedly. On the expected state side, files for verifying prefix-recoverability in the presence of unsynced data loss are deleted before verification. These include a baseline state file containing the expected state at the time of the last successful verification, and a trace file containing all operations since then. Without those files, we cannot know the sequence of DB operations expected to be recovered. This PR attempts to address this gap with a new `db_stress` flag: `preserve_unverified_changes`. Setting `preserve_unverified_changes=1` has two effects. First, prior to startup verification, `db_stress` hardlinks all DB and expected state files in "unverified/" subdirectories of `FLAGS_db` and `FLAGS_expected_values_dir`. The separate directories are needed because the pre-verification opening process deletes files written by the previous `db_stress` run as described above. These "unverified/" subdirectories are cleaned up following startup verification success. I considered other approaches for preserving DB files through startup verification, like using a read-only DB or preventing deletion of DB files externally, e.g., in the `Env` layer. However, I decided against it since such an approach would not work for expected state files, and I did not want to change the DB management logic. If there were a way to disable DB file deletions before regular DB open, I would have preferred to use that. Second, `db_stress` attempts to keep all DB and expected state files that were live at some point since the start of the `db_stress` run. This is a bit tricky and involves the following changes. - Open the DB with `disable_auto_compactions=1` and `avoid_flush_during_recovery=1` - DisableFileDeletions() - EnableAutoCompactions() For this part, too, I would have preferred to use a hypothetical API that disables DB file deletion before regular DB open. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10659 Reviewed By: hx235 Differential Revision: D39407454 Pulled By: ajkr fbshipit-source-id: 6e981025c7dce147649d2e770728471395a7fa53 12 September 2022, 21:49:38 UTC
bd2ad2f Fix stress test failure for async_io (#10660) Summary: Sanitize initial_auto_readahead_size if its greater than max_auto_readahead_size in case of async_io Pull Request resolved: https://github.com/facebook/rocksdb/pull/10660 Test Plan: Ran db_stress with intitial_auto_readahead_size greater than max_auto_readahead_size. Reviewed By: anand1976 Differential Revision: D39408095 Pulled By: akankshamahajan15 fbshipit-source-id: 07f933242f636cfbc7ccf042e0c8b959a8ec5f3a 12 September 2022, 21:48:06 UTC
f79b3d1 Inject spurious wakeup and sleep before acquiring db mutex to expose race condition (#10291) Summary: **Context/Summary:** Previous experience with bugs and flaky tests taught us there exist features in RocksDB vulnerable to race condition caused by acquiring db mutex at a particular timing. This PR aggressively exposes those vulnerable features by injecting spurious wakeup and sleep to cause acquiring db mutex at various timing in order to expose such race condition **Testing:** - `COERCE_CONTEXT_SWITCH=1 make -j56 check / make -j56 db_stress` should reveal - flaky tests caused by db mutex related race condition - Reverted https://github.com/facebook/rocksdb/pull/9528 - A/B testing on `COMPILE_WITH_TSAN=1 make -j56 listener_test` w/ and w/o `COERCE_CONTEXT_SWITCH=1` followed by `./listener_test --gtest_filter=EventListenerTest.MultiCF --gtest_repeat=10` - `COERCE_CONTEXT_SWITCH=1` can cause expected test failure (i.e, expose target TSAN data race error) within 10 run while the other couldn't. - This proves our injection can expose flaky tests caused by db mutex related race condition faster. - known or new race-condition-type of internal bug by continuously running this PR - Performance - High ops-threads time: COERCE_CONTEXT_SWITCH=1 regressed by 4 times slower (2:01.16 vs 0:22.10 elapsed ). This PR will be run as a separate CI job so this regression won't affect any existing job. ``` TEST_TMPDIR=$db /usr/bin/time ./db_stress \ --ops_per_thread=100000 --expected_values_dir=$exp --clear_column_family_one_in=0 \ --write_buffer_size=524288 —target_file_size_base=524288 —ingest_external_file_one_in=100 —compact_files_one_in=1000 —compact_range_one_in=1000 ``` - Start-up time: COERCE_CONTEXT_SWITCH=1 didn't regress by 25% (0:01.51 vs 0:01.29 elapsed) ``` TEST_TMPDIR=$db ./db_stress -ops_per_thread=100000000 -expected_values_dir=$exp --clear_column_family_one_in=0 & sleep 120; pkill -9 db_stress TEST_TMPDIR=$db /usr/bin/time ./db_stress \ --ops_per_thread=1 -reopen=0 --expected_values_dir=$exp --clear_column_family_one_in=0 --destroy_db_initially=0 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10291 Reviewed By: ajkr Differential Revision: D39231182 Pulled By: hx235 fbshipit-source-id: 7ab6695430460e0826727fd8c66679b32b3e44b6 12 September 2022, 20:55:23 UTC
be09943 Build and link libfolly with RocksDB (#10103) Summary: The current integration with folly requires cherry-picking folly source files to include in RocksDB for external CI builds. Its not scaleable as we depend on more features in folly, such as coroutines. This PR adds a dependency from RocksDB to the folly library when ```USE_FOLLY``` or ```USE_COROUTINES``` are set. We build folly using the build scripts in ```third-party/folly```, relying on it to download and build its dependencies. A new ```Makefile``` target, ```build_folly```, is provided to make building folly easier. A new option, ```USE_FOLLY_LITE``` is added to retain the old model of compiling selected folly sources with RocksDB. This might be useful for short-term development. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10103 Reviewed By: pdillinger Differential Revision: D38426787 Pulled By: anand1976 fbshipit-source-id: 33bc84abd9fdc7e2567749f02aa1b2494eb62b2f 12 September 2022, 04:40:11 UTC
7a9ecda Add auto prefetching parameters to db_bench and db_stress (#10632) Summary: Same as title Pull Request resolved: https://github.com/facebook/rocksdb/pull/10632 Test Plan: make crash_test -j32 Reviewed By: anand1976 Differential Revision: D39241479 Pulled By: akankshamahajan15 fbshipit-source-id: 5db5b0c007da786bacc1b30d8926d36d6d029b87 09 September 2022, 19:52:27 UTC
dc7d155 Mention some recent blob caching related changes in HISTORY.md (#10653) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10653 Reviewed By: riversand963 Differential Revision: D39368165 Pulled By: ltamasi fbshipit-source-id: 06cfd3c87ca90b9d07c082d5e307c0dc6a16840c 09 September 2022, 16:56:10 UTC
4100eb3 minor cleanups to db_crashtest.py (#10654) Summary: Expanded `all_params` to include all parameters crash test may set. Previously, `atomic_flush` was not included in `all_params` and thus was not visible to `finalize_and_sanitize()`. The consequence was manual crash test runs could provide unsafe combinations of parameters to `db_stress`. For example, running `db_crashtest.py` with `-atomic_flush=0` could cause `db_stress` to run with `-atomic_flush=0 -disable_wal=1`, which is known to produce inconsistencies across column families. While expanding `all_params`, I found we cannot have an entry in it for both `db_stress` and `db_crashtest.py`. So I renamed `enable_tiered_storage` to `test_tiered_storage` for `db_crashtest.py`, which appears more conventional anyways. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10654 Reviewed By: hx235 Differential Revision: D39369349 Pulled By: ajkr fbshipit-source-id: 31d9010c760c868b20d5e9bd78ba75c8ff3ce348 09 September 2022, 00:39:22 UTC
0148c49 Add PerfContext counters for CompressedSecondaryCache (#10650) Summary: Add PerfContext counters for CompressedSecondaryCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10650 Test Plan: Unit Tests. Reviewed By: anand1976 Differential Revision: D39354712 Pulled By: gitbw95 fbshipit-source-id: 1b90d3df99d08ddecd351edfd48d1e3723fdbc15 08 September 2022, 23:35:57 UTC
3d67d79 Fix overlapping check by excluding timestamp (#10615) Summary: With user-defined timestamp, checking overlapping should exclude timestamp part from key. This has already been done for range checking for files in sstableKeyCompare(), but not yet done when checking with concurrent compactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615 Test Plan: (Will add more tests) make check (Repro seems easier with this commit sha: git checkout 78bbdef530bd36fa299d496bd1013cf39d8e203a) rm -rf /dev/shm/rocksdb/* && mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress --allow_data_in_errors=True --clear_column_family_one_in=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5 --delrangepercent=0 --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected --iterpercent=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1 --ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0 --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0 --snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3 --target_file_size_base=65536 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1 --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1 --write_buffer_size=65536 --writepercent=60 -disable_wal=1 Reviewed By: akankshamahajan15 Differential Revision: D39146797 Pulled By: riversand963 fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163 08 September 2022, 20:03:07 UTC
fe56cb9 Eliminate some allocations/copies around the blob cache (#10647) Summary: Historically, `BlobFileReader` has returned the blob(s) read from the file in the `PinnableSlice` provided by the client. This interface was preserved when caching was implemented for blobs, which meant that the blob data was copied multiple times when caching was in use: first, into the client-provided `PinnableSlice` (by `BlobFileReader::SaveValue`), and then, into the object stored in the cache (by `BlobSource::PutBlobIntoCache`). The patch eliminates these copies and the related allocations by changing `BlobFileReader` so it returns its results in the form of heap-allocated `BlobContents` objects that can be directly inserted into the cache. The allocations backing these `BlobContents` objects are made using the blob cache's allocator if the blobs are to be inserted into the cache (i.e. if a cache is configured and the `fill_cache` read option is set). Note: this PR focuses on the common case when blobs are compressed; some further small optimizations are possible for uncompressed blobs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10647 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D39335185 Pulled By: ltamasi fbshipit-source-id: 464503d60a5520d654c8273ffb8efd5d1bcd7b36 08 September 2022, 19:40:18 UTC
6de7081 Always verify SST unique IDs on SST file open (#10532) Summary: Although we've been tracking SST unique IDs in the DB manifest unconditionally, checking has been opt-in and with an extra pass at DB::Open time. This changes the behavior of `verify_sst_unique_id_in_manifest` to check unique ID against manifest every time an SST file is opened through table cache (normal DB operations), replacing the explicit pass over files at DB::Open time. This change also enables the option by default and removes the "EXPERIMENTAL" designation. One possible criticism is that the option no longer ensures the integrity of a DB at Open time. This is far from an all-or-nothing issue. Verifying the IDs of all SST files hardly ensures all the data in the DB is readable. (VerifyChecksum is supposed to do that.) Also, with max_open_files=-1 (default, extremely common), all SST files are opened at DB::Open time anyway. Implementation details: * `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass that is now removed. * Unit tests that manipulate/corrupt table properties have to opt out of this check, because that corrupts the "actual" unique id. (And even for testing we don't currently have a mechanism to set "no unique id" in the in-memory file metadata for new files.) * A lot of other unit test churn relates to (a) default checking on, and (b) checking on SST open even without DB::Open (e.g. on flush) * Use `FileMetaData` for more `TableCache` operations (in place of `FileDescriptor`) so that we have access to the unique_id whenever we might need to open an SST file. **There is the possibility of performance impact because we can no longer use the more localized `fd` part of an `FdWithKeyRange` but instead follow the `file_metadata` pointer. However, this change (possible regression) is only done for `GetMemoryUsageByTableReaders`.** * Removed a completely unnecessary constructor overload of `TableReaderOptions` Possible follow-up: * Verification only happens when opening through table cache. Are there more places where this should happen? * Improve error message when there is a file size mismatch vs. manifest (FIXME added in the appropriate place). * I'm not sure there's a justification for `FileDescriptor` to be distinct from `FileMetaData`. * I'm skeptical that `FdWithKeyRange` really still makes sense for optimizing some data locality by duplicating some data in memory, but I could be wrong. * An unnecessary overload of NewTableReader was recently added, in the public API nonetheless (though unusable there). It should be cleaned up to put most things under `TableReaderOptions`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532 Test Plan: updated unit tests Performance test showing no significant difference (just noise I think): `./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000` Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec Reviewed By: jay-zhuang Differential Revision: D38765551 Pulled By: pdillinger fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2 08 September 2022, 05:52:42 UTC
back to top