https://github.com/facebook/rocksdb

sort by:
Revision Author Date Message Commit Date
e072c12 Update HISTORY and version.h for 7.4.5 02 August 2022, 19:56:55 UTC
ed05774 Fix serious FSDirectory use-after-Close bug (missing fsync) (#10460) Summary: TL;DR: due to a recent change, if you drop a column family, often that DB will no longer fsync after writing new SST files to remaining or new column families, which could lead to data loss on power loss. More bug detail: The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at DB::Close time rather than waiting for DB object destruction. Unfortunately, it also closes shared FSDirectory objects on DropColumnFamily (& destroy remaining handles), which can lead to use-after-Close on FSDirectory shared with remaining column families. Those "uses" are only Fsyncs (or redundant Closes). In the default Posix filesystem, an Fsync on a closed FSDirectory is a quiet no-op. Consequently (under most configurations), if you drop a column family, that DB will no longer fsync after writing new SST files to column families sharing the same directory (true under most configurations). More fix detail: Basically, this removes unnecessary Close ops on destroying ColumnFamilyData. We let `shared_ptr` take care of calling the destructor at the right time. If the intent was to require Close be called before destroying FSDirectory, that was not made clear by the author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow timely destruction of FSDirectory to suffice as Close (in CountedFileSystem). Added a TODO to revisit that. Also in this PR: * Added a TODO to share FSDirectory instances between DB and its column families. (Already shared among column families.) * Made DB::Close attempt to close all its open FSDirectory objects even if there is a failure in closing one. Also code clean-up around this logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460 Test Plan: add an assert to check for use-after-Close. With that existing tests can detect the misuse. With fix, tests pass (except noted relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049) Reviewed By: ajkr Differential Revision: D38357922 Pulled By: pdillinger fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137 02 August 2022, 19:45:05 UTC
e656fa3 update HISTORY.md and version.h for 7.4.4 19 July 2022, 15:49:59 UTC
66f369f Make RateLimiter not Customizable (#10378) Summary: (PR created for informational/testing purposes only.) - Fixes lost dynamic updates to GenericRateLimiter bandwidth using `SetBytesPerSecond()` - Benefit over #10374 is eliminating race conditions with Configurable framework. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10378 Reviewed By: pdillinger Differential Revision: D37914865 fbshipit-source-id: d4f566d60ec9726d26932388c61671adf0ee0f30 19 July 2022, 15:48:58 UTC
050027c Update version 13 July 2022, 19:01:38 UTC
41be9ec Stop tracking syncing live WAL for performance (#10330) Summary: With https://github.com/facebook/rocksdb/issues/10087, applications calling `SyncWAL()` or writing with `WriteOptions::sync=true` can suffer from performance regression. This PR reverts to original behavior of tracking the syncing of closed WALs. After we revert back to old behavior, recovery, whether kPointInTime or kAbsoluteConsistency, may fail to detect corruption in synced WALs if the corruption is in the live WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10330 Test Plan: make check Before https://github.com/facebook/rocksdb/issues/10087 ```bash fillsync : 750.269 micros/op 1332 ops/sec 75.027 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync : 776.492 micros/op 1287 ops/sec 77.649 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync [AVG 2 runs] : 1310 (± 44) ops/sec; 0.1 (± 0.0) MB/sec fillsync : 805.625 micros/op 1241 ops/sec 80.563 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync [AVG 3 runs] : 1287 (± 51) ops/sec; 0.1 (± 0.0) MB/sec fillsync [AVG 3 runs] : 1287 (± 51) ops/sec; 0.1 (± 0.0) MB/sec fillsync [MEDIAN 3 runs] : 1287 ops/sec; 0.1 MB/sec ``` Before this PR and after https://github.com/facebook/rocksdb/issues/10087 ```bash fillsync : 1479.601 micros/op 675 ops/sec 147.960 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync : 1626.080 micros/op 614 ops/sec 162.608 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync [AVG 2 runs] : 645 (± 59) ops/sec; 0.1 (± 0.0) MB/sec fillsync : 1588.402 micros/op 629 ops/sec 158.840 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync [AVG 3 runs] : 640 (± 35) ops/sec; 0.1 (± 0.0) MB/sec fillsync [AVG 3 runs] : 640 (± 35) ops/sec; 0.1 (± 0.0) MB/sec fillsync [MEDIAN 3 runs] : 629 ops/sec; 0.1 MB/sec ``` After this PR ```bash fillsync : 749.621 micros/op 1334 ops/sec 74.962 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync : 865.577 micros/op 1155 ops/sec 86.558 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync [AVG 2 runs] : 1244 (± 175) ops/sec; 0.1 (± 0.0) MB/sec fillsync : 845.837 micros/op 1182 ops/sec 84.584 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync [AVG 3 runs] : 1223 (± 109) ops/sec; 0.1 (± 0.0) MB/sec fillsync [AVG 3 runs] : 1223 (± 109) ops/sec; 0.1 (± 0.0) MB/sec fillsync [MEDIAN 3 runs] : 1182 ops/sec; 0.1 MB/sec ``` Reviewed By: ajkr Differential Revision: D37725212 Pulled By: riversand963 fbshipit-source-id: 8fa7d13b3c7662be5d56351c42caf3266af937ae 13 July 2022, 19:01:38 UTC
4fd1285 Update version to 7.4.2 and update HISTORY 01 July 2022, 02:16:59 UTC
d74333c Fix bug in Logger creation if dbname and db_log_dir are on different filesystem (#10292) Summary: If dbname and db_log_dir are at different filesystems (one local and one remote), creation of dbname will fail because that path doesn't exist wrt to db_log_dir. This patch will ignore the error returned on creation of dbname. If they are on same filesystem, db_log_dir creation will automatically return the error in case there is any error in creation of dbname. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10292 Test Plan: Existing unit tests Reviewed By: riversand963 Differential Revision: D37567773 Pulled By: akankshamahajan15 fbshipit-source-id: 005d28c536208d4c126c8cb8e196d1d85b881100 01 July 2022, 02:12:53 UTC
57adbf0 Update version to 7.4.1 and update HISTORY 28 June 2022, 16:36:10 UTC
0b8c8cf Pass rate_limiter_priority through filter block reader functions to FS (#10251) Summary: With https://github.com/facebook/rocksdb/pull/9996 , we can pass the rate_limiter_priority to FS for most cases. This PR is to update the code path for filter block reader. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10251 Test Plan: Current unit tests should pass. Reviewed By: pdillinger Differential Revision: D37427667 Pulled By: gitbw95 fbshipit-source-id: 1ce5b759b136efe4cfa48a6b97e2f837ff087433 28 June 2022, 16:28:09 UTC
1bee5e0 Expose the initial logger creation error (#10223) Summary: https://github.com/facebook/rocksdb/issues/9984 changes the behavior of RocksDB: if logger creation failed during `SanitizeOptions()`, `DB::Open()` will fail. However, since `SanitizeOptions()` is called in `DBImpl::DBImpl()`, we cannot directly expose the error to caller without some additional work. This is a first version proposal which: - Adds a new member `init_logger_creation_s` to `DBImpl` to store the result of init logger creation - Checks the error during `DB::Open()` and return it to caller if non-ok This is not very ideal. We can alternatively move the logger creation logic out of the `SanitizeOptions()`. Since `SanitizeOptions()` is used in other places, we need to check whether this change breaks anything in case other callers of `SanitizeOptions()` assumes that a logger should be created. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10223 Test Plan: make check Reviewed By: pdillinger Differential Revision: D37321717 Pulled By: riversand963 fbshipit-source-id: 58042358a86369d606549dd9938933dd47591c4b 22 June 2022, 15:36:05 UTC
c79e43d Add data block hash index to crash test, fix MultiGet issue (#10220) Summary: There was a bug in the MultiGet enhancement in https://github.com/facebook/rocksdb/issues/9899 with data block hash index, which was not caught because data block hash index was never added to stress tests. This change fixes both issues. Fixes https://github.com/facebook/rocksdb/issues/10186 I intend to pick this into the 7.4.0 release candidate Pull Request resolved: https://github.com/facebook/rocksdb/pull/10220 Test Plan: Failure quickly reproduces in crash test with kDataBlockBinaryAndHash, and does not seem to with the fix. Reproducing the failure with a unit test I believe would be too tricky and fragile to be worthwhile. Reviewed By: anand1976 Differential Revision: D37315647 Pulled By: pdillinger fbshipit-source-id: 9f648265bba867275edc752f7a56611a59401cba 22 June 2022, 15:35:44 UTC
c26abd4 Fix bad include (#10213) Summary: include "include/rocksdb/blah.h" is messing up some internal builds vs. include "rocksdb/blah." This fixes the bad case and adds a check for future instances. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10213 Test Plan: back-port to 7.4 release candidate and watch internal build Reviewed By: hx235 Differential Revision: D37296202 Pulled By: pdillinger fbshipit-source-id: d7cc6b2c57d858dff0444f19320d83c8b4f9b185 21 June 2022, 00:54:08 UTC
fac7a23 Update HISTORY for 7.4.0 release freeze (#10196) Summary: Planned for Sunday 6/19 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10196 Test Plan: no code Reviewed By: akankshamahajan15 Differential Revision: D37244857 Pulled By: pdillinger fbshipit-source-id: afbf4aa201983b3c01c16b5f55c68f2325d17421 19 June 2022, 23:31:16 UTC
0e0a198 Fix a bug in WriteBatchInternal::Append when write batch KV protection is turned on (#10201) Summary: This bug was discovered after write batch checksum verification before WAL is added (https://github.com/facebook/rocksdb/issues/10114) and stress test with write batch checksum protection is turned on (https://github.com/facebook/rocksdb/issues/10037). In this [line](https://github.com/facebook/rocksdb/blob/d5d8920f2cfd06d1803b0976acbe8b564b88b6b1/db/write_batch.cc#L2887), the number of checksums may not be consistent with `batch->Count()`. This PR fixes this issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10201 Test Plan: ``` ./db_stress --batch_protection_bytes_per_key=8 --destroy_db_initially=1 --max_key=100000 --use_txn=1 ``` Reviewed By: ajkr Differential Revision: D37260799 Pulled By: cbi42 fbshipit-source-id: ff8dce7dcce295d689333bc9d892d17a843bf0ea 18 June 2022, 22:12:17 UTC
d5d8920 Fix race condition with WAL tracking and `FlushWAL(true /* sync */)` (#10185) Summary: `FlushWAL(true /* sync */)` is used internally and for manual WAL sync. It had a bug when used together with `track_and_verify_wals_in_manifest` where the synced size tracked in MANIFEST was larger than the number of bytes actually synced. The bug could be repro'd almost immediately with the following crash test command: `python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --max_bytes_for_level_base=2097152 --target_file_size_base=524288 --duration=3600 --interval=10 --sync_fault_injection=1 --disable_wal=0 --checkpoint_one_in=1000 --max_key=10000 --value_size_mult=33`. An example error message produced by the above command is shown below. The error sometimes arose from the checkpoint and other times arose from the main stress test DB. ``` Corruption: Size mismatch: WAL (log number: 119) in MANIFEST is 27938 bytes , but actually is 27859 bytes on disk. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10185 Test Plan: - repro unit test - the above crash test command no longer finds the error. It does find a different error after a while longer such as "Corruption: WAL file 481 required by manifest but not in directory list" Reviewed By: riversand963 Differential Revision: D37200993 Pulled By: ajkr fbshipit-source-id: 98e0071c1a89f4d009888512ed89f9219779ae5f 17 June 2022, 23:45:28 UTC
a5d773e Add rate-limiting support to batched MultiGet() (#10159) Summary: **Context/Summary:** https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not. However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159 Test Plan: - Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet` - Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds. - Stress test: - Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work Reviewed By: ajkr, anand1976 Differential Revision: D37135172 Pulled By: hx235 fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd 17 June 2022, 23:40:47 UTC
c965c9e Read blob from blob cache if exists when GetBlob() (#10178) Summary: There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache. In this task, we added a new abstraction layer `BlobSource` to retrieve blobs from either blob cache or raw blob file. Note: For simplicity, the current PR only includes `GetBlob()`. `MultiGetBlob()` will be included in the next PR. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10178 Reviewed By: ltamasi Differential Revision: D37250507 Pulled By: gangliao fbshipit-source-id: 3fc4a55a0cea955a3147bdc7dba06430e377259b 17 June 2022, 22:22:59 UTC
1aac814 Use optimized folly DistributedMutex in LRUCache when available (#10179) Summary: folly DistributedMutex is faster than standard mutexes though imposes some static obligations on usage. See https://github.com/facebook/folly/blob/main/folly/synchronization/DistributedMutex.h for details. Here we use this alternative for our Cache implementations (especially LRUCache) for better locking performance, when RocksDB is compiled with folly. Also added information about which distributed mutex implementation is being used to cache_bench output and to DB LOG. Intended follow-up: * Use DMutex in more places, perhaps improving API to support non-scoped locking * Fix linking with fbcode compiler (needs ROCKSDB_NO_FBCODE=1 currently) Credit: Thanks Siying for reminding me about this line of work that was previously left unfinished. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10179 Test Plan: for correctness, existing tests. CircleCI config updated. Also Meta-internal buck build updated. For performance, ran simultaneous before & after cache_bench. Out of three comparison runs, the middle improvement to ops/sec was +21%: Baseline: USE_CLANG=1 DEBUG_LEVEL=0 make -j24 cache_bench (fbcode compiler) ``` Complete in 20.201 s; Rough parallel ops/sec = 1584062 Thread ops/sec = 107176 Operation latency (ns): Count: 32000000 Average: 9257.9421 StdDev: 122412.04 Min: 134 Median: 3623.0493 Max: 56918500 Percentiles: P50: 3623.05 P75: 10288.02 P99: 30219.35 P99.9: 683522.04 P99.99: 7302791.63 ``` New: (add USE_FOLLY=1) ``` Complete in 16.674 s; Rough parallel ops/sec = 1919135 (+21%) Thread ops/sec = 135487 Operation latency (ns): Count: 32000000 Average: 7304.9294 StdDev: 108530.28 Min: 132 Median: 3777.6012 Max: 91030902 Percentiles: P50: 3777.60 P75: 10169.89 P99: 24504.51 P99.9: 59721.59 P99.99: 1861151.83 ``` Reviewed By: anand1976 Differential Revision: D37182983 Pulled By: pdillinger fbshipit-source-id: a17eb05f25b832b6a2c1356f5c657e831a5af8d1 17 June 2022, 20:08:45 UTC
f87adcf Fix overflow in ribbon_bench after #10184 (#10195) Summary: Ribbon micro-bench needs updating after re-numbering `BloomLikeFilterPolicy::GetAllFixedImpls()` entries. (CircleCI nightly failure.) Also fixed memory leaks while using ASAN to validate my fix. (I assume the leaks weren't intentional for some performance characteristic.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10195 Test Plan: run with ASAN Reviewed By: jay-zhuang Differential Revision: D37244459 Pulled By: pdillinger fbshipit-source-id: 5a363e10de3c4c9c88099c937e3dc3b4cf24fd30 17 June 2022, 19:53:57 UTC
5d6005c Add WriteOptions::protection_bytes_per_key (#10037) Summary: Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`. Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user. There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037 Test Plan: - Manual - Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24` - Enabled in MyShadow for 1+ week - Automated - Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions` - Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection Reviewed By: cbi42 Differential Revision: D36614569 Pulled By: ajkr fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb 17 June 2022, 06:10:07 UTC
f62c1e1 Fix a false negative merge conflict (#10192) Summary: .. between https://github.com/facebook/rocksdb/issues/10184 and https://github.com/facebook/rocksdb/issues/10122 not detected by source control, leading to non-compiling code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10192 Test Plan: updated test Reviewed By: hx235 Differential Revision: D37231921 Pulled By: pdillinger fbshipit-source-id: fa21488716f4c006b111b8c4127d71c757c935c3 17 June 2022, 04:14:10 UTC
8cf8625 Update HISTORY.md for #10114 (#10189) Summary: Update HISTORY.md for https://github.com/facebook/rocksdb/issues/10114: write batch checksum verification before writing to WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10189 Reviewed By: ajkr Differential Revision: D37226366 Pulled By: cbi42 fbshipit-source-id: cd2f076961abc35f35783e0f2cc3beda68cdb446 17 June 2022, 02:59:26 UTC
fff302d More testing w/prefix extractor, small refactor (#10122) Summary: There was an interesting code path not covered by testing that is difficult to replicate in a unit test, which is now covered using a sync point. Specifically, the case of table_prefix_extractor == null and !need_upper_bound_check in `BlockBasedTable::PrefixMayMatch`, which can happen if table reader is open before extractor is registered with global object registry, but is later registered and re-set with SetOptions. (We don't have sufficient testing control over object registry to set that up repeatedly.) Also, this function has been renamed to `PrefixRangeMayMatch` for clarity vs. other functions that are not the same. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10122 Test Plan: unit tests expanded Reviewed By: siying Differential Revision: D36944834 Pulled By: pdillinger fbshipit-source-id: 9e52d9da1929a3e42bbc230fcdc3599949de7bdb 16 June 2022, 23:41:25 UTC
126c223 Remove deprecated block-based filter (#10184) Summary: In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using the public API, because of its inefficiency. Although we normally maintain read compatibility on old DBs forever, filters are not required for reading a DB, only for optimizing read performance. Thus, it should be acceptable to remove this code and the substantial maintenance burden it carries as useful features are developed and validated (such as user timestamp). This change completely removes the code for reading and writing the old block-based filters, net removing about 1370 lines of code no longer needed. Options removed from testing / benchmarking tools. The prior existence is only evident in a couple of places: * `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in a major release to minimize source code incompatibilities. * A warning is logged when an old table file is opened that used the old block-based filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing should suffice. Unfortunately, sst_dump does not tell you whether a file uses block-based filter, and the structure of the code makes it very difficult to fix. * To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`) for metaindex is maintained (for now). Other notes: * In some cases where numbers are associated with filter configurations, we have had to update the assigned numbers so that they all correspond to something that exists. * Fixed potential stat counting bug by assuming `filter_checked = false` for cases like `filter == nullptr` rather than assuming `filter_checked = true` * Removed obsolete `block_offset` and `prefix_extractor` parameters from several functions. * Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)` because the caller guarantees the prefix extractor exists and is compatible Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184 Test Plan: tests updated, manually test new warning in LOG using base version to generate a DB Reviewed By: riversand963 Differential Revision: D37212647 Pulled By: pdillinger fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80 16 June 2022, 22:51:33 UTC
a6691d0 Update stats to help users estimate MultiGet async IO impact (#10182) Summary: Add a couple of stats to help users estimate the impact of potential MultiGet perf improvements - 1. NUM_LEVEL_READ_PER_MULTIGET - A histogram stat for number of levels that required MultiGet to read from a file 2. MULTIGET_COROUTINE_COUNT - A ticker stat to count the number of times the coroutine version of MultiGetFromSST was used The NUM_DATA_BLOCKS_READ_PER_LEVEL stat is obsoleted as it doesn't provide useful information for MultiGet optimization. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10182 Reviewed By: akankshamahajan15 Differential Revision: D37213296 Pulled By: anand1976 fbshipit-source-id: 5d2b7708017c0e278578ae4bffac3926f6530efb 16 June 2022, 19:12:43 UTC
4d31d3c Abort in dbg mode after logging (#10183) Summary: In CompactionIterator code, there are multiple places where the process will abort in dbg mode before logging the error message describing the cause. This PR changes only the logging behavior for compaction iterator so that error message is written to LOG before the process aborts in debug mode. Also updated the triggering condition for an assertion for single delete with user-defined timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10183 Test Plan: make check Reviewed By: akankshamahajan15 Differential Revision: D37190218 Pulled By: riversand963 fbshipit-source-id: 741bb007067be7cfbe94ac9e530ad4b2b339c009 16 June 2022, 05:00:24 UTC
8353ae8 Add few optimizations in async_io for short scans (#10140) Summary: This PR adds few optimizations for async_io for shorter scans. 1. If async_io is enabled, seek would create FilePrefetchBuffer object to fetch the data asynchronously. However `FilePrefetchbuffer::num_file_reads_` wasn't taken into consideration if it calls Next after Seek and would go for Prefetching. This PR fixes that and Next will go for prefetching only if `FilePrefetchbuffer::num_file_reads_` is greater than 2 along with if blocks are sequential. This scenario is only for implicit auto readahead. 2. For seek, when it calls TryReadFromCacheAsync to poll it makes async call as well because TryReadFromCacheAsync flow wasn't changed. So I updated to return after poll instead of further prefetching any data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10140 Test Plan: 1. Added a unit test 2. Ran crash_test with async_io = 1 to make sure nothing crashes. Reviewed By: anand1976 Differential Revision: D37042242 Pulled By: akankshamahajan15 fbshipit-source-id: b8e6b7cb2ee0886f37a8f53951948b9084e8ffda 16 June 2022, 03:17:35 UTC
3d358a7 Fix handling of accidental truncation of IDENTITY file (#10173) Summary: A consequence of https://github.com/facebook/rocksdb/issues/9990 was requiring a non-empty DB ID to generate new SST files. But if the DB ID is not tracked in the manifest and the IDENTITY file is somehow truncated to 0 bytes, then an empty DB ID would be assigned, leading to crash. This change ensures a non-empty DB ID is assigned and set in the IDENTITY file. Also, * Some light refactoring to clean up the logic * (I/O efficiency) If the ID is tracked in the manifest and already matches the IDENTITY file, don't needlessly overwrite the file. * (Debugging) Log the DB ID to info log on open, because sometimes IDENTITY can change if DB is moved around (though it would be unusual for info log to be copied/moved without IDENTITY file) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10173 Test Plan: unit tests expanded/updated Reviewed By: ajkr Differential Revision: D37176545 Pulled By: pdillinger fbshipit-source-id: a9b414cd35bfa33de48af322a36c24538d50bef1 15 June 2022, 22:39:49 UTC
94329ae Use only ASCII in source files (#10164) Summary: Fix existing usage of non-ASCII and add a check to prevent future use. Added `-n` option to greps to provide line numbers. Alternative to https://github.com/facebook/rocksdb/issues/10147 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10164 Test Plan: used new checker to find & fix cases, manually check db_bench output is preserved Reviewed By: akankshamahajan15 Differential Revision: D37148792 Pulled By: pdillinger fbshipit-source-id: 68c8b57e7ab829369540d532590bf756938855c7 15 June 2022, 21:44:43 UTC
9882652 Verify write batch checksum before WAL (#10114) Summary: Context: WriteBatch can have key-value checksums when it was created `with protection_bytes_per_key > 0`. This PR added checksum verification for write batches before they are written to WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10114 Test Plan: - Added new unit tests to db_kv_checksum_test.cc: `make check -j32` - benchmark on performance regression: `./db_bench --benchmarks=fillrandom[-X20] -db=/dev/shm/test_rocksdb -write_batch_protection_bytes_per_key=8` - Pre-PR: ` fillrandom [AVG 20 runs] : 198875 (± 3006) ops/sec; 22.0 (± 0.3) MB/sec ` - Post-PR: ` fillrandom [AVG 20 runs] : 196487 (± 2279) ops/sec; 21.7 (± 0.3) MB/sec ` Mean regressed about 1% (198875 -> 196487 ops/sec). Reviewed By: ajkr Differential Revision: D36917464 Pulled By: cbi42 fbshipit-source-id: 29beb74edf65f04b1a890b4f650d873dc7ed790d 15 June 2022, 20:43:58 UTC
2e5a323 Change the instruction used for a pause on arm64 (#10118) Summary: While the yield instruction conseptually sounds correct on most platforms it is a simple nop that doesn't delay the execution anywhere close to what an x86 pause instruction does. In other projects with spin-wait loops an isb has been observed to be much closer to the x86 behavior. On a Graviton3 system the following test improves on average by 2x with this change averaged over 20 runs: ``` ./db_bench -benchmarks=fillrandom -threads=64 -batch_size=1 -memtablerep=skip_list -value_size=100 --num=100000 level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 --block_size=16384 --allow_concurrent_memtable_write -compression_type none ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10118 Reviewed By: jay-zhuang Differential Revision: D37120578 fbshipit-source-id: c20bde4298222edfab7ff7cb6d42497e7012400d 15 June 2022, 20:08:11 UTC
69a32ee Use madvise() for mmaped file advise (#10170) Summary: A recent PR https://github.com/facebook/rocksdb/pull/10142 enabled fadvise for mmaped file. However, we were told that it might not take effective and madvise() should be used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10170 Test Plan: Run existing tests Run a benchmark using mmap with advise random and see I/O size is indeed small. Reviewed By: anand1976 Differential Revision: D37158582 fbshipit-source-id: 8b3a74f0e89d2e16aac78ee4124c05841d4135c3 15 June 2022, 20:05:58 UTC
ce419c0 Allow db_bench and db_stress to set `allow_data_in_errors` (#10171) Summary: There is `Options::allow_data_in_errors` that controls whether RocksDB is allowed to log data, e.g. key, value, etc in LOG files. It is false by default. However, in db_bench and db_stress, it is often ok to log data because there is no concern about privacy. This PR allows db_stress and db_bench to set this option on the command line, while it remains false by default. Furthermore, make crash/recovery test driven by db_crashtest.py to opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10171 Test Plan: Stress test and db_bench Reviewed By: hx235 Differential Revision: D37163787 Pulled By: riversand963 fbshipit-source-id: 0242f24d292ba15b6faf8ff903963b85d3e011f8 15 June 2022, 19:38:04 UTC
19345de fix cancel argument for latest liburing (#10168) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10168 the arg changed to u64 Reviewed By: ajkr Differential Revision: D37155407 fbshipit-source-id: 464eab2806675f148fce075a6fea369fa3d7a9bb 15 June 2022, 16:10:19 UTC
40dfa26 Fix C4702 on windows (#10146) Summary: This code is unreachable when `ROCKSDB_LITE` not defined. And it cause build fail on my environment VS2019 16.11.15. ``` -- Selecting Windows SDK version 10.0.19041.0 to target Windows 10.0.19044. -- The CXX compiler identification is MSVC 19.29.30145.0 -- The C compiler identification is MSVC 19.29.30145.0 -- The ASM compiler identification is MSVC ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10146 Reviewed By: akankshamahajan15 Differential Revision: D37112916 Pulled By: ajkr fbshipit-source-id: e0b2bf3055d6fac1b3fb40b9f02c4cbae3f82757 15 June 2022, 04:32:10 UTC
77f4799 Fix potential leak when reusing PinnableSlice instances. (#10166) Summary: `PinnableSlice` may hold a handle to a cache value which must be released to correctly decrement the ref-counter. However, when `PinnableSlice` variables are reused, e.g. like this: ``` PinnableSlice pin_slice; db.Get("foo", &pin_slice); db.Get("foo", &pin_slice); ``` then the second `Get` simply overwrites the old value in `pin_slice` and the handle returned by the first `Get` is _not_ released. This PR adds `Reset` calls to the `Get`/`MultiGet` calls that accept `PinnableSlice` arguments to ensure proper cleanup of old values. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10166 Reviewed By: hx235 Differential Revision: D37151632 Pulled By: ajkr fbshipit-source-id: 9dd3c3288300f560531b843f67db11aeb569a9ff 15 June 2022, 04:29:52 UTC
b550fc0 Modify the instructions emited for PREFETCH on arm64 (#10117) Summary: __builtin_prefetch(...., 1) prefetches into the L2 cache on x86 while the same emits a pldl3keep instruction on arm64 which doesn't seem to be close enough. Testing on a Graviton3, and M1 system with memtablerep_bench fillrandom and skiplist througpuh increased as follows adjusting the 1 to 2 or 3: ``` 1 -> 2 1 -> 3 ---------------------------- Graviton3 +10% +15% M1 +10% +10% ``` Given that prefetching into the L1 cache seems to help, I chose that conversion Pull Request resolved: https://github.com/facebook/rocksdb/pull/10117 Reviewed By: pdillinger Differential Revision: D37120475 fbshipit-source-id: db1ef43f941445019c68316500a2250acc643d5e 15 June 2022, 00:58:44 UTC
751d1a3 mingw: remove no-asynchronous-unwind-tables (#9963) Summary: This default is generally incompatible with other parts of mingw, and can be applied by outside users as-needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9963 Reviewed By: akankshamahajan15 Differential Revision: D36302813 Pulled By: ajkr fbshipit-source-id: 9456b41a96bde302bacbc39e092ccecfcb42f34f 15 June 2022, 00:42:55 UTC
cba398d Add blob cache option in the column family options (#10155) Summary: There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10155 Reviewed By: ltamasi Differential Revision: D37150819 Pulled By: gangliao fbshipit-source-id: b807c7916ea5d411588128f8e22a49f171388fe2 14 June 2022, 21:19:26 UTC
1d2950b fix a false positive case of parsing table factory from options file (#10094) Summary: During options file parsing, reset table factory before attempting to parse it from string. This avoids mistakenly treating the default table factory as a newly created one. Signed-off-by: tabokie <xy.tao@outlook.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/10094 Reviewed By: akankshamahajan15 Differential Revision: D36945378 Pulled By: ajkr fbshipit-source-id: 94b2604e5e87682063b4b78f6370f3e8f101dc44 14 June 2022, 20:20:54 UTC
d665afd Account memory of FileMetaData in global memory limit (#9924) Summary: **Context/Summary:** As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924 Test Plan: - Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)` - db bench (CPU cost of `charge_file_metadata` in write and compact) - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'` table 1 - write #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078 80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734** 160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978** table 2 - compact #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67 20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96 40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96** 80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78** - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36055583 Pulled By: hx235 fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625 14 June 2022, 20:06:40 UTC
40d19bc Fix the failure related to io_uring_prep_cancel (#10165) Summary: Fix for Internal jobs are failing with ``` error: no matching function for call to 'io_uring_prep_cancel' io_uring_prep_cancel(sqe, posix_handle, 0); ^~~~~~~~~~~~~~~~~~~~ note: candidate function not viable: no known conversion from 'rocksdb::Posix_IOHandle *' to '__u64' (aka 'unsigned long long') for 2nd argument static inline void io_uring_prep_cancel(struct io_uring_sqe *sqe, ``` User data is set using `io_uring_set_data` API so no need to pass posix_handle here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10165 Test Plan: CircleCI jobs Reviewed By: jay-zhuang Differential Revision: D37145233 Pulled By: akankshamahajan15 fbshipit-source-id: 05da650e1240e9c6fcc8aed5f0067308dccb164a 14 June 2022, 19:35:11 UTC
f105e1a Make the per-shard hash table fixed-size. (#10154) Summary: We make the size of the per-shard hash table fixed. The base level of the hash table is now preallocated with the required capacity. The user must provide an estimate of the size of the values. Notice that even though the base level becomes fixed, the chains are still dynamic. Overall, the shard capacity mechanisms haven't changed, so we don't need to test this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10154 Test Plan: `make -j24 check` Reviewed By: pdillinger Differential Revision: D37124451 Pulled By: guidotag fbshipit-source-id: cba6ac76052fe0ec60b8ff4211b3de7650e80d0c 14 June 2022, 03:29:00 UTC
bfaf829 Fix a race condition in transaction stress test (#10157) Summary: Before this PR, there can be a race condition between the thread calling `StressTest::Open()` and a background compaction thread calling `MultiOpsTxnsStressTest::VerifyPkSkFast()`. ``` Time thread1 bg_compact_thr | TransactionDB::Open(..., &txn_db_) | db_ is still nullptr | db_->GetSnapshot() // segfault | db_ = txn_db_ V ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10157 Test Plan: CI Reviewed By: akankshamahajan15 Differential Revision: D37121653 Pulled By: riversand963 fbshipit-source-id: 6a53117f958e9ee86f77297fdeb843e5160a9331 14 June 2022, 01:54:38 UTC
c0e0f30 Implement AbortIO using io_uring (#10125) Summary: Implement AbortIO in posix using io_uring to cancel any pending read requests submitted. Its cancelled using io_uring_prep_cancel which sets the IORING_OP_ASYNC_CANCEL flag. To cancel a request, the sqe must have ->addr set to the user_data of the request it wishes to cancel. If the request is cancelled successfully, the original request is completed with -ECANCELED and the cancel request is completed with a result of 0. If the request was already running, the original may or may not complete in error. The cancel request will complete with -EALREADY for that case. And finally, if the request to cancel wasn't found, the cancel request is completed with -ENOENT. Reference: https://kernel.dk/io_uring-whatsnew.pdf, https://lore.kernel.org/io-uring/d9a8d76d23690842f666c326631ecc2d85b6c1bc.1615566409.git.asml.silence@gmail.com/ Pull Request resolved: https://github.com/facebook/rocksdb/pull/10125 Test Plan: Existing Posix tests. Reviewed By: anand1976 Differential Revision: D36946970 Pulled By: akankshamahajan15 fbshipit-source-id: 3bc1f1521b3151d01a348fc6431eb3fc85db3a14 14 June 2022, 01:07:24 UTC
04bd347 Increase num_levels for universal from 8 to 40 (#10158) Summary: See https://github.com/facebook/rocksdb/issues/10082 for more details. Trivial move isn't done for universal when compaction is from L0 into L0. So a too small value for num_levels with db_bench means there will be fewer trivial moves with universal and that means that write-amp will increase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10158 Test Plan: run it Reviewed By: siying Differential Revision: D37122519 Pulled By: mdcallag fbshipit-source-id: 1cb39049676f68a6cc3ea8d105a9965f89d4d09e 13 June 2022, 23:24:32 UTC
ad135f3 Document design/specification bugs with auto_prefix_mode (#10144) Summary: auto_prefix_mode is designed to use prefix filtering in a particular "safe" set of cases where the upper bound and the seek key have different prefixes: where the upper bound is the "same length immediate successor". These conditions are not sufficient to guarantee the same iteration results as total_order_seek if the DB contains "short" keys, less than the "full" (maximum) prefix length. We are not simply disabling the optimization in these successor cases because it is likely that users are essentially getting what they want out of existing usage. Especially if users are constructing successor bounds with the intention of doing a prefix-bounded seek, the existing behavior is more expected than the total_order_seek behavior. Consequently, for now we reconcile the bad specification of behavior by documenting the existing mismatch with total_order_seek. A closely related issue affects hypothetical comparators like ReverseBytewiseComparator: if they "correctly" implement IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more entries (other than "short" keys noted above). Luckily, the built-in ReverseBytewiseComparator has an "incorrect" implementation of IsSameLengthImmediateSuccessor that effectively prevents prefix optimization and, thus, the bug. This is now documented as a new constraint on IsSameLengthImmediateSuccessor, and the implementation tweaked to be simply "safe" rather than "incorrect". This change also includes unit test updates to demonstrate the above issues. (Test was cleaned up for readability and simplicity.) Intended follow-up: * Tweak documented axioms for prefix_extractor (more details then) * Consider some sort of fix for this case. I don't know what that would look like without breaking the performance of existing code. Perhaps if all keys in an SST file have prefixes that are "full length," we can track that fact and use it to allow optimization with the "same length immediate successor", but that would only apply to new files. * Consider a better system of specifying prefix bounds Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144 Test Plan: test updates included Reviewed By: siying Differential Revision: D37052710 Pulled By: pdillinger fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db 13 June 2022, 18:08:50 UTC
8273435 Bypass tests instead of skipping to resolve internal failure (#10148) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10148 Reviewed By: hx235 Differential Revision: D37092202 Pulled By: akankshamahajan15 fbshipit-source-id: 12fae5641a1c4ab584e586db95f4044273aba23a 12 June 2022, 19:05:11 UTC
415200d Assume fixed size key (#10137) Summary: FastLRUCache now only supports 16B keys. The tests have changed to reflect this. Because the unit tests were designed for caches that accept any string as keys, some tests are no longer compatible with FastLRUCache. We have disabled those for runs with FastLRUCache. (We could potentially change all tests to use 16B keys, but we don't because the cache public API does not require this.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10137 Test Plan: make -j24 check Reviewed By: gitbw95 Differential Revision: D37083934 Pulled By: guidotag fbshipit-source-id: be1719cf5f8364a9a32bc4555bce1a0de3833b0d 11 June 2022, 02:12:18 UTC
80afa77 Run fadvise with mmap file (#10142) Summary: Right now with mmap file, we don't run fadvise following users' requests. There is no reason for that so this diff does that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10142 Test Plan: A simple readrandom against files with page cache dropped shows latency improvement from 7.8 us to 2.8: ./db_bench -use_existing_db --benchmarks=readrandom --num=100 Reviewed By: anand1976 Differential Revision: D37074975 fbshipit-source-id: ccc72bcac1b5fd634eb8fa2b6a5d9afe332e0bf6 10 June 2022, 23:34:01 UTC
1777e5f Snapshots with user-specified timestamps (#9879) Summary: In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable. It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29. This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps. Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps. In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot` object is created with the last published sequence number of the super-version. You can see that the reader actually has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called, an arbitrarily long period of time may have already elapsed since the last write, which is when the last published sequence number is written. This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will ensure any two snapshots with timestamps should satisfy the following: ``` snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts ``` If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create a snapshot with associated timestamp. Code example ```cpp // Create a timestamped snapshot when committing transaction. txn->SetCommitTimestamp(100); txn->SetSnapshotOnNextOperation(); txn->Commit(); // A wrapper API for convenience Status Transaction::CommitAndTryCreateSnapshot( std::shared_ptr<TransactionNotifier> notifier, TxnTimestamp ts, std::shared_ptr<const Snapshot>* ret); // Create a timestamped snapshot if caller guarantees no concurrent writes std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100); ``` The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp. ```cpp // Return the timestamped snapshot correponding to given timestamp. If ts is // kMaxTxnTimestamp, then we return the latest timestamped snapshot if present. // Othersise, we return the snapshot whose timestamp is equal to `ts`. If no // such snapshot exists, then we return null. std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const; // Return the latest timestamped snapshot if present. std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const; ``` We also provide two additional APIs for stats collection and reporting purposes. ```cpp Status TransactionDB::GetAllTimestampedSnapshots( std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; // Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`. Status TransactionDB::GetTimestampedSnapshots( TxnTimestamp ts_lb, TxnTimestamp ts_ub, std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; ``` To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release timestamped snapshots whose timestamps are older than or equal to a given threshold. ```cpp void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts); ``` Before shutdown, RocksDB will release all timestamped snapshots. Comparison with user-defined timestamp and how they can be combined: User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile mapping between snapshots (sequence numbers) and timestamps. Different internal keys with the same user key but different timestamps will be treated as different by compaction, thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection. In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent). The timestamped snapshot supports the semantics of reading at an exact point in time. Timestamped snapshots can also be used with user-defined timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879 Test Plan: ``` make check TEST_TMPDIR=/dev/shm make crash_test_with_txn ``` Reviewed By: siying Differential Revision: D35783919 Pulled By: riversand963 fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4 10 June 2022, 23:07:03 UTC
f4052d1 Enable SecondaryCache::CreateFromString to create sec cache based on the uri for CompressedSecondaryCache (#10132) Summary: Update SecondaryCache::CreateFromString and enable it to create sec cache based on the uri for CompressedSecondaryCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10132 Test Plan: Add unit tests. Reviewed By: anand1976 Differential Revision: D36996997 Pulled By: gitbw95 fbshipit-source-id: 882ad563cff6d38b306a53426ad7e47273f34edc 10 June 2022, 19:23:10 UTC
d3a3b02 Fix bug with kHashSearch and changing prefix_extractor with SetOptions (#10128) Summary: When opening an SST file created using index_type=kHashSearch, the *current* prefix_extractor would be saved, and used with hash index if the *new current* prefix_extractor at query time is compatible with the SST file. This is a problem if the prefix_extractor at SST open time is not compatible but SetOptions later changes (back) to one that is compatible. This change fixes that by using the known compatible (or missing) prefix extractor we save for use with prefix filtering. Detail: I have moved the InternalKeySliceTransform wrapper to avoid some indirection and remove unnecessary fields. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10128 Test Plan: expanded unit test (using some logic from https://github.com/facebook/rocksdb/issues/10122) that fails before fix and probably covers some other previously uncovered cases. Reviewed By: siying Differential Revision: D36955738 Pulled By: pdillinger fbshipit-source-id: 0c78a6b0d24054ef2f3cb237bf010c1c5589fb10 10 June 2022, 15:51:45 UTC
693dffd Return try again when full_history_ts_low is higher than requested ts (#10126) Summary: This PR helps handle the race condition mentioned in this comment thread: https://github.com/facebook/rocksdb/pull/7884#discussion_r572402281 In case where actual full_history_ts_low is higher than the user's requested ts, return a try again message so they don't have the misconception that data between [ts, full_history_ts_low) is kept. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10126 Test Plan: ``` $COMPILE_WITH_ASAN=1 make -j24 all $./db_with_timestamp_basic_test --gtest_filter=UpdateFullHistoryTsLowTest.ConcurrentUpdate $ make -j24 check ``` Reviewed By: riversand963 Differential Revision: D37055368 Pulled By: jowlyzhang fbshipit-source-id: 787fd0984a246540fa03ac227b1d232590d27828 10 June 2022, 15:21:08 UTC
5fa6ef7 Fix fragile CacheTest::ApplyToAllEntriesDuringResize (#10145) Summary: As seen in https://github.com/facebook/rocksdb/issues/10137, simply churning the cache key hashes (e.g. by changing the raw cache keys) could trigger failure in this test, due to possibility of some cache shard exceeding its portion of capacity and evicting entries. Updated the test to be less fragile by using greater margins, and added a pre-check for evictions, which doesn't manifest as a race condition, before the main check that can race. Also added stack trace handler to cache_test for debugging. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10145 Test Plan: test thousands of iterations with gtest-parallel, including with changes in https://github.com/facebook/rocksdb/issues/10137 that were surfacing the problem. Pre-check without the fix would always fail with https://github.com/facebook/rocksdb/issues/10137 Reviewed By: guidotag Differential Revision: D37058771 Pulled By: pdillinger fbshipit-source-id: a7cf137967aef49c07ae9602d8523c63e7388fab 10 June 2022, 02:43:19 UTC
1a3e23a Update jemalloc version for platform009 (#10143) Summary: Update jemalloc version for platform009. Current one is a bit old and the new one can bring some quick wins (e.g. new heap profiling features on devserver). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10143 Test Plan: 1. The building and testing on devserver should work. 2. `db_bench` with `--dump_malloc_stats` `./db_bench --benchmarks=fillrandom --num=10000000 -db=/db_bench_1 ` `./db_bench --benchmarks=overwrite,stats --num=10000000 -use_existing_db -duration=10 --benchmark_write_rate_limit=2000000 -db=/db_bench_1 ` `./db_bench --benchmarks=seekrandom,stats --threads=16 --num=10000000 -use_existing_db -duration=120 --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=520000000 --statistics -db=/db_bench_1 --dump_malloc_stats=true` Before this PR: jemalloc Version: "5.2.1-1303-g73b8faa7149e452f93e52005c89459da08343570" After this PR: jemalloc Version: Reviewed By: anand1976 Differential Revision: D37049347 Pulled By: gitbw95 fbshipit-source-id: 3fcd82cca989047b4bbdfdebe5beba2c4c255ed8 10 June 2022, 00:13:13 UTC
ecfd4ae Enable wal_compression in crash_tests (#10141) Summary: Same as title Pull Request resolved: https://github.com/facebook/rocksdb/pull/10141 Test Plan: ``` export CRASH_TEST_EXT_ARGS=" --wal_compression=zstd" make crash_test -j ``` Reviewed By: riversand963 Differential Revision: D37042810 Pulled By: akankshamahajan15 fbshipit-source-id: 53f0793d78241f1b5c954dcc808cb4c0a3e9172a 09 June 2022, 19:08:01 UTC
f85b31a Fix bug for WalManager with compressed WAL (#10130) Summary: RocksDB uses WalManager to manage WAL files. In WalManager::ReadFirstLine(), the assumption is that reading the first record of a valid WAL file will return OK status and set the output sequence to non-zero value. This assumption has been broken by WAL compression which writes a `kSetCompressionType` record which is not associated with any sequence number. Consequently, WalManager::GetSortedWalsOfType() will skip these WALs and not return them to caller, e.g. Checkpoint, Backup, causing the operations to fail. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10130 Test Plan: - Newly Added test Reviewed By: riversand963 Differential Revision: D36985744 Pulled By: akankshamahajan15 fbshipit-source-id: dfde7b3be68b6a30b75b49479779748eedf29f7f 08 June 2022, 21:16:43 UTC
9efae14 Fix parsing of db_bench output (#10124) Summary: A recent diff add a few more fields to one of the db_bench output lines that gets parsed. This diff updates tools/benchmark.sh to handle that. overwrite : 7.939 micros/op 125963 ops/sec; 50.5 MB/s overwrite : 7.854 micros/op 127320 ops/sec 1800.001 seconds 229176999 operations; 51.0 MB/s Pull Request resolved: https://github.com/facebook/rocksdb/pull/10124 Test Plan: Run it Reviewed By: jay-zhuang Differential Revision: D36945137 Pulled By: mdcallag fbshipit-source-id: 9c96f79491411da997e369a3be9c6b921a21d0fa 08 June 2022, 16:23:36 UTC
f890527 Update test for secondary instance in stress test (#10121) Summary: This PR updates secondary instance testing in stress test by default. A background thread will be started (disabled by default), running a secondary instance tailing the logs of the primary. Periodically (every 1 sec), this thread calls `TryCatchUpWithPrimary()` and uses point lookup or range scan to read some random keys with only very basic verification to make sure no assertion failure is triggered. Thanks to https://github.com/facebook/rocksdb/issues/10061 , we can enable secondary instance when user-defined timestamp is enabled. Also removed a less useful test configuration, `secondary_catch_up_one_in`. This is very similar to the periodic catch-up. In the last commit, I decided not to enable it now, but just update the tests, since secondary instance does not work well when the underlying file is renamed by primary, e.g. SstFileManager. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10121 Test Plan: ``` TEST_TMPDIR=/dev/shm/rocksdb make crash_test TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_atomic_flush ``` Reviewed By: ajkr Differential Revision: D36939458 Pulled By: riversand963 fbshipit-source-id: 1c065b7efc3690fc341569b9d369a5cbd8ef6b3e 08 June 2022, 04:07:47 UTC
ff32346 Set db_stress defaults for TSAN deadlock detector (#10131) Summary: After https://github.com/facebook/rocksdb/issues/9357 we began seeing the following error attempting to acquire locks for file ingestion: ``` FATAL: ThreadSanitizer CHECK failed: /home/engshare/third-party2/llvm-fb/12/src/llvm/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:67 "((n_all_locks_)) < (((sizeof(all_locks_with_contexts_)/sizeof((all_locks_with_contexts_)[0]))))" (0x40, 0x40) ``` The command was using default values for `ingest_external_file_width` (1000) and `log2_keys_per_lock` (2). The expected number of locks needed to update those keys is then (1000 / 2^2) = 250, which is above the 0x40 (64) limit. This PR reduces the default value of `ingest_external_file_width` to 100 so the expected number of locks is 25, which is within the limit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10131 Reviewed By: ltamasi Differential Revision: D36986307 Pulled By: ajkr fbshipit-source-id: e918cdb2fcc39517d585f1e5fd2539e185ada7c1 07 June 2022, 22:15:09 UTC
5cbee1f Add unit test to verify that the dynamic priority can be passed from compaction to FS (#10088) Summary: **Summary:** Add unit tests to verify that the dynamic priority can be passed from compaction to FS. Compaction reads&writes and other DB reads&writes share the same read&write paths to FSRandomAccessFile or FSWritableFile, so a MockTestFileSystem is added to replace the default filesystem from Env to intercept and verify the io_priority. To prepare the compaction input files, use the default filesystem from Env. To test the io priority of the compaction reads and writes, db_options_.fs is set as MockTestFileSystem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10088 Test Plan: Add unit tests. Reviewed By: anand1976 Differential Revision: D36882528 Pulled By: gitbw95 fbshipit-source-id: 120adc15801966f2b8c9fc45285f590a3fff96d1 07 June 2022, 18:57:12 UTC
b6de139 Handle "NotSupported" status by default implementation of Close() in … (#10127) Summary: The default implementation of Close() function in Directory/FSDirectory classes returns `NotSupported` status. However, we don't want operations that worked in older versions to begin failing after upgrading when run on FileSystems that have not implemented Directory::Close() yet. So we require the upper level that calls Close() function should properly handle "NotSupported" status instead of treating it as an error status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10127 Reviewed By: ajkr Differential Revision: D36971112 Pulled By: littlepig2013 fbshipit-source-id: 100f0e6ad1191e1acc1ba6458c566a11724cf466 07 June 2022, 16:49:31 UTC
3ee6c9b Consolidate manual_compaction_paused_ check (#10070) Summary: As pointed out by [https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422](https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422), check `manual_compaction_paused` and `manual_compaction_canceled` can be reduced by setting `*canceled` to be true in `DisableManualCompaction()` and `*canceled` to be false in the last time calling `EnableManualCompaction()`. Changed Tests: The origin `DBTest2.PausingManualCompaction1` uses a callback function to increase `manual_compaction_paused` and the origin CompactionJob/CompactionIterator with `manual_compaction_paused` can detect this. I changed the callback function so that it sets `*canceled` as true if `canceled` is not `nullptr` (to notify CompactionJob/CompactionIterator the compaction has been canceled). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10070 Test Plan: This change does not introduce new features, but some slight difference in compaction implementation. Run the same manual compaction unit tests as before (e.g., PausingManualCompaction[1-4], CancelManualCompaction[1-2], CancelManualCompactionWithListener in db_test2, and db_compaction_test). Reviewed By: ajkr Differential Revision: D36949133 Pulled By: littlepig2013 fbshipit-source-id: c5dc4c956fbf8f624003a0f5ad2690240063a821 07 June 2022, 01:32:26 UTC
a101c9d Return "invalid argument" when read timestamp is too old (#10109) Summary: With this change, when a given read timestamp is smaller than the column-family's full_history_ts_low, Get(), MultiGet() and iterators APIs will return Status::InValidArgument(). Test plan ``` $COMPILE_WITH_ASAN=1 make -j24 all $./db_with_timestamp_basic_test --gtest_filter=DBBasicTestWithTimestamp.UpdateFullHistoryTsLow $ make -j24 check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10109 Reviewed By: riversand963 Differential Revision: D36901126 Pulled By: jowlyzhang fbshipit-source-id: 255feb1a66195351f06c1d0e42acb1ff74527f86 06 June 2022, 21:36:22 UTC
9f244b2 Fix default implementaton of close() function for Directory/FSDirecto… (#10123) Summary: As pointed by anand1976 in his [comment](https://github.com/facebook/rocksdb/pull/10049#pullrequestreview-994255819), previous implementation (adding Close() function in Directory/FSDirectory class) is not backward-compatible. And we mistakenly added the default implementation `return Status::NotSupported("Close")` or `return IOStatus::NotSupported("Close")` in WritableFile class in this [pull request](https://github.com/facebook/rocksdb/pull/10101). This pull request fixes the above issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10123 Reviewed By: ajkr Differential Revision: D36943661 Pulled By: littlepig2013 fbshipit-source-id: 9dc45f4d2ab3a9d51c30bdfde679f1d13c4d5509 06 June 2022, 21:27:31 UTC
2af132c Fix overflow bug in standard deviation computation. (#10100) Summary: There was an overflow bug when computing the variance in the HistogramStat class. This manifests, for instance, when running cache_bench with default arguments. This executes 32M lookups/inserts/deletes in a block cache, and then computes (among other things) the variance of the latencies. The variance is computed as ``variance = (cur_sum_squares * cur_num - cur_sum * cur_sum) / (cur_num * cur_num)``, where ``cum_sum_squares`` is the sum of the squares of the samples, ``cur_num`` is the number of samples, and ``cur_sum`` is the sum of the samples. Because the median latency in a typical run is around 3800 nanoseconds, both the ``cur_sum_squares * cur_num`` and ``cur_sum * cur_sum`` terms overflow as uint64_t. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10100 Test Plan: Added a unit test. Run ``make -j24 histogram_test && ./histogram_test``. Reviewed By: pdillinger Differential Revision: D36942738 Pulled By: guidotag fbshipit-source-id: 0af5fb9e2a297a284e8e74c24e604d302906006e 06 June 2022, 20:53:47 UTC
4f78f96 Refactor: Add BlockTypes to make them imply C++ type in block cache (#10098) Summary: We have three related concepts: * BlockType: an internal enum conceptually indicating a type of SST file block * CacheEntryRole: a user-facing enum for categorizing block cache entries, which is also involved in associated cache entries with an appropriate deleter. Can include categories for non-block cache entries (e.g. memory reservations). * TBlocklike: a C++ type for the actual type behind a void* cache entry. We had some existing code ugliness because BlockType did not imply TBlocklike, because of various kinds of "filter" block. This refactoring fixes that with new BlockTypes. More clean-up can come in later work. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10098 Test Plan: existing tests Reviewed By: akankshamahajan15 Differential Revision: D36897945 Pulled By: pdillinger fbshipit-source-id: 3ae496b5caa81e0a0ed85e873eb5b525e2d9a295 06 June 2022, 18:16:12 UTC
e36008d Disable CI benchmark from #9723 (#10119) Summary: The script is broken. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10119 Reviewed By: ltamasi Differential Revision: D36928076 Pulled By: jay-zhuang fbshipit-source-id: f325cfd00869c506c64573fe8192cb5b561825d6 06 June 2022, 05:28:50 UTC
2f4a0ff CI Benchmarking with CircleCI Runner and OpenSearch Dashboard (EB 1088) (#9723) Summary: CircleCI runner based benchmarking. A runner is a dedicate machine configured for CircleCI to perform work on. Our work is a repeatable benchmark, the `benchmark-linux` job in `config.yml` A runner, in CircleCI terminology, is a machine that is managed by the client (us) rather than running on CircleCI resources in the cloud. This means that we define and configure the iron, and that therefore the performance is repeatable and predictable. Which is what we need for performance regression benchmarking. On a time schedule (or on commit, during branch development) benchmarks are set off on the runner, and then a script is run `benchmark_log_tool.py` which parses the benchmark output and pushes it into a pre-configured OpenSearch document connected to an OpenSearch dashboard. Members of the team can examine benchmark performance changes on the dashboard. As time progresses we can add different benchmarks to the suite which gets run. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9723 Reviewed By: pdillinger Differential Revision: D35555626 Pulled By: jay-zhuang fbshipit-source-id: c6a905ca04494495c3784cfbb991f5ab90c807ee 04 June 2022, 16:31:47 UTC
560906a Add a simple example of backup and restore (#10054) Summary: Add a simple example of backup and restore Signed-off-by: YiteGu <ess_gyt@qq.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/10054 Reviewed By: jay-zhuang Differential Revision: D36678141 Pulled By: ajkr fbshipit-source-id: 43545356baddb4c2c76c62cd63d7a3238d1f8a00 04 June 2022, 06:25:31 UTC
e9c74bc Add wide column serialization primitives (#9915) Summary: The patch adds some low-level logic that can be used to serialize/deserialize a sorted vector of wide columns to/from a simple binary searchable string representation. Currently, there is no user-facing API; this will be implemented in subsequent stages. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9915 Test Plan: `make check` Reviewed By: siying Differential Revision: D35978076 Pulled By: ltamasi fbshipit-source-id: 33f5f6628ec3bcd8c8beab363b1978ac047a8788 04 June 2022, 03:54:48 UTC
3e02c6e Point-lookup returns timestamps of Delete and SingleDelete (#10056) Summary: If caller specifies a non-null `timestamp` argument in `DB::Get()` or a non-null `timestamps` in `DB::MultiGet()`, RocksDB will return the timestamps of the point tombstones. Note: DeleteRange is still unsupported. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10056 Test Plan: make check Reviewed By: ltamasi Differential Revision: D36677956 Pulled By: riversand963 fbshipit-source-id: 2d7af02cc7237b1829cd269086ea895a49d501ae 04 June 2022, 03:00:42 UTC
4bdcc80 Increase ChargeTableReaderTest/ChargeTableReaderTest.Basic error tolerance rate from 1% to 5% (#10113) Summary: **Context:** https://github.com/facebook/rocksdb/pull/9748 added support to charge table reader memory to block cache. In the test `ChargeTableReaderTest/ChargeTableReaderTest.Basic`, it estimated the table reader memory, calculated the expected number of table reader opened based on this estimation and asserted this number with actual number. The expected number of table reader opened calculated based on estimated table reader memory will not be 100% accurate and should have tolerance for error. It was previously set to 1% and recently encountered an assertion failure that `(opened_table_reader_num) <= (max_table_reader_num_capped_upper_bound), actual: 375 or 376 vs 374` where `opened_table_reader_num` is the actual opened one and `max_table_reader_num_capped_upper_bound` is the estimated opened one (=371 * 1.01). I believe it's safe to increase error tolerance from 1% to 5% hence there is this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10113 Test Plan: - CI again succeeds. Reviewed By: ajkr Differential Revision: D36911556 Pulled By: hx235 fbshipit-source-id: 259687dd77b450fea0f5658a5b567a1d31d4b1f7 04 June 2022, 02:42:22 UTC
c1018b7 cmake: add an option to skip thirdparty.inc on Windows (#10110) Summary: When building RocksDB with getdeps on Windows, `thirdparty.inc` get in the way since `FindXXXX.cmake` are working properly now. This PR adds an option to skip that file when building RocksDB so we can disable it. FB: see [D36905191](https://www.internalfb.com/diff/D36905191). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10110 Reviewed By: siying Differential Revision: D36913882 Pulled By: fanzeyi fbshipit-source-id: 33d36841dc0d4fe87f51e1d9fd2b158a3adab88f 04 June 2022, 02:20:34 UTC
7d36bc4 Fix some bugs in verify_random_db.sh (#10112) Summary: The patch attempts to fix three bugs in `verify_random_db.sh`: 1) https://github.com/facebook/rocksdb/pull/9937 changed the default for `--try_load_options` to true in the script's use case, so we have to explicitly set it to false if the corresponding argument of the script is 0. This should fix the issue we've been seeing with our forward compatibility tests where 7.3 is unable to open a database created by the version on main after adding a new configuration option. 2) The script seems to support two "extra parameters"; however, in practice, if the second one was set, only that one was passed on to `ldb`. Now both get forwarded. 3) When running the `diff` command, the base DB directory was passed as the second argument instead of the file containing the `ldb` output (this actually seems to work, probably accidentally though). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10112 Reviewed By: pdillinger Differential Revision: D36911363 Pulled By: ltamasi fbshipit-source-id: fe29db4e28d373cee51a12322c59050fc50e926d 03 June 2022, 23:35:13 UTC
d739de6 Fix a bug in WAL tracking (#10087) Summary: Closing https://github.com/facebook/rocksdb/issues/10080 When `SyncWAL()` calls `MarkLogsSynced()`, even if there is only one active WAL file, this event should still be added to the MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10087 Test Plan: make check Reviewed By: ajkr Differential Revision: D36797580 Pulled By: riversand963 fbshipit-source-id: 24184c9dd606b3939a454ed41de6e868d1519999 03 June 2022, 23:33:00 UTC
eb99e08 Add support for FastLRUCache in cache_bench (#10095) Summary: cache_bench can now run with FastLRUCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10095 Test Plan: - Temporarily add an ``assert(false)`` in the execution path that sets up the FastLRUCache. Run ``make -j24 cache_bench``. Then test the appropriate code is used by running ``./cache_bench -cache_type=fast_lru_cache`` and checking that the assert is called. Repeat for LRUCache. - Verify that FastLRUCache (currently a clone of LRUCache) has similar latency distribution than LRUCache, by comparing the outputs of ``./cache_bench -cache_type=fast_lru_cache`` and ``./cache_bench -cache_type=lru_cache``. Reviewed By: pdillinger Differential Revision: D36875834 Pulled By: guidotag fbshipit-source-id: eb2ad0bb32c2717a258a6ac66ed736e06f826cd8 03 June 2022, 20:40:09 UTC
21906d6 Add default impl to dir close (#10101) Summary: As pointed by anand1976 in his [comment](https://github.com/facebook/rocksdb/pull/10049#pullrequestreview-994255819), previous implementation is not backward-compatible. In this implementation, the default implementation `return Status::NotSupported("Close")` or `return IOStatus::NotSupported("Close")` is added for `Close()` function for `*Directory` classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10101 Test Plan: DBBasicTest.DBCloseAllDirectoryFDs Reviewed By: anand1976 Differential Revision: D36899346 Pulled By: littlepig2013 fbshipit-source-id: 430624793362f330cbb8837960f0e8712a944ab9 03 June 2022, 19:53:28 UTC
cf85607 Add support for FastLRUCache in db_bench. (#10096) Summary: db_bench can now run with FastLRUCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10096 Test Plan: - Temporarily add an ``assert(false)`` in the execution path that sets up the FastLRUCache. Run ``make -j24 db_bench``. Then test the appropriate code is used by running ``./db_bench -cache_type=fast_lru_cache`` and checking that the assert is called. Repeat for LRUCache. - Verify that FastLRUCache (currently a clone of LRUCache) produces similar benchmark data than LRUCache, by comparing the outputs of ``./db_bench -benchmarks=fillseq,fillrandom,readseq,readrandom -cache_type=fast_lru_cache`` and ``./db_bench -benchmarks=fillseq,fillrandom,readseq,readrandom -cache_type=lru_cache``. Reviewed By: gitbw95 Differential Revision: D36898774 Pulled By: guidotag fbshipit-source-id: f9f6b6f6da124f88b21b3c8dee742fbb04eff773 03 June 2022, 18:16:49 UTC
2b3c50c Temporarily disable wal compression (#10108) Summary: Will re-enable after fixing the bug in https://github.com/facebook/rocksdb/issues/10099 and https://github.com/facebook/rocksdb/issues/10097. Right now, the priority is https://github.com/facebook/rocksdb/issues/10087, but the bug in WAL compression prevents the mini crash test from passing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10108 Reviewed By: pdillinger Differential Revision: D36897214 Pulled By: riversand963 fbshipit-source-id: d64dc52738222d5f66003f7731dc46eaeed812be 03 June 2022, 17:22:52 UTC
5506954 Enhance to support more tuning options, and universal and integrated… (#9704) Summary: … BlobDB for all tests This does two big things: * provides more tuning options * supports universal and integrated BlobDB for all of the benchmarks that are leveled-only It does several smaller things, and I will list a few * sets l0_slowdown_writes_trigger which wasn't set before this diff. * improves readability in report.tsv by using smaller field names in the header * adds more columns to report.tsv report.tsv before this diff: ``` ops_sec mb_sec total_size_gb level0_size_gb sum_gb write_amplification write_mbps usec_op percentile_50 percentile_75 percentile_99 percentile_99.9 percentile_99.99 uptime stall_time stall_percent test_name test_date rocksdb_version job_id 823294 329.8 0.0 21.5 21.5 1.0 183.4 1.2 1.0 1.0 3 6 14 120 00:00:0.000 0.0 fillseq.wal_disabled.v400 2022-03-16T15:46:45.000-07:00 7.0 326520 130.8 0.0 0.0 0.0 0.0 0 12.2 139.8 155.1 170 234 250 60 00:00:0.000 0.0 multireadrandom.t4 2022-03-16T15:48:47.000-07:00 7.0 86313 345.7 0.0 0.0 0.0 0.0 0 46.3 44.8 50.6 75 84 108 60 00:00:0.000 0.0 revrangewhilewriting.t4 2022-03-16T15:50:48.000-07:00 7.0 101294 405.7 0.0 0.1 0.1 1.0 1.6 39.5 40.4 45.9 64 75 103 62 00:00:0.000 0.0 fwdrangewhilewriting.t4 2022-03-16T15:52:50.000-07:00 7.0 258141 103.4 0.0 0.1 1.2 18.2 19.8 15.5 14.3 18.1 28 34 48 62 00:00:0.000 0.0 readwhilewriting.t4 2022-03-16T15:54:51.000-07:00 7.0 334690 134.1 0.0 7.6 18.7 4.2 308.8 12.0 11.8 13.7 21 30 62 62 00:00:0.000 0.0 overwrite.t4.s0 2022-03-16T15:56:53.000-07:00 7.0 ``` report.tsv with this diff: ``` ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id 831144 332.9 22GB 0.0GB, 21.7 1.0 185.1 264 262 0 0 1.2 1.0 3 6 14 9198 120 0.0 0 0.4 0.0 0.7 fillseq.wal_disabled.v400 2022-03-16T16:21:23 7.0 325229 130.3 22GB 0.0GB, 0.0 0.0 0 0 0 0 12.3 139.8 170 237 249 572 60 0.0 0 0.4 0.1 1.2 multireadrandom.t4 2022-03-16T16:23:25 7.0 312920 125.3 26GB 0.0GB, 11.1 2.6 189.3 115 113 0 0 12.8 11.8 21 34 1255 6442 60 0.2 1 0.7 0.1 0.6 overwritesome.t4.s0 2022-03-16T16:25:27 7.0 81698 327.2 25GB 0.0GB, 0.0 0.0 0 0 0 0 48.9 46.2 79 246 369 9445 60 0.0 0 0.4 0.1 1.4 revrangewhilewriting.t4 2022-03-16T16:30:21 7.0 92484 370.4 25GB 0.0GB, 0.1 1.5 1.1 1 0 0 0 43.2 42.3 75 103 110 9512 62 0.0 0 0.4 0.1 1.4 fwdrangewhilewriting.t4 2022-03-16T16:32:24 7.0 241661 96.8 25GB 0.0GB, 0.1 1.5 1.1 1 0 0 0 16.5 17.1 30 34 49 9092 62 0.0 0 0.4 0.1 1.4 readwhilewriting.t4 2022-03-16T16:34:27 7.0 305234 122.3 30GB 0.0GB, 12.1 2.7 201.7 127 124 0 0 13.1 11.8 21 128 1934 6339 62 0.0 0 0.7 0.1 0.7 overwrite.t4.s0 2022-03-16T16:36:30 7.0 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9704 Test Plan: run it Reviewed By: jay-zhuang Differential Revision: D36864627 Pulled By: mdcallag fbshipit-source-id: d5af1cfc258a16865210163fa6fd1b803ab1a7d3 03 June 2022, 15:20:10 UTC
7b2c014 Fix Java build (#10105) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10105 Reviewed By: cbi42 Differential Revision: D36891073 Pulled By: ltamasi fbshipit-source-id: 16487ec708fc96add2a1ebc2d98f6439dfc852ca 03 June 2022, 15:11:31 UTC
b8fe7df Fix LITE build (#10106) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10106 Reviewed By: cbi42 Differential Revision: D36891284 Pulled By: ltamasi fbshipit-source-id: 304ffa84549201659feb0b74d6ba54a83f08906b 03 June 2022, 06:42:41 UTC
e88d893 Add comments/permit unchecked error to close_db_dir pull requests (#10093) Summary: In [close_db_dir](https://github.com/facebook/rocksdb/pull/10049) pull request, some merging conflicts occurred (some comments and one line `s.PermitUncheckedError()` are missing). This pull request aims to put them back. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10093 Reviewed By: ajkr Differential Revision: D36884117 Pulled By: littlepig2013 fbshipit-source-id: 8c0e2a8793fc52804067c511843bd1ff4912c1c3 03 June 2022, 04:52:35 UTC
ed50ccd Install zstd on CircleCI linux (#10102) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10102 Reviewed By: jay-zhuang Differential Revision: D36885468 Pulled By: riversand963 fbshipit-source-id: 6ed5b62dda8fe0f4be4b66d09bdec0134cf4500c 03 June 2022, 04:38:29 UTC
e6432df Make it possible to enable blob files starting from a certain LSM tree level (#10077) Summary: Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed. In order to achieve this, we would like to do the following: - Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic) - Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level` - Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` ) - Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh` - Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool) - Ideally extend the C and Java bindings with the new option - Update the BlobDB wiki to document the new option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077 Reviewed By: ltamasi Differential Revision: D36884156 Pulled By: gangliao fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d 03 June 2022, 03:04:33 UTC
a020031 Add kLastTemperature as temperature high bound (#10044) Summary: Only used as temperature high bound for current code, may increase with more temperatures added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10044 Test Plan: ci Reviewed By: siying Differential Revision: D36633410 Pulled By: jay-zhuang fbshipit-source-id: eecdfa7623c31778c31d789902eacf78aad7b482 02 June 2022, 20:10:49 UTC
3dc6eba Support specifying blob garbage collection parameters when CompactRange() (#10073) Summary: Garbage collection is generally controlled by the BlobDB configuration options `enable_blob_garbage_collection` and `blob_garbage_collection_age_cutoff`. However, there might be use cases where we would want to temporarily override these options while performing a manual compaction. (One use case would be doing a full key-space manual compaction with full=100% garbage collection age cutoff in order to minimize the space occupied by the database.) Our goal here is to make it possible to override the configured GC parameters when using the `CompactRange` API to perform manual compactions. This PR would involve: - Extending the `CompactRangeOptions` structure so clients can both force-enable and force-disable GC, as well as use a different cutoff than what's currently configured - Storing whether blob GC should actually be enabled during a certain manual compaction and the cutoff to use in the `Compaction` object (considering the above overrides) and passing it to `CompactionIterator` via `CompactionProxy` - Updating the BlobDB wiki to document the new options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10073 Test Plan: Adding unit tests and adding the new options to the stress test tool. Reviewed By: ltamasi Differential Revision: D36848700 Pulled By: gangliao fbshipit-source-id: c878ef101d1c612429999f513453c319f75d78e9 02 June 2022, 02:40:26 UTC
65893ad Explicitly closing all directory file descriptors (#10049) Summary: Currently, the DB directory file descriptor is left open until the deconstruction process (`DB::Close()` does not close the file descriptor). To verify this, comment out the lines between `db_ = nullptr` and `db_->Close()` (line 512, 513, 514, 515 in ldb_cmd.cc) to leak the ``db_'' object, build `ldb` tool and run ``` strace --trace=open,openat,close ./ldb --db=$TEST_TMPDIR --ignore_unknown_options put K1 V1 --create_if_missing ``` There is one directory file descriptor that is not closed in the strace log. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10049 Test Plan: Add a new unit test DBBasicTest.DBCloseAllDirectoryFDs: Open a database with different WAL directory and three different data directories, and all directory file descriptors should be closed after calling Close(). Explicitly call Close() after a directory file descriptor is not used so that the counter of directory open and close should be equivalent. Reviewed By: ajkr, hx235 Differential Revision: D36722135 Pulled By: littlepig2013 fbshipit-source-id: 07bdc2abc417c6b30997b9bbef1f79aa757b21ff 02 June 2022, 01:03:34 UTC
b4d0e04 Add support for FastLRUCache in stress and crash tests. (#10081) Summary: Stress tests can run with the experimental FastLRUCache. Crash tests randomly choose between LRUCache and FastLRUCache. Since only LRUCache supports a secondary cache, we validate the `--secondary_cache_uri` and `--cache_type` flags---when `--secondary_cache_uri` is set, the `--cache_type` is set to `lru_cache`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10081 Test Plan: - To test that the FastLRUCache is used and the stress test runs successfully, run `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test_with_atomic_flush`. The cache type should sometimes be `fast_lru_cache`. - To test the flag validation, run `make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --secondary_cache_uri=x" blackbox_crash_test_with_atomic_flush` multiple times. The test will always be aborted (which is okay). Check that the cache type is always `lru_cache`. Reviewed By: anand1976 Differential Revision: D36839908 Pulled By: guidotag fbshipit-source-id: ebcdfdcd12ec04c96c09ae5b9c9d1e613bdd1725 02 June 2022, 01:00:28 UTC
45b1c78 Update History.md for #9922 (#10092) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10092 Reviewed By: riversand963 Differential Revision: D36832311 Pulled By: akankshamahajan15 fbshipit-source-id: 8fb1cf90b1d4dddebbfbeebeddb15f6905968e9b 01 June 2022, 22:36:55 UTC
5864900 Get current LogFileNumberSize the same as log_writer (#10086) Summary: `db_impl.alive_log_files_` is used to track the WAL size in `db_impl.logs_`. Get the `LogFileNumberSize` obj in `alive_log_files_` the same time as `log_writer` to keep them consistent. For this issue, it's not safe to do `deque::reverse_iterator::operator*` and `deque::pop_front()` concurrently, so remove the tail cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10086 Test Plan: ``` # on Windows gtest-parallel ./db_test --gtest_filter=DBTest.FileCreationRandomFailure -r 1000 -w 100 ``` Reviewed By: riversand963 Differential Revision: D36822373 Pulled By: jay-zhuang fbshipit-source-id: 5e738051dfc7bcf6a15d85ba25e6365df6b6a6af 01 June 2022, 22:33:22 UTC
463873f Add bug fix to HISTORY.md (#10091) Summary: Add to HISTORY.md the bug fixed in https://github.com/facebook/rocksdb/issues/10051 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10091 Reviewed By: ajkr Differential Revision: D36821861 Pulled By: cbi42 fbshipit-source-id: 598812fab88f65c0147ece53cff55cf4ea73aac6 01 June 2022, 21:07:13 UTC
a00cffa Reduce risk of backup or checkpoint missing a WAL file (#10083) Summary: We recently saw a case in crash test in which a WAL file in the middle of the list of live WALs was not included in the backup, so the DB was not openable due to missing WAL. We are not sure why, but this change should at least turn that into a backup-time failure by ensuring all the WAL files expected by the manifest (according to VersionSet) are included in `GetSortedWalFiles()` (used by `GetLiveFilesStorageInfo()`, `BackupEngine`, and `Checkpoint`) Related: to maximize the effectiveness of track_and_verify_wals_in_manifest with GetSortedWalFiles() during checkpoint/backup, we will now sync WAL in GetLiveFilesStorageInfo() when track_and_verify_wals_in_manifest=true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10083 Test Plan: added new unit test for the check in GetSortedWalFiles() Reviewed By: ajkr Differential Revision: D36791608 Pulled By: pdillinger fbshipit-source-id: a27bcf0213fc7ab177760fede50d4375d579afa6 01 June 2022, 18:02:27 UTC
d04df27 Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922) Summary: In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't flush the data from WAL to L0 for all column families if possible. As a result, not all column families can increase their log_numbers, and min_log_number_to_keep won't change. For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change. If we persist a new MANIFEST with advanced log_numbers for some column families, then during a second crash after persisting the MANIFEST, RocksDB will see some column families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail. As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL. If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point. If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error. Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922 Test Plan: 1. Update unit tests to fail without this change 2. make crast_test -j Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix) Reviewed By: riversand963 Differential Revision: D36043701 Pulled By: akankshamahajan15 fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9 01 June 2022, 17:52:26 UTC
7c8c803 Remove unused variable `single_column_family_mode_` (#10078) Summary: This variable is actually not being used for anything meaningful, thus remove it. This can make https://github.com/facebook/rocksdb/issues/7516 slightly simpler by reducing the amount of state that must be made lock-free. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10078 Test Plan: make check Reviewed By: ajkr Differential Revision: D36779817 Pulled By: riversand963 fbshipit-source-id: ffb0d9ad6149616917ae5e02bb28102cb90fc406 31 May 2022, 20:03:37 UTC
151dc00 Bypass tests instead of skipping (#10076) Summary: Make fb test infra happy, more details: https://github.com/facebook/rocksdb/issues/8048 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10076 Test Plan: CI Reviewed By: ajkr Differential Revision: D36768766 Pulled By: jay-zhuang fbshipit-source-id: 4f039a5c623abb6d4a7d09bbf97077618e7ec2c8 31 May 2022, 20:02:50 UTC
5ab5537 Deflake unit test BackupEngineTest.Concurrency (#10069) Summary: After https://github.com/facebook/rocksdb/issues/9984, BackupEngineTest.Concurrency becomes flaky. During DB::Open(), someone else can rename/remove the LOG file, causing this thread's `CreateLoggerFromOptions()` to fail. The reason is that the operation sequence of "FileExists -> Rename" is not atomic. It's possible that a FileExists() returns OK, but the file gets deleted before Rename(), causing the latter to return IOError with PathNotFound subcode. Although it's not encouraged to concurrently modify the contents of the directories managed by the database instance in this case, we can still perform some simple handling to make DB::Open() more robust. In this case, we can check if a racing thread has deleted the original LOG file, we can allow this thread to continue creating a new LOG file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10069 Test Plan: ~/gtest-parallel/gtest-parallel -r 100 ./backup_engine_test --gtest_filter=BackupEngineTest.Concurrency Reviewed By: jay-zhuang Differential Revision: D36736913 Pulled By: riversand963 fbshipit-source-id: 3cbe92d77ca175e55e586bdb1a32ac8107217ae6 31 May 2022, 16:36:32 UTC
back to top