sort by:
Revision Author Date Message Commit Date
27ec3b3 Sanitize input in DB::MultiGet() API (#6054) Summary: The new DB::MultiGet() doesn't validate input for num_keys > 1 and GCC-9 complains about it. Fix it by directly return when num_keys == 0 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6054 Test Plan: Build with GCC-9 and see it passes. Differential Revision: D18608958 fbshipit-source-id: 1c279aff3c7fe6e9d5a6d085ed02550ecea4fdb2 20 November 2019, 18:38:01 UTC
0306e01 Fixes for g++ 4.9.2 compatibility (#6053) Summary: Taken from merryChris in https://github.com/facebook/rocksdb/issues/6043 Stackoverflow ref on {{}} vs. {}: https://stackoverflow.com/questions/26947704/implicit-conversion-failure-from-initializer-list Note to reader: .clear() does not empty out an ostringstream, but .str("") suffices because we don't have to worry about clearing error flags. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6053 Test Plan: make check, manual run of filter_bench Differential Revision: D18602259 Pulled By: pdillinger fbshipit-source-id: f6190f83b8eab4e80e7c107348839edabe727841 19 November 2019, 23:43:37 UTC
ec3e3c3 Fix corruption with intra-L0 on ingested files (#5958) Summary: ## Problem Description Our process was abort when it call `CheckConsistency`. And the information in `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ". Here are the causes of the accident I investigated. * RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested. * When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number. * `CheckConsistency` determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal. * If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst, the `smallest_seqno` of this new file will be smaller than his `largest_seqno`. * If more than one ingested file was ingested before memtable schedule flush, and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable. So `CheckConsistency` will return a `Corruption`. * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start, but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2). **But there was still some data in s1 written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files: > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno < s1.largest_seqno So I skip pick sst file which was ingested in function `FindIntraL0Compaction ` ## Reason Here is my bug report: https://github.com/facebook/rocksdb/issues/5913 There are two situations that can cause the check to fail. ### First situation: - First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest. - If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst. - Then some data was put into memtable, and memtable was flushed to L0. We called this sst B. - RocksDB check consistency , and find the `smallest_seqno` of B is less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested. ### Secondary situaion - First we have flushed many sst in L0, we call them [s1, s2, s3]. - There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1. And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1. - m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5. - [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction. We call it s6. - compacted 4@0 files to L0 - When s6 is added into manifest, the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5. But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5. - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno Pull Request resolved: https://github.com/facebook/rocksdb/pull/5958 Differential Revision: D18601316 fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267 19 November 2019, 23:09:11 UTC
019eb1f Disable blob iterator test with max_sequential_skip_in_iterations==0 in LITE mode (#6052) Summary: The SetOptions API used by the test is not supported in LITE mode, so we should skip the new chunk in this case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6052 Test Plan: Ran the unit tests both in regular and LITE mode. Differential Revision: D18601763 Pulled By: ltamasi fbshipit-source-id: 883d6882771e0fb4aae72bb77ba4e63d9febec04 19 November 2019, 23:02:41 UTC
4e0dcd3 db_stress sometimes generates keys close to SST file boundaries (#6037) Summary: Recently, a bug was found related to a seek key that is close to SST file boundary. However, it only occurs in a very small chance in db_stress, because the chance that a random key hits SST file boundaries is small. To boost the chance, with 1/16 chance, we pick keys that are close to SST file boundaries. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6037 Test Plan: Did some manual printing out, and hack to cover the key generation logic to be correct. Differential Revision: D18598476 fbshipit-source-id: 13b76687d106c5be4e3e02a0c77fa5578105a071 19 November 2019, 21:17:03 UTC
20b48c6 Fix blob context when db_iter uses seek (#6051) Summary: Fix: when `db_iter` falls back to using seek by `FindValueForCurrentKeyUsingSeek`, `is_blob_` flag is not properly set on encountering BlobIndex. Also patch existing test for the mentioned code path. Signed-off-by: tabokie <xy.tao@outlook.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/6051 Differential Revision: D18596274 Pulled By: ltamasi fbshipit-source-id: 8e4714af263b99dc2c379707d50db88fe6799278 19 November 2019, 19:39:02 UTC
38cc611 Fix test failure in LITE mode (#6050) Summary: GetSupportedCompressions() is not available in LITE build, so check and use Snappy compression in db_basic_test.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6050 Test Plan: make LITE=1 check make check Differential Revision: D18588114 Pulled By: anand1976 fbshipit-source-id: a193de58c44f91bcc237107f25dbc1b9458eef3d 19 November 2019, 18:13:24 UTC
ac498cd Remove a few unnecessary includes Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6046 Test Plan: make check, manual inspection Differential Revision: D18573044 Pulled By: pdillinger fbshipit-source-id: 7a5999fc08d798ce3157b56d4b36d24027409fc3 19 November 2019, 16:20:42 UTC
279c488 Mark blob files not needed by any memtables/SSTs obsolete (#6032) Summary: The patch adds logic to mark no longer needed blob files obsolete upon database open and whenever a flush or compaction completes. Unneeded blob files are detected by iterating through live immutable non-TTL blob files starting from the lowest-numbered one, and stopping when a blob file used by any SSTs or potentially used by memtables is found. (The latter is determined by comparing the sequence number at which the blob file became immutable with the largest sequence number received in flush notifications.) In addition, the patch cleans up the logic around closing and obsoleting blob files and enforces invariants around this area (blob files are now guaranteed to go through the stages mutable-non-obsolete, immutable-non-obsolete, and immutable-obsolete in this order). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6032 Test Plan: Extended unit tests and tested using the BlobDB mode of `db_bench`. Differential Revision: D18495610 Pulled By: ltamasi fbshipit-source-id: 11825b84af74f3f4abfd9bcae04e80870ae58961 19 November 2019, 00:30:06 UTC
a150604 db_stress to cover total order seek (#6039) Summary: Right now, in db_stress, as long as prefix extractor is defined, TestIterator always uses. There is value of cover total_order_seek = true when prefix extractor is define. Add a small chance that this flag is turned on. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6039 Test Plan: Run the test for a while. Differential Revision: D18539689 fbshipit-source-id: 568790dd7789c9986b83764b870df0423a122d99 18 November 2019, 23:01:38 UTC
5b9233b Fix a test failure on systems that don't have Snappy compression libraries (#6038) Summary: The ParallelIO/DBBasicTestWithParallelIO.MultiGet/11 test fails if Snappy compression library is not installed, since RocksDB defaults to Snappy if none is specified. So dynamically determine the supported compression types and pick the first one. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6038 Differential Revision: D18532370 Pulled By: anand1976 fbshipit-source-id: a0a735114d1f8892ea09f7c4af8688d7bcc5b075 18 November 2019, 17:37:18 UTC
f65ec09 Fix IngestExternalFile's bug with two_write_queue (#5976) Summary: When two_write_queue enable, IngestExternalFile performs EnterUnbatched on both write queues. SwitchMemtable also EnterUnbatched on 2nd write queue when this option is enabled. When the call stack includes IngestExternalFile -> FlushMemTable -> SwitchMemtable, this results into a deadlock. The implemented solution is to pass on the existing writes_stopped argument in FlushMemTable to skip EnterUnbatched in SwitchMemtable. Fixes https://github.com/facebook/rocksdb/issues/5974 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5976 Differential Revision: D18535943 Pulled By: maysamyabandeh fbshipit-source-id: a4f9d4964c10d4a7ca06b1e0102ca2ec395512bc 15 November 2019, 22:00:37 UTC
0058dae Disable SmallestUnCommittedSeq in Valgrind run (#6035) Summary: SmallestUnCommittedSeq sometimes takes too long when run under Valgrind. The patch disables it when the tests are run under Valgrind. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6035 Differential Revision: D18509198 Pulled By: maysamyabandeh fbshipit-source-id: 1191443b9fedb6b9c50d6b76f5c92371f5030230 14 November 2019, 22:41:52 UTC
00d58a3 Abandon use of folly::Optional (#6036) Summary: Had complications with LITE build and valgrind test. Reverts/fixes small parts of PR https://github.com/facebook/rocksdb/issues/6007 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6036 Test Plan: make LITE=1 all check and ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 make -j24 db_bloom_filter_test && ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 ./db_bloom_filter_test Differential Revision: D18512238 Pulled By: pdillinger fbshipit-source-id: 37213cf0d309edf11c483fb4b2fb6c02c2cf2b28 14 November 2019, 22:04:15 UTC
6123611 crash_test: use large max_manifest_file_size most of the time. (#6034) Summary: Right now, crash_test always uses 16KB max_manifest_file_size value. It is good to cover logic of manifest file switch. However, information stored in manifest files might be useful in debugging failures. Switch to only use small manifest file size in 1/15 of the time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6034 Test Plan: Observe command generated by db_crash_test.py multiple times and see the --max_manifest_file_size value distribution. Differential Revision: D18513824 fbshipit-source-id: 7b3ae6dbe521a0918df41064e3fa5ecbf2466e04 14 November 2019, 22:01:06 UTC
e8e7fb1 More fixes to auto-GarbageCollect in BackupEngine (#6023) Summary: Production: * Fixes GarbageCollect (and auto-GC triggered by PurgeOldBackups, DeleteBackup, or CreateNewBackup) to clean up backup directory independent of current settings (except max_valid_backups_to_open; see issue https://github.com/facebook/rocksdb/issues/4997) and prior settings used with same backup directory. * Fixes GarbageCollect (and auto-GC) not to attempt to remove "." and ".." entries from directories. * Clarifies contract with users in modifying BackupEngine operations. In short, leftovers from any incomplete operation are cleaned up by any subsequent call to that same kind of operation (PurgeOldBackups and DeleteBackup considered the same kind of operation). GarbageCollect is available to clean up after all kinds. (NB: right now PurgeOldBackups and DeleteBackup will clean up after incomplete CreateNewBackup, but we aren't promising to continue that behavior.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6023 Test Plan: * Refactors open parameters to use an option enum, for readability, etc. (Also fixes an unused parameter bug in the redundant OpenDBAndBackupEngineShareWithChecksum.) * Fixes an apparent bug in ShareTableFilesWithChecksumsTransition in which old backup data was destroyed in the transition to be tested. That test is now augmented to ensure GarbageCollect (or auto-GC) does not remove shared files when BackupEngine is opened with share_table_files=false. * Augments DeleteTmpFiles test to ensure that CreateNewBackup does auto-GC when an incompletely created backup is detected. Differential Revision: D18453559 Pulled By: pdillinger fbshipit-source-id: 5e54e7b08d711b161bc9c656181012b69a8feac4 14 November 2019, 14:20:18 UTC
f059c7d New Bloom filter implementation for full and partitioned filters (#6007) Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes https://github.com/facebook/rocksdb/issues/5857 Fixes https://github.com/facebook/rocksdb/issues/4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f 14 November 2019, 00:44:01 UTC
f382f44 fix typo (#6025) Summary: fix a typo at java readme page Pull Request resolved: https://github.com/facebook/rocksdb/pull/6025 Differential Revision: D18481232 fbshipit-source-id: 1c70c2435bcd4b02f25e28cd7e35c42273e07be0 13 November 2019, 19:02:28 UTC
bb23bfe Fix a regression bug on total order seek with prefix enabled and range delete (#6028) Summary: Recent change https://github.com/facebook/rocksdb/pull/5861 mistakely use "prefix_extractor_ != nullptr" as the condition to determine whehter prefix bloom filter isused. It fails to consider read_options.total_order_seek, so it is wrong. The result is that an optimization for non-total-order seek is mistakely applied to total order seek, and introduces a bug in following corner case: Because of RangeDelete(), a file's largest key is extended. Seek key falls into the range deleted file, so level iterator seeks into the previous file without getting any key. The correct behavior is to place the iterator to the first key of the next file. However, an optimization is triggered and invalidates the iterator because it is out of the prefix range, causing wrong results. This behavior is reproduced in the unit test added. Fix the bug by setting prefix_extractor to be null if total order seek is used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6028 Test Plan: Add a unit test which fails without the fix. Differential Revision: D18479063 fbshipit-source-id: ac075f013029fcf69eb3a598f14c98cce3e810b3 13 November 2019, 18:11:34 UTC
42b5494 Fix BloomFilterPolicy changes for unsigned char (ARM) (#6024) Summary: Bug in PR https://github.com/facebook/rocksdb/issues/5941 when char is unsigned that should only affect assertion on unused/invalid filter metadata. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6024 Test Plan: on ARM: ./bloom_test && ./db_bloom_filter_test && ./block_based_filter_block_test && ./full_filter_block_test && ./partitioned_filter_block_test Differential Revision: D18461206 Pulled By: pdillinger fbshipit-source-id: 68a7c813a0b5791c05265edc03cdf52c78880e9a 12 November 2019, 23:29:15 UTC
6c7b1a0 Batched MultiGet API for multiple column families (#5816) Summary: Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice. As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816 Test Plan: make check make asan_check asan_crash_test Differential Revision: D18408676 Pulled By: anand1976 fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d 12 November 2019, 21:52:55 UTC
a19de78 db_stress to cover SeekForPrev() (#6022) Summary: Right now, db_stress doesn't cover SeekForPrev(). Add the coverage, which mirrors what we do for Seek(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6022 Test Plan: Run "make crash_test". Do some manual source code hack to simular iterator wrong results and see it caught. Differential Revision: D18442193 fbshipit-source-id: 879b79000d5e33c625c7e970636de191ccd7776c 12 November 2019, 01:33:54 UTC
03ce7fb Fix a buffer overrun problem in BlockBasedTable::MultiGet (#6014) Summary: The calculation in BlockBasedTable::MultiGet for the required buffer length for reading in compressed blocks is incorrect. It needs to take the 5-byte block trailer into account. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6014 Test Plan: Add a unit test DBBasicTest.MultiGetBufferOverrun that fails in asan_check before the fix, and passes after. Differential Revision: D18412753 Pulled By: anand1976 fbshipit-source-id: 754dfb66be1d5f161a7efdf87be872198c7e3b72 12 November 2019, 00:59:15 UTC
f29e6b3 bugfix: MemTableList::RemoveOldMemTables invalid iterator after remov… (#6013) Summary: Fix issue https://github.com/facebook/rocksdb/issues/6012. I found that it may be caused by the following codes in function _RemoveOldMemTables()_ in **db/memtable_list.cc** : ``` for (auto it = memlist.rbegin(); it != memlist.rend(); ++it) { MemTable* mem = *it; if (mem->GetNextLogNumber() > log_number) { break; } current_->Remove(mem, to_delete); ``` The iterator **it** turns invalid after `current_->Remove(mem, to_delete);` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6013 Test Plan: ``` make check ``` Differential Revision: D18401107 Pulled By: riversand963 fbshipit-source-id: bf0da3b868ed70f7aff24cf7b3e2049c0c5c7a4e 11 November 2019, 23:57:38 UTC
c17384f Cascade TTL Compactions to move expired key ranges to bottom levels faster (#5992) Summary: When users use Level-Compaction-with-TTL by setting `cf_options.ttl`, the ttl-expired data could take n*ttl time to reach the bottom level (where n is the number of levels) due to how the `creation_time` table property was calculated for the newly created files during compaction. The creation time of new files was set to a max of all compaction-input-files-creation-times which essentially resulted in resetting the ttl as the key range moves across levels. This behavior is now fixed by changing the `creation_time` to be based on minimum of all compaction-input-files-creation-times; this will cause cascading compactions across levels for the ttl-expired data to move to the bottom level, resulting in getting rid of tombstones/deleted-data faster. This will help start cascading compactions to move the expired key range to the bottom-most level faster. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5992 Test Plan: `make check` Differential Revision: D18257883 Pulled By: sagar0 fbshipit-source-id: 00df0bb8d0b7e14d9fc239df2cba8559f3e54cbc 11 November 2019, 22:09:01 UTC
8e7aa62 BlobDB: Maintain mapping between blob files and SSTs (#6020) Summary: The patch adds logic to BlobDB to maintain the mapping between blob files and SSTs for which the blob file in question is the oldest blob file referenced by the SST file. The mapping is initialized during database open based on the information retrieved using `GetLiveFilesMetaData`, and updated after flushes/compactions based on the information received through the `EventListener` interface (or, in the case of manual compactions issued through the `CompactFiles` API, the `CompactionJobInfo` object). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6020 Test Plan: Added a unit test; also tested using the BlobDB mode of `db_bench`. Differential Revision: D18410508 Pulled By: ltamasi fbshipit-source-id: dd9e778af781cfdb0d7056298c54ba9cebdd54a5 11 November 2019, 22:01:34 UTC
aa63abf Auto-GarbageCollect on PurgeOldBackups and DeleteBackup (#6015) Summary: Only if there is a crash, power failure, or I/O error in DeleteBackup, shared or private files from the backup might be left behind that are not cleaned up by PurgeOldBackups or DeleteBackup-- only by GarbageCollect. This makes the BackupEngine API "leaky by default." Even if it means a modest performance hit, I think we should make Delete and Purge do as they say, with ongoing best effort: i.e. future calls will attempt to finish any incomplete work from earlier calls. This change does that by having DeleteBackup and PurgeOldBackups do a GarbageCollect, unless (to minimize performance hit) this BackupEngine has already done a GarbageCollect and there have been no deletion-related I/O errors in that GarbageCollect or since then. Rejected alternative 1: remove meta file last instead of first. This would in theory turn partially deleted backups into corrupted backups, but code changes would be needed to allow the missing files and consider it acceptably corrupt, rather than failing to open the BackupEngine. This might be a reasonable choice, but I mostly rejected it because it doesn't solve the legacy problem of cleaning up existing lingering files. Rejected alternative 2: use a deletion marker file. If deletion started with creating a file that marks a backup as flagged for deletion, then we could reliably detect partially deleted backups and efficiently finish removing them. In addition to not solving the legacy problem, this could be precarious if there's a disk full situation, and we try to create a new file in order to delete some files. Ugh. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6015 Test Plan: Updated unit tests Differential Revision: D18401333 Pulled By: pdillinger fbshipit-source-id: 12944e372ce6809f3f5a4c416c3b321a8927d925 09 November 2019, 03:15:35 UTC
72de842 Fix DBFlushTest::FireOnFlushCompletedAfterCommittedResult hang (#6018) Summary: The test would fire two flushes to let them run in parallel. Previously it wait for the first job to be scheduled before firing the second. It is possible the job is not started before the second job being scheduled, making the two job combine into one. Change to wait for the first job being started. Fixes https://github.com/facebook/rocksdb/issues/6017 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6018 Test Plan: ``` while ./db_flush_test --gtest_filter=*FireOnFlushCompletedAfterCommittedResult*; do :; done ``` and let it run for a while. Signed-off-by: Yi Wu <yiwu@pingcap.com> Differential Revision: D18405576 Pulled By: riversand963 fbshipit-source-id: 6ebb6262e033d5dc2ef81cb3eb410b314f2de4c9 08 November 2019, 21:47:29 UTC
f80050f Add file number/oldest referenced blob file number to {Sst,Live}FileMetaData (#6011) Summary: The patch exposes the file numbers of the SSTs as well as the oldest blob files they contain a reference to through the GetColumnFamilyMetaData/ GetLiveFilesMetaData interface. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6011 Test Plan: Fixed and extended the existing unit tests. (The earlier ColumnFamilyMetaDataTest wasn't really testing anything because the generated memtables were never flushed, so the metadata structure was essentially empty.) Differential Revision: D18361697 Pulled By: ltamasi fbshipit-source-id: d5ed1d94ac70858b84393c48711441ddfe1251e9 07 November 2019, 22:04:16 UTC
07a0ad3 Download bzip2 packages from sourceforge (#5995) Summary: From bzip2's official [download page](http://www.bzip.org/downloads.html), we could download it from sourceforge. This source would be more credible than previous web archive. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5995 Differential Revision: D18377662 fbshipit-source-id: e8353f83d5d6ea6067f78208b7bfb7f0d5b49c05 07 November 2019, 20:51:06 UTC
9836a1f Fix MultiGet crash when no_block_cache is set (#5991) Summary: This PR fixes https://github.com/facebook/rocksdb/issues/5975. In ```BlockBasedTable::RetrieveMultipleBlocks()```, we were calling ```MaybeReadBlocksAndLoadToCache()```, which is a no-op if neither uncompressed nor compressed block cache are configured. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5991 Test Plan: 1. Add unit tests that fail with the old code and pass with the new 2. make check and asan_check Cc spetrunia Differential Revision: D18272744 Pulled By: anand1976 fbshipit-source-id: e62fa6090d1a6adf84fcd51dfd6859b03c6aebfe 07 November 2019, 20:02:21 UTC
1da1f04 Stress test to relax the iterator verification case for lower bound (#5869) Summary: In stress test, all iterator verification is turned off is lower bound is enabled. This might be stricter than needed. This PR relaxes the condition and include the case where lower bound is lower than both of seek key and upper bound. It seems to work mostly fine when I run crash test locally. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5869 Test Plan: Run crash_test Differential Revision: D18363578 fbshipit-source-id: 23d57e11ea507949b8100f4190ddfbe8db052d5a 07 November 2019, 19:16:59 UTC
982a753 Add two test cases for single sorted universal periodic compaction (#6002) Summary: It's useful to add test coverage for universal compaction's periodic compaction. Add two tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6002 Test Plan: Run the two tests Differential Revision: D18363544 fbshipit-source-id: bbd04b54057315f64f959709006412db1f76d170 07 November 2019, 19:14:14 UTC
f0b469e Turn on periodic compaction in universal by default if compaction filter is used. (#5994) Summary: Recently, periodic compaction got turned on by default for leveled compaction is compaction filter is used. Since periodic compaction is now supported in universal compaction too, we do the same default for universal now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5994 Test Plan: Add a new unit test. Differential Revision: D18363744 fbshipit-source-id: 5093288ce990ee3cab0e44ffd92d8489fbcd6a48 07 November 2019, 18:58:10 UTC
7b3222e Partial rebalance of TEST_GROUPs for Travis (#6010) Summary: TEST_GROUP=1 has sometimes been timing out but generally taking 45-50 minutes vs. 20-25 for groups 2-4. Beyond the compilation time, tests in group 1 consist of about 19 minutes of db_test, and 7 minutes of everything else. This change moves most of that "everything else" to group 2. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6010 Test Plan: Travis for this PR, oncall watch Travis Differential Revision: D18373536 Pulled By: pdillinger fbshipit-source-id: 0b3af004c71e4fd6bc01a94dac34cc3079fc9ce1 07 November 2019, 17:50:59 UTC
111ebf3 db_stress: improve TestGet() failure printing (#5989) Summary: Right now, in db_stress's CF consistency test's TestGet case, if failure happens, we do normal string printing, rather than hex printing, so that some text is not printed out, which makes debugging harder. Fix it by printing hex instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5989 Test Plan: Build db_stress and see t passes. Differential Revision: D18363552 fbshipit-source-id: 09d1b8f6fbff37441cbe7e63a1aef27551226cec 07 November 2019, 01:38:25 UTC
8ea087a Workload generator (Mixgraph) based on prefix hotness (#5953) Summary: In the previous PR https://github.com/facebook/rocksdb/issues/4788, user can use db_bench mix_graph option to generate the workload that is from the social graph. The key is generated based on the key access hotness. In this PR, user can further model the key-range hotness and fit those to two-term-exponential distribution. First, user cuts the whole key space into small key ranges (e.g., key-ranges are the same size and the key-range number is the number of SST files). Then, user calculates the average access count per key of each key-range as the key-range hotness. Next, user fits the key-range hotness to two-term-exponential distribution (f(x) = f(x) = a*exp(b*x) + c*exp(d*x)) and generate the value of a, b, c, and d. They are the parameters in db_bench: prefix_dist_a, prefix_dist_b, prefix_dist_c, and prefix_dist_d. Finally, user can run db_bench by specify the parameters. For example: `./db_bench --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=268435456 -key_dist_a=0.002312 -key_dist_b=0.3467 -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=350 -sine_b=0.0105 -sine_d=50000 --perf_level=2 -reads=1000000 -num=5000000 -key_size=48` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5953 Test Plan: run db_bench with different parameters and checked the results. Differential Revision: D18053527 Pulled By: zhichao-cao fbshipit-source-id: 171f8b3142bd76462f1967c58345ad7e4f84bab7 06 November 2019, 21:02:20 UTC
5080465 Enable write-conflict snapshot in stress tests (#5897) Summary: DBImpl extends the public GetSnapshot() with GetSnapshotForWriteConflictBoundary() method that takes snapshots specially for write-write conflict checking. Compaction treats such snapshots differently to avoid GCing a value written after that, so that the write conflict remains visible even after the compaction. The patch extends stress tests with such snapshots. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5897 Differential Revision: D17937476 Pulled By: maysamyabandeh fbshipit-source-id: bd8b0c578827990302194f63ae0181e15752951d 06 November 2019, 19:13:22 UTC
834feaf Add clarifying/instructive header to TARGETS and defs.bzl Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6008 Differential Revision: D18343273 Pulled By: pdillinger fbshipit-source-id: f7d1c78d711bbfb0deea9ec88212c19ab2ec91b8 06 November 2019, 04:20:33 UTC
67e735d Rename BlockBasedTable::ReadMetaBlock (#6009) Summary: According to https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format, the block read by BlockBasedTable::ReadMetaBlock is actually the meta index block. Therefore, it is better to rename the function to ReadMetaIndexBlock. This PR also applies some format change to existing code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6009 Test Plan: make check Differential Revision: D18333238 Pulled By: riversand963 fbshipit-source-id: 2c4340a29b3edba53d19c132cbfd04caf6242aed 06 November 2019, 01:19:11 UTC
230bcae Add a limited support for iteration bounds into BaseDeltaIterator (#5403) Summary: For MDEV-19670: MyRocks: key lookups into deleted data are very slow BaseDeltaIterator remembers iterate_upper_bound and will not let delta_iterator_ walk above the iterate_upper_bound if base_iterator_ is not valid anymore. == Rationale == The most straightforward way would be to make the delta_iterator (which is a rocksdb::WBWIIterator) to support iterator bounds. But checking for bounds has an extra CPU overhead. So we put the check into BaseDeltaIterator, and only make it when base_iterator_ is not valid. (note: We could take it even further, and move the check a few lines down, and only check iterator bounds ourselves if base_iterator_ is not valid AND delta_iterator_ hit a tombstone). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5403 Differential Revision: D15863092 Pulled By: maysamyabandeh fbshipit-source-id: 8da458e7b9af95ff49356666f69664b4a6ccf49b 05 November 2019, 19:39:36 UTC
52733b4 WritePrepared: Fix flaky test MaxCatchupWithNewSnapshot (#5850) Summary: MaxCatchupWithNewSnapshot tests that the snapshot sequence number will be larger than the max sequence number when the snapshot was taken. However since the test does not have access to the max sequence number when the snapshot was taken, it uses max sequence number after that, which could have advanced the snapshot by then, thus making the test flaky. The fix is to compare with max sequence number before the snapshot was taken, which is a lower bound for the value when the snapshot was taken. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5850 Test Plan: ~/gtest-parallel/gtest-parallel --repeat=12800 ./write_prepared_transaction_test --gtest_filter="*MaxCatchupWithNewSnapshot*" Differential Revision: D17608926 Pulled By: maysamyabandeh fbshipit-source-id: b122ae5a27f982b290bd60da852e28d3c5eb0136 05 November 2019, 00:23:57 UTC
0d91a98 Fix assertion in universal compaction periodic compaction (#6000) Summary: We recently added periodic compaction to universal compaction. An old assertion that we can't onlyl compact the last sorted run triggered. However, with periodic compaction, it is possible that we only compact the last sorted run, so the assertion now became stricter than needed. Relaxing this assertion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6000 Test Plan: This should be a low risk change. Will observe whether stress test will pass after it. Differential Revision: D18285396 fbshipit-source-id: 9a6863debdf104c40a7f6c46ab62d84cdf5d8592 02 November 2019, 01:33:12 UTC
e4e1d35 Revert "Disable pre-5.5 versions in the format compatibility test (#5990)" (#5999) Summary: This reverts commit 351e25401b4ca1c9a607a65dc9dcae7da2d418b6. All branches have been fixed to buildable on FB environments, so we can revert it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5999 Differential Revision: D18281947 fbshipit-source-id: 6deaaf1b5df2349eee5d6ed9b91208cd7e23ec8e 01 November 2019, 22:57:15 UTC
a44670e Use aggregate initialization for FlushJobInfo/CompactionJobInfo (#5997) Summary: FlushJobInfo and CompactionJobInfo are aggregates; we should use the aggregate initialization syntax to ensure members (specifically those of built-in types) are value-initialized. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5997 Test Plan: make check Differential Revision: D18273398 Pulled By: ltamasi fbshipit-source-id: 35b1a63ad9ca01605d288329858af72fffd7f392 01 November 2019, 18:46:19 UTC
5b65658 crash_test: disable periodic compaction in FIFO compaction. (#5993) Summary: A recent commit make periodic compaction option valid in FIFO, which means TTL. But we fail to disable it in crash test, causing assert failure. Fix it by having it disabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5993 Test Plan: Restart "make crash_test" many times and make sure --periodic_compaction_seconds=0 is always the case when --compaction_style=2 Differential Revision: D18263223 fbshipit-source-id: c91a802017d83ae89ac43827d1b0012861933814 01 November 2019, 00:28:03 UTC
18f57f5 Add new persistent 64-bit hash (#5984) Summary: For upcoming new SST filter implementations, we will use a new 64-bit hash function (XXH3 preview, slightly modified). This change updates hash.{h,cc} for that change, adds unit tests, and out-of-lines the implementations to keep hash.h as clean/small as possible. In developing the unit tests, I discovered that the XXH3 preview always returns zero for the empty string. Zero is problematic for some algorithms (including an upcoming SST filter implementation) if it occurs more often than at the "natural" rate, so it should not be returned from trivial values using trivial seeds. I modified our fork of XXH3 to return a modest hash of the seed for the empty string. With hash function details out-of-lines in hash.h, it makes sense to enable XXH_INLINE_ALL, so that direct calls to XXH64/XXH32/XXH3p are inlined. To fix array-bounds warnings on some inline calls, I injected some casts to uintptr_t in xxhash.cc. (Issue reported to Yann.) Revised: Reverted using XXH_INLINE_ALL for now. Some Facebook checks are unhappy about #include on xxhash.cc file. I would fix that by rename to xxhash_cc.h, but to best preserve history I want to do that in a separate commit (PR) from the uintptr casts. Also updated filter_bench for this change, improving the performance predictability of dry run hashing and adding support for 64-bit hash (for upcoming new SST filter implementations, minor dead code in the tool for now). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5984 Differential Revision: D18246567 Pulled By: pdillinger fbshipit-source-id: 6162fbf6381d63c8cc611dd7ec70e1ddc883fbb8 31 October 2019, 23:36:35 UTC
685e895 Prepare filter tests for more implementations (#5967) Summary: This change sets up for alternate implementations underlying BloomFilterPolicy: * Refactor BloomFilterPolicy and expose in internal .h file so that it's easy to iterate over / select implementations for testing, regardless of what the best public interface will look like. Most notably updated db_bloom_filter_test to use this. * Hide FullFilterBitsBuilder from unit tests (alternate derived classes planned); expose the part important for testing (CalculateSpace), as abstract class BuiltinFilterBitsBuilder. (Also cleaned up internally exposed interface to CalculateSpace.) * Rename BloomTest -> BlockBasedBloomTest for clarity (despite ongoing confusion between block-based table and block-based filter) * Assert that block-based filter construction interface is only used on BloomFilterPolicy appropriately constructed. (A couple of tests updated to add ", true".) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5967 Test Plan: make check Differential Revision: D18138704 Pulled By: pdillinger fbshipit-source-id: 55ef9273423b0696309e251f50b8c1b5e9ec7597 31 October 2019, 21:12:33 UTC
351e254 Disable pre-5.5 versions in the format compatibility test (#5990) Summary: We have updated earlier release branches going back to 5.5 so they are built using gcc7 by default. Disabling ancient versions before that until we figure out a plan for them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5990 Test Plan: Ran the script locally. Differential Revision: D18252386 Pulled By: ltamasi fbshipit-source-id: a7bbb30dc52ff2eaaf31a29ecc79f7cf4e2834dc 31 October 2019, 20:45:02 UTC
aa6f7d0 Support periodic compaction in universal compaction (#5970) Summary: Previously, periodic compaction is not supported in universal compaction. Add the support using following approach: if any file is marked as qualified for periodid compaction, trigger a full compaction. If a full compaction is prevented by files being compacted, try to compact the higher levels than files currently being compacted. If in this way we can only compact the last sorted run and none of the file to be compacted qualifies for periodic compaction, skip the compact. This is to prevent the same single level compaction from being executed again and again. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5970 Test Plan: Add several test cases. Differential Revision: D18147097 fbshipit-source-id: 8ecc308154d9aca96fb192c51fbceba3947550c1 31 October 2019, 18:31:37 UTC
2a9e5ca Make FIFO compaction take default 30 days TTL by default (#5987) Summary: Right now, by default FIFO compaction has no TTL. We believe that a default TTL of 30 days will be better. With this patch, the default will be changed to 30 days. Default of Options.periodic_compaction_seconds will mean the same as options.ttl. If Options.ttl and Options.periodic_compaction_seconds left default, a default 30 days TTL will be used. If both options are set, the stricter value of the two will be used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5987 Test Plan: Add an option sanitize test to cover the case. Differential Revision: D18237935 fbshipit-source-id: a6dcea1f36c3849e13c0a69e413d73ad8eab58c9 31 October 2019, 18:13:12 UTC
dccaf9f Turn compaction asserts to runtime check (#5935) Summary: Compaction iterator has many assert statements that are active only during test runs. Some rare bugs would show up only at runtime could violate the assert condition but go unnoticed since assert statements are not compiled in release mode. Turning the assert statements to runtime check sone pors and cons: Pros: - A bug that would result into incorrect data would be detected early before the incorrect data is written to the disk. Cons: - Runtime overhead: which should be negligible since compaction cpu is the minority in the overall cpu usage - The assert statements might already being violated at runtime, and turning them to runtime failure might result into reliability issues. The patch takes a conservative step in this direction by logging the assert violations at runtime. If we see any violation reported in logs, we investigate. Otherwise, we can go ahead turning them to runtime error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5935 Differential Revision: D18229697 Pulled By: maysamyabandeh fbshipit-source-id: f1890eca80ccd7cca29737f1825badb9aa8038a8 30 October 2019, 20:48:38 UTC
0337d87 crash_test: disable atomic flush with pipelined write (#5986) Summary: Recently, pipelined write is enabled even if atomic flush is enabled, which causing sanitizing failure in db_stress. Revert this change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5986 Test Plan: Run "make crash_test_with_atomic_flush" and see it to run for some while so that the old sanitizing error (which showed up quickly) doesn't show up. Differential Revision: D18228278 fbshipit-source-id: 27fdf2f8e3e77068c9725a838b9bef4ab25a2553 30 October 2019, 18:36:55 UTC
15119f0 Add more release branches to tools/check_format_compatible.sh (#5985) Summary: More release branches are created. We should include them in continuous format compatibility checks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5985 Test Plan: Let's see whether it is passes. Differential Revision: D18226532 fbshipit-source-id: 75d8cad5b03ccea4ce16f00cea1f8b7893b0c0c8 30 October 2019, 18:20:49 UTC
a3960fc Move pipeline write waiting logic into WaitForPendingWrites() (#5716) Summary: In pipeline writing mode, memtable switching needs to wait for memtable writing to finish to make sure that when memtables are made immutable, inserts are not going to them. This is currently done in DBImpl::SwitchMemtable(). This is done after flush_scheduler_.TakeNextColumnFamily() is called to fetch the list of column families to switch. The function flush_scheduler_.TakeNextColumnFamily() itself, however, is not thread-safe when being called together with flush_scheduler_.ScheduleFlush(). This change provides a fix, which moves the waiting logic before flush_scheduler_.TakeNextColumnFamily(). WaitForPendingWrites() is a natural place where the logic can happen. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5716 Test Plan: Run all tests with ASAN and TSAN. Differential Revision: D18217658 fbshipit-source-id: b9c5e765c9989645bf10afda7c5c726c3f82f6c3 30 October 2019, 01:16:36 UTC
f22aaf8 db_stress: CF Consistency check to use random CF to validate iterator results (#5983) Summary: Right now, in db_stress's iterator tests, we always use the same CF to validate iterator results. This commit changes it so that a randomized CF is used in Cf consistency test, where every CF should have exactly the same data. This would help catch more bugs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5983 Test Plan: Run "make crash_test_with_atomic_flush". Differential Revision: D18217643 fbshipit-source-id: 3ac998852a0378bb59790b20c5f236f6a5d681fe 30 October 2019, 01:16:35 UTC
4c9aa30 Auto enable Periodic Compactions if a Compaction Filter is used (#5865) Summary: - Periodic compactions are auto-enabled if a compaction filter or a compaction filter factory is set, in Level Compaction. - The default value of `periodic_compaction_seconds` is changed to UINT64_MAX, which lets RocksDB auto-tune periodic compactions as needed. An explicit value of 0 will still work as before ie. to disable periodic compactions completely. For now, on seeing a compaction filter along with a UINT64_MAX value for `periodic_compaction_seconds`, RocksDB will make SST files older than 30 days to go through periodic copmactions. Some RocksDB users make use of compaction filters to control when their data can be deleted, usually with a custom TTL logic. But it is occasionally possible that the compactions get delayed by considerable time due to factors like low writes to a key range, data reaching bottom level, etc before the TTL expiry. Periodic Compactions feature was originally built to help such cases. Now periodic compactions are auto enabled by default when compaction filters or compaction filter factories are used, as it is generally helpful to all cases to collect garbage. `periodic_compaction_seconds` is set to a large value, 30 days, in `SanitizeOptions` when RocksDB sees that a `compaction_filter` or `compaction_filter_factory` is used. This is done only for Level Compaction style. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5865 Test Plan: - Added a new test `DBCompactionTest.LevelPeriodicCompactionWithCompactionFilters` to make sure that `periodic_compaction_seconds` is set if either `compaction_filter` or `compaction_filter_factory` options are set. - `COMPILE_WITH_ASAN=1 make check` Differential Revision: D17659180 Pulled By: sagar0 fbshipit-source-id: 4887b9cf2e53cf2dc93a7b658c6b15e1181217ee 29 October 2019, 22:05:51 UTC
26dc296 filter_bench not needed for ROCKSDB_LITE (#5978) Summary: filter_bench is a specialized micro-benchmarking tool that should not be needed with ROCKSDB_LITE. This should fix the LITE build. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5978 Test Plan: make LITE=1 check Differential Revision: D18177941 Pulled By: pdillinger fbshipit-source-id: b73a171404661e09e018bc99afcf8d4bf1e2949c 28 October 2019, 21:12:36 UTC
79018ba Upgrading version to 6.6.0 on Master (#5965) Summary: Upgrading version to 6.6.0 on Master. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5965 Differential Revision: D18119839 Pulled By: vjnadimpalli fbshipit-source-id: 4adbcbb82b108d2f626e88c786453baad8455f4e 28 October 2019, 20:14:45 UTC
1075c37 Fix for lite build (#5971) Summary: Fix for lite build Pull Request resolved: https://github.com/facebook/rocksdb/pull/5971 Test Plan: make J=1 -j64 LITE=1 all check Differential Revision: D18148306 Pulled By: vjnadimpalli fbshipit-source-id: 5b9a3edc3e73e054fee6b96e6f6e583cecc898f3 26 October 2019, 01:22:24 UTC
3f891c4 More improvements to filter_bench (#5968) Summary: * Adds support for plain table filter. This is not critical right now, but does add a -impl flag that will be useful for new filter implementations initially targeted at block-based table (and maybe later ported to plain table) * Better mixing of inside vs. outside queries, for more realism * A -best_case option handy for implementation tuning inner loop * Option for whether to include hashing time in dry run / net timings No modifications to production code, just filter_bench. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5968 Differential Revision: D18139872 Pulled By: pdillinger fbshipit-source-id: 5b09eba963111b48f9e0525a706e9921070990e8 25 October 2019, 20:27:07 UTC
b3dc2f3 Update xxhash.cc to allow combined compilation (#5969) Summary: To fix unity_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/5969 Test Plan: make unity_test Differential Revision: D18140426 Pulled By: pdillinger fbshipit-source-id: d5516e6d665f57e3706b9f9b965b0c458e58ccef 25 October 2019, 19:54:41 UTC
ec88043 API to get file_creation_time of the oldest file in the DB (#5948) Summary: Adding a new API to db.h that allows users to get file_creation_time of the oldest file in the DB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5948 Test Plan: Added unit test. Differential Revision: D18056151 Pulled By: vjnadimpalli fbshipit-source-id: 448ec9d34cb6772e1e5a62db399ace00dcbfbb5d 25 October 2019, 18:53:57 UTC
013babc Clean up some filter tests and comments (#5960) Summary: Some filtering tests were unfriendly to new implementations of FilterBitsBuilder because of dynamic_cast to FullFilterBitsBuilder. Most of those have now been cleaned up, worked around, or at least changed from crash on dynamic_cast failure to individual test failure. Also put some clarifying comments on filter-related APIs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5960 Test Plan: make check Differential Revision: D18121223 Pulled By: pdillinger fbshipit-source-id: e83827d9d5d96315d96f8e25a99cd70f497d802c 25 October 2019, 01:48:16 UTC
2309fd6 Update column families' log number altogether after flushing during recovery (#5856) Summary: A bug occasionally shows up in crash test, and https://github.com/facebook/rocksdb/issues/5851 reproduces it. The bug can surface in the following way. 1. Database has multiple column families. 2. Between one DB restart, the last log file is corrupted in the middle (not the tail) 3. During restart, DB crashes between flushing between two column families. Then DB will fail to be opened again with error "SST file is ahead of WALs". Solution is to update the log number associated with each column family altogether after flushing all column families' memtables. The version edits should be written to a new MANIFEST. Only after writing to all these version edits succeed does RocksDB (atomically) points the CURRENT file to the new MANIFEST. Test plan (on devserver): ``` $make all && make check ``` Specifically ``` $make db_test2 $./db_test2 --gtest_filter=DBTest2.CrashInRecoveryMultipleCF ``` Also checked for compatibility as follows. Use this branch, run DBTest2.CrashInRecoveryMultipleCF and preserve the db directory. Then checkout 5.4, build ldb, and dump the MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5856 Differential Revision: D17620818 Pulled By: riversand963 fbshipit-source-id: b52ce5969c9a8052cacec2bd805fcfb373589039 25 October 2019, 01:29:30 UTC
ca7ccbe Misc hashing updates / upgrades (#5909) Summary: - Updated our included xxhash implementation to version 0.7.2 (== the latest dev version as of 2019-10-09). - Using XXH_NAMESPACE (like other fb projects) to avoid potential name collisions. - Added fastrange64, and unit tests for it and fastrange32. These are faster alternatives to hash % range. - Use preview version of XXH3 instead of MurmurHash64A for NPHash64 -- Had to update cache_test to increase probability of passing for any given hash function. - Use fastrange64 instead of % with uses of NPHash64 -- Had to fix WritePreparedTransactionTest.CommitOfDelayedPrepared to avoid deadlock apparently caused by new hash collision. - Set default seed for NPHash64 because specifying a seed rarely makes sense for it. - Removed unnecessary include xxhash.h in a popular .h file - Rename preview version of XXH3 to XXH3p for clarity and to ease backward compatibility in case final version of XXH3 is integrated. Relying on existing unit tests for NPHash64-related changes. Each new implementation of fastrange64 passed unit tests when manipulating my local build to select it. I haven't done any integration performance tests, but I consider the improved performance of the pieces being swapped in to be well established. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5909 Differential Revision: D18125196 Pulled By: pdillinger fbshipit-source-id: f6bf83d49d20cbb2549926adf454fd035f0ecc0d 25 October 2019, 00:16:46 UTC
ec11eff FilterPolicy consolidation, part 2/2 (#5966) Summary: The parts that are used to implement FilterPolicy / NewBloomFilterPolicy and not used other than for the block-based table should be consolidated under table/block_based/filter_policy*. This change is step 2 of 2: mv util/bloom.cc table/block_based/filter_policy.cc This gets its own PR so that git has the best chance of following the rename for blame purposes. Note that low-level shared implementation details of Bloom filters remain in util/bloom_impl.h, and util/bloom_test.cc remains where it is for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5966 Test Plan: make check Differential Revision: D18124930 Pulled By: pdillinger fbshipit-source-id: 823bc09025b3395f092ef46a46aa5ba92a914d84 24 October 2019, 22:44:51 UTC
f7e7b34 Propagate SST and blob file numbers through the EventListener interface (#5962) Summary: This patch adds a number of new information elements to the FlushJobInfo and CompactionJobInfo structures that are passed to EventListeners via the OnFlush{Begin, Completed} and OnCompaction{Begin, Completed} callbacks. Namely, for flushes, the file numbers of the new SST and the oldest blob file it references are propagated. For compactions, the new pieces of information are the file number, level, and the oldest blob file referenced by each compaction input and output file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5962 Test Plan: Extended the EventListener unit tests with logic that checks that these information elements are correctly propagated from the corresponding FileMetaData. Differential Revision: D18095568 Pulled By: ltamasi fbshipit-source-id: 6874359a6aadb53366b5fe87adcb2f9bd27a0a56 24 October 2019, 21:44:15 UTC
dd19014 FilterPolicy consolidation, part 1/2 (#5963) Summary: The parts that are used to implement FilterPolicy / NewBloomFilterPolicy and not used other than for the block-based table should be consolidated under table/block_based/filter_policy*. I don't foresee sharing these APIs with e.g. the Plain Table because they don't expose hashes for reuse in indexing. This change is step 1 of 2: (a) mv table/full_filter_bits_builder.h to table/block_based/filter_policy_internal.h which I expect to expand soon to internally reveal more implementation details for testing. (b) consolidate eventual contents of table/block_based/filter_policy.cc in util/bloom.cc, which has the most elaborate revision history (see step 2 ...) Step 2 soon to follow: mv util/bloom.cc table/block_based/filter_policy.cc This gets its own PR so that git has the best chance of following the rename for blame purposes. Note that low-level shared implementation details of Bloom filters are in util/bloom_impl.h. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5963 Test Plan: make check Differential Revision: D18121199 Pulled By: pdillinger fbshipit-source-id: 8f21732c3d8909777e3240e4ac3123d73140326a 24 October 2019, 20:20:35 UTC
2837008 Vary key size and alignment in filter_bench (#5933) Summary: The first version of filter_bench has selectable key size but that size does not vary throughout a test run. This artificially favors "branchy" hash functions like the existing BloomHash, MurmurHash1, probably because of optimal return for branch prediction. This change primarily varies those key sizes from -2 to +2 bytes vs. the average selected size. We also set the default key size at 24 to better reflect our best guess of typical key size. But steadily random key sizes may not be realistic either. So this change introduces a new filter_bench option: -vary_key_size_log2_interval=n where the same key size is used 2^n times and then changes to another size. I've set the default at 5 (32 times same size) as a compromise between deployments with rather consistent vs. rather variable key sizes. On my Skylake system, the performance boost to MurmurHash1 largely lies between n=10 and n=15. Also added -vary_key_alignment (bool, now default=true), though this doesn't currently seem to matter in hash functions under consideration. This change also does a "dry run" for each testing scenario, to improve the accuracy of those numbers, as there was more difference between scenarios than expected. Subtracting gross test run times from dry run times is now also embedded in the output, because these "net" times are generally the most useful. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5933 Differential Revision: D18121683 Pulled By: pdillinger fbshipit-source-id: 3c7efee1c5661a5fe43de555e786754ddf80dc1e 24 October 2019, 20:08:30 UTC
2509531 Add test showing range tombstones can create excessively large compactions (#5956) Summary: For more information on the original problem see this [link](https://github.com/facebook/rocksdb/issues/3977). This change adds two new tests. They are identical other than one uses range tombstones and the other does not. Each test generates sub files at L2 which overlap with keys L3. The test that uses range tombstones generates a single file at L2. This single file will generate a very large range overlap that will in turn create excessively large compaction. 1: T001 - T005 2: 000 - 005 In contrast, the test that uses key ranges generates 3 files at L2. As a single file is compacted at a time, those 3 files will generate less work per compaction iteration. 1: 001 - 002 1: 003 - 004 1: 005 2: 000 - 005 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5956 Differential Revision: D18071631 Pulled By: dlambrig fbshipit-source-id: 12abae75fb3e0b022d228c6371698aa5e53385df 24 October 2019, 18:08:44 UTC
9f1e5a0 CfConsistencyStressTest to validate key consistent across CFs in TestGet() (#5863) Summary: Right now in CF consitency stres test's TestGet(), keys are just fetched without validation. With this change, in 1/2 the time, compare all the CFs share the same value with the same key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5863 Test Plan: Run "make crash_test_with_atomic_flush" and see tests pass. Hack the code to generate some inconsistency and observe the test fails as expected. Differential Revision: D17934206 fbshipit-source-id: 00ba1a130391f28785737b677f80f366fb83cced 23 October 2019, 23:57:16 UTC
6a32e3b Remove unused BloomFilterPolicy::hash_func_ (#5961) Summary: This is an internal, file-local "feature" that is not used and potentially confusing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5961 Test Plan: make check Differential Revision: D18099018 Pulled By: pdillinger fbshipit-source-id: 7870627eeed09941d12538ec55d10d2e164fc716 23 October 2019, 22:47:17 UTC
b4ebda7 Make buckifier python3 compatible (#5922) Summary: Make buckifier/buckify_rocksdb.py run on both Python 3 and 2 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5922 Test Plan: ``` $python3 buckifier/buckify_rocksdb.py $python3 buckifier/buckify_rocksdb.py '{"fake": {"extra_deps": [":test_dep", "//fakes/module:mock1"], "extra_compiler_flags": ["-DROCKSDB_LITE", "-Os"]}}' $python2 buckifier/buckify_rocksdb.py $python2 buckifier/buckify_rocksdb.py '{"fake": {"extra_deps": [":test_dep", "//fakes/module:mock1"], "extra_compiler_flags": ["-DROCKSDB_LITE", "-Os"]}}' ``` Differential Revision: D17920611 Pulled By: riversand963 fbshipit-source-id: cc6e2f36013a88a710d96098f6ca18cbe85e3f62 23 October 2019, 20:52:27 UTC
0933360 Fix the potential memory leak in trace_replay (#5955) Summary: In the previous PR https://github.com/facebook/rocksdb/issues/5934 , in the while loop, if/else if is used without ending with else to free the object referenced by ra, it might cause potential memory leak (warning during compiling). Fix it by changing the last "else if" to "else". Pull Request resolved: https://github.com/facebook/rocksdb/pull/5955 Test Plan: pass make asan check, pass the USE_CLANG=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j64 analyze. Differential Revision: D18071612 Pulled By: zhichao-cao fbshipit-source-id: 51c00023d0c97c2921507254329aed55d56e1786 22 October 2019, 23:39:46 UTC
c0abc6b Use FLAGS_env for certain operations in db_bench (#5943) Summary: Since we already parse env_uri from command line and creates custom Env accordingly, we should invoke the methods of such Envs instead of using Env::Default(). Test Plan (on devserver): ``` $make db_bench db_stress $./db_bench -benchmarks=fillseq ./db_stress ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5943 Differential Revision: D18018550 Pulled By: riversand963 fbshipit-source-id: 03b61329aaae0dfd914a0b902cc677f570f102e3 22 October 2019, 18:43:21 UTC
925250f Include db_stress_tool in rocksdb tools lib (#5950) Summary: include db_stress_tool in rocksdb tools lib Test Plan (on devserver): ``` $make db_stress $./db_stress $make all && make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5950 Differential Revision: D18044399 Pulled By: riversand963 fbshipit-source-id: 895585abbbdfd8b954965921dba4b1400b7af1b1 22 October 2019, 02:40:35 UTC
5677f4f Using clang for internal ubsan tests (#5952) Summary: Using clang for internal ubsan tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5952 Differential Revision: D18048810 Pulled By: vjnadimpalli fbshipit-source-id: ae55677a1928397b067e972d0ecb4ac1b7e2c8dc 22 October 2019, 02:37:00 UTC
27a1245 Fix memory leak on error opening PlainTable (#5951) Summary: Several error paths in opening of a plain table would leak memory. PR https://github.com/facebook/rocksdb/issues/5940 opened the leak to one more error path, which happens to have been (mistakenly) exercised by CuckooTableDBTest.AdaptiveTable. That test has been fixed, and the exercising of plain table error cases (more than before) has been added as BadOptions1 and BadOptions2 to PlainTableDBTest. This effectively moved the memory leak to plain_table_db_test. Also here is a cheap fix for the memory leak, without (yet?) changing the signature of ReadTableProperties. This fixes ASAN on unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5951 Test Plan: make COMPILE_WITH_ASAN=1 check Differential Revision: D18051940 Pulled By: pdillinger fbshipit-source-id: e2952930c09a2b46c4f1ff09818c5090426929de 21 October 2019, 23:53:06 UTC
7245fb5 Fix the potential memory leak of ReplayMultiThread (#5949) Summary: The pointer ra needs to be freed the status s returns not OK. In the previous PR https://github.com/facebook/rocksdb/issues/5934 , the ra is not freed which might cause potential memory leak. Fix this issue by moving the clarification of ra inside the while loop and freeing it as desired. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5949 Test Plan: pass make asan check. Differential Revision: D18045726 Pulled By: zhichao-cao fbshipit-source-id: d5445b7b832c8bb1dafe008bafea7bfe9eb0b1ce 21 October 2019, 22:05:01 UTC
2ce6aa5 Making platform 007 (gcc 7) default in build_detect_platform.sh (#5947) Summary: Making platform 007 (gcc 7) default in build_detect_platform.sh. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5947 Differential Revision: D18038837 Pulled By: vjnadimpalli fbshipit-source-id: 9ac2ddaa93bf328a416faec028970e039886378e 21 October 2019, 19:09:29 UTC
a0cd920 LevelIterator to avoid gap after prefix bloom filters out a file (#5861) Summary: Right now, when LevelIterator::Seek() is called, when a file is filtered out by prefix bloom filter, the position is put to the beginning of the next file. This is a confusing internal interface because many keys in the levels are skipped. Avoid this behavior by checking the key of the next file against the seek key, and invalidate the whole iterator if the prefix doesn't match. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5861 Test Plan: Add a new unit test to validate the behavior; run all exsiting tests; run crash_test Differential Revision: D17918213 fbshipit-source-id: f06b47d937c7cc8919001f18dcc3af5b28c9cdac 21 October 2019, 18:40:57 UTC
30e2dc0 Fix VerifyChecksum readahead with mmap mode (#5945) Summary: A recent change introduced readahead inside VerifyChecksum(). However it is not compatible with mmap mode and generated wrong checksum verification failure. Fix it by not enabling readahead in mmap mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5945 Test Plan: Add a unit test that used to fail. Differential Revision: D18021443 fbshipit-source-id: 6f2eb600f81b26edb02222563a4006869d576bff 21 October 2019, 18:38:30 UTC
1a21afa Fix some dependency paths (#5946) Summary: Some dependency path is not correct so that ASAN cannot run with CLANG. Fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5946 Test Plan: Run ASAN with CLANG Differential Revision: D18040933 fbshipit-source-id: 1d82be9d350485cf1df1c792dad765188958641f 21 October 2019, 17:41:47 UTC
29ccf20 Store the filter bits reader alongside the filter block contents (#5936) Summary: Amongst other things, PR https://github.com/facebook/rocksdb/issues/5504 refactored the filter block readers so that only the filter block contents are stored in the block cache (as opposed to the earlier design where the cache stored the filter block reader itself, leading to potentially dangling pointers and concurrency bugs). However, this change introduced a performance hit since with the new code, the metadata fields are re-parsed upon every access. This patch reunites the block contents with the filter bits reader to eliminate this overhead; since this is still a self-contained pure data object, it is safe to store it in the cache. (Note: this is similar to how the zstd digest is handled.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5936 Test Plan: make asan_check filter_bench results for the old code: ``` $ ./filter_bench -quick WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 26.7153 Number of filters: 16669 Total memory (MB): 200.009 Bits/key actual: 10.0647 ---------------------------- Inside queries... Dry run (46b) ns/op: 33.4258 Single filter ns/op: 42.5974 Random filter ns/op: 217.861 ---------------------------- Outside queries... Dry run (25d) ns/op: 32.4217 Single filter ns/op: 50.9855 Random filter ns/op: 219.167 Average FP rate %: 1.13993 ---------------------------- Done. (For more info, run with -legend or -help.) $ ./filter_bench -quick -use_full_block_reader WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 26.5172 Number of filters: 16669 Total memory (MB): 200.009 Bits/key actual: 10.0647 ---------------------------- Inside queries... Dry run (46b) ns/op: 32.3556 Single filter ns/op: 83.2239 Random filter ns/op: 370.676 ---------------------------- Outside queries... Dry run (25d) ns/op: 32.2265 Single filter ns/op: 93.5651 Random filter ns/op: 408.393 Average FP rate %: 1.13993 ---------------------------- Done. (For more info, run with -legend or -help.) ``` With the new code: ``` $ ./filter_bench -quick WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 25.4285 Number of filters: 16669 Total memory (MB): 200.009 Bits/key actual: 10.0647 ---------------------------- Inside queries... Dry run (46b) ns/op: 31.0594 Single filter ns/op: 43.8974 Random filter ns/op: 226.075 ---------------------------- Outside queries... Dry run (25d) ns/op: 31.0295 Single filter ns/op: 50.3824 Random filter ns/op: 226.805 Average FP rate %: 1.13993 ---------------------------- Done. (For more info, run with -legend or -help.) $ ./filter_bench -quick -use_full_block_reader WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 26.5308 Number of filters: 16669 Total memory (MB): 200.009 Bits/key actual: 10.0647 ---------------------------- Inside queries... Dry run (46b) ns/op: 33.2968 Single filter ns/op: 58.6163 Random filter ns/op: 291.434 ---------------------------- Outside queries... Dry run (25d) ns/op: 32.1839 Single filter ns/op: 66.9039 Random filter ns/op: 292.828 Average FP rate %: 1.13993 ---------------------------- Done. (For more info, run with -legend or -help.) ``` Differential Revision: D17991712 Pulled By: ltamasi fbshipit-source-id: 7ea205550217bfaaa1d5158ebd658e5832e60f29 19 October 2019, 02:32:59 UTC
c53db17 Fix TestIterate for HashSkipList in db_stress (#5942) Summary: Since SeekForPrev (used by Prev) is not supported by HashSkipList when prefix is used, we disable it when stress testing HashSkipList. - Change the default memtablerep to skip list. - Avoid Prev() when memtablerep is HashSkipList and prefix is used. Test Plan (on devserver): ``` $make db_stress $./db_stress -ops_per_thread=10000 -reopen=1 -destroy_db_initially=true -column_families=1 -threads=1 -column_families=1 -memtablerep=prefix_hash $# or simply $./db_stress $./db_stress -memtablerep=prefix_hash ``` Results must print "Verification successful". Pull Request resolved: https://github.com/facebook/rocksdb/pull/5942 Differential Revision: D18017062 Pulled By: riversand963 fbshipit-source-id: af867e59aa9e6f533143c984d7d529febf232fd7 18 October 2019, 22:49:12 UTC
5f8f2fd Refactor / clean up / optimize FullFilterBitsReader (#5941) Summary: FullFilterBitsReader, after creating in BloomFilterPolicy, was responsible for decoding metadata bits. This meant that FullFilterBitsReader::MayMatch had some metadata checks in order to implement "always true" or "always false" functionality in the case of inconsistent or trivial metadata. This made for ugly mixing-of-concerns code and probably had some runtime cost. It also didn't really support plugging in alternative filter implementations with extensions to the existing metadata schema. BloomFilterPolicy::GetFilterBitsReader is now (exclusively) responsible for decoding filter metadata bits and constructing appropriate instances deriving from FilterBitsReader. "Always false" and "always true" derived classes allow FullFilterBitsReader not to be concerned with handling of trivial or inconsistent metadata. This also makes for easy expansion to alternative filter implementations in new, alternative derived classes. This change makes calls to FilterBitsReader::MayMatch *necessarily* virtual because there's now more than one built-in implementation. Compared with the previous implementation's extra 'if' checks in MayMatch, there's no consistent performance difference, measured by (an older revision of) filter_bench (differences here seem to be within noise): Inside queries... - Dry run (407) ns/op: 35.9996 + Dry run (407) ns/op: 35.2034 - Single filter ns/op: 47.5483 + Single filter ns/op: 47.4034 - Batched, prepared ns/op: 43.1559 + Batched, prepared ns/op: 42.2923 ... - Random filter ns/op: 150.697 + Random filter ns/op: 149.403 ---------------------------- Outside queries... - Dry run (980) ns/op: 34.6114 + Dry run (980) ns/op: 34.0405 - Single filter ns/op: 56.8326 + Single filter ns/op: 55.8414 - Batched, prepared ns/op: 48.2346 + Batched, prepared ns/op: 47.5667 - Random filter ns/op: 155.377 + Random filter ns/op: 153.942 Average FP rate %: 1.1386 Also, the FullFilterBitsReader ctor was responsible for a surprising amount of CPU in production, due in part to inefficient determination of the CACHE_LINE_SIZE used to construct the filter being read. The overwhelming common case (same as my CACHE_LINE_SIZE) is now substantially optimized, as shown with filter_bench with -new_reader_every=1 (old option - see below) (repeatable result): Inside queries... - Dry run (453) ns/op: 118.799 + Dry run (453) ns/op: 105.869 - Single filter ns/op: 82.5831 + Single filter ns/op: 74.2509 ... - Random filter ns/op: 224.936 + Random filter ns/op: 194.833 ---------------------------- Outside queries... - Dry run (aa1) ns/op: 118.503 + Dry run (aa1) ns/op: 104.925 - Single filter ns/op: 90.3023 + Single filter ns/op: 83.425 ... - Random filter ns/op: 220.455 + Random filter ns/op: 175.7 Average FP rate %: 1.13886 However PR#5936 has/will reclaim most of this cost. After that PR, the optimization of this code path is likely negligible, but nonetheless it's clear we aren't making performance any worse. Also fixed inadequate check of consistency between filter data size and num_lines. (Unit test updated.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5941 Test Plan: previously added unit tests FullBloomTest.CorruptFilters and FullBloomTest.RawSchema Differential Revision: D18018353 Pulled By: pdillinger fbshipit-source-id: 8e04c2b4a7d93223f49a237fd52ef2483929ed9c 18 October 2019, 21:50:52 UTC
fe464bc Fix PlainTableReader not to crash sst_dump (#5940) Summary: Plain table SSTs could crash sst_dump because of a bug in PlainTableReader that can leave table_properties_ as null. Even if it was intended not to keep the table properties in some cases, they were leaked on the offending code path. Steps to reproduce: $ db_bench --benchmarks=fillrandom --num=2000000 --use_plain_table --prefix-size=12 $ sst_dump --file=0000xx.sst --show_properties from [] to [] Process /dev/shm/dbbench/000014.sst Sst file format: plain table Raw user collected properties ------------------------------ Segmentation fault (core dumped) Also added missing unit testing of plain table full_scan_mode, and an assertion in NewIterator to check for regression. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5940 Test Plan: new unit test, manual, make check Differential Revision: D18018145 Pulled By: pdillinger fbshipit-source-id: 4310c755e824c4cd6f3f86a3abc20dfa417c5e07 18 October 2019, 21:44:42 UTC
526e3b9 Enable trace_replay with multi-threads (#5934) Summary: In the current trace replay, all the queries are serialized and called by single threads. It may not simulate the original application query situations closely. The multi-threads replay is implemented in this PR. Users can set the number of threads to replay the trace. The queries generated according to the trace records are scheduled in the thread pool job queue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5934 Test Plan: test with make check and real trace replay. Differential Revision: D17998098 Pulled By: zhichao-cao fbshipit-source-id: 87eecf6f7c17a9dc9d7ab29dd2af74f6f60212c8 18 October 2019, 21:13:50 UTC
69bd8a2 Update HISTORY.md with recent BlobDB adjacent changes Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5939 Differential Revision: D18009096 Pulled By: ltamasi fbshipit-source-id: 032a48a302f9da38aecf4055b5a8d4e1dffd9dc7 18 October 2019, 17:24:23 UTC
e60cc09 Expose db stress tests (#5937) Summary: expose db stress test by providing db_stress_tool.h in public header. This PR does the following: - adds a new header, db_stress_tool.h, in include/rocksdb/ - renames db_stress.cc to db_stress_tool.cc - adds a db_stress.cc which simply invokes a test function. - update Makefile accordingly. Test Plan (dev server): ``` make db_stress ./db_stress ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5937 Differential Revision: D17997647 Pulled By: riversand963 fbshipit-source-id: 1a8d9994f89ce198935566756947c518f0052410 18 October 2019, 16:46:44 UTC
fdc1cb4 Support decoding blob indexes in sst_dump (#5926) Summary: The patch adds a new command line parameter --decode_blob_index to sst_dump. If this switch is specified, sst_dump prints blob indexes in a human readable format, printing the blob file number, offset, size, and expiration (if applicable) for blob references, and the blob value (and expiration) for inlined blobs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5926 Test Plan: Used db_bench's BlobDB mode to generate SST files containing blob references with and without expiration, as well as inlined blobs with and without expiration (note: the latter are stored as plain values), and confirmed sst_dump correctly prints all four types of records. Differential Revision: D17939077 Pulled By: ltamasi fbshipit-source-id: edc5f58fee94ba35f6699c6a042d5758f5b3963d 18 October 2019, 02:36:54 UTC
1f9d7c0 Fix OnFlushCompleted fired before flush result write to MANIFEST (#5908) Summary: When there are concurrent flush job on the same CF, `OnFlushCompleted` can be called before the flush result being install to LSM. Fixing the issue by passing `FlushJobInfo` through `MemTable`, and the thread who commit the flush result can fetch the `FlushJobInfo` and fire `OnFlushCompleted` on behave of the thread actually writing the SST. Fix https://github.com/facebook/rocksdb/issues/5892 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5908 Test Plan: Add new test. The test will fail without the fix. Differential Revision: D17916144 Pulled By: riversand963 fbshipit-source-id: e18df67d9533b5baee52ae3605026cdeb05cbe10 16 October 2019, 17:40:23 UTC
2c9e9f2 Update HISTORY for SeekForPrev bug fix (#5925) Summary: Update history for the bug fix in https://github.com/facebook/rocksdb/pull/5907 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5925 Differential Revision: D17952605 Pulled By: maysamyabandeh fbshipit-source-id: 609afcbb2e4087f9153822c4d11193a75a7b0e7a 16 October 2019, 14:59:26 UTC
5ef27de Fix clang analyzer error (#5924) Summary: Without this PR, clang analyzer complains. ``` $USE_CLANG=1 make analyze db/compaction/compaction_job_test.cc:161:20: warning: The left operand of '==' is a garbage value if (key.type == kTypeBlobIndex) { ~~~~~~~~ ^ 1 warning generated. ``` Test Plan (on devserver) ``` $USE_CLANG=1 make analyze ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5924 Differential Revision: D17923226 Pulled By: riversand963 fbshipit-source-id: 9d1eb769b5e0de7cb3d89dc90d1cfa895db7fdc8 15 October 2019, 05:14:24 UTC
78b28d8 Support non-TTL Puts for BlobDB in db_bench (#5921) Summary: Currently, db_bench only supports PutWithTTL operations for BlobDB but not regular Puts. The patch adds support for regular (non-TTL) Puts and also changes the default for blob_db_max_ttl_range to zero, which corresponds to no TTL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5921 Test Plan: make check ./db_bench -benchmarks=fillrandom -statistics -stats_interval_seconds=1 -duration=90 -num=500000 -use_blob_db=1 -blob_db_file_size=1000000 -target_file_size_base=1000000 (issues Put operations with no TTL) ./db_bench -benchmarks=fillrandom -statistics -stats_interval_seconds=1 -duration=90 -num=500000 -use_blob_db=1 -blob_db_file_size=1000000 -target_file_size_base=1000000 -blob_db_max_ttl_range=86400 (issues PutWithTTL operations with random TTLs in the [0, blob_db_max_ttl_range) interval, as before) Differential Revision: D17919798 Pulled By: ltamasi fbshipit-source-id: b946c3522b836b92b4c157ffbad24f92ba2b0a16 15 October 2019, 00:49:20 UTC
93edd51 bloom_test.cc: include <array> (#5920) Summary: Fix build failure on some platforms, reported in issue https://github.com/facebook/rocksdb/issues/5914 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5920 Test Plan: make bloom_test && ./bloom_test Differential Revision: D17918328 Pulled By: pdillinger fbshipit-source-id: b822004d4442de0171db2aeff433677783f7b94e 14 October 2019, 22:38:31 UTC
5f025ea BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903) Summary: This is groundwork for adding garbage collection support to BlobDB. The patch adds logic that keeps track of the oldest blob file referred to by each SST file. The oldest blob file is identified during flush/ compaction (similarly to how the range of keys covered by the SST is identified), and persisted in the manifest as a custom field of the new file edit record. Blob indexes with TTL are ignored for the purposes of identifying the oldest blob file (since such blob files are cleaned up by the TTL logic in BlobDB). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5903 Test Plan: Added new unit tests; also ran db_bench in BlobDB mode, inspected the manifest using ldb, and confirmed (by scanning the SST files using sst_dump) that the value of the oldest blob file number field matches the contents of the file for each SST. Differential Revision: D17859997 Pulled By: ltamasi fbshipit-source-id: 21662c137c6259a6af70446faaf3a9912c550e90 14 October 2019, 22:21:01 UTC
a59dc84 Move blob_index.h to db/ (#5919) Summary: Extracted from PR https://github.com/facebook/rocksdb/issues/5903 for technical reasons. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5919 Test Plan: make check Differential Revision: D17910132 Pulled By: ltamasi fbshipit-source-id: 6ecbb8d6e84b2a1d1f28575ad48ac3cc65833eb5 14 October 2019, 19:54:05 UTC
231fffd Add Env::SanitizeEnvOptions (#5885) Summary: Add Env::SanitizeEnvOptions to allow underlying environments properly configure env options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5885 Test Plan: ``` make check ``` Differential Revision: D17910327 Pulled By: riversand963 fbshipit-source-id: 86a1ac616e485742c35c4a9cc9f1227c529fc00f 14 October 2019, 19:25:00 UTC
back to top