sort by:
Revision Author Date Message Commit Date
db36f22 Allow options file in db_stress and db_crashtest Summary: - When options file is provided to db_stress, take supported options from the file instead of from flags - Call `BuildOptionsTable` after `Open` so it can use `options_` once it has been populated either from flags or from file - Allow options filename to be passed via `db_crashtest.py` Closes https://github.com/facebook/rocksdb/pull/3768 Differential Revision: D7755331 Pulled By: ajkr fbshipit-source-id: 5205cc5deb0d74d677b9832174153812bab9a60a 27 April 2018, 01:42:07 UTC
7004e45 Remove block-based table assertion for non-empty filter block Summary: 7a6353bd1c516fe3f7118248c77035697c5ac247 prevents empty filter blocks from being written for SST files containing range deletions only. However the assertion this PR removes is still a problem as we could be reading from a DB generated by a RocksDB build without the 7a6353bd1c516fe3f7118248c77035697c5ac247 patch. So remove the assertion. We already don't do this check when `cache_index_and_filter_blocks=false`, so it should be safe. Closes https://github.com/facebook/rocksdb/pull/3773 Differential Revision: D7769964 Pulled By: ajkr fbshipit-source-id: 7285762446f2cd2ccf16efd7a988a106fbb0d8d3 26 April 2018, 21:43:11 UTC
63c965c Sync parent directory after deleting a file in delete scheduler Summary: sync parent directory after deleting a file in delete scheduler. Otherwise, trim speed may not be as smooth as what we want. Closes https://github.com/facebook/rocksdb/pull/3767 Differential Revision: D7760136 Pulled By: siying fbshipit-source-id: ec131d53b61953f09c60d67e901e5eeb2716b05f 26 April 2018, 20:58:20 UTC
7e4e381 Fix the bloom filter skipping empty prefixes Summary: bc0da4b5125ac4f43c88879522013814355338e7 optimized bloom filters by skipping duplicate entires when the whole key and prefixes are both added to the bloom. It however used empty string as the initial value of the last entry added to the bloom. This is incorrect since empty key/prefix are valid entires by themselves. This patch fixes that. Closes https://github.com/facebook/rocksdb/pull/3776 Differential Revision: D7778803 Pulled By: maysamyabandeh fbshipit-source-id: d5a065daebee17f9403cac51e9d5626aac87bfbc 26 April 2018, 20:28:31 UTC
e5a4dac WritePrepared Txn: disable rollback in stress test Summary: WritePrepared rollback implementation is not ready to be invoked in the middle of workload. This is due the lack of synchronization to obtain the cf handle from db. Temporarily disabling this until the problem with rollback is fixed. Closes https://github.com/facebook/rocksdb/pull/3772 Differential Revision: D7769041 Pulled By: maysamyabandeh fbshipit-source-id: 0e3b0ce679bc2afba82e653a40afa3f045722754 26 April 2018, 16:27:55 UTC
7c9f23e Rate limiter should be allowed to share between different rocksdb instances in C API Summary: Currently, the `rocksdb_options_set_ratelimiter` in `c.cc` will change the input to nil, which make it is not possible to use the shared rate limiter create by `rocksdb_ratelimiter_create` in different rocksdb option. In this pr, I changed it to shared ptr. Closes https://github.com/facebook/rocksdb/pull/3758 Differential Revision: D7749740 Pulled By: ajkr fbshipit-source-id: c6121f8ca75402afdb4b295ce63c2338d253a1b5 25 April 2018, 22:57:48 UTC
406b951 Fix clang build failure with -Wgnu-redeclared-enum Summary: In include/rocksdb/db.h, enum EntryType is redeclared even though original declaration in types.h in included. Closes https://github.com/facebook/rocksdb/pull/3766 Differential Revision: D7765504 Pulled By: anand1976 fbshipit-source-id: 622a8ecb306993915be1b9dd5cdd79dbc6a4ea05 25 April 2018, 22:42:46 UTC
13a0bd9 cmake: add options for enabling TBB and NUMA support Summary: see also https://github.com/facebook/rocksdb/issues/3036 Signed-off-by: Kefu Chai <tchaikov@gmail.com> Closes https://github.com/facebook/rocksdb/pull/3750 Differential Revision: D7765170 Pulled By: ajkr fbshipit-source-id: 455788b3131bf62a4987a65684b757e68473eed9 25 April 2018, 21:26:55 UTC
dfc61e7 initialize local variable for UBSAN in PosixEnv function Summary: It seems clear to me that the variable is initialized before line 492, but it wasn't clear to UBSAN. The failure was: ``` In file included from ./env/io_posix.h:14:0, from env/env_posix.cc:44: ./include/rocksdb/env.h: In member function ‘virtual rocksdb::Status rocksdb::{anonymous}::PosixEnv::NewMemoryMappedFileBuffer(const string&, std::unique_ptr<rocksdb::MemoryMappedFileBuffer>*)’: ./include/rocksdb/env.h:822:36: error: ‘base’ may be used uninitialized in this function [-Werror=maybe-uninitialized] : base(_base), length(_length) {} ^ env/env_posix.cc:482:11: note: ‘base’ was declared here void* base; ``` We can just initialize to nullptr to keep UBSAN happy. Closes https://github.com/facebook/rocksdb/pull/3770 Differential Revision: D7756287 Pulled By: ajkr fbshipit-source-id: 0f2efb9594e2d3a30706a4ca7e1d4a6328031bf2 25 April 2018, 20:42:02 UTC
9b89479 Pass -latomic to linker when using clang Summary: clang compilation is failing due to a4fb1f8c049ee9d61a9da8cf23b64d2c7d36a33f. In that commit I added a call to `std::atomic::is_lock_free` which was evidently relying on a compiler builtin only present in gcc. Drawbacks to this fix are: - users may need to install libatomic - there might be cases where clang is used even though USE_CLANG is unset (e.g., when clang is the only available compiler). I didn't figure out how to add -latomic in those cases... An alternative fix mentioned in http://lists.llvm.org/pipermail/llvm-bugs/2017-August/057263.html is using -stdlib=libc++ with clang. Closes https://github.com/facebook/rocksdb/pull/3769 Differential Revision: D7756261 Pulled By: ajkr fbshipit-source-id: 26888300683fa9970ab5950239d1aa217e8efd49 25 April 2018, 19:13:41 UTC
a4fb1f8 Add crash-recovery correctness check to db_stress Summary: Previously, our `db_stress` tool held the expected state of the DB in-memory, so after crash-recovery, there was no way to verify data correctness. This PR adds an option, `--expected_values_file`, which specifies a file holding the expected values. In black-box testing, the `db_stress` process can be killed arbitrarily, so updates to the `--expected_values_file` must be atomic. We achieve this by `mmap`ing the file and relying on `std::atomic<uint32_t>` for atomicity. Actually this doesn't provide a total guarantee on what we want as `std::atomic<uint32_t>` could, in theory, be translated into multiple stores surrounded by a mutex. We can verify our assumption by looking at `std::atomic::is_always_lock_free`. For the `mmap`'d file, we didn't have an existing way to expose its contents as a raw memory buffer. This PR adds it in the `Env::NewMemoryMappedFileBuffer` function, and `MemoryMappedFileBuffer` class. `db_crashtest.py` is updated to use an expected values file for black-box testing. On the first iteration (when the DB is created), an empty file is provided as `db_stress` will populate it when it runs. On subsequent iterations, that same filename is provided so `db_stress` can check the data is as expected on startup. Closes https://github.com/facebook/rocksdb/pull/3629 Differential Revision: D7463144 Pulled By: ajkr fbshipit-source-id: c8f3e82c93e045a90055e2468316be155633bd8b 24 April 2018, 22:58:22 UTC
bc0da4b Skip duplicate bloom keys when whole_key and prefix are mixed Summary: Currently we rely on FilterBitsBuilder to skip the duplicate keys. It does that by comparing that hash of the key to the hash of the last added entry. This logic breaks however when we have whole_key_filtering mixed with prefix blooms as their addition to FilterBitsBuilder will be interleaved. The patch fixes that by comparing the last whole key and last prefix with the whole key and prefix of the new key respectively and skip the call to FilterBitsBuilder if it is a duplicate. Closes https://github.com/facebook/rocksdb/pull/3764 Differential Revision: D7744413 Pulled By: maysamyabandeh fbshipit-source-id: 15df73bbbafdfd754d4e1f42ea07f47b03bc5eb8 24 April 2018, 17:58:16 UTC
090c78a Support lowering CPU priority of background threads Summary: Background activities like compaction can negatively affect latency of higher-priority tasks like request processing. To avoid this, rocksdb already lowers the IO priority of background threads on Linux systems. While this takes care of typical IO-bound systems, it does not help much when CPU (temporarily) becomes the bottleneck. This is especially likely when using more expensive compression settings. This patch adds an API to allow for lowering the CPU priority of background threads, modeled on the IO priority API. Benchmarks (see below) show significant latency and throughput improvements when CPU bound. As a result, workloads with some CPU usage bursts should benefit from lower latencies at a given utilization, or should be able to push utilization higher at a given request latency target. A useful side effect is that compaction CPU usage is now easily visible in common tools, allowing for an easier estimation of the contribution of compaction vs. request processing threads. As with IO priority, the implementation is limited to Linux, degrading to a no-op on other systems. Closes https://github.com/facebook/rocksdb/pull/3763 Differential Revision: D7740096 Pulled By: gwicke fbshipit-source-id: e5d32373e8dc403a7b0c2227023f9ce4f22b413c 24 April 2018, 15:41:51 UTC
affe01b Improve write time breakdown stats Summary: There's a group of stats in PerfContext for profiling the write path. They break down the write time into WAL write, memtable insert, throttling, and everything else. We use these stats a lot for figuring out the cause of slow writes. These stats got a bit out of date and are now categorizing some interesting things as "everything else", and also do some double counting. This PR fixes it and adds two new stats: time spent waiting for other threads of the batch group, and time spent waiting for scheduling flushes/compactions. Probably these will be enough to explain all the occasional abnormally slow (multiple seconds) writes that we're seeing. Closes https://github.com/facebook/rocksdb/pull/3602 Differential Revision: D7251562 Pulled By: al13n321 fbshipit-source-id: 0a2d0f5a4fa5677455e1f566da931cb46efe2a0d 24 April 2018, 00:58:54 UTC
d5afa73 Revert "Skip deleted WALs during recovery" Summary: This reverts commit 73f21a7b2177aeb82b9f518222e2b9ea8fbb7c4f. It breaks compatibility. When created a DB using a build with this new change, opening the DB and reading the data will fail with this error: "Corruption: Can't access /000000.sst: IO error: while stat a file for size: /tmp/xxxx/000000.sst: No such file or directory" This is because the dummy AddFile4 entry generated by the new code will be treated as a real entry by an older build. The older build will think there is a real file with number 0, but there isn't such a file. Closes https://github.com/facebook/rocksdb/pull/3762 Differential Revision: D7730035 Pulled By: siying fbshipit-source-id: f2051859eff20ef1837575ecb1e1bb96b3751e77 23 April 2018, 19:01:26 UTC
a8a28da Avoid directory renames in BackupEngine Summary: We used to name private directories like "1.tmp" while BackupEngine populated them, and then rename without the ".tmp" suffix (i.e., rename "1.tmp" to "1") after all files were copied. On glusterfs, directory renames like this require operations across many hosts, and partial failures have caused operational problems. Fortunately we don't need to rename private directories. We already have a meta-file that uses the tempfile-rename pattern to commit a backup atomically after all its files have been successfully copied. So we can copy private files directly to their final location, so now there's no directory rename. Closes https://github.com/facebook/rocksdb/pull/3749 Differential Revision: D7705610 Pulled By: ajkr fbshipit-source-id: fd724a28dd2bf993ce323a5f2cb7e7d6980cc346 21 April 2018, 00:28:33 UTC
2e72a58 Disable EnvPosixTest::FilePermission Summary: The test is flaky in our CI but could not be reproduce manually on the same CI host. Disabling it. Closes https://github.com/facebook/rocksdb/pull/3753 Differential Revision: D7716320 Pulled By: yiwu-arbug fbshipit-source-id: 6bed3b05880c1d24e8dc86bc970e5181bc98fb45 20 April 2018, 22:42:42 UTC
bb2a2ec WritePrepared Txn: rollback via commit Summary: Currently WritePrepared rolls back a transaction with prepare sequence number prepare_seq by i) write a single rollback batch with rollback_seq, ii) add <rollback_seq, rollback_seq> to commit cache, iii) remove prepare_seq from PrepareHeap. This is correct assuming that there is no snapshot taken when a transaction is rolled back. This is the case the way MySQL does rollback which is after recovery. Otherwise if max_evicted_seq advances the prepare_seq, the live snapshot might assume data as committed since it does not find them in CommitCache. The change is to simply add <prepare_seq. rollback_seq> to commit cache before removing prepare_seq from PrepareHeap. In this way if max_evicted_seq advances prpeare_seq, the existing mechanism that we have to check evicted entries against live snapshots will make sure that the live snapshot will not see the data of rolled back transaction. Closes https://github.com/facebook/rocksdb/pull/3745 Differential Revision: D7696193 Pulled By: maysamyabandeh fbshipit-source-id: c9a2d46341ddc03554dded1303520a1cab74ef9c 20 April 2018, 22:28:19 UTC
dbdaa46 Add a stat for MultiGet keys found, update memtable hit/miss stats Summary: 1. Add a new ticker stat rocksdb.number.multiget.keys.found to track the number of keys successfully read 2. Update rocksdb.memtable.hit/miss in DBImpl::MultiGet(). It was being done in DBImpl::GetImpl(), but not MultiGet Closes https://github.com/facebook/rocksdb/pull/3730 Differential Revision: D7677364 Pulled By: anand1976 fbshipit-source-id: af22bd0ef8ddc5cf2b4244b0a024e539fe48bca5 20 April 2018, 22:28:19 UTC
c3d1e36 WritePrepared Txn: enable TryAgain for duplicates at the end of the batch Summary: The WriteBatch::Iterate will try with a larger sequence number if the memtable reports a duplicate. This status is specified with TryAgain status. So far the assumption was that the last entry in the batch will never return TryAgain, which is correct when WAL is created via WritePrepared since it always appends a batch separator if a natural one does not exist. However when reading a WAL generated by WriteCommitted this batch separator might not exist. Although WritePrepared is not supposed to be able to read the WAL generated by WriteCommitted we should avoid confusing scenarios in which the behavior becomes unpredictable. The path fixes that by allowing TryAgain even for the last entry of the write batch. Closes https://github.com/facebook/rocksdb/pull/3747 Differential Revision: D7708391 Pulled By: maysamyabandeh fbshipit-source-id: bfaddaa9b14a4cdaff6977f6f63c789a6ab1ee0d 20 April 2018, 22:28:19 UTC
17e0403 Propagate fill_cache config to partitioned index iterator Summary: Currently the partitioned index iterator creates a new ReadOptions which ignores the fill_cache config set to ReadOptions passed by the user. The patch propagates fill_cache from the user's ReadOptions to that of partition index iterator. Also it clarifies the contract of fill_cache that i) it does not apply to filters, ii) it still charges block cache for the size of the data block, it still pin the block if it is already in the block cache. Closes https://github.com/facebook/rocksdb/pull/3739 Differential Revision: D7678308 Pulled By: maysamyabandeh fbshipit-source-id: 53ed96424ae922e499e2d4e3580ddc3f0db893da 20 April 2018, 22:13:05 UTC
dee95a1 Fix GitHub issue #3716: gcc-8 warnings Summary: Fix the following gcc-8 warnings: - conflicting C language linkage declaration [-Werror] - writing to an object with no trivial copy-assignment [-Werror=class-memaccess] - array subscript -1 is below array bounds [-Werror=array-bounds] Solves https://github.com/facebook/rocksdb/issues/3716 Closes https://github.com/facebook/rocksdb/pull/3736 Differential Revision: D7684161 Pulled By: yiwu-arbug fbshipit-source-id: 47c0423d26b74add251f1d3595211eee1e41e54a 20 April 2018, 20:42:47 UTC
8a9c7f7 fix compilation error: implicit conversion loses integer precision Summary: Fix compilation error with clang: > tools/db_stress.cc:2598:21: error: implicit conversion loses integer precision: 'gflags::uint64' (aka 'unsigned long') to 'uint32_t' (aka 'unsigned int') [-Werror,-Wshorten-64-to-32] Random rand(FLAGS_seed); ~~~~ ^~~~~~~~~~ Closes https://github.com/facebook/rocksdb/pull/3746 Differential Revision: D7703209 Pulled By: miasantreble fbshipit-source-id: 18c56a5138a2f308e4213594bc82e8e64bc21570 20 April 2018, 01:57:43 UTC
69faddb CMake: Read rocksdb version from version.h header file Summary: This replaces reading the rocksdb version by external shell script. This does not work reliably on Windows (I wander how it works on AppVeyor). Closes https://github.com/facebook/rocksdb/pull/3737 Differential Revision: D7703106 Pulled By: ajkr fbshipit-source-id: 4079c7c77431757e9ddc801363ed896b18fdbf23 20 April 2018, 00:42:11 UTC
e1e826b check return status for Sync() and Append() calls to avoid corruption Summary: Right now in `SyncClosedLogs`, `CopyFile`, and `AddRecord`, where `Sync` and `Append` are invoked in a loop, the error status are not checked. This could lead to potential corruption as later calls will overwrite the error status. Closes https://github.com/facebook/rocksdb/pull/3740 Differential Revision: D7678848 Pulled By: miasantreble fbshipit-source-id: 4b0b412975989dfe80348f73217b9c4122a4bd77 19 April 2018, 21:13:46 UTC
ad51168 Add block cache related DB properties Summary: Add DB properties "rocksdb.block-cache-capacity", "rocksdb.block-cache-usage", "rocksdb.block-cache-pinned-usage" to show block cache usage. Closes https://github.com/facebook/rocksdb/pull/3734 Differential Revision: D7657180 Pulled By: yiwu-arbug fbshipit-source-id: dd34a019d5878dab539c51ee82669e97b2b745fd 19 April 2018, 04:42:25 UTC
3cea613 include thread-pool priority in thread names Summary: Previously threads were named "rocksdb:bg\<index in thread pool\>", so the first thread in all thread pools would be named "rocksdb:bg0". Users want to be able to distinguish threads used for flush (high-pri) vs regular compaction (low-pri) vs compaction to bottom-level (bottom-pri). So I changed the thread naming convention to include the thread-pool priority. Closes https://github.com/facebook/rocksdb/pull/3702 Differential Revision: D7581415 Pulled By: ajkr fbshipit-source-id: ce04482b6acd956a401ef22dc168b84f76f7d7c1 19 April 2018, 00:27:56 UTC
6d06be2 Improve db_stress with transactions Summary: db_stress was already capable running transactions by setting use_txn. Running it under stress showed a couple of problems fixed in this patch. - The uncommitted transaction must be either rolled back or commit after recovery. - Current implementation of WritePrepared transaction cannot handle cf drop before crash. Clarified that in the comments and added safety checks. When running with use_txn, clear_column_family_one_in must be set to 0. Closes https://github.com/facebook/rocksdb/pull/3733 Differential Revision: D7654419 Pulled By: maysamyabandeh fbshipit-source-id: a024bad80a9dc99677398c00d29ff17d4436b7f3 18 April 2018, 23:32:35 UTC
2ee1496 Add missing whitespace. Summary: Closes https://github.com/facebook/rocksdb/pull/3729 Differential Revision: D7645465 Pulled By: riversand963 fbshipit-source-id: a64da0960fe6c39847ef848b8888fe9a9c1df25d 17 April 2018, 16:57:40 UTC
2c2f388 db_bench fillXXXdeterministic should respect compression type Summary: db_bench fillXXXdeterministic should respect compression type when calling CompactFiles(). Closes https://github.com/facebook/rocksdb/pull/3731 Differential Revision: D7647761 Pulled By: yiwu-arbug fbshipit-source-id: 15e12429e0dd93ece2231b015f2e26c2d94781e6 17 April 2018, 01:01:47 UTC
b4f3339 Improve the comment on TableFactory::NewTableReader() Summary: `DBImpl::AddFile()` has been replaced by `DBImpl::IngestExternalFile()`. Closes https://github.com/facebook/rocksdb/pull/3726 Differential Revision: D7646875 Pulled By: ajkr fbshipit-source-id: 241eb7a8d88527fdc5c26b0c3f6faec3296451f8 16 April 2018, 23:58:20 UTC
5e48811 Initialize a boolean member variable of a struct. Summary: The reason for this initialization is that LLVM UBSAN check will fail due to uninitialized bool. [StackOverflow post](https://stackoverflow.com/questions/31420154/runtime-error-load-of-value-127-which-is-not-a-valid-value-for-type-bool). UBSAN log: > ===== Running external_sst_file_basic_test [==========] Running 7 tests from 1 test case. [----------] Global test environment set-up. [----------] 7 tests from ExternalSSTFileBasicTest [ RUN ] ExternalSSTFileBasicTest.Basic [ OK ] ExternalSSTFileBasicTest.Basic (6 ms) [ RUN ] ExternalSSTFileBasicTest.NoCopy db/external_sst_file_ingestion_job.h:23:8: runtime error: load of value 253, which is not a valid value for type 'bool' miasantreble I've tested this locally using the following command. ``` TEST_TMPDIR=/dev/shm/rocksdb COMPILE_WITH_UBSAN=1 OPT=-g make J=1 -j8 ubsan_check ``` ajkr This PR is related to your review comment in [PR](https://github.com/facebook/rocksdb/pull/3713/). It turns out that, with UBSAN enabled, we must provide a default value for boolean member variables. Closes https://github.com/facebook/rocksdb/pull/3728 Differential Revision: D7642476 Pulled By: riversand963 fbshipit-source-id: 4c09a4b8d271151cb99ae7393db9e4ad9f29762e 16 April 2018, 21:28:01 UTC
af95aec use delete[] to dealloc an array Summary: fix a bug in `db_stress` where an int array was incorrectly deallocated using delete instead of delete[] Closes https://github.com/facebook/rocksdb/pull/3725 Differential Revision: D7634749 Pulled By: miasantreble fbshipit-source-id: 489b776f5f4c03de1824edac5495787ec19cc910 16 April 2018, 06:56:39 UTC
954b496 fix memory leak in two_level_iterator Summary: this PR fixes a few failed contbuild: 1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406 2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662 3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag. Closes https://github.com/facebook/rocksdb/pull/3718 Reviewed By: maysamyabandeh Differential Revision: D7621192 Pulled By: miasantreble fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146 16 April 2018, 00:26:26 UTC
9fcd82e cmake: append rados to THIRDPARTY_LIBS before appending it to LIBS Summary: otherwise the env_librados_test executable will fail to link against librados. Signed-off-by: Kefu Chai <tchaikov@gmail.com> Closes https://github.com/facebook/rocksdb/pull/3724 Differential Revision: D7631542 Pulled By: ajkr fbshipit-source-id: 38afbf21f9aeb7dedfb840aba8b2f8b421f9edb0 15 April 2018, 20:27:54 UTC
81d44f2 fix-typo: add missing periods Summary: Closes https://github.com/facebook/rocksdb/pull/3720 Differential Revision: D7631525 Pulled By: ajkr fbshipit-source-id: 50cf4dc363b0d32b150d963011171a8a6f53a384 15 April 2018, 20:12:23 UTC
28087ac Implemented Knuth shuffle to construct permutation for selecting no_o… Summary: …verwrite_keys. Also changed each no_overwrite_key set to an unordered set, otherwise Knuth shuffle only gets you 2x time improvement, because insertion (and subsequent internal sorting) into an ordered set is the bottleneck. With this change, each iteration of permutation construction and prefix selection takes around 40 secs, as opposed to 360 secs previously. However, this still means that with the default 10 CF per blackbox test case, the test is going to time out given the default interval of 200 secs. Also, there is currently an assertion error affecting all blackbox tests in db_crashtest.py; this assertion error will be fixed in a future PR. Closes https://github.com/facebook/rocksdb/pull/3699 Differential Revision: D7624616 Pulled By: amytai fbshipit-source-id: ea64fbe83407ff96c1c0ecabbc6c830576939393 14 April 2018, 05:13:13 UTC
a0102aa Make database files' permissions configurable Summary: Closes https://github.com/facebook/rocksdb/pull/3709 Differential Revision: D7610227 Pulled By: xiaofeidu008 fbshipit-source-id: 88a52f0f9f96e2195fccde995cf9760b785e9f07 13 April 2018, 20:13:04 UTC
31ee4bf add kEntryRangeDeletion Summary: When there are many range deletions in a range, we want to trigger manual compaction on this range to reclaim disk space as soon as possible and speed up read. After this change, we can collect informations of range deletions and store them into user properties which can guide our manual compaction. Closes https://github.com/facebook/rocksdb/pull/3695 Differential Revision: D7570322 Pulled By: ajkr fbshipit-source-id: c358fa43b0aac6cc954d2eadc7d3bd8015373369 13 April 2018, 18:27:17 UTC
1f5457e Merge raw and shared pointer log method impls Summary: Calling rocksdb::Log, rocksdb::Info, etc with a `shared_ptr<Logger>` should behave the same as calling those functions with a `Logger *`. This PR achieves it by making the `shared_ptr<Logger>` versions delegate to the `Logger *` versions. Closes #3689 Closes https://github.com/facebook/rocksdb/pull/3710 Differential Revision: D7595557 Pulled By: ajkr fbshipit-source-id: 64dd7f20fd42dc821bac7b8032705c35b483e00d 13 April 2018, 18:12:54 UTC
c81b0ab Improve accuracy of I/O stats collection of external SST ingestion. Summary: RocksDB supports ingestion of external ssts. If ingestion_options.move_files is true, when performing ingestion, RocksDB first tries to link external ssts. If external SST file resides on a different FS, or the underlying FS does not support hard link, then RocksDB performs actual file copy. However, no matter which choice is made, current code increase bytes-written when updating compaction stats, which is inaccurate when RocksDB does NOT copy file. Rename a sync point. Closes https://github.com/facebook/rocksdb/pull/3713 Differential Revision: D7604151 Pulled By: riversand963 fbshipit-source-id: dd0c0d9b9a69c7d9ffceafc3d9c23371aa413586 13 April 2018, 17:58:42 UTC
3be9b36 comment unused parameters to turn on -Wunused-parameter flag Summary: This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557. Closes https://github.com/facebook/rocksdb/pull/3662 Differential Revision: D7426121 Pulled By: Dayvedde fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352 13 April 2018, 00:59:16 UTC
d15397b WritePrepared Txn: rollback_merge_operands hack Summary: This is a hack as temporary fix of MyRocks with rollbacking the merge operands. The way MyRocks uses merge operands is without protection of locks, which violates the assumption behind the rollback algorithm. They are ok with not being rolled back as it would just create a gap in the autoincrement column. The hack add an option to disable the rollback of merge operands by default and only enables it to let the unit test pass. Closes https://github.com/facebook/rocksdb/pull/3711 Differential Revision: D7597177 Pulled By: maysamyabandeh fbshipit-source-id: 544be0f666c7e7abb7f651ec8b23124e05056728 12 April 2018, 18:58:11 UTC
6f5e644 WritePrepared Txn: fix smallest_prep atomicity issue Summary: We introduced smallest_prep optimization in this commit b225de7e10f02be6d00e96b9fb86dfef880babdf, which enables storing the smallest uncommitted sequence number along with the snapshot. This enables the readers that read from the snapshot to skip further checks and safely assumed the data is committed if its sequence number is less than smallest uncommitted when the snapshot was taken. The problem was that smallest uncommitted and the snapshot must be taken atomically, and the lack of atomicity had led to readers using a smallest uncommitted after the snapshot was taken and hence mistakenly skipping some data. This patch fixes the problem by i) separating the process of removing of prepare entries from the AddCommitted function, ii) removing the prepare entires AFTER the committed sequence number is published, iii) getting smallest uncommitted (from the prepare list) BEFORE taking a snapshot. This guarantees that the smallest uncommitted that is accompanied with a snapshot is less than or equal of such number if it was obtained atomically. Tested by running MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest that was failing sporadically. Closes https://github.com/facebook/rocksdb/pull/3703 Differential Revision: D7581934 Pulled By: maysamyabandeh fbshipit-source-id: dc9d6f4fb477eba75d4d5927326905b548a96a32 12 April 2018, 03:11:51 UTC
d42bd04 Improve visibility into the reasons for compaction. Summary: Add `compaction_reason` as part of event log for event `compaction started`. Add counters for each `CompactionReason`. Closes https://github.com/facebook/rocksdb/pull/3679 Differential Revision: D7550348 Pulled By: riversand963 fbshipit-source-id: a19cff3a678c785aa5ef41aac78b9a5968fcc34d 11 April 2018, 17:58:44 UTC
019d789 fix calling SetOptions on deprecated options Summary: In `cf_options_type_info`, the deprecated options are all considered to have offset zero in the `MutableCFOptions` struct. Previously we weren't checking in `GetMutableOptionsFromStrings` whether the provided option was deprecated or not and simply writing the provided value to the offset specified by `cf_options_type_info`. That meant setting any deprecated option would overwrite the first element in the struct, which is `write_buffer_size`. `db_stress` hit this often since it calls `SetOptions` with `soft_rate_limit=0` and `hard_rate_limit=0`, which are both deprecated so cause `write_buffer_size` to be set to zero, which causes it to crash on the following assertion: ``` db_stress: db/memtable.cc:106: rocksdb::MemTable::MemTable(const rocksdb::InternalKeyComparator&, const rocksdb::ImmutableCFOptions&, const rocksdb::MutableCFOptions&, rocksdb::WriteBufferManager*, rocksdb::SequenceNumber, uint32_t): Assertion `!ShouldScheduleFlush()' failed. ``` We fix it by skipping deprecated options (and logging a warning) when users provide them to `SetOptions`. I didn't want to fail the call for compatibility reasons. Closes https://github.com/facebook/rocksdb/pull/3700 Differential Revision: D7572596 Pulled By: ajkr fbshipit-source-id: bd5d84e14c0c39f30c5d4c6df7c1503d2c28ecf1 11 April 2018, 02:02:09 UTC
d95014b fix some text in comments. Summary: 1. Remove redundant text. 2. Make terminology consistent across all comments and doc of RocksDB. Also do our best to conform to conventions. Specifically, use 'callback' instead of 'call-back' [wikipedia](https://en.wikipedia.org/wiki/Callback_(computer_programming)). Closes https://github.com/facebook/rocksdb/pull/3693 Differential Revision: D7560396 Pulled By: riversand963 fbshipit-source-id: ba8c251c487f4e7d1872a1a8dc680f9e35a6ffb8 10 April 2018, 22:59:24 UTC
2770a94 make MockTimeEnv::current_time_ atomic to fix data race Summary: fix a new TSAN failure https://gist.github.com/miasantreble/7599c33f4e17da1024c67d4540dbe397 Closes https://github.com/facebook/rocksdb/pull/3694 Differential Revision: D7565310 Pulled By: miasantreble fbshipit-source-id: f672c96e925797b34dec6e20b59527e8eebaa825 10 April 2018, 21:13:18 UTC
5ec382b Fix up backupable_db stack corruption. Summary: Fix up OACR(Lint) warnings. Closes https://github.com/facebook/rocksdb/pull/3674 Differential Revision: D7563869 Pulled By: ajkr fbshipit-source-id: 8c1e5045c8a6a2d85b2933fdbc60fde93bf0c9de 10 April 2018, 02:27:24 UTC
d2bcd76 Fix the memory leak with pinned partitioned filters Summary: The existing unit test did not set the level so the check for pinned partitioned filter/index being properly released from the block cache was not properly exercised as they only take effect in level 0. As a result a memory leak in pinned partitioned filters was hidden. The patch fix the test as well as the bug. Closes https://github.com/facebook/rocksdb/pull/3692 Differential Revision: D7559763 Pulled By: maysamyabandeh fbshipit-source-id: 55eff274945838af983c764a7d71e8daff092e4a 09 April 2018, 23:28:19 UTC
65fe8d6 Change a comment Summary: In this case, we add input files of compaction, not outputs. Closes https://github.com/facebook/rocksdb/pull/3686 Differential Revision: D7556781 Pulled By: ajkr fbshipit-source-id: ae135bb6eda60db8f275a9ba2d21c18aaadef5b7 09 April 2018, 20:42:31 UTC
1c27cbf fix intra-L0 FIFO for uncompressed use case Summary: - inflate the argument passed as `max_compact_bytes_per_del_file` by a bit (10%). The intent of this argument is prevent L0 files from being intra-L0 compacted multiple times. Without compression, some intra-L0 compactions exceed this limit (and thus aren't executed), even though none of their files have gone through intra-L0 before. - fix `FindIntraL0Compaction` as it was rejecting some valid intra-L0 compactions. In particular, `compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len), whereas `new_compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len+1). The former is more correct for checking whether we've found an eligible span. Closes https://github.com/facebook/rocksdb/pull/3684 Differential Revision: D7530396 Pulled By: ajkr fbshipit-source-id: cad4f50902bdc428ac9ff6fffb13eb288648d85e 09 April 2018, 20:42:31 UTC
f3a1d9e fix data race Summary: Fix a TSAN failure in `DBRangeDelTest.ValidLevelSubcompactionBoundaries`: https://gist.github.com/miasantreble/712e04b4de2ff7f193c98b1acf07e899 Closes https://github.com/facebook/rocksdb/pull/3691 Differential Revision: D7541400 Pulled By: miasantreble fbshipit-source-id: b0b4538980bce7febd0385e61d6e046580bcaefb 09 April 2018, 19:28:28 UTC
bde1c1a WritePrepared Txn: add stats Summary: Adding some stats that would be helpful to monitor if the DB has gone to unlikely stats that would hurt the performance. These are mostly when we end up needing to acquire a mutex. Closes https://github.com/facebook/rocksdb/pull/3683 Differential Revision: D7529393 Pulled By: maysamyabandeh fbshipit-source-id: f7d36279a8f39bd84d8ddbf64b5c97f670c5d6d9 08 April 2018, 04:56:42 UTC
eb5a295 WritePrepared Txn: add write_committed option to dump_wal Summary: Currently dump_wal cannot print the prepared records from the WAL that is generated by WRITE_PREPARED write policy since the default reaction of the handler is to return NotSupported if markers of WRITE_PREPARED are encountered. This patch enables the admin to pass --write_committed=false option, which will be accordingly passed to the handler. Note that DBFileDumperCommand and DBDumperCommand are still not updated by this patch but firstly they are not urgent and secondly we need to revise this approach later when we also add WRITE_UNPREPARED markers so I leave it for future work. Tested by running it on a WAL generated by WRITE_PREPARED: $ ./ldb dump_wal --walfile=/dev/shm/dbbench/000003.log | grep BEGIN_PREARE | head -1 1,2,70,0,BEGIN_PREARE $ ./ldb dump_wal --walfile=/dev/shm/dbbench/000003.log --write_committed=false | grep BEGIN_PREARE | head -1 1,2,70,0,BEGIN_PREARE PUT(0) : 0x30303031313330313938 PUT(0) : 0x30303032353732313935 END_PREPARE(0x74786E31313535383434323738303738363938313335312D30) Closes https://github.com/facebook/rocksdb/pull/3682 Differential Revision: D7522090 Pulled By: maysamyabandeh fbshipit-source-id: a0332207261c61e18b2f9dfbe9feecd9a1339aca 08 April 2018, 04:56:42 UTC
ca87aef Added support for SstFileManager to RocksJava Summary: Closes https://github.com/facebook/rocksdb/pull/3666 Differential Revision: D7457634 Pulled By: sagar0 fbshipit-source-id: 47741e2ee66e9255c580f4e38cfb86b284c27c2f 07 April 2018, 04:26:32 UTC
74767de Fix typo Summary: regrad -> regard Closes https://github.com/facebook/rocksdb/pull/3685 Differential Revision: D7540952 Pulled By: miasantreble fbshipit-source-id: e08c9389f7fccf401c962a4441b62cd5e73a33ad 06 April 2018, 22:42:50 UTC
faba3fb protect valid backup files when max_valid_backups_to_open is set Summary: When `max_valid_backups_to_open` is set, the `BackupEngine` doesn't know about the files referenced by existing backups. This PR prevents us from deleting valid files when that option is set, in cases where we are unable to accurately determine refcount. There are warnings logged when we may miss deleting unreferenced files, and a recommendation in the header for users to periodically unset this option and run a full `GarbageCollect`. Closes https://github.com/facebook/rocksdb/pull/3518 Differential Revision: D7008331 Pulled By: ajkr fbshipit-source-id: 87907f964dc9716e229d08636a895d2fc7b72305 06 April 2018, 04:13:21 UTC
6571770 fix shared libary compile on ppc Summary: shared-ppc-objects is missed in $(SHARED4) target Closes https://github.com/facebook/rocksdb/pull/3619 Differential Revision: D7475767 Pulled By: ajkr fbshipit-source-id: d957ac7290bab3cd542af504405fb5ff912bfbf1 06 April 2018, 02:58:20 UTC
446b32c Support for Column family specific paths. Summary: In this change, an option to set different paths for different column families is added. This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path. To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path. Changes : 1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions. This member is used to identify the path information whenever files are accessed. 2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting. 3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths. 4) Unit tests are added appropriately. Closes https://github.com/facebook/rocksdb/pull/3102 Differential Revision: D6951697 Pulled By: ajkr fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d 06 April 2018, 02:58:20 UTC
6718267 Stats for false positive rate of full filtesr Summary: Adds two stats to allow us measuring the false positive rate of full filters: - The total count of positives: rocksdb.bloom.filter.full.positive - The total count of true positives: rocksdb.bloom.filter.full.true.positive Not the term "full" in the stat name to indicate that they are meaningful in full filters. block-based filters are to be deprecated soon and supporting it is not worth the the additional cost of if-then-else branches. Closes #3680 Tested by: $ ./db_bench -benchmarks=fillrandom -db /dev/shm/rocksdb-tmpdb --num=1000000 -bloom_bits=10 $ ./db_bench -benchmarks="readwhilewriting" -db /dev/shm/rocksdb-tmpdb --statistics -bloom_bits=10 --duration=60 --num=2000000 --use_existing_db 2>&1 > /tmp/full.log $ grep filter.full /tmp/full.log rocksdb.bloom.filter.full.positive COUNT : 3628593 rocksdb.bloom.filter.full.true.positive COUNT : 3536026 which gives the false positive rate of 2.5% Closes https://github.com/facebook/rocksdb/pull/3681 Differential Revision: D7517570 Pulled By: maysamyabandeh fbshipit-source-id: 630ab1a473afdce404916d297035b6318de4c052 05 April 2018, 22:58:48 UTC
685912d Clock cache should check if deleter is nullptr before calling it Summary: Clock cache should check if deleter is nullptr before calling it. Closes https://github.com/facebook/rocksdb/pull/3677 Differential Revision: D7493602 Pulled By: yiwu-arbug fbshipit-source-id: 4f94b188d2baf2cbc7c0d5da30fea1215a683de4 05 April 2018, 18:57:53 UTC
147dfc7 Fix pre_release callback argument list. Summary: Primitive types constness does not affect the signature of the method and has no influence on whether the overriding method would actually have that const bool instead of just bool. In addition, it is rarely useful but does produce a compatibility warnings in VS 2015 compiler. Closes https://github.com/facebook/rocksdb/pull/3663 Differential Revision: D7475739 Pulled By: ajkr fbshipit-source-id: fb275378b5acc397399420ae6abb4b6bfe5bd32f 05 April 2018, 18:12:16 UTC
36a9f22 Blob DB: blob_dump to show uncompressed values Summary: Make blob_dump tool able to show uncompressed values if the blob file is compressed. Also show total compressed vs. raw size at the end if --show_summary is provided. Closes https://github.com/facebook/rocksdb/pull/3633 Differential Revision: D7348926 Pulled By: yiwu-arbug fbshipit-source-id: ca709cb4ed5cf6a550ff2987df8033df81516f8e 05 April 2018, 18:12:16 UTC
c827b2d fix build for rocksdb lite Summary: currently rocksdb lite build fails due to the following errors: > db/db_sst_test.cc:29:51: error: ‘FlushJobInfo’ does not name a type virtual void OnFlushCompleted(DB* /*db*/, const FlushJobInfo& info) override { ^ db/db_sst_test.cc:29:16: error: ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB*, const int&)’ marked ‘override’, but does not override virtual void OnFlushCompleted(DB* /*db*/, const FlushJobInfo& info) override { ^ db/db_sst_test.cc:24:7: error: ‘class rocksdb::FlushedFileCollector’ has virtual functions and accessible non-virtual destructor [-Werror=non-virtual-dtor] class FlushedFileCollector : public EventListener { ^ db/db_sst_test.cc: In member function ‘virtual void rocksdb::FlushedFileCollector::OnFlushCompleted(rocksdb::DB*, const int&)’: db/db_sst_test.cc:31:35: error: request for member ‘file_path’ in ‘info’, which is of non-class type ‘const int’ flushed_files_.push_back(info.file_path); ^ cc1plus: all warnings being treated as errors make: *** [db/db_sst_test.o] Error 1 Closes https://github.com/facebook/rocksdb/pull/3676 Differential Revision: D7493006 Pulled By: miasantreble fbshipit-source-id: 77dff0a5b23e27db51be9b9798e3744e6fdec64f 05 April 2018, 16:11:36 UTC
7d90679 Ttl-triggered and snapshot-release-triggered compactions should not be manual compactions Summary: Ttl-triggered and snapshot-release-triggered compactions should not be considered as manual compactions. This is a bug. Closes https://github.com/facebook/rocksdb/pull/3678 Differential Revision: D7498151 Pulled By: sagar0 fbshipit-source-id: a2d5bed05268a4dc93d54ea97a9ae44b366df15d 05 April 2018, 13:41:52 UTC
2a62ca1 Make Optimistic Tx database stackable Summary: This change models Optimistic Tx db after Pessimistic TX db. The motivation for this change is to make the ptr polymorphic so it can be held by the same raw or smart ptr. Currently, due to the inheritance of the Opt Tx db not being rooted in the manner of Pess Tx from a single DB root it is more difficult to write clean code and have clear ownership of the database in cases when options dictate instantiate of plan DB, Pess Tx DB or Opt tx db. Closes https://github.com/facebook/rocksdb/pull/3566 Differential Revision: D7184502 Pulled By: yiwu-arbug fbshipit-source-id: 31d06efafd79497bb0c230e971857dba3bd962c3 03 April 2018, 22:28:40 UTC
b058a33 Reduce default --nooverwritepercent in black-box crash tests Summary: Previously `python tools/db_crashtest.py blackbox` would do no useful work as the crash interval (two minutes) was shorter than the preparation phase. The preparation phase is slow because of the ridiculously inefficient way it computes which keys should not be overwritten. It was doing this for 60M keys since default values were `FLAGS_nooverwritepercent == 60` and `FLAGS_max_key == 100000000`. Move the "nooverwritepercent" override from whitebox-specific to the general options so it also applies to blackbox test runs. Now preparation phase takes a few seconds. Closes https://github.com/facebook/rocksdb/pull/3671 Differential Revision: D7457732 Pulled By: ajkr fbshipit-source-id: 601f4461a6a7e49e50449dcf15aebc9b8a98d6f0 03 April 2018, 22:28:40 UTC
12b400e Some small improvements to the build_tools Summary: Closes https://github.com/facebook/rocksdb/pull/3664 Differential Revision: D7459433 Pulled By: sagar0 fbshipit-source-id: 3817e5d45fc70e83cb26f9800eaa0f4566c8dc0e 03 April 2018, 06:57:41 UTC
04c11b8 Level Compaction with TTL Summary: Level Compaction with TTL. As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space. Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data. This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space. This functionality can be controlled by the newly introduced column family option -- ttl. TODO for later: - Make ttl mutable - Extend TTL to Universal compaction as well? (TTL is already supported in FIFO) - Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option. Closes https://github.com/facebook/rocksdb/pull/3591 Differential Revision: D7275442 Pulled By: sagar0 fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267 03 April 2018, 05:14:28 UTC
df14424 Fix 3-way SSE4.2 crc32c usage in MSVC with CMake Summary: The introduction of the 3-way SSE4.2 optimized crc32c implementation in commit f54d7f5feaff09b0e2bf749ecd77043bcd4141b2 added the `HAVE_PCLMUL` definition when the compiler supports intrinsics for that instruction, but did not modify CMakeLists.txt to set that definition on MSVC when appropriate. As a result, 3-way SSE4.2 is not used in MSVC builds with CMake although it could be. Since the existing test program in CMakeLists.txt for `HAVE_SSE42` already uses `_mm_clmulepi64_si128` which is a PCLMUL instruction, this PR sets `HAVE_PCLMUL` as well if that program builds successfully, fixing the problem. Closes https://github.com/facebook/rocksdb/pull/3673 Differential Revision: D7473975 Pulled By: miasantreble fbshipit-source-id: bc346b9eb38920e427aa1a253e6dd9811efa269e 03 April 2018, 03:42:26 UTC
b225de7 WritePrepared Txn: smallest_prepare optimization Summary: The is an optimization to reduce lookup in the CommitCache when querying IsInSnapshot. The optimization takes the smallest uncommitted data at the time that the snapshot was taken and if the sequence number of the read data is lower than that number it assumes the data as committed. To implement this optimization two changes are required: i) The AddPrepared function must be called sequentially to avoid out of order insertion in the PrepareHeap (otherwise the top of the heap does not indicate the smallest prepare in future too), ii) non-2PC transactions also call AddPrepared if they do not commit in one step. Closes https://github.com/facebook/rocksdb/pull/3649 Differential Revision: D7388630 Pulled By: maysamyabandeh fbshipit-source-id: b79506238c17467d590763582960d4d90181c600 03 April 2018, 03:27:41 UTC
1579626 Enable cancelling manual compactions if they hit the sfm size limit Summary: Manual compactions should be cancelled, just like scheduled compactions are cancelled, if sfm->EnoughRoomForCompaction is not true. Closes https://github.com/facebook/rocksdb/pull/3670 Differential Revision: D7457683 Pulled By: amytai fbshipit-source-id: 669b02fdb707f75db576d03d2c818fb98d1876f5 03 April 2018, 02:58:04 UTC
44653c7 Revert "Avoid adding tombstones of the same file to RangeDelAggregato… Summary: …r multiple times" This reverts commit e80709a33a2fc05cf85a89ade1f17944197e5451. lingbin PR https://github.com/facebook/rocksdb/pull/3635 is causing some performance regression for seekrandom workloads I'm reverting the commit for now but feel free to submit new patches :smiley: To reproduce the regression, you can run the following db_bench command > ./db_bench --benchmarks=fillrandom,seekrandomwhilewriting --threads=1 --num=1000000 --reads=150000 --key_size=66 --value_size=1262 --statistics=0 --compression_ratio=0.5 --histogram=1 --seek_nexts=1 --stats_per_interval=1 --stats_interval_seconds=600 --max_background_flushes=4 --num_multi_db=1 --max_background_compactions=16 --seed=1522388277 -write_buffer_size=1048576 --level0_file_num_compaction_trigger=10000 --compression_type=none write stats printed by db_bench: Table | | | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- revert commit | Percentiles: | P50: | 80.77 | P75: |102.94 |P99: | 1786.44 | P99.9: | 1892.39 |P99.99: 2645.10 | keep commit | Percentiles: | P50: | 221.72 | P75: | 686.62 | P99: | 1842.57 | P99.9: | 1899.70| P99.99: 2814.29| Closes https://github.com/facebook/rocksdb/pull/3672 Differential Revision: D7463315 Pulled By: miasantreble fbshipit-source-id: 8e779c87591127f2c3694b91a56d9b459011959d 03 April 2018, 02:58:04 UTC
8917eee Fixed small typos Summary: Closes https://github.com/facebook/rocksdb/pull/3667 Differential Revision: D7470060 Pulled By: miasantreble fbshipit-source-id: 8e8545cda38f0805f35ccdb8841666a2d7a965f5 02 April 2018, 00:14:46 UTC
d12112d Throw NoSpace instead of IOError when out of space. Summary: Replaces #1702 and is updated from feedback. Closes https://github.com/facebook/rocksdb/pull/3531 Differential Revision: D7457395 Pulled By: gfosco fbshipit-source-id: 25a21dd8cfa5a6e42e024208b444d9379d920c82 30 March 2018, 22:27:18 UTC
d9bfb35 Update buckifier and TARGETS Summary: Some flags used via make were not applied in the buckifier/targets file, causing some failures to be missed by testing infra ( ie the one fixed by #3434 ) Closes https://github.com/facebook/rocksdb/pull/3452 Differential Revision: D7457419 Pulled By: gfosco fbshipit-source-id: e4aed2915ca3038c1485bbdeebedfc33d5704a49 30 March 2018, 21:26:53 UTC
c3eb762 Update 64-bit shift in compression.h Summary: This was failing the build on windows with zstd, warning treated as an error, 32-bit shift implicitly converted to 64-bit. Closes https://github.com/facebook/rocksdb/pull/3624 Differential Revision: D7307883 Pulled By: gfosco fbshipit-source-id: 68110e9b5b1b59b668dec6cf86b67556402574e7 30 March 2018, 18:28:05 UTC
73f21a7 Skip deleted WALs during recovery Summary: This patch record the deleted WAL numbers in the manifest to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic. Closes https://github.com/facebook/rocksdb/pull/3488 Differential Revision: D6967893 Pulled By: maysamyabandeh fbshipit-source-id: 13119feb155a08ab6d4909f437c7a750480dc8a1 30 March 2018, 18:28:05 UTC
89d989e WritePrepared Txn: fix a bug in publishing recoverable state seq Summary: When using two_write_queue, the published seq and the last allocated sequence could be ahead of the LastSequence, even if both write queues are stopped as in WriteRecoverableState. The patch fixes a bug in WriteRecoverableState in which LastSequence was used as a reference but the result was applied to last fetched sequence and last published seq. Closes https://github.com/facebook/rocksdb/pull/3665 Differential Revision: D7446099 Pulled By: maysamyabandeh fbshipit-source-id: 1449bed9aed8e9db6af85946efd347cb8efd3c0b 29 March 2018, 21:46:41 UTC
3cb5919 Allow rocksdbjavastatic to also be built as debug build Summary: Closes https://github.com/facebook/rocksdb/pull/3654 Differential Revision: D7417948 Pulled By: sagar0 fbshipit-source-id: 9514df9328181e54a6384764444c0c7ce66e7f5f 28 March 2018, 23:30:36 UTC
0377ff9 WritePrepared Txn: make recoverable state visible after flush Summary: Currently if the CommitTimeWriteBatch is set to be used only as a state that is required only for recovery , the user cannot see that in DB until it is restarted. This while the state is already inserted into the DB after the memtable flush. It would be useful for debugging if make this state visible to the user after the flush by committing it. The patch does it by a invoking a callback that does the commit on the recoverable state. Closes https://github.com/facebook/rocksdb/pull/3661 Differential Revision: D7424577 Pulled By: maysamyabandeh fbshipit-source-id: 137f9408662f0853938b33fa440f27f04c1bbf5c 28 March 2018, 19:12:08 UTC
1f5def1 Fix race condition causing double deletion of ssts Summary: Possible interleaved execution of background compaction thread calling `FindObsoleteFiles (no full scan) / PurgeObsoleteFiles` and user thread calling `FindObsoleteFiles (full scan) / PurgeObsoleteFiles` can lead to race condition on which RocksDB attempts to delete a file twice. The second attempt will fail and return `IO error`. This may occur to other files, but this PR targets sst. Also add a unit test to verify that this PR fixes the issue. The newly added unit test `obsolete_files_test` has a test case for this scenario, implemented in `ObsoleteFilesTest#RaceForObsoleteFileDeletion`. `TestSyncPoint`s are used to coordinate the interleaving the `user_thread` and background compaction thread. They execute as follows ``` timeline user_thread background_compaction thread t1 | FindObsoleteFiles(full_scan=false) t2 | FindObsoleteFiles(full_scan=true) t3 | PurgeObsoleteFiles t4 | PurgeObsoleteFiles V ``` When `user_thread` invokes `FindObsoleteFiles` with full scan, it collects ALL files in RocksDB directory, including the ones that background compaction thread have collected in its job context. Then `user_thread` will see an IO error when trying to delete these files in `PurgeObsoleteFiles` because background compaction thread has already deleted the file in `PurgeObsoleteFiles`. To fix this, we make RocksDB remember which (SST) files have been found by threads after calling `FindObsoleteFiles` (see `DBImpl#files_grabbed_for_purge_`). Therefore, when another thread calls `FindObsoleteFiles` with full scan, it will not collect such files. ajkr could you take a look and comment? Thanks! Closes https://github.com/facebook/rocksdb/pull/3638 Differential Revision: D7384372 Pulled By: riversand963 fbshipit-source-id: 01489516d60012e722ee65a80e1449e589ce26d3 28 March 2018, 17:29:59 UTC
90c5423 Update comments about MergeOperator::AllowSingleOperand Summary: Updated comments around AllowSingleOperand. Reason: A couple of users were confused and encountered issues due to no overriding PartialMerge with AllowSingleOperand=true. I'll also look into modifying the default merge operator implementation so that overriding PartialMerge is not mandatory when AllowSingleOp=true. Closes https://github.com/facebook/rocksdb/pull/3659 Differential Revision: D7422691 Pulled By: sagar0 fbshipit-source-id: 3d075a6ced0120f5d65cb7ae5412936f1862f342 28 March 2018, 00:13:53 UTC
d687670 Fix a leak in FilterBlockBuilder when adding prefix Summary: Our valgrind continuous test found an interesting leak which got introduced in #3614. We were adding the prefix key before saving the previous prefix start offset, due to which previous prefix offset is always incorrect. Fixed it by saving the the previous sate before adding the key. Closes https://github.com/facebook/rocksdb/pull/3660 Differential Revision: D7418698 Pulled By: sagar0 fbshipit-source-id: 9933685f943cf2547ed5c553f490035a2fa785cf 27 March 2018, 22:13:56 UTC
f9f4d40 Align SST file data blocks to avoid spanning multiple pages Summary: Provide a block_align option in BlockBasedTableOptions to allow alignment of SST file data blocks. This will avoid higher IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages. When this option is set to true, the block alignment is set to lower of block size and 4KB. Closes https://github.com/facebook/rocksdb/pull/3502 Differential Revision: D7400897 Pulled By: anand1976 fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44 27 March 2018, 03:26:10 UTC
0999e9b WritePrepared Txn: Increase commit cache size to 2^23 Summary: Current commit cache size is 2^21. This was due to a type. With 2^23 commit entries we can have transactions as long as 64s without incurring the cost of having them evicted from the commit cache before their commit. Here is the math: 2^23 / 2 (one out of two seq numbers are for commit) / 2^16 TPS = 2^6 = 64s Closes https://github.com/facebook/rocksdb/pull/3657 Differential Revision: D7411211 Pulled By: maysamyabandeh fbshipit-source-id: e7cacf40579f3acf940643d8a1cfe5dd201caa35 27 March 2018, 02:45:17 UTC
35a4469 Fix race condition via concurrent FlushWAL Summary: Currently log_writer->AddRecord in WriteImpl is protected from concurrent calls via FlushWAL only if two_write_queues_ option is set. The patch fixes the problem by i) skip log_writer->AddRecord in FlushWAL if manual_wal_flush is not set, ii) protects log_writer->AddRecord in WriteImpl via log_write_mutex_ if manual_wal_flush_ is set but two_write_queues_ is not. Fixes #3599 Closes https://github.com/facebook/rocksdb/pull/3656 Differential Revision: D7405608 Pulled By: maysamyabandeh fbshipit-source-id: d6cc265051c77ae49c7c6df4f427350baaf46934 26 March 2018, 23:29:56 UTC
23f9d93 Exclude MySQLStyleTransactionTest.TransactionStressTest* from valgrind Summary: I found that each instance of MySQLStyleTransactionTest.TransactionStressTest/x is taking more than 10 hours to complete on our continuous testing environment, causing the whole valgrind run to timeout after a day. So excluding these tests. Closes https://github.com/facebook/rocksdb/pull/3652 Differential Revision: D7400332 Pulled By: sagar0 fbshipit-source-id: 987810574506d01487adf7c2de84d4817ec3d22d 26 March 2018, 17:27:47 UTC
3e417a6 WritePrepared Txn: AddPrepared for all sub-batches Summary: Currently AddPrepared is performed only on the first sub-batch if there are duplicate keys in the write batch. This could cause a problem if the transaction takes too long to commit and the seq number of the first sub-patch moved to old_prepared_ but not the seq of the later ones. The patch fixes this by calling AddPrepared for all sub-patches. Closes https://github.com/facebook/rocksdb/pull/3651 Differential Revision: D7388635 Pulled By: maysamyabandeh fbshipit-source-id: 0ccd80c150d9bc42fe955e49ddb9d7ca353067b4 24 March 2018, 00:30:04 UTC
d382ae7 Imporve perf of random read and insert compare by suggesting inlining to the compiler Summary: Results from 2015 compiler. This improve sequential insert. Random Read results are inconclusive but I hope 2017 will do a better job at inlining. Before: fillseq : **3.638 micros/op 274866 ops/sec; 213.9 MB/s** After: fillseq : **3.379 micros/op 295979 ops/sec; 230.3 MB/s** Closes https://github.com/facebook/rocksdb/pull/3645 Differential Revision: D7382711 Pulled By: siying fbshipit-source-id: 092a07ffe8a6e598d1226ceff0f11b35e6c5c8e4 23 March 2018, 20:26:55 UTC
53d66df Refactor sync_point to make implementation either customizable or replaceable Summary: Closes https://github.com/facebook/rocksdb/pull/3637 Differential Revision: D7354373 Pulled By: ajkr fbshipit-source-id: 6816c7bbc192ed0fb944942b11c7074bf24eddf1 23 March 2018, 19:56:52 UTC
a993c01 Add 5.11 and 5.12 to tools/check_format_compatible.sh Summary: Closes https://github.com/facebook/rocksdb/pull/3646 Differential Revision: D7384727 Pulled By: sagar0 fbshipit-source-id: f713af7adb2ffea5303bbf0fac8a8a1630af7b38 23 March 2018, 19:43:06 UTC
e80709a Avoid adding tombstones of the same file to RangeDelAggregator multiple times Summary: RangeDelAggregator will remember the files whose range tombstones have been added, so the caller can check whether the file has been added before call AddTombstones. Closes https://github.com/facebook/rocksdb/pull/3635 Differential Revision: D7354604 Pulled By: ajkr fbshipit-source-id: 9b9f7ec130556028df417e650711554b46d8d107 23 March 2018, 19:43:06 UTC
7ffce28 Add Java-API-Changes section to History Summary: We have not been updating our HISTORY.md change log with the RocksJava changes. Going forward, lets add Java changes also to HISTORY.md. There is an old java/HISTORY-JAVA.md, but it hasn't been updated in years. It is much easier to remember to update the change log in a single file, HISTORY.md. I added information about shared block cache here, which was introduced in #3623. Closes https://github.com/facebook/rocksdb/pull/3647 Differential Revision: D7384448 Pulled By: sagar0 fbshipit-source-id: 9b6e569f44e6df5cb7ba06413d9975df0b517d20 23 March 2018, 19:28:51 UTC
09b6bf8 InlineSkiplist: don't decode keys unnecessarily during comparisons Summary: Summary ======== `InlineSkipList<>::Insert` takes the `key` parameter as a C-string. Then, it performs multiple comparisons with it requiring the `GetLengthPrefixedSlice()` to be spawn in `MemTable::KeyComparator::operator()(const char* prefix_len_key1, const char* prefix_len_key2)` on the same data over and over. The patch tries to optimize that. Rough performance comparison ===== Big keys, no compression. ``` $ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256 (...) fillrandom : 4.222 micros/op 236836 ops/sec; 80.4 MB/s ``` ``` $ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256 (...) fillrandom : 4.064 micros/op 246059 ops/sec; 83.5 MB/s ``` TODO ====== In ~~a separated~~ this PR: - [x] Go outside the write path. Maybe even eradicate the C-string-taking variant of `KeyIsAfterNode` entirely. - [x] Try to cache the transformations applied by `KeyComparator` & friends in situations where we havy many comparisons with the same key. Closes https://github.com/facebook/rocksdb/pull/3516 Differential Revision: D7059300 Pulled By: ajkr fbshipit-source-id: 6f027dbb619a488129f79f79b5f7dbe566fb2dbb 23 March 2018, 19:14:30 UTC
1cbc96d FlushReason improvement Summary: Right now flush reason "SuperVersion Change" covers a few different scenarios which is a bit vague. For example, the following db_bench job should trigger "Write Buffer Full" > $ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 $ grep 'flush_reason' /dev/shm/dbbench/LOG ... 2018/03/06-17:30:42.543638 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242543634, "job": 192, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018024, "flush_reason": "SuperVersion Change"} 2018/03/06-17:30:42.569541 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242569536, "job": 193, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"} 2018/03/06-17:30:42.596396 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242596392, "job": 194, "event": "flush_started", "num_memtables": 1, "num_entries": 7008, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "SuperVersion Change"} 2018/03/06-17:30:42.622444 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242622440, "job": 195, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"} With the fix: > 2018/03/19-14:40:02.341451 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602341444, "job": 98, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018008, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.379655 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602379642, "job": 100, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018016, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.418479 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602418474, "job": 101, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.455084 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602455079, "job": 102, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.492293 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602492288, "job": 104, "event": "flush_started", "num_memtables": 1, "num_entries": 7007, "num_deletes": 0, "memory_usage": 1018056, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.528720 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602528715, "job": 105, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.566255 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602566238, "job": 107, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018112, "flush_reason": "Write Buffer Full"} Closes https://github.com/facebook/rocksdb/pull/3627 Differential Revision: D7328772 Pulled By: miasantreble fbshipit-source-id: 67c94065fbdd36930f09930aad0aaa6d2c152bb8 23 March 2018, 01:42:18 UTC
82137f0 Add unit test for WAL corruption Summary: Closes https://github.com/facebook/rocksdb/pull/3618 Differential Revision: D7301053 Pulled By: ajkr fbshipit-source-id: a9dde90caa548c294d03d6386f78428c8536ca14 23 March 2018, 01:28:01 UTC
2e3d407 Fsync after writing global seq number in ExternalSstFileIngestionJob Summary: Fsync after writing global sequence number to the ingestion file in ExternalSstFileIngestionJob. Otherwise the file metadata could be incorrect. Closes https://github.com/facebook/rocksdb/pull/3644 Differential Revision: D7373813 Pulled By: sagar0 fbshipit-source-id: 4da2c9e71a8beb5c08b4ac955f288ee1576358b8 23 March 2018, 00:42:56 UTC
4d51fea Rename function for handling WAL write error Summary: It was misnamed. It actually updates `bg_error_` if `PreprocessWrite()` or `WriteToWAL()` fail, not related to the user callback. Closes https://github.com/facebook/rocksdb/pull/3485 Differential Revision: D6955787 Pulled By: ajkr fbshipit-source-id: bd7afc3fdb7a52830c021cbfc25fcbc3ab7d5e10 22 March 2018, 22:58:39 UTC
back to top