795e663 | Zhongyi Xie | 22 June 2018, 04:16:49 UTC | option for timing measurement of non-blocking ops during compaction (#4029) Summary: For example calling CompactionFilter is always timed and gives the user no way to disable. This PR will disable the timer if `Statistics::stats_level_` (which is part of DBOptions) is `kExceptDetailedTimers` Closes https://github.com/facebook/rocksdb/pull/4029 Differential Revision: D8583670 Pulled By: miasantreble fbshipit-source-id: 913be9fe433ae0c06e88193b59d41920a532307f | 22 June 2018, 04:28:05 UTC |
0a5b16c | Andrew Kryczka | 21 June 2018, 23:22:57 UTC | Cleanup staging directory at start of checkpoint (#4035) Summary: - Attempt to clean the checkpoint staging directory before starting a checkpoint. It was already cleaned up at the end of checkpoint. But it wasn't cleaned up in the edge case where the process crashed while staging checkpoint files. - Attempt to clean the checkpoint directory before calling `Checkpoint::Create` in `db_stress`. This handles the case where checkpoint directory was created by a previous `db_stress` run but the process crashed before cleaning it up. - Use `DestroyDB` for cleaning checkpoint directory since a checkpoint is a DB. Closes https://github.com/facebook/rocksdb/pull/4035 Reviewed By: yiwu-arbug Differential Revision: D8580223 Pulled By: ajkr fbshipit-source-id: 28c667400e249fad0fdedc664b349031b7b61599 | 21 June 2018, 23:27:12 UTC |
645e57c | Sagar Vemuri | 21 June 2018, 21:45:49 UTC | Assert for Direct IO at the beginning in PositionedRead (#3891) Summary: Moved the direct-IO assertion to the top in `PosixSequentialFile::PositionedRead`, as it doesn't make sense to check for sector alignments before checking for direct IO. Closes https://github.com/facebook/rocksdb/pull/3891 Differential Revision: D8267972 Pulled By: sagar0 fbshipit-source-id: 0ecf77c0fb5c35747a4ddbc15e278918c0849af7 | 21 June 2018, 21:58:01 UTC |
58c2214 | Yi Wu | 21 June 2018, 21:30:00 UTC | Update TARGETS file (#4028) Summary: -Wshorten-64-to-32 is invalid flag in fbcode. Changing it to -Warrowing. Closes https://github.com/facebook/rocksdb/pull/4028 Differential Revision: D8553694 Pulled By: yiwu-arbug fbshipit-source-id: 1523cbcb4c76cf1d2b10a4d28b5f58c78e6cb876 | 21 June 2018, 21:42:39 UTC |
3974959 | Yanqin Jin | 21 June 2018, 18:09:09 UTC | Fix a warning (treated as error) caused by type mismatch. Summary: Closes https://github.com/facebook/rocksdb/pull/4032 Differential Revision: D8573061 Pulled By: riversand963 fbshipit-source-id: 112324dcb35956d6b3ec891073f4f21493933c8b | 21 June 2018, 18:13:09 UTC |
7103559 | Sagar Vemuri | 21 June 2018, 18:02:49 UTC | Improve direct IO range scan performance with readahead (#3884) Summary: This PR extends the improvements in #3282 to also work when using Direct IO. We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash. **Description:** This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan. **Implementation Details:** - Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead. - `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled. - `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer. - Made sure not to re-read partial chunks of data that were already available in the buffer, from device again. - Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date. **Constraints:** - Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value). - Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously. - Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them. **Benchmarks:** I used the same benchmark as used in #3282. Data fill: ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes ``` Do a long range scan: Seekrandom with large number of nexts ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram ``` ``` Before: seekrandom : 37939.906 micros/op 26 ops/sec; 29.2 MB/s (1636 of 1999 found) With this change: seekrandom : 8527.720 micros/op 117 ops/sec; 129.7 MB/s (6530 of 7999 found) ``` ~4.5X perf improvement. Taken on an average of 3 runs. Closes https://github.com/facebook/rocksdb/pull/3884 Differential Revision: D8082143 Pulled By: sagar0 fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb | 21 June 2018, 18:13:08 UTC |
524c6e6 | Yanqin Jin | 21 June 2018, 15:34:24 UTC | Add file name info to SequentialFileReader. (#4026) Summary: We potentially need this information for tracing, profiling and diagnosis. Closes https://github.com/facebook/rocksdb/pull/4026 Differential Revision: D8555214 Pulled By: riversand963 fbshipit-source-id: 4263e06c00b6d5410b46aa46eb4e358ff2161dd2 | 21 June 2018, 15:42:24 UTC |
14cee19 | Andrew Kryczka | 21 June 2018, 05:14:12 UTC | Support file ingestion in stress test (#4018) Summary: Once per `ingest_external_file_one_in` operations, uses SstFileWriter to create a file containing `ingest_external_file_width` consecutive keys. The file is named containing the thread ID to avoid clashes. The file is then added to the DB using `IngestExternalFile`. We can't enable it by default in crash test because `nooverwritepercent` and `test_batches_snapshot` both must be zero for the DB's whole lifetime. Perhaps we should setup a separate test with that config as range deletion also requires it. Closes https://github.com/facebook/rocksdb/pull/4018 Differential Revision: D8507698 Pulled By: ajkr fbshipit-source-id: 1437ea26fd989349a9ce8b94117241c65e40f10f | 21 June 2018, 05:27:45 UTC |
61d69d4 | Dmitri Smirnov | 20 June 2018, 00:07:49 UTC | Hide jemalloc aligned allocation functions into .cc (#4025) Summary: so they could be overriden Closes https://github.com/facebook/rocksdb/pull/4025 Differential Revision: D8526287 Pulled By: siying fbshipit-source-id: 9537b299dc907b4d1eeaf77a8784b13cb058280d | 20 June 2018, 00:12:06 UTC |
28a9d89 | Maysam Yabandeh | 19 June 2018, 21:06:17 UTC | Fix the bug with duplicate prefix in partition filters (#4024) Summary: https://github.com/facebook/rocksdb/pull/3764 introduced an optimization feature to skip duplicate prefix entires in full bloom filters. Unfortunately it also introduces a bug in partitioned full filters, where the duplicate prefix should still be inserted if it is in a new partition. The patch fixes the bug by resetting the duplicate detection logic each time a partition is cut. This bug could result into false negatives, which means that DB could skip an existing key. Closes https://github.com/facebook/rocksdb/pull/4024 Differential Revision: D8518866 Pulled By: maysamyabandeh fbshipit-source-id: 044f4d988e606a330ecafd8c79daceb68b8796bf | 19 June 2018, 21:12:46 UTC |
92ee335 | Siying Dong | 19 June 2018, 16:42:40 UTC | BlockBasedTableIterator to keep BlockIter after out of upper bound (#4004) Summary: b555ed30a4a93b80a3ac4781c6721ab988e03b5b makes the BlockBasedTableIterator to be invalidated if the current position if over the upper bound. However, this can bring performance regression to the case of multiple Seek()s hitting the same data block but all out of upper bound. For example, if an SST file has a data block containing following keys : {a, z} The user sets the upper bound to be "x", and it executed following queries: Seek("b") Seek("c") Seek("d") Before the upper bound optimization, these queries always come to this same current data block of the iterator, but now inside each Seek() the data block is read from the block cache but is returned again. To prevent this regression case, we keep the current data block iterator if it is upper bound. Closes https://github.com/facebook/rocksdb/pull/4004 Differential Revision: D8463192 Pulled By: siying fbshipit-source-id: 8710628b30acde7063a097c3184d6c4333a8ef81 | 19 June 2018, 16:57:11 UTC |
7f3a634 | Andrew Kryczka | 19 June 2018, 15:56:43 UTC | Support pipelined write in stress/crash tests Summary: Closes https://github.com/facebook/rocksdb/pull/4019 Differential Revision: D8508681 Pulled By: ajkr fbshipit-source-id: 23a3c07d642386446e322b02e69cdf70d12ef009 | 19 June 2018, 16:14:12 UTC |
8585059 | Andrew Kryczka | 19 June 2018, 02:23:42 UTC | Support backup and checkpoint in db_stress (#4005) Summary: Add the `backup_one_in` and `checkpoint_one_in` options to periodically trigger backups and checkpoints. The directory names contain thread ID to avoid clashing with parallel backups/checkpoints. Enable checkpoint in crash test so our CI runs will use it. Didn't enable backup in crash test since it copies all the files which is too slow. Closes https://github.com/facebook/rocksdb/pull/4005 Differential Revision: D8472275 Pulled By: ajkr fbshipit-source-id: ff91bdc37caac4ffd97aea8df96b3983313ac1d5 | 19 June 2018, 02:28:18 UTC |
de2c6fb | Andrew Kryczka | 19 June 2018, 00:42:48 UTC | Fix stderr processing in crash test (#4006) Summary: Fixed bug where `db_stress` output a line with a warning followed by a line with an error, and `db_crashtest.py` considered that a success. For example: ``` WARNING: prefix_size is non-zero but memtablerep != prefix_hash open error: Corruption: SST file is ahead of WALs ``` Closes https://github.com/facebook/rocksdb/pull/4006 Differential Revision: D8473463 Pulled By: ajkr fbshipit-source-id: 60461bdd7491d9d26c63f7d4ee522a0f88ba3de7 | 19 June 2018, 00:58:13 UTC |
c766887 | Tomas Kolda | 18 June 2018, 21:38:50 UTC | Fix ExternalSSTFileTest::OverlappingRanges test on Solaris Sparc (#4012) Summary: Fix of #4011 Closes https://github.com/facebook/rocksdb/pull/4012 Differential Revision: D8499173 Pulled By: sagar0 fbshipit-source-id: cbb2b90c544ed364a3640ea65835d577b2dbc5df | 18 June 2018, 21:57:37 UTC |
7b4b43f | Tomas Kolda | 18 June 2018, 20:52:46 UTC | zLinux build error with gcc and IBM Java headers (#4013) Summary: `SetByteArrayRegion` does not have const source buffer thus compilation error. I have made that same as in other JNI files (const_cast). It was missing for new transaction functionality added recently. Closes https://github.com/facebook/rocksdb/pull/4013 Differential Revision: D8493290 Pulled By: sagar0 fbshipit-source-id: 14afedf365b111121bd11e68a8d546a1cae68b26 | 18 June 2018, 20:58:28 UTC |
e5bee40 | Tomas Kolda | 18 June 2018, 16:42:29 UTC | zLinux s390x support in JNI (#4009) Summary: Adding support for zLinux on s390x architecture in JNI. Closes https://github.com/facebook/rocksdb/pull/4009 Differential Revision: D8483750 Pulled By: siying fbshipit-source-id: e681657c27e7a28f1731e08e8570382de5deff44 | 18 June 2018, 16:57:02 UTC |
e750dac | Tomas Kolda | 18 June 2018, 03:54:22 UTC | Crash on Windows, because of shared_ptr reinterpret cast (#3999) Summary: For more details see #3998 Closes https://github.com/facebook/rocksdb/pull/3999 Differential Revision: D8458905 Pulled By: sagar0 fbshipit-source-id: d6e09182933253a08eaf81ac7cfe50ed3b6576c5 | 18 June 2018, 03:56:33 UTC |
80bc359 | Zhongyi Xie | 16 June 2018, 02:24:21 UTC | Should only decode restart points for uncompressed blocks (#3996) Summary: The Block object assumes contents are uncompressed. Block's constructor tries to read the number of restarts, but does not get an accurate number when its contents are compressed, which is causing issues like https://github.com/facebook/rocksdb/issues/3843. This PR address this issue by skipping reconstruction of restart points when blocks are known to be compressed. Somehow the restart points can be read directly when Snappy is used and some tests (for example https://github.com/facebook/rocksdb/blob/master/db/db_block_cache_test.cc#L196) expects blocks to be fully constructed even when Snappy compression is used, so here we keep the restart point logic for Snappy. Closes https://github.com/facebook/rocksdb/pull/3996 Differential Revision: D8416186 Pulled By: miasantreble fbshipit-source-id: 002c0b62b9e5d89fb7736563d354ce0023c8cb28 | 16 June 2018, 02:26:58 UTC |
c48764b | Anand Ananthabhotla | 16 June 2018, 00:53:12 UTC | Don't generate a notification for a 0 size SST (#4003) Summary: Don't call the OnTableFileCreated listener callback when a 0 size SST file gets created by Flush. Doing so causes an assertion failure in db_stress. It is also not correct behavior as we call env->DeleteFile() for such files right before the notification. Closes https://github.com/facebook/rocksdb/pull/4003 Differential Revision: D8461385 Pulled By: anand1976 fbshipit-source-id: ae92d4f921c2e2cff981ad58f4929ed8b609f35d | 16 June 2018, 00:57:24 UTC |
3fbc865 | zhichao-cao | 16 June 2018, 00:23:08 UTC | Add kOptionsStatistics to GetProperty() (#3966) Summary: Add a new DB property to DB::GetProperty(), which returns the option.statistics. Test is updated to pass. Closes https://github.com/facebook/rocksdb/pull/3966 Differential Revision: D8311139 Pulled By: zhichao-cao fbshipit-source-id: ea78f4727358c807b0e5a0ea62e09defb10ad9ac | 16 June 2018, 00:28:01 UTC |
7b5f7ff | Daniel Black | 15 June 2018, 23:42:35 UTC | travis: osx install zstd lz4 snappy xz (#3893) Summary: test osx against the brew libraries zstd, lz4, snappy, xz. Closes https://github.com/facebook/rocksdb/pull/3893 Differential Revision: D8461988 Pulled By: siying fbshipit-source-id: cc2a8487bcb1e98ca05bddd3a509a6896258ccf8 | 15 June 2018, 23:57:30 UTC |
906a602 | Tomas Kolda | 15 June 2018, 19:39:16 UTC | Build and tests fixes for Solaris Sparc (#4000) Summary: Here are some fixes for build on Solaris Sparc. It is also fixing CRC test on BigEndian platforms. Closes https://github.com/facebook/rocksdb/pull/4000 Differential Revision: D8455394 Pulled By: ajkr fbshipit-source-id: c9289a7b541a5628139c6b77e84368e14dc3d174 | 15 June 2018, 19:42:53 UTC |
f23fed1 | 奏之章 | 15 June 2018, 19:28:06 UTC | Delay verify compaction output table (#3979) Summary: Verify table will load SST into `TableCache` it occupy memory & `TableCache`‘s capacity ... but no logic use them it's unnecessary ... so , we verify them after all sub compact finished Closes https://github.com/facebook/rocksdb/pull/3979 Differential Revision: D8389946 Pulled By: ajkr fbshipit-source-id: 54bd4f474f9e7b3accf39c3068b1f36a27ec4c49 | 15 June 2018, 19:42:53 UTC |
4faaab7 | Hans-Wilhelm Warlo | 15 June 2018, 19:07:38 UTC | Benchmark sine wave write rate limit (#3914) Summary: As mentioned at the [dev forum.](https://www.facebook.com/groups/rocksdb.dev/1693425187422655/) Let me know if you would like me to do any changes! Closes https://github.com/facebook/rocksdb/pull/3914 Differential Revision: D8452824 Pulled By: siying fbshipit-source-id: 56439b3228ecdcc5a199d5198eff2fab553be961 | 15 June 2018, 19:12:03 UTC |
f5281a5 | Siying Dong | 15 June 2018, 17:59:40 UTC | tools/check_format_compatible.sh to cover forward option reading too (#3994) Summary: Make sure that some recent releases can read master's option files while ignoring unknown options. Also add two more recent release branches. Closes https://github.com/facebook/rocksdb/pull/3994 Differential Revision: D8409499 Pulled By: siying fbshipit-source-id: 1b025f19ba288da0517f6b4572797573e23e23c2 | 15 June 2018, 18:12:29 UTC |
fbe3b9e | Fenggang Wu | 15 June 2018, 17:32:20 UTC | Udpate db_universal_compaction_test according to PR #3970 (#3995) Summary: The SST file sizes changed slightly after the improvement of PR #3970 which reduces the size of the properties block. Before PR #3970 a size ratio compaction included all of the first four flushed files but it only includes two files after. We increase the size_ratio universal compaction option to make that compaction include all four files again. Closes https://github.com/facebook/rocksdb/pull/3995 Differential Revision: D8426925 Pulled By: fgwu fbshipit-source-id: 1429c38672e9f4fb4d4881fd4b06db45c4861d62 | 15 June 2018, 17:42:21 UTC |
1f32dc7 | Andrew Kryczka | 14 June 2018, 00:28:31 UTC | Check with PosixEnv before opening LOCK file (#3993) Summary: Rebased and resubmitting #1831 on behalf of stevelittle. The problem is when a single process attempts to open the same DB twice, the second attempt fails due to LOCK file held. If the second attempt had opened the LOCK file, it'll now need to close it, and closing causes the file to be unlocked. Then, any subsequent attempt to open the DB will succeed, which is the wrong behavior. The solution was to track which files a process has locked in PosixEnv, and check those before opening a LOCK file. Fixes #1780. Closes https://github.com/facebook/rocksdb/pull/3993 Differential Revision: D8398984 Pulled By: ajkr fbshipit-source-id: 2755fe66950a0c9de63075f932f9e15768041918 | 14 June 2018, 00:32:04 UTC |
7497f99 | Andrew Kryczka | 13 June 2018, 23:39:03 UTC | Run manual compaction in stress/crash tests (#3936) Summary: - Add support to `db_stress` for `CompactRange` - Enable `CompactRange` and `CompactFiles` in crash tests Closes https://github.com/facebook/rocksdb/pull/3936 Differential Revision: D8230953 Pulled By: ajkr fbshipit-source-id: 208f9980b5bc8c204b1fa726e83791ad674e21e8 | 13 June 2018, 23:45:28 UTC |
dd216dd | Andrew Kryczka | 13 June 2018, 20:37:21 UTC | Choose unique keys faster in db_stress (#3990) Summary: db_stress initialization randomly chooses a set of keys to not overwrite. It was doing it separately for each column family. That caused 30+ second initialization times for the non-simple crash tests, which have 10 CFs. This PR: - reuses the same set of randomly chosen no-overwrite keys across all CFs - logs a couple more timestamps so we can more easily see initialization time Closes https://github.com/facebook/rocksdb/pull/3990 Differential Revision: D8393821 Pulled By: ajkr fbshipit-source-id: d0b263a298df607285ffdd8b0983ff6575cc6c34 | 13 June 2018, 20:43:23 UTC |
a720401 | Andrew Kryczka | 13 June 2018, 20:03:09 UTC | Avoid acquiring SyncPoint mutex when it is disabled (#3991) Summary: In `db_stress` profile the vast majority of CPU time is spent acquiring the `SyncPoint` mutex. I mistakenly assumed #3939 had fixed this mutex contention problem by disabling `SyncPoint` processing. But actually the lock was still being acquired just to check whether processing is enabled. We can avoid that overhead by using an atomic to track whether it's enabled. Closes https://github.com/facebook/rocksdb/pull/3991 Differential Revision: D8393825 Pulled By: ajkr fbshipit-source-id: 5bc4e3c722ee7304e7a9c2439998c456b05a6897 | 13 June 2018, 20:13:18 UTC |
d82f142 | Siying Dong | 12 June 2018, 23:43:44 UTC | Fix regression bug of Prev() with upper bound (#3989) Summary: A recent change pushed down the upper bound checking to child iterators. However, this causes the logic of following sequence wrong: Seek(key); if (!Valid()) SeekToLast(); Because !Valid() may be caused by upper bounds, rather than the end of the iterator. In this case SeekToLast() points to totally wrong places. This can cause wrong results, infinite loops, or segfault in some cases. This sequence is called when changing direction from forward to backward. And this by itself also implicitly happen during reseeking optimization in Prev(). Fix this bug by using SeekForPrev() rather than this sequuence, as what is already done in prefix extrator case. Closes https://github.com/facebook/rocksdb/pull/3989 Differential Revision: D8385422 Pulled By: siying fbshipit-source-id: 429e869990cfd2dc389421e0836fc496bed67bb4 | 12 June 2018, 23:57:36 UTC |
9d34733 | Andrew Kryczka | 12 June 2018, 20:38:09 UTC | Fix argument mismatch in BlockBasedTableBuilder (#3974) Summary: The sixth argument should be `key_includes_seq` bool, the seventh a `GetContext*`. We were mistakenly passing the `GetContext*` as the sixth argument and relying on the default (nullptr) for the seventh. This would make statistics inaccurate, at least. Blame: 402b7aa0 Closes https://github.com/facebook/rocksdb/pull/3974 Differential Revision: D8344907 Pulled By: ajkr fbshipit-source-id: 3ad865a0541d6d30f75dfc726352788118cfe12e | 12 June 2018, 20:57:44 UTC |
9c7da96 | shpala | 12 June 2018, 20:36:05 UTC | Fix a crash in WinEnvIO::GetSectorSize (#3975) Summary: Fix a crash in `WinEnvIO::GetSectorSize` that happens on old Windows systems (e.g Windows 7). On old Windows systems that don't support querying StorageAccessAlignmentProperty using IOCTL_STORAGE_QUERY_PROPERTY, the flow calls a different DeviceIoControl with nullptr as lpBytesReturned. When the code reaches this point, we get an access violation. Closes https://github.com/facebook/rocksdb/pull/3975 Differential Revision: D8385186 Pulled By: ajkr fbshipit-source-id: fae4c9b4b0a52c8a10182e1b35bcaa30dc393bbb | 12 June 2018, 20:45:18 UTC |
3593275 | Fenggang Wu | 12 June 2018, 19:46:48 UTC | Remove restart point from the properties_block (#3970) Summary: Property block will be read sequentially and cached in a heap located object, so there's no need for restart points. Thus we set the restart interval to infinity to save space. Closes https://github.com/facebook/rocksdb/pull/3970 Differential Revision: D8332586 Pulled By: fgwu fbshipit-source-id: 899c3267832a81d0f084ec2db6b387332f461134 | 12 June 2018, 19:57:37 UTC |
f450294 | Fenggang Wu | 08 June 2018, 19:53:23 UTC | Change db path for BlockBasedTableTest.BadOptions (#3965) Summary: BadOptions test creates a temporary db path changed to table_block_based_bad_options_test to avoid collide with that created by the PrefixAndWholeKeyTest Closes https://github.com/facebook/rocksdb/pull/3965 Differential Revision: D8316080 Pulled By: fgwu fbshipit-source-id: bb8e0fdfdb9abf0e5ce94494b4388cd1622ee032 | 08 June 2018, 19:57:14 UTC |
3470c75 | Yanqin Jin | 07 June 2018, 22:34:42 UTC | Fix build errors. Summary: Closes https://github.com/facebook/rocksdb/pull/3967 Differential Revision: D8322775 Pulled By: riversand963 fbshipit-source-id: bd73067bd5d3ed4627348f0685bc499359ad6442 | 07 June 2018, 22:43:09 UTC |
23e1d23 | Zhichao Cao | 07 June 2018, 18:34:52 UTC | Fixed the fprintf of uint64_t by using PRIu64 (#3963) Summary: Fixed the fprintf format of uint64_t by using PRIu64 in file tools/ldb_cmd.cc Closes https://github.com/facebook/rocksdb/pull/3963 Differential Revision: D8306179 Pulled By: zhichao-cao fbshipit-source-id: 597dcd55321576801bbf2cf4714736ebc4750a0c | 07 June 2018, 18:44:48 UTC |
0a0860a | Yanqin Jin | 07 June 2018, 17:34:56 UTC | Refactoring db_stress.cc (#3902) Summary: We use `db_stress.cc` intensively to test and verify the behavior of RocksDB. Sometimes we need to add new tests for recently added features. Original `StressTest` class provides many general functionality that can be leveraged by other tests. Therefore, in this refactoring PR, I try to identify the general operations as well as operations that future tests most likely want to customize. Future tests can inherit `StressTest` and overriding the virtual functions to test custom logic. Closes https://github.com/facebook/rocksdb/pull/3902 Differential Revision: D8284607 Pulled By: riversand963 fbshipit-source-id: 019302d04665a2b18334b6d05d04a477168c8ea4 | 07 June 2018, 17:43:00 UTC |
45b6bcc | Zhongyi Xie | 07 June 2018, 06:30:26 UTC | ZSTD compression: should also expect type = kZSTDNotFinalCompression (#3964) Summary: Depending on the compression type, `CompressBlock` calls the compress method for each compression type. It calls ZSTD_Compress for both kZSTD and kZSTDNotFinalCompression (https://github.com/facebook/rocksdb/blob/master/table/block_based_table_builder.cc#L169). However currently ZSTD_Compress only expects the type to be kZSTD and this is causing assert failures and crashes. The same also applies to ZSTD_Uncompress. Closes https://github.com/facebook/rocksdb/pull/3964 Differential Revision: D8308715 Pulled By: miasantreble fbshipit-source-id: e5125f53edb829c9c33733167bec74e4793d0782 | 07 June 2018, 06:42:29 UTC |
b736521 | Maysam Yabandeh | 06 June 2018, 23:44:52 UTC | Extend format 3 to partitioned index/filters (#3958) Summary: format_version 3 changes the format of index blocks by storing user keys instead of the internal keys, which saves 8-bytes per key. This patch extends the format to top-level indexes in partitioned index/filters. Closes https://github.com/facebook/rocksdb/pull/3958 Differential Revision: D8294615 Pulled By: maysamyabandeh fbshipit-source-id: 17666cc16b8076c363972e2308e31547e835f0fe | 06 June 2018, 23:58:16 UTC |
5504a05 | Pooja Malik | 06 June 2018, 21:28:51 UTC | Adding advisor Rules and parser scripts with unit tests. (#3934) Summary: This adds some rules in the tools/advisor/advisor/rules.ini (refer this for more information) file and corresponding python parser scripts for parsing the rules file and the rocksdb LOG and OPTIONS files. This is WIP for adding rules depending on ODS. The starting point of the script is the rocksdb/tools/advisor/advisor/rule_parser.py file. Closes https://github.com/facebook/rocksdb/pull/3934 Reviewed By: maysamyabandeh Differential Revision: D8304059 Pulled By: poojam23 fbshipit-source-id: 47f2a50f04d46d40e225dd1cbf58ba490f79e239 | 06 June 2018, 21:42:59 UTC |
4420df4 | Andrew Kryczka | 05 June 2018, 21:03:42 UTC | Check conflict at output level in CompactFiles (#3926) Summary: CompactFiles checked whether the existing files conflicted with the chosen compaction. But it missed checking whether future files would conflict, i.e., when another compaction was simultaneously writing new files to the same range at the same output level. Closes https://github.com/facebook/rocksdb/pull/3926 Differential Revision: D8218996 Pulled By: ajkr fbshipit-source-id: 21cb00a6fed4c8c62d3ed2ff810962e6bdc2fdfb | 05 June 2018, 21:14:05 UTC |
f1592a0 | Zhongyi Xie | 05 June 2018, 19:51:05 UTC | run make format for PR 3838 (#3954) Summary: PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings. Run `make format` to fix formatting as suggested by siying . Also piggyback two changes: 1) fix singleton destruction order for windows and posix env 2) fix two clang warnings Closes https://github.com/facebook/rocksdb/pull/3954 Differential Revision: D8272041 Pulled By: miasantreble fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f | 05 June 2018, 19:58:02 UTC |
812c737 | Mike Kolupaev | 05 June 2018, 18:32:07 UTC | Fix performance regression in Get() for block-based tables (#3953) Summary: This fixes a regression in one of myrocks regression tests (readwhilewriting), introduced in https://github.com/facebook/rocksdb/commit/8bf555f487d1de84a4fb19cb97b9ae1a8dbebc60 This PR changes two lines of code: one of them actually fixes the observed regression, the other is a mostly unrelated small fix that I'm piggy-backing here. EDIT: Nevermind, it fixes one line. More details in inline comments. Closes https://github.com/facebook/rocksdb/pull/3953 Differential Revision: D8270664 Pulled By: al13n321 fbshipit-source-id: a7d91e196807d1e816551591257c700f70e4ccac | 05 June 2018, 18:43:16 UTC |
d0c38c0 | Maysam Yabandeh | 05 June 2018, 02:59:44 UTC | Extend some tests to format_version=3 (#3942) Summary: format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3. Closes https://github.com/facebook/rocksdb/pull/3942 Differential Revision: D8238413 Pulled By: maysamyabandeh fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035 | 05 June 2018, 03:13:00 UTC |
2210152 | Andrew Kryczka | 04 June 2018, 22:52:31 UTC | Fix singleton destruction order of PosixEnv and SyncPoint (#3951) Summary: Ensure the PosixEnv singleton is destroyed first since its destructor waits for background threads to all complete. This ensures background threads cannot hit sync points after the SyncPoint singleton is destroyed, which was previously possible. Closes https://github.com/facebook/rocksdb/pull/3951 Differential Revision: D8265295 Pulled By: ajkr fbshipit-source-id: 7738dd458c5d993a78377dd0420e82badada81ab | 04 June 2018, 22:58:46 UTC |
ab2254b | Manuel Ung | 04 June 2018, 21:39:45 UTC | Fix clang analyze Summary: This fixes the errors as reported here: https://github.com/facebook/rocksdb/pull/3941#issuecomment-394424043 Closes https://github.com/facebook/rocksdb/pull/3950 Differential Revision: D8263086 Pulled By: lth fbshipit-source-id: 5e148d489cab2153e5846d16979a0a1f2d677d57 | 04 June 2018, 21:44:23 UTC |
f4b72d7 | Dmitri Smirnov | 04 June 2018, 19:04:52 UTC | Provide a way to override windows memory allocator with jemalloc for ZSTD Summary: Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably. For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance. The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression. Closes https://github.com/facebook/rocksdb/pull/3838 Differential Revision: D8229794 Pulled By: miasantreble fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d | 04 June 2018, 19:12:48 UTC |
4f297ad | Andrew Kryczka | 04 June 2018, 04:28:41 UTC | Fix crash test check for direct I/O Summary: We need to keep the DB directory around since the direct IO check in "db_crashtest.py" relies on it existing. This PR fixes an issue where it was removed after each stress test run during the second half of whitebox crash testing. Closes https://github.com/facebook/rocksdb/pull/3946 Differential Revision: D8247998 Pulled By: ajkr fbshipit-source-id: 4e7cffbdab9b40df125e7842d0d59916e76261d3 | 04 June 2018, 04:42:12 UTC |
50d7ac0 | Zhongyi Xie | 02 June 2018, 03:28:19 UTC | Fix test for rocksdb_lite: hide incompatible option kDirectIO Summary: Previous commit https://github.com/facebook/rocksdb/pull/3935 unhide a few test options which includes kDirectIO. However it's not supported by RocksDB lite. Need to hide this option from the lite build. Closes https://github.com/facebook/rocksdb/pull/3943 Differential Revision: D8242757 Pulled By: miasantreble fbshipit-source-id: 1edfad3a5d01a46bfb7eedee765981ebe02c500a | 02 June 2018, 03:42:36 UTC |
fea2b1d | Andrew Kryczka | 01 June 2018, 23:46:32 UTC | Copy Get() result when file reads use mmap Summary: For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region. For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`). This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied. This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime. Closes https://github.com/facebook/rocksdb/pull/3881 Differential Revision: D8076288 Pulled By: ajkr fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08 | 01 June 2018, 23:57:58 UTC |
88c3ee2 | Andrew Kryczka | 01 June 2018, 23:33:07 UTC | Configure direct I/O statically in db_stress Summary: Previously `db_stress` attempted to configure direct I/O dynamically in `SetOptions()` which had multiple problems (ummm must've never been tested): - It's a DB option so SetDBOptions should've been called instead - It's not a dynamic option so even SetDBOptions would fail - It required enabling SyncPoint to mask O_DIRECT since it had no way to detect whether the DB directory was in tmpfs or not. This required locking that consumed ~80% of db_stress CPU. In this PR I delete the broken dynamic config and instead configure it statically, only enabling it if the DB directory truly supports O_DIRECT. Closes https://github.com/facebook/rocksdb/pull/3939 Differential Revision: D8238120 Pulled By: ajkr fbshipit-source-id: 60bb2deebe6c9b54a3f788079261715b4a229279 | 01 June 2018, 23:42:34 UTC |
01e3c30 | Manuel Ung | 01 June 2018, 21:57:55 UTC | Extend existing unit tests to run with WriteUnprepared as well Summary: As titled. I have not extended the Compatibility tests because the new WAL markers are still unimplemented. Closes https://github.com/facebook/rocksdb/pull/3941 Differential Revision: D8238394 Pulled By: lth fbshipit-source-id: 980e3d44837bbf2cfa64047f9738f559dfac4b1d | 01 June 2018, 21:58:41 UTC |
89b3708 | straw | 01 June 2018, 16:33:10 UTC | add c api rocksdb_sstfilewriter_file_size Summary: Closes https://github.com/facebook/rocksdb/pull/3922 Differential Revision: D8208528 Pulled By: ajkr fbshipit-source-id: d384fe53cf526f2aadc7b79a423ce36dbd3ff224 | 01 June 2018, 16:43:59 UTC |
2a0dfaa | Zhongyi Xie | 01 June 2018, 04:27:08 UTC | fix PrefixExtractorChanged: pass raw pointer instead shared_ptr Summary: This should resolve the performance regression caused by the unnecessary copying of the shared_ptr. Closes https://github.com/facebook/rocksdb/pull/3937 Differential Revision: D8232330 Pulled By: miasantreble fbshipit-source-id: 7885bf7cd190b6f87164c52d6edd328298c13f97 | 01 June 2018, 04:42:50 UTC |
44cf849 | Maysam Yabandeh | 01 June 2018, 02:16:11 UTC | Fix the bug of some test scenarios being put after kEnd Summary: DBTestBase::OptionConfig includes the scenarios that unit tests could iterate over them by calling ChangeOptions(). Some of the options have been mistakenly put after kEnd which makes them essentially invisible to ChangeOptions() caller. This patch fixes it except for kUniversalSubcompactions which is left as TODO since it would break some unit tests. Closes https://github.com/facebook/rocksdb/pull/3935 Differential Revision: D8230748 Pulled By: maysamyabandeh fbshipit-source-id: edddb8fffcd161af1809fef24798ce118f8593db | 01 June 2018, 02:28:00 UTC |
2807678 | QingpingWang | 01 June 2018, 00:20:56 UTC | c api set bottommost level compaction Summary: Closes https://github.com/facebook/rocksdb/pull/3928 Differential Revision: D8224962 Pulled By: ajkr fbshipit-source-id: 3caf463509a935bff46530f27232a85ae7e4e484 | 01 June 2018, 00:30:50 UTC |
82089d5 | Siying Dong | 31 May 2018, 19:53:43 UTC | DBImpl::FindObsoleteFiles() not to call GetChildren() on the same path Summary: DBImpl::FindObsoleteFiles() may call GetChildren() multiple times if different CFs are on the same path. Fix it. Closes https://github.com/facebook/rocksdb/pull/3885 Differential Revision: D8084634 Pulled By: siying fbshipit-source-id: b471fbc251f6a05e9243304dc14c0831060cc0b0 | 31 May 2018, 19:58:33 UTC |
a35451e | maoyouxiang | 31 May 2018, 18:00:23 UTC | fix deadlock with enable_pipelined_write=true and max_successive_merges > 0 Summary: fix this https://github.com/facebook/rocksdb/issues/3916 Closes https://github.com/facebook/rocksdb/pull/3923 Differential Revision: D8215192 Pulled By: yiwu-arbug fbshipit-source-id: a4c2f839a91d92dc70906d2b7c6de0fe014a2422 | 31 May 2018, 18:13:14 UTC |
aaac6cd | Manuel Ung | 31 May 2018, 17:42:44 UTC | Add write unprepared classes by inheriting from write prepared Summary: Closes https://github.com/facebook/rocksdb/pull/3907 Differential Revision: D8218325 Pulled By: lth fbshipit-source-id: ff32d8dab4a159cd2762876cba4b15e3dc51ff3b | 31 May 2018, 17:47:42 UTC |
727eb88 | Jacquin Mininger | 31 May 2018, 00:45:02 UTC | Compile error in db bench tool Summary: Small format error below causes build to fail. I believe that this : ``` fprintf(stderr, "num reads to do %lu\n", reads_); ``` Can be changed to this: ``` fprintf(stderr, "num reads to do %" PRIu64 "\n", reads_); ``` Successful build ``` CC utilities/blob_db/blob_dump_tool.o AR librocksdb_debug.a ar: creating archive librocksdb_debug.a /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: librocksdb_debug.a(rocks_lua_compaction_filter.o) has no symbols CC tools/db_bench.o CC tools/db_bench_tool.o tools/db_bench_tool.cc:4532:46: error: format specifies type 'unsigned long' but the argument has type 'int64_t' (aka 'long long') [-Werror,-Wformat] fprintf(stderr, "num reads to do %lu\n", reads_); ~~~ ^~~~~~ %lld 1 error generated. make: *** [tools/db_bench_tool.o] Error 1 ``` ``` $ cd rocksdb $ make all $ g++ --version Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1 Apple LLVM version 9.1.0 (clang-902.0.39.1) Target: x86_64-apple-darwin17.5.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin ``` Closes https://github.com/facebook/rocksdb/pull/3909 Differential Revision: D8215710 Pulled By: siying fbshipit-source-id: 15e49fb02a818fec846e9f9b2a50e372b6b67751 | 31 May 2018, 01:01:36 UTC |
4dd80de | Siying Dong | 30 May 2018, 22:33:19 UTC | Remove tests from ROCKSDB_VALGRIND_RUN Summary: In order to make valgrind check test to pass in a day, remove some tests that run prohibitively slow under valgrind. Closes https://github.com/facebook/rocksdb/pull/3924 Differential Revision: D8210184 Pulled By: siying fbshipit-source-id: 5b06fb08f3cf57571d422d05a0dbddc9f9376f7a | 30 May 2018, 23:15:16 UTC |
a736255 | Anand Ananthabhotla | 29 May 2018, 22:42:14 UTC | Delete triggered compaction for universal style Summary: This is still WIP, but I'm hoping for early feedback on the overall approach. This patch implements deletion triggered compaction, which till now only worked for leveled, for universal style. SST files are marked for compaction by the CompactOnDeletionCollertor table property. This is expected to be used when free disk space is low and the user wants to reclaim space by deleting a bunch of keys. The deletions are expected to be dense. In such a situation, we want to avoid a full compaction due to its space overhead. The strategy used in this case is similar to leveled. We pick one file from the set of files marked for compaction. We then expand the inputs to a clean cut on the same level, and then pick overlapping files from the next non-mepty level. Picking files from the next level can cause the key range to expand, and we opportunistically expand inputs in the source level to include files wholly in this key range. The main side effect of this is that it breaks the property of no time range overlap between levels. This shouldn't break any functionality. Closes https://github.com/facebook/rocksdb/pull/3860 Differential Revision: D8124397 Pulled By: anand1976 fbshipit-source-id: bfa2a9dd6817930e991b35d3a8e7e61304ed3dcf | 29 May 2018, 22:44:34 UTC |
724855c | Yi Wu | 29 May 2018, 22:11:22 UTC | Fix LRUCache missing null check on destruct Summary: Fix LRUCache missing null check on destruct. The check is needed if LRUCache::DisownData is called. Closes https://github.com/facebook/rocksdb/pull/3920 Differential Revision: D8191631 Pulled By: yiwu-arbug fbshipit-source-id: d5014f6e49b51692c18a25fb55ece935f5a023c4 | 29 May 2018, 22:13:09 UTC |
cf826de | Yanqin Jin | 29 May 2018, 19:26:57 UTC | Fix compilation error when OPT="-DROCKSDB_LITE". Summary: Closes https://github.com/facebook/rocksdb/pull/3917 Differential Revision: D8187733 Pulled By: riversand963 fbshipit-source-id: e4aa179cd0791ca77167e357f99de9afd4aef910 | 29 May 2018, 19:28:59 UTC |
03cda53 | Maysam Yabandeh | 29 May 2018, 19:09:01 UTC | Check for rep_->table_properties being nullptr Summary: The very old sst formats do not have table_properties and rep_->table_properties is thus nullptr. The recent patch in https://github.com/facebook/rocksdb/pull/3894 does not check for nullptr and hence makes it backward incompatible. This patch adds the check. Closes https://github.com/facebook/rocksdb/pull/3918 Differential Revision: D8188638 Pulled By: maysamyabandeh fbshipit-source-id: b1d986665ecf0b4d1c442adfa8a193b97707d47b | 29 May 2018, 19:13:55 UTC |
1c1bafa | 奏之章 | 28 May 2018, 18:13:17 UTC | Fix VersionStorageInfo::EstimateLiveDataSize seg fault Summary: `HandleEstimateLiveDataSize`'s `need_out_of_mutex` is true https://github.com/facebook/rocksdb/blob/402b7aa07f0e6da4c1f0216ff2b2e50fd2e5eaac/db/internal_stats.cc#L412-L413 so , is will ref a `SuperVersion` https://github.com/facebook/rocksdb/blob/402b7aa07f0e6da4c1f0216ff2b2e50fd2e5eaac/db/db_impl.cc#L1896-L1908 so , the param `version` of `InternalStats::HandleEstimateLiveDataSize` is safe , but `cfd_->current()` is not safe ! https://github.com/facebook/rocksdb/blob/402b7aa07f0e6da4c1f0216ff2b2e50fd2e5eaac/db/internal_stats.cc#L790-L795 the `cfd_->current()` maybe invalid ... here's mongo-rocks crash backtrace ``` mongod(mongo::printStackTrace(std::basic_ostream<char, std::char_traits<char> >&)+0x41) [0x7fe3a3137c51] mongod(+0x2152E89) [0x7fe3a3136e89] mongod(+0x21534F6) [0x7fe3a31374f6] libpthread.so.0(+0xF5E0) [0x7fe39f5e45e0] mongod(rocksdb::InternalKeyComparator::Compare(rocksdb::Slice const&, rocksdb::Slice const&) const+0x17) [0x7fe3a22375a7] mongod(rocksdb::VersionStorageInfo::EstimateLiveDataSize() const+0x3AA) [0x7fe3a228daba] mongod(rocksdb::InternalStats::HandleEstimateLiveDataSize(unsigned long*, rocksdb::DBImpl*, rocksdb::Version*)+0x20) [0x7fe3a2250d70] mongod(rocksdb::DBImpl::GetIntPropertyInternal(rocksdb::ColumnFamilyData*, rocksdb::DBPropertyInfo const&, bool, unsigned long*)+0xEF) [0x7fe3a21e3dbf] ``` Closes https://github.com/facebook/rocksdb/pull/3912 Differential Revision: D8179944 Pulled By: yiwu-arbug fbshipit-source-id: 26f314a8f98f4c2dc4348745d759f26f0e8d95e1 | 28 May 2018, 18:27:08 UTC |
402b7aa | Maysam Yabandeh | 26 May 2018, 01:41:31 UTC | Exclude seq from index keys Summary: Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator. The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s Closes https://github.com/facebook/rocksdb/pull/3894 Differential Revision: D8118775 Pulled By: maysamyabandeh fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd | 26 May 2018, 01:42:43 UTC |
8c3bf08 | Nathan VanBenschoten | 26 May 2018, 00:35:36 UTC | Check status when reading HashIndexPrefixesMetadataBlock Summary: This was missed in a refactor of `ReadBlockContents` (2f1a3a4). Closes https://github.com/facebook/rocksdb/pull/3906 Differential Revision: D8172648 Pulled By: ajkr fbshipit-source-id: 27e453b19795fea974bfed4721105be6f3a12090 | 26 May 2018, 00:42:51 UTC |
4543417 | Adam Retter | 25 May 2018, 22:06:47 UTC | Fix an issue with unnecessary capture in lambda expressions Summary: Closes https://github.com/facebook/rocksdb/issues/3900 Replaces https://github.com/facebook/rocksdb/pull/3901 I needed this to build v5.12.4 on Mac OS X (10.13.3). Closes https://github.com/facebook/rocksdb/pull/3904 Differential Revision: D8169357 Pulled By: sagar0 fbshipit-source-id: 85faac42168796e7def9250d0c221a9a03b84476 | 25 May 2018, 22:12:44 UTC |
aa53579 | Yanqin Jin | 25 May 2018, 18:45:12 UTC | Fix segfault caused by object premature destruction Summary: Please refer to earlier discussion in [issue 3609](https://github.com/facebook/rocksdb/issues/3609). There was also an alternative fix in [PR 3888](https://github.com/facebook/rocksdb/pull/3888), but the proposed solution requires complex change. To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed. Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised. Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion. Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler. Test plan ``` $ make -j16 $ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion ``` Expected behavior: All tests should pass. Closes https://github.com/facebook/rocksdb/pull/3898 Differential Revision: D8149421 Pulled By: riversand963 fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438 | 25 May 2018, 18:57:51 UTC |
6e08916 | 奏之章 | 25 May 2018, 17:47:56 UTC | Fix Fadvise on closed file when reads use mmap Summary: ```PosixMmapReadableFile::fd_``` is closed after created, but needs to remain open for the lifetime of `PosixMmapReadableFile` since it is used whenever `InvalidateCache` is called. Closes https://github.com/facebook/rocksdb/pull/2764 Differential Revision: D8152515 Pulled By: ajkr fbshipit-source-id: b738a6a55ba4e392f9b0f374ff396a1e61c64f65 | 25 May 2018, 17:57:57 UTC |
070319f | QingpingWang | 25 May 2018, 05:26:09 UTC | add flush_before_backup parameter to c api rocksdb_backup_engine_create_new_backup Summary: Add flush_before_backup to rocksdb_backup_engine_create_new_backup. make c api able to control the flush before backup behavior. Closes https://github.com/facebook/rocksdb/pull/3897 Differential Revision: D8157676 Pulled By: ajkr fbshipit-source-id: 88998c62f89f087bf8672398fd7ddafabbada505 | 25 May 2018, 05:28:52 UTC |
bc7e8d4 | Yi Wu | 24 May 2018, 22:45:49 UTC | LRUCache midpoint insertion Summary: Implement midpoint insertion strategy where new blocks will be insert to the middle of LRU list, then move the head on the first hit in cache. Closes https://github.com/facebook/rocksdb/pull/3877 Differential Revision: D8100895 Pulled By: yiwu-arbug fbshipit-source-id: f4bd83cb8be469e5d02072cfc8bd66011391f3da | 24 May 2018, 22:57:33 UTC |
3db8504 | Dmitri Smirnov | 24 May 2018, 22:05:00 UTC | Catchup with posix features Summary: Catch up with Posix features NewWritableRWFile must fail when file does not exists Implement Env::Truncate() Adjust Env options optimization functions Implement MemoryMappedBuffer on Windows. Closes https://github.com/facebook/rocksdb/pull/3857 Differential Revision: D8053610 Pulled By: ajkr fbshipit-source-id: ccd0d46c29648a9f6f496873bc1c9d6c5547487e | 24 May 2018, 22:13:04 UTC |
c465509 | Kefu Chai | 24 May 2018, 18:59:27 UTC | port_posix: use posix_memalign() for aligned_alloc Summary: to workaround issue of http://tracker.ceph.com/issues/21422 . and in tcmalloc aligned_alloc and posix_memalign() are basically the same thing. the same applies to GNU glibc. fixes #3175 Signed-off-by: Kefu Chai <tchaikov@gmail.com> Closes https://github.com/facebook/rocksdb/pull/3862 Differential Revision: D8147930 Pulled By: yiwu-arbug fbshipit-source-id: 355afe93c4dd0a96a0d711ef190e8b86fbe8d11d | 24 May 2018, 19:13:16 UTC |
7a99c04 | Yi Wu | 24 May 2018, 01:53:17 UTC | refactor constructor of LRUCacheShard Summary: Update LRUCacheShard constructor so that adding new params to it don't need to add extra SetXXX() methods. Closes https://github.com/facebook/rocksdb/pull/3896 Differential Revision: D8128618 Pulled By: yiwu-arbug fbshipit-source-id: 6afa715de1493a50de413678761a765e3af9b83b | 24 May 2018, 01:57:42 UTC |
01bcc34 | Andrew Kryczka | 24 May 2018, 01:33:00 UTC | Introduce library-independent default compression level Summary: Previously we were using -1 as the default for every library, which was legacy from our zlib options. That worked for a while, but after zstd introduced https://github.com/facebook/zstd/commit/a146ee04ae5866b948be0c1911418e0436d80cb4, it started giving poor compression ratios by default in zstd. This PR adds a constant to RocksDB public API, `CompressionOptions::kDefaultCompressionLevel`, which will get translated to the default value specific to the compression library being used in "util/compression.h". The constant uses a number that appears to be larger than any library's maximum compression level. Closes https://github.com/facebook/rocksdb/pull/3895 Differential Revision: D8125780 Pulled By: ajkr fbshipit-source-id: 2db157a89118cd4f94577c2f4a0a5ff31c8391c6 | 24 May 2018, 01:42:08 UTC |
4011012 | Yanqin Jin | 23 May 2018, 22:59:16 UTC | Specify the underlying type of enums. Summary: Explicitly specify the underlying type of enums help developers understand the physical storage. Closes https://github.com/facebook/rocksdb/pull/3892 Differential Revision: D8107027 Pulled By: riversand963 fbshipit-source-id: a00efecbba46df4a3c8eed0994a2d4972ad1a1d3 | 23 May 2018, 23:12:59 UTC |
6c73a46 | Zhongyi Xie | 22 May 2018, 20:51:21 UTC | Fix a backward compatibility problem with table_properties being nullptr Summary: Currently when ldb built from master tries to open a DB from version 2.2, there will be a segfault because table_properties didn't exist back then. Closes https://github.com/facebook/rocksdb/pull/3890 Differential Revision: D8100914 Pulled By: miasantreble fbshipit-source-id: b255e8aedc54695432be2e704839c857dabdd65a | 22 May 2018, 20:57:17 UTC |
4420cb4 | Jacquin Mininger | 22 May 2018, 20:34:26 UTC | Fix Issue #3771: Slice ctor checks for nullptr and creates empty string Summary: Fix Issue #3771 : Check for nullptr in Slice constructor Slice ctor checks for nullptr and creates empty string if the string does not exist Closes https://github.com/facebook/rocksdb/pull/3887 Differential Revision: D8098852 Pulled By: ajkr fbshipit-source-id: 04471077defa9776ce7b8c389a61312ce31002fb | 22 May 2018, 20:41:56 UTC |
7db721b | Andrew Kryczka | 22 May 2018, 18:58:51 UTC | Avoid sleep in DBTest.GroupCommitTest to fix flakiness Summary: DBTest.GroupCommitTest would often fail when run under valgrind because its sleeps were insufficient to guarantee a group commit had multiple entries. Instead we can use sync point to force a leader to wait until a non-leader thread has enqueued its work, thus guaranteeing a leader can do group commit work for multiple threads. Closes https://github.com/facebook/rocksdb/pull/3883 Differential Revision: D8079429 Pulled By: ajkr fbshipit-source-id: 61dc50fad29d2c85547842f681288de60fa29049 | 22 May 2018, 19:16:25 UTC |
fcb3101 | Andrew Kryczka | 22 May 2018, 17:50:09 UTC | Avoid single-deleting merge operands in db_stress Summary: I repro'd some of the "unexpected value" failures showing up in our CI lately and they always happened on keys that have a mix of single deletes and merge operands. The `SingleDelete()` API comment mentions it's incompatible with `Merge()`, so this PR prevents `db_stress` from mixing them. Closes https://github.com/facebook/rocksdb/pull/3878 Differential Revision: D8097346 Pulled By: ajkr fbshipit-source-id: 357a48c6a31156f4f8db3ce565638ad924c437a1 | 22 May 2018, 17:58:36 UTC |
3db1ada | Siying Dong | 21 May 2018, 23:37:47 UTC | PersistRocksDBOptions() to use WritableFileWriter Summary: By using WritableFileWriter rather than WritableFile directly, we can buffer multiple Append() calls to one write() file system call, which will be expensive to underlying Env without its own write buffering. Closes https://github.com/facebook/rocksdb/pull/3882 Differential Revision: D8080673 Pulled By: siying fbshipit-source-id: e0db900cb3c178166aa738f3985db65e3ae2cf1b | 21 May 2018, 23:42:22 UTC |
c3ebc75 | Zhongyi Xie | 21 May 2018, 21:33:55 UTC | Move prefix_extractor to MutableCFOptions Summary: Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users. This PR aims to make it possible to dynamically change bloom filter config. Closes https://github.com/facebook/rocksdb/pull/3601 Differential Revision: D7253114 Pulled By: miasantreble fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c | 21 May 2018, 21:43:11 UTC |
263ef52 | Yanqin Jin | 21 May 2018, 18:52:31 UTC | Update ColumnFamilyTest for multi-CF verification Summary: Change `keys_` from `set<string>` to `vector<set<string>>` so that each column family's keys are stored in one set. ajkr When you have a chance, can you PTAL? Thanks! Closes https://github.com/facebook/rocksdb/pull/3871 Differential Revision: D8056447 Pulled By: riversand963 fbshipit-source-id: 650d0f9cad02b1bc005fc329ad76edbf053e6386 | 21 May 2018, 18:57:42 UTC |
508a09f | Andrew Kryczka | 21 May 2018, 18:09:00 UTC | Print histogram count and sum in statistics string Summary: Previously it only printed percentiles, even though our histogram keeps track of count and sum (and more). There have been many times we want to know more than the percentiles. For example, we currently want sum of "rocksdb.compression.times.nanos" and sum of "rocksdb.decompression.times.nanos", which would allow us to know the relative cost of compression vs decompression. This PR adds count and sum to the string printed by `StatisticsImpl::ToString`. This is a bit risky as there are definitely parsers assuming the old format. I will mention it in HISTORY.md and hope for the best... Closes https://github.com/facebook/rocksdb/pull/3863 Differential Revision: D8038831 Pulled By: ajkr fbshipit-source-id: 0465b72e4b0cbf18ef965f4efe402601d16d5b5c | 21 May 2018, 18:12:47 UTC |
7b65521 | Andrew Kryczka | 21 May 2018, 16:42:49 UTC | Assert keys/values pinned by range deletion meta-block iterators Summary: `RangeDelAggregator` holds the pointers returned by `BlockIter::key()` and `BlockIter::value()` so requires the data to which they point is pinned. `BlockIter::key()` points into block memory and is guaranteed to be pinned if and only if prefix encoding is disabled (or, equivalently, restart interval is set to one). I think `BlockIter::value()` is always pinned. Added an assert for these and removed the wrong TODO about increasing restart interval, which would enable key prefix encoding and break the assertion. Closes https://github.com/facebook/rocksdb/pull/3875 Differential Revision: D8063667 Pulled By: ajkr fbshipit-source-id: 60b5ebcc0cdd610dd6aad9e74a23378793672c41 | 21 May 2018, 16:57:00 UTC |
e410501 | Andrew Kryczka | 21 May 2018, 16:35:46 UTC | Add missing test files to src.mk Summary: We only generate the header dependency (".cc.d") files for files mentioned in "src.mk". When we don't generate them, changes to header dependencies do not cause `make` to recompile the dependent ".o". Then it takes a while for developers (or maybe just me) to realize `make clean` is necessary. Closes https://github.com/facebook/rocksdb/pull/3876 Differential Revision: D8065389 Pulled By: ajkr fbshipit-source-id: 0f62eee7bcab15b0215791564e6ab3775d46996b | 21 May 2018, 16:43:29 UTC |
ed4d339 | Zhongyi Xie | 19 May 2018, 04:44:07 UTC | fix a division by zero bug Summary: fixes the failing clang_analyze contrun test Closes https://github.com/facebook/rocksdb/pull/3872 Differential Revision: D8059241 Pulled By: miasantreble fbshipit-source-id: e8fc1838004fe16a823456188386b8b39429803b | 19 May 2018, 04:57:24 UTC |
26da367 | Siying Dong | 18 May 2018, 19:54:09 UTC | class Block to store num_restarts_ Summary: Right now, every Block::NewIterator() reads num_restarts_ from the block, which is already read in Block::Block(). This sometimes cause a CPU cache miss. Although fetching this cacheline can usually benefit follow-up block restart offset reading, as they are close to each other, it's almost free to get ride of this read by storing it in the Block class. Closes https://github.com/facebook/rocksdb/pull/3869 Differential Revision: D8052493 Pulled By: siying fbshipit-source-id: 9c72360f0c2d7329f3c198ce4eaedd2bc14b87c1 | 18 May 2018, 19:56:55 UTC |
a0c7b4d | Yanqin Jin | 18 May 2018, 14:58:59 UTC | Set the default value of max_manifest_file_size. Summary: In the past, the default value of max_manifest_file_size is uint64_t::MAX, allowing a long running RocksDB process to grow its MANIFEST file to take up the entire disk, as reported in [issue 3851](https://github.com/facebook/rocksdb/issues/3851). It is reasonable and common to provide a default non-max value for this option. Therefore, I set the value to 1GB. siying miasantreble Please let me know whether this looks good to you. Thanks! Closes https://github.com/facebook/rocksdb/pull/3867 Differential Revision: D8051524 Pulled By: riversand963 fbshipit-source-id: 50251f0804b1fa933a19a30d19d261ea8b9d2b72 | 18 May 2018, 15:11:55 UTC |
17af09f | Siying Dong | 18 May 2018, 01:24:20 UTC | Implement key shortening functions in ReverseBytewiseComparator Summary: Right now ReverseBytewiseComparator::FindShortestSeparator() doesn't really shorten key, and ReverseBytewiseComparator::FindShortestSuccessor() seems to return wrong results. The code is confusing too as it uses BytewiseComparatorImpl::FindShortestSeparator() but the function actually won't do anything if the the first key is larger than the second. Implement ReverseBytewiseComparator::FindShortestSeparator() and override ReverseBytewiseComparator::FindShortestSuccessor() to be empty. Closes https://github.com/facebook/rocksdb/pull/3836 Differential Revision: D7959762 Pulled By: siying fbshipit-source-id: 93acb621c16ce6f23e087ae4e19f7d84d1254683 | 18 May 2018, 01:27:16 UTC |
1d7ca20 | Zhongyi Xie | 18 May 2018, 00:51:46 UTC | add override to virtual functions Summary: this will fix the failing clang_check test Closes https://github.com/facebook/rocksdb/pull/3868 Differential Revision: D8050880 Pulled By: miasantreble fbshipit-source-id: 749932e2e4025f835c961c068d601e522a126da6 | 18 May 2018, 00:57:48 UTC |
aed7abb | Xin Tong | 18 May 2018, 00:49:06 UTC | Reorder field based on esan data Summary: Running. TEST_TMPDIR=/dev/shm ./buck-out/gen/rocks/tools/rocks_db_bench --benchmarks=readwhilewriting --num=5000000 -benchmark_write_rate_limit=2000000 --threads=32 Collected esan data and reorder field. Accesses to 4th and 6th fields take majority of the access. Group them. Overall, this struct takes 10%+ of the total accesses in the program. (637773011/6107964986) ==2433831== class rocksdb::InlineSkipList ==2433831== size = 48, count = 637773011, ratio = 112412, array access = 0 ==2433831== # 0: offset = 0, size = 2, count = 455137, type = i16 ==2433831== # 1: offset = 2, size = 2, count = 6, type = i16 ==2433831== # 2: offset = 4, size = 4, count = 182303, type = i32 ==2433831== # 3: offset = 8, size = 8, count = 263953900, type = %"class.rocksdb::MemTableRep::KeyComparator"* ==2433831== # 4: offset = 16, size = 8, count = 136409, type = %"class.rocksdb::Allocator"* ==2433831== # 5: offset = 24, size = 8, count = 366628820, type = %"struct.rocksdb::InlineSkipList<const rocksdb::MemTableRep::KeyComparator &>::Node"* ==2433831== # 6: offset = 32, size = 4, count = 6280031, type = %"struct.std::atomic" = type { %"struct.std::__atomic_base" } ==2433831== # 7: offset = 40, size = 8, count = 136405, type = %"struct.rocksdb::InlineSkipList<const rocksdb::MemTableRep::KeyComparator &>::Splice"* ==2433831==EfficiencySanitizer: total struct field access count = 6107964986 Before re-ordering [trentxintong@devbig460.frc2 ~/fbsource/fbcode]$ fgrep readwhilewriting without-ro.log readwhilewriting : 0.036 micros/op 27545605 ops/sec; 26.8 MB/s (45954 of 5000000 found) readwhilewriting : 0.036 micros/op 28024240 ops/sec; 27.2 MB/s (43158 of 5000000 found) readwhilewriting : 0.037 micros/op 27345145 ops/sec; 27.1 MB/s (46725 of 5000000 found) readwhilewriting : 0.037 micros/op 27072588 ops/sec; 27.3 MB/s (42605 of 5000000 found) readwhilewriting : 0.034 micros/op 29578781 ops/sec; 28.3 MB/s (44294 of 5000000 found) readwhilewriting : 0.035 micros/op 28528304 ops/sec; 27.7 MB/s (44176 of 5000000 found) readwhilewriting : 0.037 micros/op 27075497 ops/sec; 26.5 MB/s (43763 of 5000000 found) readwhilewriting : 0.036 micros/op 28024117 ops/sec; 27.1 MB/s (40622 of 5000000 found) readwhilewriting : 0.037 micros/op 27078709 ops/sec; 27.6 MB/s (47774 of 5000000 found) readwhilewriting : 0.034 micros/op 29020689 ops/sec; 28.1 MB/s (45066 of 5000000 found) AVERAGE()=27.37 MB/s After re-ordering [trentxintong@devbig460.frc2 ~/fbsource/fbcode]$ fgrep readwhilewriting ro.log readwhilewriting : 0.036 micros/op 27542409 ops/sec; 27.7 MB/s (46163 of 5000000 found) readwhilewriting : 0.036 micros/op 28021148 ops/sec; 28.2 MB/s (46155 of 5000000 found) readwhilewriting : 0.036 micros/op 28021035 ops/sec; 27.3 MB/s (44039 of 5000000 found) readwhilewriting : 0.036 micros/op 27538659 ops/sec; 27.5 MB/s (46781 of 5000000 found) readwhilewriting : 0.036 micros/op 28028604 ops/sec; 27.6 MB/s (44689 of 5000000 found) readwhilewriting : 0.036 micros/op 27541452 ops/sec; 27.3 MB/s (43156 of 5000000 found) readwhilewriting : 0.034 micros/op 29041338 ops/sec; 28.8 MB/s (44895 of 5000000 found) readwhilewriting : 0.036 micros/op 27784974 ops/sec; 26.3 MB/s (39963 of 5000000 found) readwhilewriting : 0.036 micros/op 27538892 ops/sec; 28.1 MB/s (46570 of 5000000 found) readwhilewriting : 0.038 micros/op 26622473 ops/sec; 27.0 MB/s (43236 of 5000000 found) AVERAGE()=27.58 MB/s Closes https://github.com/facebook/rocksdb/pull/3855 Reviewed By: siying Differential Revision: D8048781 Pulled By: trentxintong fbshipit-source-id: bc9807a9845e2a92cb171ce1ecb5a2c8a51f1481 | 18 May 2018, 00:57:48 UTC |
fa43948 | Fosco Marotto | 17 May 2018, 21:20:54 UTC | Update HISTORY and version for upcoming 5.14 Summary: Closes https://github.com/facebook/rocksdb/pull/3866 Differential Revision: D8043563 Pulled By: gfosco fbshipit-source-id: da4af20e604534602ac0e07943135513fd9a9f53 | 17 May 2018, 21:27:17 UTC |
7ccb35f | Siying Dong | 17 May 2018, 19:46:23 UTC | In instrumented mutex, take timing once for both of perf_context and statistics Summary: Closes https://github.com/facebook/rocksdb/pull/3427 Differential Revision: D6827236 Pulled By: siying fbshipit-source-id: d8a2cc525c90df625510565669f2659014259a8a | 17 May 2018, 19:56:53 UTC |
8bf555f | Mike Kolupaev | 17 May 2018, 09:44:14 UTC | Change and clarify the relationship between Valid(), status() and Seek*() for all iterators. Also fix some bugs Summary: Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are: * If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted. * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not. However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours). This PR changes the convention to: * If status() is not ok, Valid() always returns false. * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.) This does sacrifice the two use cases listed above, but siying said it's ok. Overview of the changes: * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario. * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed. * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator. To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types: Iterators that didn't need changes: * status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator. * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator. Iterators with changes (see inline comments for details): * DBIter - an overhaul: - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back. - It had a few code paths silently discarding subiterator's status. The stress test caught a few. - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example. - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption. - It used to not reset status on seek for some types of errors. - Some simplifications and better comments. - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer. * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status. * ForwardIterator - changed to the new convention, also slightly simplified. * ForwardLevelIterator - fixed some bugs and simplified. * LevelIterator - simplified. * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_. * BlockBasedTableIterator - minor changes. * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid. * PlainTableIterator - some seeks used to not reset status. * CuckooTableIterator - tiny code cleanup. * ManagedIterator - fixed some bugs. * BaseDeltaIterator - changed to the new convention and fixed a bug. * BlobDBIterator - seeks used to not reset status. * KeyConvertingIterator - some small change. Closes https://github.com/facebook/rocksdb/pull/3810 Differential Revision: D7888019 Pulled By: al13n321 fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d | 17 May 2018, 09:56:56 UTC |
46fde6b | Maysam Yabandeh | 16 May 2018, 19:51:31 UTC | Fix race condition between log_.erase and log_.back Summary: log_ contract specifies that it should not be modified unless both mutex_ and log_write_mutex_ are held. log_.erase however does that with only holding mutex_. This causes a race condition with two_write_queues since logs_.back is read with holding only log_write_mutex_ (which is correct according to logs_ contract) but logs_.erase is called concurrently. This is probably the cause of logs_.back returning nullptr in https://github.com/facebook/rocksdb/issues/3852 although I could not reproduce it. Fixes https://github.com/facebook/rocksdb/issues/3852 Closes https://github.com/facebook/rocksdb/pull/3859 Differential Revision: D8026103 Pulled By: maysamyabandeh fbshipit-source-id: ee394e00fe4aa520d884c5ef87981e9d6b5ccb28 | 16 May 2018, 20:01:33 UTC |