Revision 7103559f49b46b3287973045f741c0679e3e9e44 authored by Sagar Vemuri on 21 June 2018, 18:02:49 UTC, committed by Facebook Github Bot on 21 June 2018, 18:13:08 UTC
Summary:
This PR extends the improvements in #3282 to also work when using Direct IO.
We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.

**Description:**
This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.

**Implementation Details:**
- Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
- `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
- `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
- Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
- Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.

**Constraints:**
- Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
- Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
- Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.

**Benchmarks:**
I used the same benchmark as used in #3282.
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```

Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```

```
Before:
seekrandom   :   37939.906 micros/op 26 ops/sec;   29.2 MB/s (1636 of 1999 found)
With this change:
seekrandom   :   8527.720 micros/op 117 ops/sec;  129.7 MB/s (6530 of 7999 found)
```
~4.5X perf improvement. Taken on an average of 3 runs.
Closes https://github.com/facebook/rocksdb/pull/3884

Differential Revision: D8082143

Pulled By: sagar0

fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
1 parent 524c6e6
History
File Mode Size
advisor
dump
rdb
CMakeLists.txt -rw-r--r-- 544 bytes
Dockerfile -rw-r--r-- 81 bytes
auto_sanity_test.sh -rwxr-xr-x 2.7 KB
benchmark.sh -rwxr-xr-x 16.4 KB
benchmark_leveldb.sh -rwxr-xr-x 5.1 KB
blob_dump.cc -rw-r--r-- 3.3 KB
check_format_compatible.sh -rwxr-xr-x 4.5 KB
db_bench.cc -rw-r--r-- 813 bytes
db_bench_tool.cc -rw-r--r-- 197.2 KB
db_bench_tool_test.cc -rw-r--r-- 9.5 KB
db_crashtest.py -rw-r--r-- 13.2 KB
db_repl_stress.cc -rw-r--r-- 4.5 KB
db_sanity_test.cc -rw-r--r-- 8.4 KB
db_stress.cc -rw-r--r-- 114.7 KB
dbench_monitor -rwxr-xr-x 2.6 KB
generate_random_db.sh -rwxr-xr-x 734 bytes
ldb.cc -rw-r--r-- 572 bytes
ldb_cmd.cc -rw-r--r-- 96.3 KB
ldb_cmd_impl.h -rw-r--r-- 14.1 KB
ldb_cmd_test.cc -rw-r--r-- 1.8 KB
ldb_test.py -rw-r--r-- 24.4 KB
ldb_tool.cc -rw-r--r-- 4.6 KB
pflag -rwxr-xr-x 4.0 KB
reduce_levels_test.cc -rw-r--r-- 5.2 KB
regression_test.sh -rwxr-xr-x 15.8 KB
report_lite_binary_size.sh -rwxr-xr-x 1.2 KB
rocksdb_dump_test.sh -rwxr-xr-x 364 bytes
run_flash_bench.sh -rwxr-xr-x 13.5 KB
run_leveldb.sh -rwxr-xr-x 6.2 KB
sample-dump.dmp -rw-r--r-- 100 bytes
sst_dump.cc -rw-r--r-- 581 bytes
sst_dump_test.cc -rw-r--r-- 6.4 KB
sst_dump_tool.cc -rw-r--r-- 23.8 KB
sst_dump_tool_imp.h -rw-r--r-- 2.8 KB
verify_random_db.sh -rwxr-xr-x 1.0 KB
write_stress.cc -rw-r--r-- 10.9 KB
write_stress_runner.py -rw-r--r-- 2.3 KB

back to top