Revision - a742994 - Btrfs: don't remove extents and xattrs when logging new names

Revision a742994aa2e271eb8cd8e043d276515ec858ed73 authored by Filipe Manana on 13 February 2015, 16:56:14 UTC, committed by Chris Mason on 14 February 2015, 16:22:49 UTC

Btrfs: don't remove extents and xattrs when logging new names

If we are recording in the tree log that an inode has new names (new hard
links were added), we would drop items, belonging to the inode, that we
shouldn't:

1) When the flag BTRFS_INODE_COPY_EVERYTHING is set in the inode's runtime
   flags, we ended up dropping all the extent and xattr items that were
   previously logged. This was done only in memory, since logging a new
   name doesn't imply syncing the log;

2) When the flag BTRFS_INODE_COPY_EVERYTHING is set in the inode's runtime
   flags, we ended up dropping all the xattr items that were previously
   logged. Like the case before, this was done only in memory because
   logging a new name doesn't imply syncing the log.

This led to some surprises in scenarios such as the following:

1) write some extents to an inode;
2) fsync the inode;
3) truncate the inode or delete/modify some of its xattrs
4) add a new hard link for that inode
5) fsync some other file, to force the log tree to be durably persisted
6) power failure happens

The next time the fs is mounted, the fsync log replay code is executed,
and the resulting file doesn't have the content it had when the last fsync
against it was performed, instead if has a content matching what it had
when the last transaction commit happened.

So change the behaviour such that when a new name is logged, only the inode
item and reference items are processed.

This is easy to reproduce with the test I just made for xfstests, whose
main body is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with some data.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # Make sure the file is durably persisted.
  sync

  # Append some data to our file, to increase its size.
  $XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # Fsync the file, so from this point on if a crash/power failure happens, our
  # new data is guaranteed to be there next time the fs is mounted.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Now shrink our file to 5000 bytes.
  $XFS_IO_PROG -c "truncate 5000" $SCRATCH_MNT/foo

  # Now do an expanding truncate to a size larger than what we had when we last
  # fsync'ed our file. This is just to verify that after power failure and
  # replaying the fsync log, our file matches what it was when we last fsync'ed
  # it - 12Kb size, first 8Kb of data had a value of 0xaa and the last 4Kb of
  # data had a value of 0xcc.
  $XFS_IO_PROG -c "truncate 32K" $SCRATCH_MNT/foo

  # Add one hard link to our file. This made btrfs drop all of our file's
  # metadata from the fsync log, including the metadata relative to the
  # extent we just wrote and fsync'ed. This change was made only to the fsync
  # log in memory, so adding the hard link alone doesn't change the persisted
  # fsync log. This happened because the previous truncates set the runtime
  # flag BTRFS_INODE_NEEDS_FULL_SYNC in the btrfs inode structure.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link

  # Now make sure the in memory fsync log is durably persisted.
  # Creating and fsync'ing another file will do it.
  # After this our persisted fsync log will no longer have metadata for our file
  # foo that points to the extent we wrote and fsync'ed before.
  touch $SCRATCH_MNT/bar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar

  # As expected, before the crash/power failure, we should be able to see a file
  # with a size of 32Kb, with its first 5000 bytes having the value 0xaa and all
  # the remaining bytes with value 0x00.
  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # After mounting the fs again, the fsync log was replayed.
  # The expected result is to see a file with a size of 12Kb, with its first 8Kb
  # of data having the value 0xaa and its last 4Kb of data having a value of 0xcc.
  # The btrfs bug used to leave the file as it used te be as of the last
  # transaction commit - that is, with a size of 8Kb with all bytes having a
  # value of 0xaa.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

The test case for xfstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>

1 parent 1a4bcf4

Files
Changes

Permalinks

mmc-async-req.txt

Rationale
=========

How significant is the cache maintenance overhead?
It depends. Fast eMMC and multiple cache levels with speculative cache
pre-fetch makes the cache overhead relatively significant. If the DMA
preparations for the next request are done in parallel with the current
transfer, the DMA preparation overhead would not affect the MMC performance.
The intention of non-blocking (asynchronous) MMC requests is to minimize the
time between when an MMC request ends and another MMC request begins.
Using mmc_wait_for_req(), the MMC controller is idle while dma_map_sg and
dma_unmap_sg are processing. Using non-blocking MMC requests makes it
possible to prepare the caches for next job in parallel with an active
MMC request.

MMC block driver
================

The mmc_blk_issue_rw_rq() in the MMC block driver is made non-blocking.
The increase in throughput is proportional to the time it takes to
prepare (major part of preparations are dma_map_sg() and dma_unmap_sg())
a request and how fast the memory is. The faster the MMC/SD is the
more significant the prepare request time becomes. Roughly the expected
performance gain is 5% for large writes and 10% on large reads on a L2 cache
platform. In power save mode, when clocks run on a lower frequency, the DMA
preparation may cost even more. As long as these slower preparations are run
in parallel with the transfer performance won't be affected.

Details on measurements from IOZone and mmc_test
================================================

https://wiki.linaro.org/WorkingGroups/Kernel/Specs/StoragePerfMMC-async-req

MMC core API extension
======================

There is one new public function mmc_start_req().
It starts a new MMC command request for a host. The function isn't
truly non-blocking. If there is an ongoing async request it waits
for completion of that request and starts the new one and returns. It
doesn't wait for the new request to complete. If there is no ongoing
request it starts the new request and returns immediately.

MMC host extensions
===================

There are two optional members in the mmc_host_ops -- pre_req() and
post_req() -- that the host driver may implement in order to move work
to before and after the actual mmc_host_ops.request() function is called.
In the DMA case pre_req() may do dma_map_sg() and prepare the DMA
descriptor, and post_req() runs the dma_unmap_sg().

Optimize for the first request
==============================

The first request in a series of requests can't be prepared in parallel
with the previous transfer, since there is no previous request.
The argument is_first_req in pre_req() indicates that there is no previous
request. The host driver may optimize for this scenario to minimize
the performance loss. A way to optimize for this is to split the current
request in two chunks, prepare the first chunk and start the request,
and finally prepare the second chunk and start the transfer.

Pseudocode to handle is_first_req scenario with minimal prepare overhead:

if (is_first_req && req->size > threshold)
   /* start MMC transfer for the complete transfer size */
   mmc_start_command(MMC_CMD_TRANSFER_FULL_SIZE);

   /*
    * Begin to prepare DMA while cmd is being processed by MMC.
    * The first chunk of the request should take the same time
    * to prepare as the "MMC process command time".
    * If prepare time exceeds MMC cmd time
    * the transfer is delayed, guesstimate max 4k as first chunk size.
    */
    prepare_1st_chunk_for_dma(req);
    /* flush pending desc to the DMAC (dmaengine.h) */
    dma_issue_pending(req->dma_desc);

    prepare_2nd_chunk_for_dma(req);
    /*
     * The second issue_pending should be called before MMC runs out
     * of the first chunk. If the MMC runs out of the first data chunk
     * before this call, the transfer is delayed.
     */
    dma_issue_pending(req->dma_desc);

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...