Notice
This document is for a development version of Ceph.
Backfill Reservation
When a new OSD joins a cluster all PGs with it in their acting sets must eventually backfill. If all of these backfills happen simultaneously they will present excessive load on the OSD: the “thundering herd” effect.
The osd_max_backfills tunable limits the number of outgoing or
incoming backfills that are active on a given OSD. Note that this limit is
applied separately to incoming and to outgoing backfill operations.
Thus there can be as many as osd_max_backfills * 2 backfill operations
in flight on each OSD.  This subtlety is often missed, and Ceph
operators can be puzzled as to why more ops are observed than expected.
Each OSDService now has two AsyncReserver instances: one for backfills going
from the OSD (local_reserver) and one for backfills going to the OSD
(remote_reserver).  An AsyncReserver (common/AsyncReserver.h)
manages a queue by priority of waiting items and a set of current reservation
holders.  When a slot frees up, the AsyncReserver queues the Context*
associated with the next item on the highest priority queue in the finisher
provided to the constructor.
For a primary to initiate a backfill it must first obtain a reservation from
its own local_reserver.  Then it must obtain a reservation from the backfill
target’s remote_reserver via a MBackfillReserve message. This process is
managed by sub-states of Active and ReplicaActive (see the sub-states
of Active in PG.h).  The reservations are dropped either on the Backfilled
event, which is sent on the primary before calling recovery_complete
and on the replica on receipt of the BackfillComplete progress message),
or upon leaving Active or ReplicaActive.
It’s important to always grab the local reservation before the remote reservation in order to prevent a circular dependency.
We minimize the risk of data loss by prioritizing the order in
which PGs are recovered.  Admins can override the default order by using
force-recovery or force-backfill. A force-recovery with op
priority 255 will start before a force-backfill op at priority 254.
If recovery is needed because a PG is below min_size a base priority of
220 is used. This is incremented by the number of OSDs short of the pool’s
min_size as well as a value relative to the pool’s recovery_priority.
The resultant priority is capped at 253 so that it does not confound forced
ops as described above. Under ordinary circumstances a recovery op is
prioritized at 180 plus a value relative to the pool’s recovery_priority.
The resultant priority is capped at 219.
If backfill is needed because the number of acting OSDs is less than
the pool’s min_size, a priority of 220 is used.  The number of OSDs
short of the pool’s min_size is added as well as a value relative to
the pool’s recovery_priority.  The total priority is limited to 253.
If backfill is needed because a PG is undersized,
a priority of 140 is used.  The number of OSDs below the size of the pool is
added as well as a value relative to the pool’s recovery_priority.  The
resultant priority is capped at 179.  If a backfill op is
needed because a PG is degraded, a priority of 140 is used.  A value
relative to the pool’s recovery_priority is added.  The resultant priority
is capped at 179 .  Under ordinary circumstances a
backfill op priority of 100 is used.  A value relative to the pool’s
recovery_priority is added.  The total priority is capped at 139.
| Description | Base priority | Maximum priority | 
|---|---|---|
| Backfill | 100 | 139 | 
| Degraded Backfill | 140 | 179 | 
| Recovery | 180 | 219 | 
| Inactive Recovery | 220 | 253 | 
| Inactive Backfill | 220 | 253 | 
| force-backfill | 254 | |
| force-recovery | 255 | 
Brought to you by the Ceph Foundation
The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.