Ceph

Ceph is a distributed storage system providing Object, Block and Filesystem Storage.

Concepts

  • Monitors: A Ceph Monitor (ceph-mon) maintains maps of the cluster state, including the monitor map, manager map, the OSD map, the MDS map, and the CRUSH map. These maps are critical cluster state required for Ceph daemons to coordinate with each other. Monitors are also responsible for managing authentication between daemons and clients. At least three monitors are normally required for redundancy and high availability.
  • Managers: A Ceph Manager daemon (ceph-mgr) is responsible for keeping track of runtime metrics and the current state of the Ceph cluster, including storage utilization, current performance metrics, and system load. The Ceph Manager daemons also host python-based modules to manage and expose Ceph cluster information, including a web-based Ceph Dashboard and REST API. At least two managers are normally required for high availability.
  • Ceph OSDs: An Object Storage Daemon (Ceph OSD, ceph-osd) stores data, handles data replication, recovery, rebalancing, and provides some monitoring information to Ceph Monitors and Managers by checking other Ceph OSD Daemons for a heartbeat. At least three Ceph OSDs are normally required for redundancy and high availability.
  • MDSs: A Ceph Metadata Server (MDS, ceph-mds) stores metadata for the Ceph File System. Ceph Metadata Servers allow CephFS users to run basic commands (like ls, find, etc.) without placing a burden on the Ceph Storage Cluster.
  • CRUSH Algorithm: The CRUSH (Controlled Replication Under Scalable Hashing) algorithm is responsible for determining where to store objects within the Ceph cluster. It maps data to placement groups (PGs) and from PGs to Object Storage Daemons (OSDs) in a way that is both scalable and efficient. CRUSH enables Ceph to dynamically rebalance, handle recovery, and scale horizontally without needing a centralized metadata store. It also helps in minimizing the amount of data moved during rebalancing or recovery, improving cluster efficiency.
  • Placement Groups (PGs): Placement Groups are logical collections of objects within a Ceph cluster that help in data distribution across OSDs. Each object in Ceph is stored in a PG, and each PG is mapped to one or more OSDs for redundancy. PGs allow Ceph to scale out while minimizing the number of replication factors and balancing the load across OSDs. The number of PGs impacts the performance and balancing of the cluster, so it is a critical configuration parameter.
  • Ceph Pools: A Ceph pool is a logical partition in the Ceph cluster that holds objects. Pools are used to separate different types of data, such as data for Ceph Block Storage (RBD), Ceph File System (CephFS), and Ceph Object Storage (RGW). Each pool can have different replication or erasure coding configurations for redundancy and durability. Pools enable better organization of data within the cluster and allow for fine-tuned control over data placement, replication, and recovery.
  • Ceph Block Devices (RBD): Ceph Block Devices, also known as RADOS Block Devices (RBD), are a feature of Ceph that allows Ceph to provide block-level storage to virtual machines (VMs) or applications. RBD images are objects stored in Ceph pools, and they offer scalable, highly available, and durable block storage that can be used as a replacement for traditional SAN or local disks.
  • Ceph Object Gateway (RGW): The Ceph Object Gateway (RGW) is a service that provides object storage via the S3 and Swift APIs, allowing Ceph to function as a cloud object store. RGW exposes the Ceph storage cluster to applications and users that use object-based storage, such as backup systems or web-scale applications. It handles object storage interactions, metadata, and user management for the cloud environment.
  • Ceph File System (CephFS): CephFS is a distributed file system that provides scalable file-based access to data within a Ceph cluster. CephFS is designed for workloads that require POSIX file system semantics and includes features like file locking, hierarchical directories, and snapshot support. MDSs are responsible for managing the metadata.

Setup

Cephadm creates a new Ceph cluster by bootstrapping a single host, expanding the cluster to encompass any additional hosts, and then deploying the needed services.

Run the ceph bootstrap command with the IP of the first cluster host:

cephadm bootstrap --mon-ip <mon-ip>

This command will:

  • Create a Monitor and a Manager daemon for the new cluster on the local host.
  • Generate a new SSH key for the Ceph cluster and add it to the root user’s /root/.ssh/authorized_keys file.
  • Write a copy of the public key to /etc/ceph/ceph.pub.
  • Write a minimal configuration file to /etc/ceph/ceph.conf. This file is needed to communicate with Ceph daemons.
  • Write a copy of the client.admin administrative (privileged!) secret key to /etc/ceph/ceph.client.admin.keyring.
  • Add the _admin label to the bootstrap host. By default, any host with this label will (also) get a copy of /etc/ceph/ceph.conf and /etc/ceph/ceph.client.admin.keyring.

Ceph CLI

The cephadm shell command launches a bash shell in a container with all of the Ceph packages installed. By default, if configuration and keyring files are found in /etc/ceph on the host, they are passed into the container environment so that the shell is fully functional. Note that when executed on a MON host, cephadm shell will infer the config from the MON container instead of using the default configuration. If --mount <path> is given, then the host <path> (file or directory) will appear under /mnt inside the container:

cephadm shell

To execute ceph commands, you can also run commands like this:

cephadm shell -- ceph -s

You can install the ceph-common package, which contains all of the ceph commands, including ceph, rbd, mount.ceph (for mounting CephFS file systems), etc.:

cephadm add-repo --release reef
cephadm install ceph-common

Confirm that the ceph command is accessible with:

ceph -v
ceph status

Host Management

List hosts:

ceph orch host ls [--detail]

Add a new host:

# Copy ceph ssh key
ssh-copy-id -f -i /etc/ceph/ceph.pub root@*<new-host>*
 
# Add new node with admin label
ceph orch host add <hostname> <host_ip> --label _admin,osd

Remove a host:

# Drain all daemons from host first
ceph orch host drain <host>
 
# Check still running daemons
ceph orch ps <host>
 
# Remove the host
ceph orch host rm <host>
 
# Force remove
ceph orch host rm <host> --offline --force

OSD Managemenent

Listing:

# List storage devices
ceph device ls
 
ceph osd tree

Add a new OSD: ceph orch daemon add osd host:device

Removing an OSD:

# Mark out of the cluster
ceph osd out <OSD_ID>
 
# Stop OSD
ceph orch daemon stop osd.<n>
 
# Purge OSD
ceph osd purge {id} --yes-i-really-mean-it

Disk Monitoring

Ceph can also monitor the health metrics associated with your device. For example, SATA drives implement a standard called SMART that provides a wide range of internal metrics about the device’s usage and health (for example: the number of hours powered on, the number of power cycles, the number of unrecoverable read errors). Other device types such as SAS and NVMe present a similar set of metrics (via slightly different standards). All of these metrics can be collected by Ceph via the smartctl tool.

You can enable or disable health monitoring by running one of the following commands:

ceph device monitoring on
ceph device monitoring off

If monitoring is enabled, device metrics will be scraped automatically at regular intervals. To configure that interval, run a command of the following form:

ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>

By default, device metrics are scraped once every 24 hours.

To manually scrape all devices, run the following command:

ceph device scrape-health-metrics

To scrape a single device, run a command of the following form:

ceph device scrape-health-metrics <device-id>

To scrape a single daemon’s devices, run a command of the following form:

ceph device scrape-daemon-health-metrics <who>

To retrieve the stored health metrics for a device (optionally for a specific timestamp), run a command of the following form:

ceph device get-health-metrics <devid> [sample-timestamp]

Disk Prediction

The diskprediction module leverages Ceph device health checks to collect disk health metrics and uses the internal predictor module to produce disk failure predictions and returns them back to Ceph. It requires no external server for data analysis and the outputting of results. Its internal predictor’s accuracy is around 70%.

Run the following command to enable the diskprediction_local module in the Ceph environment:

ceph mgr module enable diskprediction_local

Run the following command to enable the local predictor:

ceph config set global device_failure_prediction_mode local

Run the following command to disable prediction:

ceph config set global device_failure_prediction_mode none

diskprediction_local requires at least six datasets of device health metrics to make prediction of the devices’ life expectancy. And these health metrics are collected only if health monitoring is enabled.

Run the following command to retrieve the life expectancy of a given device:

ceph device predict-life-expectancy <device id>

User Management

To list the users in your cluster, run the following command: ceph auth ls

To retrieve a specific user, key, and capabilities, run the following command: ceph auth get {TYPE.ID}

For example: ceph auth get client.admin

Create a user:

ceph auth add client.john mon 'allow r' osd 'allow rw pool=liverpool'
ceph auth get-or-create client.paul mon 'allow r' osd 'allow rw pool=liverpool'

Add capabilities: ceph auth caps client.john mon 'allow r' osd 'allow rw pool=liverpool'

To delete a user, use ceph auth del: ceph auth del {TYPE}.{ID}

Pools

Pools hold the actual data. They consist of PGs (Placement Groups) mapped to OSDs.

List pools:

ceph osd pool ls [detail]

Creating a pool:

# Replicated Pool
ceph osd pool create <name> <pg_num> <pgp_num> replicated <crush_rule>
 
# Erasure Coded Pool
ceph osd pool create <name> <pg_num> <pgp_num> erasure <erasure_profile> <crush_rule>

Set/Get pool values:

ceph osd pool set {pool-name} {key} {value}
 
ceph osd pool get {pool-name} {key}

Delete a pool:

ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it]

CRUSH Maps

The CRUSH Map is responsible for determining where data gets placed.

The CRUSH map defines a series of types that are used as failure domains (Data redudancy will be distributed among the domain, eg. across osds, hosts, etc). Default types include:

  • osd (or device)
  • host
  • chassis
  • rack
  • row
  • pdu
  • pod
  • room
  • datacenter
  • zone
  • region
  • root

To see the hierarchy: ceph osd tree

To see the rules that are defined for the cluster, run the following command: ceph osd crush rule ls

To view the contents of the rules, run the following command: ceph osd crush rule dump

Each device can optionally have a class assigned. By default, OSDs automatically set their class at startup to hdd, ssd, or nvme in accordance with the type of device they are backed by.

To explicitly set the device class of one or more OSDs, run a command of the following form: ceph osd crush set-device-class <class> <osd-name> [...]

Once a device class has been set, it cannot be changed to another class until the old class is unset. To remove the old class of one or more OSDs, run a command of the following form: ceph osd crush rm-device-class <osd-name> [...]

To apply the new placement rule to a specific pool, run a command of the following form: ceph osd pool set <pool-name> crush_rule <rule-name>

Raw CRUSH Map

You can edit the raw crushmap:

# Dump and edit CRUSH map
ceph osd getcrushmap -o crushmap.bin
crushtool -d crushmap.bin -o crushmap.txt
# Edit crushmap.txt
 
# Compile and upload new CRUSH map
crushtool -c crushmap.txt -o newcrushmap.bin
ceph osd setcrushmap -i newcrushmap.bin

Replicated

Replicated pools ensure redudancy by replicating data n times.

To create a rule for a replicated pool, run a command of the following form:

ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}]

To set the amount of replicas:

ceph osd pool set <pool-name> size <replica-count>

Erasure Coding

Erasure coding can decrease the storage overhead for redundancy.

Create an erasure profile:

ceph osd erasure-code-profile set myprofile \
    k=3 \
    m=2 \
    crush-failure-domain=osd

The objects will be divided in three (K=3) and two additional chunks will be created (M=2). The value of M defines how many OSDs can be lost simultaneously without losing any data. Keep in mind that each chunk goes to a distinct failure domain, so in the above example you need at least 5 (k+m) OSDs.

Use Casek (data shards)m (coding shards)Total (k+m)Overhead
General purpose (balanced)4261G 1.50G stored
High durability, less overhead83111G 1.38G stored
Fast recovery & repair6391G 1.50G stored
Low overhead, medium durability6281G 1.33G stored

CephFS

Creation

A ceph filesystem consists of a metadata and data pool. For best performance the metadata pool should reside on fast, low-latency OSDs.

# Pool Creation
ceph osd pool create cephfs_data
ceph osd pool create cephfs_metadata
 
# Create FS
ceph fs new <fs_name> <metadata> <data> [--force] [--allow-dangerous-metadata-overlay] [<fscid:int>] [--recover]

You can also create a CephFS with an erasure-coded data pool. But you need to enable EC overwrites first:

ceph osd pool set my_ec_pool allow_ec_overwrites true

Mount

mount -t ceph <ceph_user>@<ceph_fsid>.<ceph_fs>=<ceph_subpath> /mountpoint -o mon_addr=<ceph_monitors_list>,name=<ceph_user>,secret=<ceph_secret>,noatime

Example: mount -t ceph admin@7136f2d8-75d2-442f-a9bb-0eac81d774cf.mycephfs=/subpath /mnt -o mon_addr=10.2.3.4,name=admin=secret=secretvalue

Secret can be found in the keyring at /etc/ceph or a seperate user keyring.

Snapshots

Ceph can take directory scoped snapshots of the filesystem.

Snapshots of the directory will be stored in .snap.

Create a new snapshot: mkdir .snap/snap_name

Remove a snapshot: rmdir .snap/snap_name

Block Device (RBD)

Use the rbd tool to initialize the pool for use by RBD:

rbd pool init <pool-name>

Create RBD image

rbd create <pool>/<image_name> --size <size_in_MB_or_GB> [--image-feature <features>]

Example:

rbd create mypool/myimage --size 10240  # 10 GB image

List images in a pool

rbd ls <pool>

Show info about an image

rbd info <pool>/<image_name>

Map RBD image to a block device

rbd map <pool>/<image_name>
  • This will create a device like /dev/rbd0
  • To unmap:
rbd unmap /dev/rbd0

Resize an RBD image

  • Increase image size:
rbd resize --size <new_size_in_MB_or_GB> <pool>/<image_name>
  • After resize, if mapped, you may need to grow the filesystem on the block device.

Snapshots

  • Create snapshot:
rbd snap create <pool>/<image_name>@<snap_name>
  • List snapshots:
rbd snap ls <pool>/<image_name>
  • Rollback to snapshot:
rbd snap rollback <pool>/<image_name>@<snap_name>
  • Remove snapshot:
rbd snap rm <pool>/<image_name>@<snap_name>

Cloning

  • Create a clone from snapshot:
rbd clone <pool>/<image_name>@<snap_name> <pool>/<clone_name>

Export and Import images (backup/migration)

  • Export:
rbd export <pool>/<image_name> <file>
  • Import:
rbd import <file> <pool>/<image_name>

Object Gateway (S3)

Ceph can expose an S3 compatible object gateway for object storage.

# Add RGW daemon on a host
ceph orch daemon add rgw.<host> --placement="host=<host>"
 
# Configure RGW frontend port(s)
# Default frontend is "beast port=7480"
ceph config set client.rgw.<host> rgw_frontends 'beast port=7480'
 
# Create RGW user inside the RGW container shell
cephadm shell rgw.<host>
radosgw-admin user create --uid="myuser" --display-name="My User"
 
# Use S3 client (e.g., AWS CLI) to interact with RGW
aws --endpoint-url http://<host>:7480 s3 ls

Prometheus

The Manager prometheus module implements a Prometheus exporter to expose Ceph performance counters from the collection point in the Manager.

Enable the prometheus module by running the below command :

ceph mgr module enable prometheus

By default the module will accept HTTP requests on port 9283 on all IPv4 and IPv6 addresses on the host. The port and listen address are configurable with ceph config set, with keys mgr/prometheus/server_addr and mgr/prometheus/server_port. This port is registered with Prometheus’s registry.

ceph config set mgr mgr/prometheus/server_addr 0.0.0.0
ceph config set mgr mgr/prometheus/server_port 9283

You can then add the endpoint to your prometheus configuration.