Remove creation of unused partitions, implement recommendation #1188

swiftgist · 2018-06-19T19:37:11Z

The bluestore partition process has been working from original
assumptions. The refactor of bluestore args created a disconnect
between the two steps. Partitions are getting created, but never
used nor removed.

This also implements the recommendation to use a single DB partition
since the WAL will naturally use an existing DB. This reduces the
number of managed partitions from two to one for each OSD. Note
that osd.report is intentionally not changed to give the admin an
indication that the configuration can be improved.

Signed-off-by: Eric Jackson [email protected]

swiftgist · 2018-06-19T19:41:36Z

This PR is a prerequisite for the wip-migrate branch. The remove/replace smoketests can fail since the repetitive nature destroys and creates OSDs a few times. The accumulation of orphaned partitions causes the smoketest to fail.

I will ask Igor to review this change as well since the changes are based on my discussions with him with regards to the preferred method of configuring bluestore on separate devices.

susebot · 2018-06-19T20:23:51Z

PR is OK http://storage-ci.suse.de:8080/job/deepsea-pr/72/
tuethology run results for suse:smoke suite

PASSED - suse:smoke/nfs-ganesha/{cluster/2nodes3disks.yaml distros/sle_15.yaml exec/cephfs.yaml tasks.yaml} 👍
PASSED - suse:smoke/rgw/{cluster/1node3disks.yaml distros/sle_15.yaml exec/health-rgw.yaml tasks.yaml} 👍
PASSED - suse:smoke/fs/{cluster/2nodes3disks.yaml distros/sle_15.yaml exec/health-mds.yaml tasks.yaml} 👍
PASSED - suse:smoke/health-ok/{cluster/1node3disks.yaml distros/sle_15.yaml exec/health-ok.yaml tasks.yaml} 👍
PASSED - suse:smoke/nfs-ganesha/{cluster/2nodes3disks.yaml distros/sle_15.yaml exec/both.yaml tasks.yaml} 👍
PASSED - suse:smoke/rgw/{cluster/1node3disks.yaml distros/sle_15.yaml exec/health-rgw-ssl.yaml tasks.yaml} 👍
PASSED - suse:smoke/nfs-ganesha/{cluster/2nodes3disks.yaml distros/sle_15.yaml exec/rgw.yaml tasks.yaml} 👍

tserong

Just to make sure I understand, it's now no longer possible to have separate data, wal and db. It's either data, or data with separate wal, or data with separate db?

susebot · 2018-06-20T11:51:44Z

PR is OK http://storage-ci.suse.de:8080/job/deepsea-pr/73/
Test results: http://storage-ci.suse.de:8080/job/deepsea-pr/73//testReport
tuethology run results for suse:smoke:health-ok suite

PASSED - suse:smoke:health-ok/{cluster/1node3disks.yaml distros/sle_15.yaml exec/health-ok.yaml tasks.yaml} :+1:

susebot · 2018-06-20T13:29:45Z

PR is OK http://storage-ci.suse.de:8080/job/deepsea-pr/75/
Test results: http://storage-ci.suse.de:8080/job/deepsea-pr/75/testReport
tuethology run results for suse:smoke suite

PASSED - suse:smoke/nfs-ganesha/{cluster/2nodes3disks.yaml distros/sle_15.yaml exec/cephfs.yaml tasks.yaml} 👍
PASSED - suse:smoke/rgw/{cluster/1node3disks.yaml distros/sle_15.yaml exec/health-rgw.yaml tasks.yaml} 👍
PASSED - suse:smoke/fs/{cluster/2nodes3disks.yaml distros/sle_15.yaml exec/health-mds.yaml tasks.yaml} 👍
PASSED - suse:smoke/health-ok/{cluster/1node3disks.yaml distros/sle_15.yaml exec/health-ok.yaml tasks.yaml} 👍
PASSED - suse:smoke/nfs-ganesha/{cluster/2nodes3disks.yaml distros/sle_15.yaml exec/both.yaml tasks.yaml} 👍
PASSED - suse:smoke/rgw/{cluster/1node3disks.yaml distros/sle_15.yaml exec/health-rgw-ssl.yaml tasks.yaml} 👍
PASSED - suse:smoke/nfs-ganesha/{cluster/2nodes3disks.yaml distros/sle_15.yaml exec/rgw.yaml tasks.yaml} 👍

swiftgist · 2018-06-20T14:57:46Z

Does susebot run every time somebody makes a comment?

@tserong You can still have data, separate wal, separate db (i.e. 3 devices), but if only using two devices with wal and db on the same device, we have only been making our lives harder. There's no performance difference. If the wal ran out of space, it still spilled over to the db. And we had two partitions to manage per OSD.

This will give a management aspect similar to filestore. Is my journal on the same device or different device. Specifying just the db is apparently enough for the wal to use the db as well. So, conversations are easier. One partition per OSD on a shared device should be simpler.

We still keep the wal by itself option on the odd chance somebody has really small SSDs and is willing to keep the db on the main disk for a specific hardware setup.

ifed01 · 2018-06-20T16:41:35Z

srv/salt/_modules/osd.py

@@ -898,8 +898,10 @@ def _bluestore_partitions(self):
                    log.warning(("WAL size is unsupported for same device of "
                                 "{}".format(self.osd.device)))
                else:
-                    # Create wal of wal_size on wal device
-                    self.create(self.osd.wal, [('wal', self.osd.wal_size)])
+                    log.info(("WAL will reside on same device {} as db - "


Perhaps I missed something but I don't see the difference in handling following two cases:

wal == db

wal != db
IMO both are getting here but there is no wal creation for case 2)

Right, the wal would not be created in either case, but the reason is different for the admin. The first message is that ceph-disk does not support specifying a wal on the same device as the OSD. The second is that it's unnecessary.

ifed01 · 2018-06-20T16:49:52Z

srv/salt/_modules/osd.py

+                    log.info(("WAL will reside on same device {} as db - "
+                              "recommend removing the WAL entry from the "
+                              "configuration for device "
+                              "{}").format(self.osd.db, self.osd.device))
            else:
                # pylint: disable=line-too-long
                log.warning("No size specified for wal {}. Using default sizes.".format(self.osd.wal))


Unrelated to this PR but just in case - are this line and similar handling for absent db_size relevant? It looks like respective devices wouldn't be created at all rather than "using default sizes".

If a size is not specified, we cannot create a partition. However, ceph-disk will accept a device without the partition and create partitions itself. So, the "default sizes" are for what ceph-disk will use.

kshtsk · 2018-06-21T08:12:22Z

@swiftgist No, susebot does not do that on each comment, I have triggered the test job for this PR again on jenkins.

jschmid1 · 2018-07-20T08:42:13Z

@jan--f @tserong please also review this PR. It's the prerequisite of #1216

tserong · 2018-07-20T09:41:00Z

You can still have data, separate wal, separate db (i.e. 3 devices), but if only using two devices with wal and db on the same device, we have only been making our lives harder.

It reads to me as if you can have three separate devices, but only if you don't specify wal_size or db_size.

If either wal_size or db_size are specified, it seems to be forcing the wal onto the same device as the db.

Is that correct?

tserong · 2018-07-20T09:45:29Z

Also, see

DeepSea/srv/salt/_modules/osd.py

Line 910 in 2e465f0

if self.osd.wal == self.osd.device:

-- Judging from the shape of the code around this, I think that's intended to be if self.osd.db == self.osd.device, not if self.osd.wal == self.osd.device.

swiftgist · 2018-07-23T14:38:48Z

Fixed line 910. Also, fixed the indent, removed old comments and wal creation when no wal is defined without sizes.

swiftgist · 2018-07-23T14:41:28Z

With respect to no sizes, the _bluestore_partitions does nothing. The _bluestore_args will still pass the data, wal and db as three separate arguments and let ceph-disk create the partitions itself.

The bluestore partition process has been working from original assumptions. The refactor of bluestore args created a disconnect between the two steps. Partitions are getting created, but never used nor removed. This also implements the recommendation to use a single DB partition since the WAL will naturally use an existing DB. This reduces the number of managed partitions from two to one for each OSD. Note that osd.report is intentionally not changed to give the admin an indication that the configuration can be improved. Signed-off-by: Eric Jackson <[email protected]>

tserong · 2018-07-24T03:00:07Z

What about if someone wants to specify sizes for both db and wal, but wants to have three separate devices? That seems to be not possible. Or is that not a sensible thing for someone to want? Or, do I still not quite understand the logic? :-)

swiftgist · 2018-07-24T12:01:44Z

No, that would work too. With sizes, the _bluestore_partitions creates a WAL partition and a DB partition on their respective devices and those specific devices are then passed to ceph-disk. The big difference is whether we pass /dev/sdb or /dev/sdb3 to ceph-disk for a WAL or DB. When we create the partition (since we have a size), we pass the latter.

BTW, the osd.prepare and osd.activate are non-destructive ways to experiment with the configuration generating the ceph-disk commands.

jschmid1 · 2018-07-25T08:22:43Z

@tserong are we good to merge?

tserong · 2018-07-25T09:08:45Z

@tserong are we good to merge?

With the disclaimer that I haven't yet been able to test multi-device OSDs myself, yes, I'm happy this has received enough scrutiny (this one just hurt my brain trying to hold it in my head, for some reason :-/)

jschmid1 · 2018-07-25T10:04:05Z

I just tried the following setup:

ceph:
  storage:
    osds:
      /dev/vdb:
        format: bluestore
        wal: /dev/vde
        db: /dev/vdf
        db_size: 1G
        wal_size: 2G
      /dev/vdc:
        format: bluestore
        wal: /dev/vde
        db: /dev/vdf
        db_size: 4G
        wal_size: 3G
      /dev/vdd:
        format: bluestore
        wal: /dev/vde
        db: /dev/vdf

which results in:

vda    253:0    0   40G  0 disk
└─vda1 253:1    0   40G  0 part /
vdb    253:16   0   20G  0 disk
├─vdb1 253:17   0  100M  0 part
└─vdb2 253:18   0 19.9G  0 part
vdc    253:32   0   20G  0 disk
├─vdc1 253:33   0  100M  0 part /var/lib/ceph/osd/ceph-14
└─vdc2 253:34   0 19.9G  0 part
vdd    253:48   0   20G  0 disk
└─vdd1 253:49   0  100M  0 part
vde    253:64   0   20G  0 disk
vdf    253:80   0   20G  0 disk
├─vdf1 253:81   0    1G  0 part
├─vdf2 253:82   0    4G  0 part
└─vdf3 253:83   0    1G  0 part

And the following logs:

data1:~ # salt-call osd.deploy
[ERROR   ] Partition type lockbox not found on /dev/vdb
[ERROR   ] Partition type lockbox not found on /dev/vdc
[ERROR   ] Partition type lockbox not found on /dev/vdd
[ERROR   ] Partition type osd not found on /dev/vdd
[WARNING ] No size specified for wal /dev/vde. Using default sizes.
[WARNING ] No size specified for db /dev/vdf. Using default sizes

EDIT: ( also from the logs )
Jul 25 09:55:14 data1 salt-minion[1466]: [WARNING ] Ignoring WAL setting - No need for two partitions, WAL will use same device as DB /dev/vdf
Jul 25 09:55:26 data1 salt-minion[1466]: [ERROR   ] Partition type lockbox not found on /dev/vdc
Jul 25 09:55:26 data1 salt-minion[1466]: [ERROR   ] Partition type osd not found on /dev/vdc
Jul 25 09:55:27 data1 salt-minion[1466]: [WARNING ] Ignoring WAL setting - No need for two partitions, WAL will use same device as DB /dev/vdf
Jul 25 09:55:42 data1 salt-minion[1466]: [ERROR   ] Partition type lockbox not found on /dev/vdd
Jul 25 09:55:42 data1 salt-minion[1466]: [ERROR   ] Partition type osd not found on /dev/vdd
Jul 25 09:55:42 data1 salt-minion[1466]: [WARNING ] No size specified for wal /dev/vde. Using default sizes.
Jul 25 09:55:42 data1 salt-minion[1466]: [WARNING ] No size specified for db /dev/vdf. Using default sizes

Whilst osd.report still complains:

Do NOT use this appliance for production deployments!
data1:~ # salt-call osd.report
local:
    No OSD configured for 
    /dev/vdb
    /dev/vdd
    Different configuration for 
    /dev/vdc

I'm not too convinced that this is the right way to go. We more or less silently forbid certain configurations that most users were anticipate to be correct.

not ready yet

jschmid1 · 2018-07-26T12:13:20Z

Comment is referencing my last comment above:

I managed to get /dev/vdb and /dev/vdc to deployed as expected ( only one partition gets created on the db: /dev/vdf respectively )

EDIT /dev/vdd did not work as expected.

Potential issues:

I set db_size: 1G which is fine for only the DB probably.. but as we are now colocating the WAL and the DB on the same partition 1G might not be enough.
( Assuming the resulting wal actually points to the DB, although I can't find a link to the wal/db in /var/lib/ceph/ceph-osd/$ID )
I set /dev/vdd to not have specific sizes which resulted in:
ceph-disk -v prepare --bluestore --data-dev --journal-dev --cluster ceph --cluster-uuid dd52dd7b-17e7-44f1-9739-c043d15b4481 --block.wal /dev/vde -- block.db /dev/vdf /dev/vdd
(note the --block-wal )
That also failed with http://paste.opensuse.org/70180931 ValueError in ceph-disk
@tserong was asking:

What about if someone wants to specify sizes for both db and wal, but wants to have three separate
devices? That seems to be not possible. Or is that not a sensible thing for someone to want? Or, do I
still not quite understand the logic? :-)

That statement seem to hold if I lock at the osd.prepare outpout of (2)

This PR is considered to be a prerequisite for #1216 with the argument:

This PR is a prerequisite for the wip-migrate branch. The remove/replace smoketests can fail since the repetitive nature destroys and creates OSDs a few times. The accumulation of orphaned partitions causes the smoketest to fail.

iirc the smoketests use remove.osd which should take care of orphaned partitions? In case we have orphaned partitions there is a bug in the removal.

If we don't have enough space on the partitions in the first place we should think about increasing the size of the disks on our testing machines. ( qcow2 is a sparse disks and does not take up space anyways.. )

I'm still not convinced that we should try to mangle a user-created configuration. Imo we should exactly do what the user told deepsea to do. If he went for it, he shall have it. I'm not against a warning message that tells him that this is not the most efficient configuration to go for, but we do what he asked for.

I'd like to hear your thoughts on that @tserong @swiftgist @smithfarm

swiftgist · 2018-07-27T12:40:14Z

@jschmid1 The bug isn't in the removal. The removal cannot wipe all partitions on a secondary device and can only remove the ones referenced. The bug is in the partition creation currently because it creates an unreferenced partition.

The rest was to bring the behavior expected/desired by recent talks on the mailing lists.

I would have to look closer, but the spacing between -- and block.db looks off in the 3 device command with no sizes. That behavior should happen with master as well then since the no size conditions are the same.

jschmid1 · 2018-07-27T13:18:14Z

The bug isn't in the removal. The removal cannot wipe all partitions on a secondary device and can only remove the ones referenced.

Forgive me my naivety, but isn't that exactly what we need? We need to remove the partitions that the passed OSD is pointing to. Why do we need to remove all partitions on a secondary device?

The bug is in the partition creation currently because it creates an unreferenced partition.

In which circumstances?

The rest was to bring the behavior expected/desired by recent talks on the mailing lists.

right, I see that. I'd still say we either stick to the old behavior or don't 'advise' anything at all.

I would have to look closer, but the spacing between -- and block.db looks off in the 3 device command with no sizes.

The spaces may have sneaked in during copy/paste.

That behavior should happen with master as well then since the no size conditions are the same.

with #1244 we will get a better picture of what is currently works and what doesn't.

jschmid1 · 2018-07-31T09:27:57Z

any progress @swiftgist ?

swiftgist · 2018-07-31T19:49:46Z

Closing. Created https://github.com/SUSE/DeepSea/tree/fix-extra-partition

jschmid1 · 2018-08-01T11:49:14Z

@swiftgist is that merged into some exsting/open PR?

swiftgist requested review from tserong and jschmid1 June 19, 2018 19:37

tserong reviewed Jun 20, 2018

View reviewed changes

SUSE deleted a comment from susebot Jun 20, 2018

ifed01 reviewed Jun 20, 2018

View reviewed changes

jschmid1 previously approved these changes Jun 22, 2018

View reviewed changes

swiftgist mentioned this pull request Jul 6, 2018

Reuse OSD ID in migrations, support replace.osds #1216

Merged

jschmid1 added the needs-backport label Jul 20, 2018

jschmid1 requested a review from jan--f July 20, 2018 08:42

swiftgist force-pushed the fix-bluestore-partitions branch from 2e465f0 to a2de6c8 Compare July 23, 2018 14:37

swiftgist force-pushed the fix-bluestore-partitions branch from a2de6c8 to 45e16bd Compare July 23, 2018 14:59

swiftgist closed this Jul 31, 2018

jschmid1 removed the needs-backport label Aug 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove creation of unused partitions, implement recommendation #1188

Remove creation of unused partitions, implement recommendation #1188

swiftgist commented Jun 19, 2018

swiftgist commented Jun 19, 2018

susebot commented Jun 19, 2018

tserong left a comment

susebot commented Jun 20, 2018

susebot commented Jun 20, 2018

swiftgist commented Jun 20, 2018

ifed01 Jun 20, 2018

swiftgist Jun 20, 2018

ifed01 Jun 20, 2018

swiftgist Jun 20, 2018

kshtsk commented Jun 21, 2018

jschmid1 commented Jul 20, 2018

tserong commented Jul 20, 2018

tserong commented Jul 20, 2018

swiftgist commented Jul 23, 2018

swiftgist commented Jul 23, 2018

tserong commented Jul 24, 2018

swiftgist commented Jul 24, 2018

jschmid1 commented Jul 25, 2018

tserong commented Jul 25, 2018

jschmid1 commented Jul 25, 2018 •

edited

Loading

jschmid1 commented Jul 26, 2018 •

edited

Loading

swiftgist commented Jul 27, 2018

jschmid1 commented Jul 27, 2018 •

edited

Loading

jschmid1 commented Jul 31, 2018

swiftgist commented Jul 31, 2018

jschmid1 commented Aug 1, 2018

Remove creation of unused partitions, implement recommendation #1188

Remove creation of unused partitions, implement recommendation #1188

Conversation

swiftgist commented Jun 19, 2018

swiftgist commented Jun 19, 2018

susebot commented Jun 19, 2018

tserong left a comment

Choose a reason for hiding this comment

susebot commented Jun 20, 2018

susebot commented Jun 20, 2018

swiftgist commented Jun 20, 2018

ifed01 Jun 20, 2018

Choose a reason for hiding this comment

swiftgist Jun 20, 2018

Choose a reason for hiding this comment

ifed01 Jun 20, 2018

Choose a reason for hiding this comment

swiftgist Jun 20, 2018

Choose a reason for hiding this comment

kshtsk commented Jun 21, 2018

jschmid1 commented Jul 20, 2018

tserong commented Jul 20, 2018

tserong commented Jul 20, 2018

swiftgist commented Jul 23, 2018

swiftgist commented Jul 23, 2018

tserong commented Jul 24, 2018

swiftgist commented Jul 24, 2018

jschmid1 commented Jul 25, 2018

tserong commented Jul 25, 2018

jschmid1 commented Jul 25, 2018 • edited Loading

jschmid1 commented Jul 26, 2018 • edited Loading

swiftgist commented Jul 27, 2018

jschmid1 commented Jul 27, 2018 • edited Loading

jschmid1 commented Jul 31, 2018

swiftgist commented Jul 31, 2018

jschmid1 commented Aug 1, 2018

jschmid1 commented Jul 25, 2018 •

edited

Loading

jschmid1 commented Jul 26, 2018 •

edited

Loading

jschmid1 commented Jul 27, 2018 •

edited

Loading