Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove creation of unused partitions, implement recommendation #1188

Closed
wants to merge 1 commit into from

Conversation

swiftgist
Copy link
Contributor

The bluestore partition process has been working from original
assumptions. The refactor of bluestore args created a disconnect
between the two steps. Partitions are getting created, but never
used nor removed.

This also implements the recommendation to use a single DB partition
since the WAL will naturally use an existing DB. This reduces the
number of managed partitions from two to one for each OSD. Note
that osd.report is intentionally not changed to give the admin an
indication that the configuration can be improved.

Signed-off-by: Eric Jackson [email protected]

@swiftgist swiftgist requested review from tserong and jschmid1 June 19, 2018 19:37
@swiftgist
Copy link
Contributor Author

This PR is a prerequisite for the wip-migrate branch. The remove/replace smoketests can fail since the repetitive nature destroys and creates OSDs a few times. The accumulation of orphaned partitions causes the smoketest to fail.

I will ask Igor to review this change as well since the changes are based on my discussions with him with regards to the preferred method of configuring bluestore on separate devices.

@susebot
Copy link
Collaborator

susebot commented Jun 19, 2018

PR is OK http://storage-ci.suse.de:8080/job/deepsea-pr/72/
tuethology run results for suse:smoke suite


  • PASSED - suse:smoke/nfs-ganesha/{cluster/2nodes3disks.yaml distros/sle_15.yaml exec/cephfs.yaml tasks.yaml} 👍
  • PASSED - suse:smoke/rgw/{cluster/1node3disks.yaml distros/sle_15.yaml exec/health-rgw.yaml tasks.yaml} 👍
  • PASSED - suse:smoke/fs/{cluster/2nodes3disks.yaml distros/sle_15.yaml exec/health-mds.yaml tasks.yaml} 👍
  • PASSED - suse:smoke/health-ok/{cluster/1node3disks.yaml distros/sle_15.yaml exec/health-ok.yaml tasks.yaml} 👍
  • PASSED - suse:smoke/nfs-ganesha/{cluster/2nodes3disks.yaml distros/sle_15.yaml exec/both.yaml tasks.yaml} 👍
  • PASSED - suse:smoke/rgw/{cluster/1node3disks.yaml distros/sle_15.yaml exec/health-rgw-ssl.yaml tasks.yaml} 👍
  • PASSED - suse:smoke/nfs-ganesha/{cluster/2nodes3disks.yaml distros/sle_15.yaml exec/rgw.yaml tasks.yaml} 👍

Copy link
Member

@tserong tserong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand, it's now no longer possible to have separate data, wal and db. It's either data, or data with separate wal, or data with separate db?

@susebot
Copy link
Collaborator

susebot commented Jun 20, 2018

PR is OK http://storage-ci.suse.de:8080/job/deepsea-pr/73/
Test results: http://storage-ci.suse.de:8080/job/deepsea-pr/73//testReport
tuethology run results for suse:smoke:health-ok suite


  • PASSED - suse:smoke:health-ok/{cluster/1node3disks.yaml distros/sle_15.yaml exec/health-ok.yaml tasks.yaml} :+1:

@SUSE SUSE deleted a comment from susebot Jun 20, 2018
@susebot
Copy link
Collaborator

susebot commented Jun 20, 2018

PR is OK http://storage-ci.suse.de:8080/job/deepsea-pr/75/
Test results: http://storage-ci.suse.de:8080/job/deepsea-pr/75/testReport
tuethology run results for suse:smoke suite


  • PASSED - suse:smoke/nfs-ganesha/{cluster/2nodes3disks.yaml distros/sle_15.yaml exec/cephfs.yaml tasks.yaml} 👍
  • PASSED - suse:smoke/rgw/{cluster/1node3disks.yaml distros/sle_15.yaml exec/health-rgw.yaml tasks.yaml} 👍
  • PASSED - suse:smoke/fs/{cluster/2nodes3disks.yaml distros/sle_15.yaml exec/health-mds.yaml tasks.yaml} 👍
  • PASSED - suse:smoke/health-ok/{cluster/1node3disks.yaml distros/sle_15.yaml exec/health-ok.yaml tasks.yaml} 👍
  • PASSED - suse:smoke/nfs-ganesha/{cluster/2nodes3disks.yaml distros/sle_15.yaml exec/both.yaml tasks.yaml} 👍
  • PASSED - suse:smoke/rgw/{cluster/1node3disks.yaml distros/sle_15.yaml exec/health-rgw-ssl.yaml tasks.yaml} 👍
  • PASSED - suse:smoke/nfs-ganesha/{cluster/2nodes3disks.yaml distros/sle_15.yaml exec/rgw.yaml tasks.yaml} 👍

@swiftgist
Copy link
Contributor Author

Does susebot run every time somebody makes a comment?

@tserong You can still have data, separate wal, separate db (i.e. 3 devices), but if only using two devices with wal and db on the same device, we have only been making our lives harder. There's no performance difference. If the wal ran out of space, it still spilled over to the db. And we had two partitions to manage per OSD.

This will give a management aspect similar to filestore. Is my journal on the same device or different device. Specifying just the db is apparently enough for the wal to use the db as well. So, conversations are easier. One partition per OSD on a shared device should be simpler.

We still keep the wal by itself option on the odd chance somebody has really small SSDs and is willing to keep the db on the main disk for a specific hardware setup.

@@ -898,8 +898,10 @@ def _bluestore_partitions(self):
log.warning(("WAL size is unsupported for same device of "
"{}".format(self.osd.device)))
else:
# Create wal of wal_size on wal device
self.create(self.osd.wal, [('wal', self.osd.wal_size)])
log.info(("WAL will reside on same device {} as db - "
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I missed something but I don't see the difference in handling following two cases:

  1. wal == db
  2. wal != db
    IMO both are getting here but there is no wal creation for case 2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the wal would not be created in either case, but the reason is different for the admin. The first message is that ceph-disk does not support specifying a wal on the same device as the OSD. The second is that it's unnecessary.

log.info(("WAL will reside on same device {} as db - "
"recommend removing the WAL entry from the "
"configuration for device "
"{}").format(self.osd.db, self.osd.device))
else:
# pylint: disable=line-too-long
log.warning("No size specified for wal {}. Using default sizes.".format(self.osd.wal))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR but just in case - are this line and similar handling for absent db_size relevant? It looks like respective devices wouldn't be created at all rather than "using default sizes".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a size is not specified, we cannot create a partition. However, ceph-disk will accept a device without the partition and create partitions itself. So, the "default sizes" are for what ceph-disk will use.

@kshtsk
Copy link

kshtsk commented Jun 21, 2018

@swiftgist No, susebot does not do that on each comment, I have triggered the test job for this PR again on jenkins.

jschmid1
jschmid1 previously approved these changes Jun 22, 2018
@jschmid1
Copy link
Contributor

@jan--f @tserong please also review this PR. It's the prerequisite of #1216

@tserong
Copy link
Member

tserong commented Jul 20, 2018

You can still have data, separate wal, separate db (i.e. 3 devices), but if only using two devices with wal and db on the same device, we have only been making our lives harder.

It reads to me as if you can have three separate devices, but only if you don't specify wal_size or db_size.

If either wal_size or db_size are specified, it seems to be forcing the wal onto the same device as the db.

Is that correct?

@tserong
Copy link
Member

tserong commented Jul 20, 2018

Also, see

if self.osd.wal == self.osd.device:
-- Judging from the shape of the code around this, I think that's intended to be if self.osd.db == self.osd.device, not if self.osd.wal == self.osd.device.

@swiftgist swiftgist force-pushed the fix-bluestore-partitions branch from 2e465f0 to a2de6c8 Compare July 23, 2018 14:37
@swiftgist
Copy link
Contributor Author

Fixed line 910. Also, fixed the indent, removed old comments and wal creation when no wal is defined without sizes.

@swiftgist
Copy link
Contributor Author

With respect to no sizes, the _bluestore_partitions does nothing. The _bluestore_args will still pass the data, wal and db as three separate arguments and let ceph-disk create the partitions itself.

The bluestore partition process has been working from original
assumptions.  The refactor of bluestore args created a disconnect
between the two steps.  Partitions are getting created, but never
used nor removed.

This also implements the recommendation to use a single DB partition
since the WAL will naturally use an existing DB.  This reduces the
number of managed partitions from two to one for each OSD.  Note
that osd.report is intentionally not changed to give the admin an
indication that the configuration can be improved.

Signed-off-by: Eric Jackson <[email protected]>
@swiftgist swiftgist force-pushed the fix-bluestore-partitions branch from a2de6c8 to 45e16bd Compare July 23, 2018 14:59
@tserong
Copy link
Member

tserong commented Jul 24, 2018

What about if someone wants to specify sizes for both db and wal, but wants to have three separate devices? That seems to be not possible. Or is that not a sensible thing for someone to want? Or, do I still not quite understand the logic? :-)

@swiftgist
Copy link
Contributor Author

No, that would work too. With sizes, the _bluestore_partitions creates a WAL partition and a DB partition on their respective devices and those specific devices are then passed to ceph-disk. The big difference is whether we pass /dev/sdb or /dev/sdb3 to ceph-disk for a WAL or DB. When we create the partition (since we have a size), we pass the latter.

BTW, the osd.prepare and osd.activate are non-destructive ways to experiment with the configuration generating the ceph-disk commands.

@jschmid1
Copy link
Contributor

@tserong are we good to merge?

@tserong
Copy link
Member

tserong commented Jul 25, 2018

@tserong are we good to merge?

With the disclaimer that I haven't yet been able to test multi-device OSDs myself, yes, I'm happy this has received enough scrutiny (this one just hurt my brain trying to hold it in my head, for some reason :-/)

@jschmid1
Copy link
Contributor

jschmid1 commented Jul 25, 2018

I just tried the following setup:

ceph:
  storage:
    osds:
      /dev/vdb:
        format: bluestore
        wal: /dev/vde
        db: /dev/vdf
        db_size: 1G
        wal_size: 2G
      /dev/vdc:
        format: bluestore
        wal: /dev/vde
        db: /dev/vdf
        db_size: 4G
        wal_size: 3G
      /dev/vdd:
        format: bluestore
        wal: /dev/vde
        db: /dev/vdf

which results in:

vda    253:0    0   40G  0 disk
└─vda1 253:1    0   40G  0 part /
vdb    253:16   0   20G  0 disk
├─vdb1 253:17   0  100M  0 part
└─vdb2 253:18   0 19.9G  0 part
vdc    253:32   0   20G  0 disk
├─vdc1 253:33   0  100M  0 part /var/lib/ceph/osd/ceph-14
└─vdc2 253:34   0 19.9G  0 part
vdd    253:48   0   20G  0 disk
└─vdd1 253:49   0  100M  0 part
vde    253:64   0   20G  0 disk
vdf    253:80   0   20G  0 disk
├─vdf1 253:81   0    1G  0 part
├─vdf2 253:82   0    4G  0 part
└─vdf3 253:83   0    1G  0 part

And the following logs:

data1:~ # salt-call osd.deploy
[ERROR   ] Partition type lockbox not found on /dev/vdb
[ERROR   ] Partition type lockbox not found on /dev/vdc
[ERROR   ] Partition type lockbox not found on /dev/vdd
[ERROR   ] Partition type osd not found on /dev/vdd
[WARNING ] No size specified for wal /dev/vde. Using default sizes.
[WARNING ] No size specified for db /dev/vdf. Using default sizes

EDIT: ( also from the logs )
Jul 25 09:55:14 data1 salt-minion[1466]: [WARNING ] Ignoring WAL setting - No need for two partitions, WAL will use same device as DB /dev/vdf
Jul 25 09:55:26 data1 salt-minion[1466]: [ERROR   ] Partition type lockbox not found on /dev/vdc
Jul 25 09:55:26 data1 salt-minion[1466]: [ERROR   ] Partition type osd not found on /dev/vdc
Jul 25 09:55:27 data1 salt-minion[1466]: [WARNING ] Ignoring WAL setting - No need for two partitions, WAL will use same device as DB /dev/vdf
Jul 25 09:55:42 data1 salt-minion[1466]: [ERROR   ] Partition type lockbox not found on /dev/vdd
Jul 25 09:55:42 data1 salt-minion[1466]: [ERROR   ] Partition type osd not found on /dev/vdd
Jul 25 09:55:42 data1 salt-minion[1466]: [WARNING ] No size specified for wal /dev/vde. Using default sizes.
Jul 25 09:55:42 data1 salt-minion[1466]: [WARNING ] No size specified for db /dev/vdf. Using default sizes

Whilst osd.report still complains:

Do NOT use this appliance for production deployments!
data1:~ # salt-call osd.report
local:
    No OSD configured for 
    /dev/vdb
    /dev/vdd
    Different configuration for 
    /dev/vdc

I'm not too convinced that this is the right way to go. We more or less silently forbid certain configurations that most users were anticipate to be correct.

@jschmid1 jschmid1 dismissed their stale review July 25, 2018 10:04

not ready yet

@jschmid1
Copy link
Contributor

jschmid1 commented Jul 26, 2018

Comment is referencing my last comment above:

I managed to get /dev/vdb and /dev/vdc to deployed as expected ( only one partition gets created on the db: /dev/vdf respectively )

EDIT /dev/vdd did not work as expected.

Potential issues:

  1. I set db_size: 1G which is fine for only the DB probably.. but as we are now colocating the WAL and the DB on the same partition 1G might not be enough.
    ( Assuming the resulting wal actually points to the DB, although I can't find a link to the wal/db in /var/lib/ceph/ceph-osd/$ID )

  2. I set /dev/vdd to not have specific sizes which resulted in:
    ceph-disk -v prepare --bluestore --data-dev --journal-dev --cluster ceph --cluster-uuid dd52dd7b-17e7-44f1-9739-c043d15b4481 --block.wal /dev/vde -- block.db /dev/vdf /dev/vdd
    (note the --block-wal )
    That also failed with http://paste.opensuse.org/70180931 ValueError in ceph-disk

  3. @tserong was asking:

    What about if someone wants to specify sizes for both db and wal, but wants to have three separate
    devices? That seems to be not possible. Or is that not a sensible thing for someone to want? Or, do I
    still not quite understand the logic? :-)

    That statement seem to hold if I lock at the osd.prepare outpout of (2)


This PR is considered to be a prerequisite for #1216 with the argument:

This PR is a prerequisite for the wip-migrate branch. The remove/replace smoketests can fail since the repetitive nature destroys and creates OSDs a few times. The accumulation of orphaned partitions causes the smoketest to fail.

iirc the smoketests use remove.osd which should take care of orphaned partitions? In case we have orphaned partitions there is a bug in the removal.

If we don't have enough space on the partitions in the first place we should think about increasing the size of the disks on our testing machines. ( qcow2 is a sparse disks and does not take up space anyways.. )


I'm still not convinced that we should try to mangle a user-created configuration. Imo we should exactly do what the user told deepsea to do. If he went for it, he shall have it. I'm not against a warning message that tells him that this is not the most efficient configuration to go for, but we do what he asked for.

I'd like to hear your thoughts on that @tserong @swiftgist @smithfarm

@swiftgist
Copy link
Contributor Author

@jschmid1 The bug isn't in the removal. The removal cannot wipe all partitions on a secondary device and can only remove the ones referenced. The bug is in the partition creation currently because it creates an unreferenced partition.

The rest was to bring the behavior expected/desired by recent talks on the mailing lists.

I would have to look closer, but the spacing between -- and block.db looks off in the 3 device command with no sizes. That behavior should happen with master as well then since the no size conditions are the same.

@jschmid1
Copy link
Contributor

jschmid1 commented Jul 27, 2018

The bug isn't in the removal. The removal cannot wipe all partitions on a secondary device and can only remove the ones referenced.

Forgive me my naivety, but isn't that exactly what we need? We need to remove the partitions that the passed OSD is pointing to. Why do we need to remove all partitions on a secondary device?

The bug is in the partition creation currently because it creates an unreferenced partition.

In which circumstances?

The rest was to bring the behavior expected/desired by recent talks on the mailing lists.

right, I see that. I'd still say we either stick to the old behavior or don't 'advise' anything at all.

I would have to look closer, but the spacing between -- and block.db looks off in the 3 device command with no sizes.

The spaces may have sneaked in during copy/paste.

That behavior should happen with master as well then since the no size conditions are the same.

with #1244 we will get a better picture of what is currently works and what doesn't.

@jschmid1
Copy link
Contributor

any progress @swiftgist ?

@swiftgist
Copy link
Contributor Author

@swiftgist swiftgist closed this Jul 31, 2018
@jschmid1
Copy link
Contributor

jschmid1 commented Aug 1, 2018

@swiftgist is that merged into some exsting/open PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants