-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse OSD ID in migrations, support replace.osds #1216
Conversation
You are not allowed to trigger tests. Please ask to get whitelisted. |
Looking at
Now, I know the smoketests destroy all the OSDs on one node, then do migrates involving only two OSDs on that node, but, while this is all running, the OSDs appear in the "localhost" bucket, not in the "data1" bucket. Is that indicative of a problem at all? |
...actually, my environment somehow got broken part way though this:
Rebooting seems to have fixed it. |
""" | ||
# Parameters for osd.remove module | ||
supported = ['force', 'timeout', 'delay'] | ||
passed = ["{}={}".format(k, v) for k, v in kwargs.items() if k in supported] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a notification that someone passed unsupported parameters be helpful here? That'd also help people that did a typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underscore variables get passed from Salt when called via a state file, iirc. The warning would be a wall of text without extra processing trying to separate Salt internal variables. I thought the debug would be enough.
srv/modules/runners/replace.py
Outdated
""" | ||
Usage | ||
""" | ||
usage = ('salt-run replace.osd id [id ...][force=True]:\n\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should reflect the supported args of the respective module '['force', 'timeout', 'delay']'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the timeout and delay are passed to the osd.remove module, so the defaults aren't set here. Also, only using those two in smoketests currently. I will have to think about what to put here.
|
||
if len(osds) > 1: | ||
# Pause for a moment, let the admin see what they passed | ||
print("Removing osds {} from minions\nPress Ctrl-C to abort".format(", ".join(osds))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe tell the admin for how long he'll be able to hit CTRL+C
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me take a look.
completed = osds | ||
for osd_id in osds: | ||
host = _find_host(osd_id, host_osds) | ||
if host: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be verbose in case there was no host found. Currently it will just not do anything and bail out silently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The remove.osd command also calls this module. In the case that a remove.osd has already been called and is no longer present on the minions, then the OSD will be removed from Ceph (a message is printed for that). Currently, the behavior prints messages for what actions it does.
I could add a verbose option, but I would prefer the default to not be verbose. If an admin is using a wildcard for dozens of OSDs and rerunning the command multiple times, then the output gets shorter as the Ceph cluster reaches the desired state.
srv/modules/runners/replace.py
Outdated
return False | ||
|
||
|
||
def _remove_minion(local, master_minion, osd_id, passed, host): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name of the function should be _remove_osd_from_minion or something in that direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just '_remove_osd'.
return "" | ||
|
||
|
||
def help_(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like there is something missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a policy question: the smoketest runner lives in the qa package. Which way do we want to go with help messages? Do they follow all runners or only those in the main package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has anyone tried this with a custom crushmap, to verify the OSDs remain where they should during migration?
I did a manual test with a custom crushmap and osd crush update on start = false
in ceph.conf:
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-12 1.00000 root alt-root
-11 1.00000 host data1-fake
2 hdd 1.00000 osd.2 up 1.00000 1.00000
-1 0.42195 root default
-3 0.07286 host data1
6 hdd 0.01457 osd.6 up 1.00000 1.00000
10 hdd 0.01457 osd.10 up 1.00000 1.00000
14 hdd 0.01457 osd.14 up 1.00000 1.00000
18 hdd 0.01457 osd.18 up 1.00000 1.00000
22 hdd 0.01457 osd.22 up 1.00000 1.00000
[...]
This was originally deployed with filestore, I flipped to a bluestore profile for the data1 host, ran stage 2, then ceph.migrate.osds; the migration succeeded and everything remained where it was meant to be in the crushmap.
return "" | ||
|
||
|
||
def minion_profile(minion): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest calling this _rename_minion_profile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the public module name, so the runner call is "salt-run replace.minion_profile".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, my bad.
srv/modules/runners/select.py
Outdated
@@ -51,7 +51,7 @@ def _grain_host(client, minion): | |||
""" | |||
Return the host grain for a given minion, for use a short hostname | |||
""" | |||
return list(client.cmd(minion, 'grains.item', ['host']).values())[0]['host'] | |||
return list(client.cmd(minion, 'grains.item', ['nodename']).values())[0]['nodename'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why nodename
? We use grains['host']
fairly extensively elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will take me a bit to remember. I believe I noticed a difference between Salt versions about different default behavior.
@@ -797,6 +797,12 @@ def clean(self): | |||
|
|||
Note: expected to only run inside of "not is_prepared" | |||
""" | |||
if (self.osd.disk_format != 'filestore' and | |||
self.osd.disk_format != 'bluestore'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which circumstances did you see that happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this primarily for the smoketests. Overriding Salt dictionaries had some issues, so I set a disk_format of 'none'.
On the chance that an admin does mistype a format, it's better not to wipe all the partitions on that drive until it's corrected.
srv/salt/_modules/osd.py
Outdated
@@ -1244,6 +1250,9 @@ def prepare(self): | |||
if self.osd.device: | |||
cmd = "PYTHONWARNINGS=ignore ceph-disk -v prepare " | |||
|
|||
# OSD ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a bit more descriptive comment. The other option would be to extend the docstring and explain what the osd_id is used for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating
srv/salt/_modules/osd.py
Outdated
self._weight = weight | ||
self._grains = grains | ||
self.force = force | ||
self.keyring = None | ||
self.client = None | ||
if 'keyring' in kwargs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we have that in a lot of places. But this notation is easier to read and allows to assign default values:
self.keyring = kwargs.get('keyring', 'a default value')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing
srv/salt/_modules/osd.py
Outdated
if msg: | ||
log.error(msg) | ||
return msg | ||
log.debug("OSD {} marked and recorded".format(self.osd_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could be a log.info/warning ( something that a user sees on his screen )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are in a module, so the message will not come back to the admin on the Salt master. I'll set it to info on the off chance that somebody runs this as a salt-call.
log.error(msg) | ||
return msg | ||
|
||
# Remove grain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lately we ran into a situation where the remove.osd process supposedly was successful but actually did not run through completely which resulted in stale grains. We should somehow make people more aware that this command needs to be executed until it succeeds or bad things will happen :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new behavior of the remove.osd
and replace.osd
will keep retrying on timeouts. The other error messages should be much better now and actually reach the admin, so the admin will be aware that the OSD is not removed.
srv/salt/_modules/osd.py
Outdated
log.debug("content: {} {}".format(type(content), content)) | ||
|
||
if device in content: | ||
# Exit early, no by-pass equivalent from previous run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by- passpath?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
srv/salt/_modules/osd.py
Outdated
_rc, _stdout, _stderr = __salt__['helper.run'](cmd) | ||
if _stdout: | ||
_devices = _stdout.split() | ||
if _devices[0]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even the conditional check on an potentially empty array will raise an "IndexError":
It'd rather:
if devices:
return devices[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could still produce an IndexError.
if devices and devices[0]
should do the trick. What are the repercussions though if we don't throw something here? Might be a good idea to fail loudly here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could still produce an IndexError.
The if devices
only returns True if there is at least one item in the list. And that's what we are asking for here - the first element. Since python does not allow to set custom indices ( or at least we don't do that here ) we are fine with that check, aren't we?
In [1]: foo = ['bar']
In [2]: if foo:
...: print(foo[0])
...:
bar
In [3]: foo = []
In [4]: if foo:
...: print(foo[0])
...:
In [5]:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right...we can rely on having a list here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
A comment on the smoketests: I just tried blue->file migration and it removed 3/5 of my disks on the first osd node and recreated 2/5. I don't think that this is intentional. |
It's intentional. The smoke tests remove all the OSDs on one storage node, then run a whole bunch of different migrations by first creating two osds in one format or another, then migrating them to another. |
In regards to @jschmid1's comment regarding a failed osd removal. Would it make sense to have a function that makes sure ceph and DeepSea agree on which OSDs are currently deployed? Along the lines "if I'm not sure of all the failure modes an osd removal can produce. The theory is that if ceph doesn't know it, we shouldn't either. Not sure though if a running osd process can re-add itself to ceph or something fun like that. |
ok, must have missed that. |
Smoke tests went fine for me initially. Had several repeated runs and everything was green. I just did another smoketest run with one pool and ~23GB of data in it. This one did not finish and my cluster is still in warn with 3 PGs that can't seem to repair themselves (2 perring, 1 remapped and peering). There're two factors that might play a role here:
In any case, we should absolutely test this and make sure the migration work and test alright with a somwhat filled up cluster. |
rebased to capture additional tests. |
@susebot run teuthology |
@jschmid1 I only just now re-enabled the migrate and replace functional tests, so you might need to re-run teuthology. |
Commit 2c32311 is NOT OK. |
@jschmid1 In this PR, Stage 3 is failing with:
All the tests fail in the same way. To me, it looks like something in PR is causing it. |
These functions are essentially equivalent and bundled together. Signed-off-by: Eric Jackson <[email protected]>
I did a clean install with 8e2592c on my vagrant setup, and stages 0-3 plus ceph.functests.1node.migrate seem to work fine (this cluster has no data in it, though) |
@susebot run teuthology |
All tests are still failing with
I also just deployed this branch locally and could not reproduce this issue. @smithfarm |
Commit 8e2592c is NOT OK. |
@jschmid1 That would indicate some incompatibility of this PR with the teuthology environment (host name, etc.). Preparing a debugging env for you to SSH into. |
I suspect that |
Signed-off-by: Joshua Schmid <[email protected]>
That's what I thought too. On my VMs nodename == host, but on teuhology it's not. Assuming that it's the same for real hardware. |
Manually triggered
|
Results are in, and |
@smithfarm |
migrate functests orchestration failed again:
|
in response to #1216 (comment) : The root cause for this is imo that we are running these tests on a single node. That causes to trigger rebalancing of data from one osd to another on a single host which, in this case, resulted in pg inconsistency. Therefore the removal process timed out and left traces of a $wrong_deployment, which triggered osd.correct to fail. When I tried to reproduce this on my cluster ( 5 OSD nodes x 6 OSDs ) the rebalance happend quickly and the test advanced. Our take on this would be to increase the number of nodes we require for testing the migration to at least 2 ( 3 to be on the safe side ) |
In the light of the recent findings[0], I think it's time to merge this beast of a PR. Migration tests are failing due to a couple of reasons:
The first two points points are being fixed by adapting the way we execute tests ( more nodes, less tests on one node ) The last two points will be fixed shortly after this is merged. ( pending commit 8d1b303 ) NOTE: for all of us the tests pass locally. All the points mentioned above are only valid for OVH @smithfarm @tserong @jan--f Please speak up if you see any problem with this. [0] https://etherpad.nue.suse.com/p/Deepsea_standup_2018_08_21 [1] target147135132110:~ # l /dev/vdd*
brw-rw---- 1 root disk 254, 48 Aug 22 14:33 /dev/vdd
brw-rw---- 1 ceph ceph 254, 58 Aug 22 14:33 /dev/vdd10
-rw-r--r-- 1 root root 0 Aug 22 13:54 /dev/vdd2
-rw-r--r-- 1 root root 0 Aug 22 14:30 /dev/vdd8
brw-rw---- 1 ceph ceph 254, 57 Aug 22 14:33 /dev/vdd9 [2] vdd 254:48 0 10G 0 disk
├─vdd9 254:57 0 100M 0 part
└─vdd10 254:58 0 100M 0 part |
@jschmid1 I don't have any reservations about merging this. The migrate functional test orchestration still fails in OVH even with your most recent fixes (wip-adapt-migrate-functests), but it almost passes (179 states succeeded, 1 failed). I'm confident in your ability to get it completely fixed in a follow-up PR 👍 |
These functions are essentially equivalent and bundled together. See #1146 for a detailed description.
Also, this PR should not be merged until a decision about #1215 happens.
#1188#1259 is a prerequisite for this PR. Otherwise, the smoketests may fail.Signed-off-by: Eric Jackson [email protected]