Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug:1728183] SMBD thread panics on file operations from Windows, OS X and Linux when using vfs_glusterfs #898

Closed
gluster-ant opened this issue Mar 12, 2020 · 25 comments
Assignees
Labels
Migrated Type:Bug wontfix Managed by stale[bot]

Comments

@gluster-ant
Copy link
Collaborator

URL: https://bugzilla.redhat.com/1728183
Creator: ryan at 7fivefive
Time: 20190709T09:14:27

Created attachment 1588661
Windows error 01

Description of problem:
SMBD thread panics when a file operation performed from a Windows, Linux or OS X client when the share is using the glusterfs VFS module, either on its own, or in conjunction with others i.e.:

vfs objects = catia fruit streams_xattr glusterfs

Gluster volume info:
Volume Name: mcv01
Type: Distributed-Replicate
Volume ID: 1580ab45-0a14-4f2f-8958-b55b435cdc47
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: mcn01:/mnt/h1a/mcv01_data
Brick2: mcn02:/mnt/h1b/mcv01_data
Brick3: mcn01:/mnt/h2a/mcv01_data
Brick4: mcn02:/mnt/h2b/mcv01_data
Options Reconfigured:
features.quota-deem-statfs: on
nfs.disable: on
features.inode-quota: on
features.quota: on
cluster.brick-multiplex: off
cluster.server-quorum-ratio: 50%

Version-Release number of selected component (if applicable):
Gluster 6.3
Samba 4.10.6-5

How reproducible:
Every time

Steps to Reproduce:

  1. Mount share as mapped drive
  2. Write to share or read from share

Actual results:
Multiple error messages, attached to bug
In OS X or Linux, running 'dd if=/dev/zero of=/mnt/share/test.dat bs=1M count=100' results in a hang. Tailing OS X console logs reveals that the share is timing out.

Expected results:
File operation is successful

Additional info:
Gluster client logs, and SMB debug 10 logs attached

@gluster-ant
Copy link
Collaborator Author

Time: 20190709T09:15:08
ryan at 7fivefive commented:
Created attachment 1588662
Windows error 02

@gluster-ant
Copy link
Collaborator Author

Time: 20190709T09:15:48
ryan at 7fivefive commented:
Created attachment 1588663
Samba Debug 10 logs

@gluster-ant
Copy link
Collaborator Author

Time: 20190709T09:16:25
ryan at 7fivefive commented:
Created attachment 1588664
Gluster client logs

@gluster-ant
Copy link
Collaborator Author

Time: 20190709T11:47:56
ryan at 7fivefive commented:
Tested on Gluster 6.1 with the same issue.
Gluster 5.6 works fine.

@gluster-ant
Copy link
Collaborator Author

Time: 20190719T10:19:41
anoopcs at redhat commented:
(In reply to ryan from comment #0)

Created attachment 1588661 [details]
Windows error 01

Description of problem:
SMBD thread panics when a file operation performed from a Windows, Linux or
OS X client when the share is using the glusterfs VFS module, either on its
own, or in conjunction with others i.e.:

vfs objects = catia fruit streams_xattr glusterfs

Gluster volume info:
Volume Name: mcv01
Type: Distributed-Replicate
Volume ID: 1580ab45-0a14-4f2f-8958-b55b435cdc47
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: mcn01:/mnt/h1a/mcv01_data
Brick2: mcn02:/mnt/h1b/mcv01_data
Brick3: mcn01:/mnt/h2a/mcv01_data
Brick4: mcn02:/mnt/h2b/mcv01_data
Options Reconfigured:
features.quota-deem-statfs: on
nfs.disable: on
features.inode-quota: on
features.quota: on
cluster.brick-multiplex: off
cluster.server-quorum-ratio: 50%

Version-Release number of selected component (if applicable):
Gluster 6.3
Samba 4.10.6-5

How reproducible:
Every time

Steps to Reproduce:

  1. Mount share as mapped drive
  2. Write to share or read from share

Actual results:
Multiple error messages, attached to bug
In OS X or Linux, running 'dd if=/dev/zero of=/mnt/share/test.dat bs=1M
count=100' results in a hang. Tailing OS X console logs reveals that the
share is timing out.

This is weird. Can you post your smb.conf?

@gluster-ant
Copy link
Collaborator Author

Time: 20190719T10:29:03
ryan at 7fivefive commented:
Hi Anoop,

It's very odd, i've got a feeling it's something related to the upgrade/downgrade process I've been using to test different versions of Gluster for the different bug tickets I've got open.

Currently I'm using the following script to upgrade/downgrade (This one's to upgrade to 6):
yum remove centos-release-gluster* -y
yum install centos-release-gluster6 -y
yum remove glusterfs* -y
yum install glusterfs-server* -y
yum install sernet-samba-vfs-glusterfs -y
systemctl stop glusterd
systemctl stop glusterfsd
sed -i 's/operating-version=.*/operating-version=60000/gi' /var/lib/glusterd/glusterd.info
systemctl stop glusterfsd
systemctl restart glusterd
gluster volume set all cluster.op-version 60000

Please could you flag any issues with this? Or a recommended way of downgrading particularly.

SMB config:
[global]
security = ADS
workgroup = MAGENTA
realm = MAGENTA.LOCAL
netbios name = MAGENTANAS01
max protocol = SMB3
min protocol = SMB2
ea support = yes
clustering = yes
server signing = no
max log size = 10000
glusterfs:loglevel = 7
log file = /var/log/samba/log-%M.smbd
logging = file
log level = 2
template shell = /sbin/nologin
winbind offline logon = false
winbind refresh tickets = yes
winbind enum users = Yes
winbind enum groups = Yes
allow trusted domains = yes
passdb backend = tdbsam
idmap cache time = 604800
idmap negative cache time = 300
winbind cache time = 604800
idmap config magenta:backend = rid
idmap config magenta:range = 10000-999999
idmap config * : backend = tdb
idmap config * : range = 3000-7999
guest account = nobody
map to guest = bad user
force directory mode = 0777
force create mode = 0777
create mask = 0777
directory mask = 0777
hide unreadable = no
store dos attributes = no
unix extensions = no
load printers = no
printing = bsd
printcap name = /dev/null
disable spoolss = yes
glusterfs:volfile_server = localhost
kernel share modes = No
strict locking = auto
oplocks = yes
durable handles = yes
kernel oplocks = no
posix locking = no
level2 oplocks = no
readdir_attr:aapl_rsize = yes
readdir_attr:aapl_finder_info = no
readdir_attr:aapl_max_access = no
fruit:aapl = yes

[QC]
guest ok = no
read only = no
vfs objects = glusterfs
glusterfs:volume = mcv01
path = "/data/qc_only"
valid users = @"QC_ops"
recycle:repository = .recycle
recycle:keeptree = yes
recycle:versions = yes
recycle:directory_mode = 0770
recycle:subdir_mode = 0777
glusterfs:logfile = /var/log/samba/glusterfs-mcv01.%M.log

[QC-GlusterFuse]
guest ok = no
read only = no
vfs objects = glusterfs_fuse
path = "/mnt/mcv01/data/qc_only"
valid users = @"QC_ops"
recycle:repository = .recycle
recycle:keeptree = yes
recycle:versions = yes
recycle:directory_mode = 0770
recycle:subdir_mode = 0777
glusterfs:logfile = /var/log/samba/glusterfs-mcv01.%M.log

[QC-FUSE]
guest ok = no
read only = no
path = "/mnt/mcv01/data/qc_only"
valid users = @"QC_ops"
recycle:repository = .recycle
recycle:keeptree = yes
recycle:versions = yes
recycle:directory_mode = 0770
recycle:subdir_mode = 0777
glusterfs:logfile = /var/log/samba/glusterfs-mcv01-fuse.%M.log


Many thanks,
Ryan

@gluster-ant
Copy link
Collaborator Author

Time: 20190902T13:54:01
ryan at 7fivefive commented:
Anyone able to offer some assistance with this?
We're still seeing the issue on two of our servers after upgrading to Gluster 6.5 and Samba 4.10.7.

@gluster-ant
Copy link
Collaborator Author

Time: 20191017T14:46:40
ryan at 7fivefive commented:
Trying to copy a file with a windows 10 client results in the transfer failing with error (See screenshot).
Looking through the smb logs shows this:

mag-desktop-01 (ipv4:10.0.3.12:57488) connect to service Grading initially as user editor01 (uid=2000, gid=2900) (pid 296596)
[2019/10/17 14:09:35.784481, 2] ../../source3/smbd/smbXsrv_open.c:675(smbXsrv_open_global_verify_record)
smbXsrv_open_global_verify_record: key 'FA7F6275' server_id 296320 does not exist.
[2019/10/17 14:09:35.784509, 1] ../../librpc/ndr/ndr.c:422(ndr_print_debug)
&global_blob: struct smbXsrv_open_globalB
version : SMBXSRV_VERSION_0 (0)
seqnum : 0x00000002 (2)
info : union smbXsrv_open_globalU(case 0)
info0 : *
info0: struct smbXsrv_open_global0
db_rec : NULL
server_id: struct server_id
pid : 0x0000000000048580 (296320)
task_id : 0x00000000 (0)
vnn : 0xffffffff (4294967295)
unique_id : 0x3f2d4bc50a3ad530 (4552378107993707824)
open_global_id : 0xfa7f6275 (4202652277)
open_persistent_id : 0x00000000fa7f6275 (4202652277)
open_volatile_id : 0x0000000037cfd301 (936366849)
open_owner : S-1-5-21-3658843901-2482107748-408451428-1000
open_time : Thu Oct 17 14:09:36 2019 BST
create_guid : aea7fead-f0de-11e9-b036-b88584997125
client_guid : aea7fb79-f0de-11e9-b036-b88584997125
app_instance_id : 00000000-0000-0000-0000-000000000000
disconnect_time : NTTIME(0)
durable_timeout_msec : 0x0000ea60 (60000)
durable : 0x01 (1)
backend_cookie : DATA_BLOB length=452
[0000] 56 46 53 5F 44 45 46 41 55 4C 54 5F 44 55 52 41 VFS_DEFA ULT_DURA
[0010] 42 4C 45 5F 43 4F 4F 4B 49 45 5F 4D 41 47 49 43 BLE_COOK IE_MAGIC
[0020] 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
[0030] 00 00 00 00 00 00 00 00 96 89 8E 03 00 00 00 00 ........ ........
[0040] 39 E0 DF F2 31 0E 74 B0 00 00 00 00 00 00 00 00 9...1.t. ........
[0050] 00 00 02 00 04 00 02 00 00 00 10 37 00 00 00 00 ........ ...7....
skipping zero buffer bytes
[0080] 96 89 8E 03 00 00 00 00 39 E0 DF F2 31 0E 74 B0 ........ 9...1.t.
[0090] FF 81 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ........ ........
[00A0] D0 07 00 00 00 00 00 00 54 0B 00 00 00 00 00 00 ........ T.......
[00B0] 00 00 00 00 00 00 00 00 7A E3 01 37 00 00 00 00 ........ z..7....
[00C0] F3 67 A8 5D 00 00 00 00 00 00 00 00 00 00 00 00 .g.].... ........
[00D0] F3 67 A8 5D 00 00 00 00 00 00 00 00 00 00 00 00 .g.].... ........
[00E0] F3 67 A8 5D 00 00 00 00 00 00 00 00 00 00 00 00 .g.].... ........
[00F0] F3 67 A8 5D 00 00 00 00 00 00 00 00 00 00 00 00 .g.].... ........
[0100] 99 7F 00 00 00 00 00 00 08 00 00 00 00 00 00 00 ........ ........
[0110] 00 00 00 00 00 00 00 00 06 00 00 00 00 00 00 00 ........ ........
[0120] 06 00 00 00 2F 64 61 74 61 00 00 00 3C 00 00 00 ..../dat a...<...
[0130] 00 00 00 00 3C 00 00 00 6E 65 77 20 66 6F 6C 64 ....<... new fold
[0140] 65 72 20 66 72 6F 6D 20 6D 61 63 2F 77 38 6B 76 er from mac/w8kv
[0150] 76 2D 73 74 68 2D 34 38 66 70 73 2D 31 30 74 6F v-sth-48 fps-10to
[0160] 31 72 65 64 63 6F 64 65 5F 46 46 2E 52 44 43 2E 1redcode _FF.RDC.
[0170] 7A 69 70 00 00 00 00 00 00 00 00 00 00 00 00 00 zip..... ........
[0180] 00 00 00 00 00 00 00 00 F3 67 A8 5D 00 00 00 00 ........ .g.]....
[0190] 00 00 00 00 00 00 00 00 F3 67 A8 5D 00 00 00 00 ........ .g.]....
[01A0] 00 00 00 00 00 00 00 00 F3 67 A8 5D 00 00 00 00 ........ .g.]....
[01B0] 00 00 00 00 00 00 00 00 F3 67 A8 5D 00 00 00 00 ........ .g.]....
[01C0] 00 00 00 00 ....
channel_sequence : 0x0000 (0)
channel_generation : 0x0000000000000000 (0)
[2019/10/17 14:09:35.785374, 3] ../../source3/smbd/smb2_create.c:800(smbd_smb2_create_send)

@gluster-ant
Copy link
Collaborator Author

Time: 20191017T14:47:11
ryan at 7fivefive commented:
Created attachment 1626823
Screenshot of Windows 10 error

@gluster-ant
Copy link
Collaborator Author

Time: 20191022T10:58:09
anoopcs at redhat commented:
What are your current Samba and GlusterFS versions?

(In reply to ryan from comment #8)

Trying to copy a file with a windows 10 client results in the transfer
failing with error (See screenshot).

  • Does it happen every time you attempt a copy of the same file?
  • Is it something specific to a file/directory type?

mag-desktop-01 (ipv4:10.0.3.12:57488) connect to service Grading initially
as user editor01 (uid=2000, gid=2900) (pid 296596)

I don't see a share named [Grading] in the smb.conf from comment #6. If that's newly added, was there any changes to global parameters?

@gluster-ant
Copy link
Collaborator Author

Time: 20191022T15:32:19
ryan at 7fivefive commented:
Hi Anoop,

Versions:
Gluster = 6.5
Samba = 4.10.8

Does it happen every time you attempt a copy of the same file?
Is it something specific to a file/directory type?
Yes, this happens with any file being copied, or a new file being created (write fails). Happens in multiple directories 100% of the time.

In an effort to reduce the variables in play, I'd changed the config. Complete config below:

[global]
security = user
username map script = /bin/echo
max protocol = SMB3
min protocol = SMB2
ea support = yes
clustering = no
server signing = no
max log size = 10000
glusterfs:loglevel = 5
log file = /var/log/samba/log-%M.smbd
logging = file
log level = 3
template shell = /sbin/nologin
passdb backend = tdbsam
guest account = nobody
map to guest = bad user
force directory mode = 0777
force create mode = 0777
create mask = 0777
directory mask = 0777
hide unreadable = no
unix extensions = no
load printers = no
printing = bsd
printcap name = /dev/null
disable spoolss = yes
glusterfs:volfile_server = localhost
kernel share modes = No

[Grading]
read only = no
guest ok = yes
vfs objects = catia fruit streams_xattr glusterfs
glusterfs:volume = mcv01
path = "/data"
valid users = "nobody" @"audio" @"QC_ops" @"MAGENTA\domain admins" @"MAGENTA\domain users" @"nas_users"
glusterfs:logfile = /var/log/samba/glusterfs-mcv01.%M.log

Best,
Ryan

@gluster-ant
Copy link
Collaborator Author

Time: 20191030T13:57:23
ryan at 7fivefive commented:
Hi Anoop,

Did you get a chance to look into this?
Can I assist in anyway?

@gluster-ant
Copy link
Collaborator Author

Time: 20191105T10:04:27
ryan at 7fivefive commented:
Hi Anoop,

I believe we have found the issue with this, however require some assistance with the workaround.
When running the op-version at 40100 with Gluster 6.5 we don't have any issues.
However, when running at the max cluster op version of 60000 we get lots of panics in the SMB logs.
I contacted Sernet about this, and it seems the issue is because they still compile the VFS against Gluster 3.12.
We're going to try testing with a package compiled against 6.5 to see if the issue goes away.

In the meantime, is it possible to downgrade the op-version?

Many thanks,
Ryan

@gluster-ant
Copy link
Collaborator Author

Time: 20191118T10:50:08
anoopcs at redhat commented:
(In reply to ryan from comment #13)

and it seems the issue is because they still compile the VFS against Gluster 3.12.

GFAPI uses symbol versions. Unless some API got removed(zero chance for this to happen) every old version of a modified API must be still present in newer GlusterFS. Assuming Samba version is maintained I am curious how such a incompatibility can lead to panics.

We're going to try testing with a package compiled against 6.5 to see if the
issue goes away.

How did it go?

In the meantime, is it possible to downgrade the op-version?

I would suggest to stay or operate at maximum available op-version to make use of latest features in updated GlusterFS.

@gluster-ant
Copy link
Collaborator Author

Time: 20191118T10:58:22
ryan at 7fivefive commented:
Hi Anoop,

Below were the test versions and results

Gluster 4.1 (op-version 40100) + Sernet Samba Gluster VFS (Built against Gluster 3.12) = PASS
Gluster 6.5 (op-version 60000) + Sernet Samba Gluster VFS (Built against Gluster 3.12) = FAIL
Gluster 6.5 (op-version 40100) + Sernet Samba Gluster VFS (Built against Gluster 3.12) = PASS
Gluster 6.5 (op-version 60000) + Sernet Samba Gluster VFS (Built against Gluster 6.5) = PASS

The VFS packages compiled for us by Sernet, against Gluster 6.5 has resolved this issue for us.
I also downgraded the op-version by modifying the vol config files, which resulted in the Gluster 3.12 VFS, which fixed the issue.

Please let me know if you need any more info/data.
Best regards,
Ryan

@gluster-ant
Copy link
Collaborator Author

Time: 20191118T12:01:14
anoopcs at redhat commented:
(In reply to ryan from comment #15)

Hi Anoop,

Below were the test versions and results

Gluster 4.1 (op-version 40100) + Sernet Samba Gluster VFS (Built against
Gluster 3.12) = PASS

Expected..

Gluster 6.5 (op-version 60000) + Sernet Samba Gluster VFS (Built against
Gluster 3.12) = FAIL
Gluster 6.5 (op-version 40100) + Sernet Samba Gluster VFS (Built against
Gluster 3.12) = PASS

Just like GlusterFS VFS module based on v3.12 works fine with op-version 40100, I would expect it to work with op-version 60000 too. Or else it needs some investigation.

Gluster 6.5 (op-version 60000) + Sernet Samba Gluster VFS (Built against
Gluster 6.5) = PASS

Fine.

The VFS packages compiled for us by Sernet, against Gluster 6.5 has resolved
this issue for us.
I also downgraded the op-version by modifying the vol config files, which
resulted in the Gluster 3.12 VFS, which fixed the issue.

Good.

Please let me know if you need any more info/data.

I remember that you were blocked in testing bz #1680085 due to this bug. Can you re-visit bz #1680085 now?

@gluster-ant
Copy link
Collaborator Author

Time: 20191119T14:54:47
kkeithle at redhat commented:
IMO the results you see are consistent with the design of the versioned symbols in gfapi; i.e. that old programs (and other consumers of gfapi such as the Samba glusterfs VFS) that were originally compiled and linked with old libraries can be used with newer versions of gfapi without having to rebuild and relink.

For now this does imply that gluster needs to use the same (or close) op-version associated with 3.12 if you're using a gluster VFS that was linked with 3.12 libgfapi.

@gluster-ant
Copy link
Collaborator Author

Time: 20191120T11:40:20
ryan at 7fivefive commented:
Hi Kaleb,

Thanks for confirming.
Is there a recommended way of downgrading the op-version, other than editing the vol file?

Best,
Ryan

@gluster-ant
Copy link
Collaborator Author

Time: 20191121T09:31:13
ndevos at redhat commented:
I'm missing a little detail in this bug report. Compiling the vfs_gluster Samba module against glusterfs-3.12 results in a binary that can be used with glusterfs-6.x (on the Gluster client, the Samba server). It is not clear to me what versuin of Gluster client was used in the test of comment #15. Did it match the version of the Gluster server, or was it kept at 3.12?

@gluster-ant
Copy link
Collaborator Author

Time: 20191121T09:40:25
ryan at 7fivefive commented:
Hi Niels,

Please see revised comment, does this answer your question?

Gluster Server 4.1 (op-version 40100) + Sernet Samba Gluster VFS (Built against Gluster Client 3.12) = PASS
Gluster Server 6.5 (op-version 60000) + Sernet Samba Gluster VFS (Built against Gluster Client 3.12) = FAIL
Gluster Server 6.5 (op-version 40100) + Sernet Samba Gluster VFS (Built against Gluster Client 3.12) = PASS
Gluster Server 6.5 (op-version 60000) + Sernet Samba Gluster VFS (Built against Gluster Client 6.5) = PASS

Best,
Ryan

@gluster-ant
Copy link
Collaborator Author

Time: 20191121T10:14:01
ndevos at redhat commented:
Does that also mean the Gluster client packages on the Samba server are kept at the "Built against Gluster Client" version?

This is not a requirement from a libgfapi gluster-bindings perspective. It expected to work correctly when compiling Samba against glusterfs-3.12, but run the resulting vfs_gluster module (Built against Gluster Client 3.12) on a system that has only the glusterfs-6.x versions installed. The built Samba/vfs_gluster binary should be compatible with glusterfs-6.x. It is recommended that Gluster clients and Gluster servers run with the same Gluster version (even when Samba/vfs_gluster is built with an older version of Gluster).

@gluster-ant
Copy link
Collaborator Author

Time: 20191121T10:20:42
ryan at 7fivefive commented:
In our test case, the Gluster client & Samba server is on the same nodes as the Gluster server, so would all be on the same version as the server.

Best,
Ryan

@gluster-ant
Copy link
Collaborator Author

Time: 20191121T11:04:52
ndevos at redhat commented:
Thanks!

In that case I'm really surprised to hear that different op-versions can cause a panic in Samba... Anoop would be the best person to help with this.

@stale
Copy link

stale bot commented Oct 8, 2020

Thank you for your contributions.
Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity.
It will be closed in 2 weeks if no one responds with a comment here.

@stale stale bot added the wontfix Managed by stale[bot] label Oct 8, 2020
@stale
Copy link

stale bot commented Oct 24, 2020

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

@stale stale bot closed this as completed Oct 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Migrated Type:Bug wontfix Managed by stale[bot]
Projects
None yet
Development

No branches or pull requests

2 participants