Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very specific set of circumstances leads to zero-byte (empty) file being created #1015

Closed
2 of 3 tasks
pedantic-git opened this issue May 21, 2020 · 43 comments
Closed
2 of 3 tasks

Comments

@pedantic-git
Copy link

  • This is a bug report
  • This is a feature request
  • I searched existing issues before opening this one

The story

A few weeks ago, a bunch of tests in Rails code I have started failing in my Docker development environment. My dev environment is Arch so the actual set of libraries and things installed is very much a moving target.

Lots of digging around led me to create a repo that can reproduce the issue using only Ruby (no gems at all), at the lowest level possible. And it's really really weird. More information below.

Expected behavior

When copying a file in Ruby using the IO.copy_stream method, the file should be an identical copy.

Actual behavior

Under a very specific set of circumstances, the resulting file is 0 bytes.

I'm reasonably certain this is a Docker issue, because one of the conditions that needs to be true is that the source file is located on a mounted volume.

The conditions that need to be true for the IO.copy_stream operation to fail are:

  1. The source file is located on a Docker mounted volume and not the Docker container's own filesystem.
  2. The file is first copied into a Tempfile using the low-level Tempfile library.
  3. The tempfile is not touched (e.g. by utime or chmod) after the copy happens.
  4. That tempfile is then copied to another location using IO.copy_stream. Copying it using File.write and tempfile.read does not cause the issue.

Steps to reproduce the behavior

Please check out this repo: https://github.com/fishpercolator/reproduce-copystream and run the command in the README.

The output of this command on my env is:

open, local, File.write: 6744
open, local, IO.copy_stream: 6744

open, mounted, File.write: 6744
open, mounted, IO.copy_stream: 6744

tempfile + copy_file, local, File.write: 6744
tempfile + copy_file, local, IO.copy_stream: 6744

tempfile + copy_file, mounted, File.write: 6744
tempfile + copy_file, mounted, IO.copy_stream: 0

tempfile + copy_file + noop utime, mounted, File.write: 6744
tempfile + copy_file + noop utime, mounted, IO.copy_stream: 6744

tempfile + copy_file + noop chmod, mounted, File.write: 6744
tempfile + copy_file + noop chmod, mounted, IO.copy_stream: 6744

As you can see, the file is copied successfully if it is:

  1. copied directly from the mounted filesystem (not via a tempfile)
  2. copied via a tempfile from the container's own filesystem
  3. copied via a tempfile that has been touched with utime or chmod after the contents were inserted
  4. I do this same operation on an Ubuntu machine running Docker 19.03.9 ... i.e. it seems to be specific to something about my development machine

If any of the 4 statements above are true, the file copies fine. If all 4 are false, the file is created but it has a 0-byte size.

I've checked to see if the source filesystem of the mounted volume makes any difference but I get the same effect with eCryptfs, ext4 and tmpfs mountpoints.

Likewise I've tried different versions of Ruby to see if it was a regression in Ruby, but I can still reproduce the issue with the official Docker images of much older versions where I was certain it worked.

Does anyone have any ideas? Can anyone else reproduce this using my repo?

Output of docker version:

Client:
 Version:           19.03.9-ce
 API version:       1.40
 Go version:        go1.14.3
 Git commit:        9d988398e7
 Built:             Tue May 19 22:11:18 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.9-ce
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.14.3
  Git commit:       9d988398e7
  Built:            Tue May 19 22:11:01 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.3.4.m
  GitCommit:        d76c121f76a5fc8a462dc64594aea72fe18e1178.m
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Output of docker info:

Client:
 Debug Mode: false

Server:
 Containers: 317
  Running: 0
  Paused: 0
  Stopped: 317
 Images: 415
 Server Version: 19.03.9-ce
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: d76c121f76a5fc8a462dc64594aea72fe18e1178.m
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.6.13-arch1-1
 Operating System: Arch Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.53GiB
 Name: hulat
 ID: G7VK:VZOA:T4LI:ATNZ:G55B:TDEA:23J2:YM2A:MOLK:U5QY:5MVT:CU7P
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: pedanticgit
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.)

It's a physical Dell XPS-13 9360 running Arch Linux. Arch doesn't have version numbers but all packages are up to date. Docker was installed from the Arch package with is built using this PKGBUILD script from the official Docker sources.

@sarna
Copy link

sarna commented May 22, 2020

I can reproduce this on current Manjaro (5.6.12-1-MANJARO), getting the exact same output.
docker version output:

Client:
 Version:           19.03.8-ce
 API version:       1.40
 Go version:        go1.14.1
 Git commit:        afacb8b7f0
 Built:             Thu Apr  2 00:04:36 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.8-ce
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.14.1
  Git commit:       afacb8b7f0
  Built:            Thu Apr  2 00:04:16 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.3.4.m
  GitCommit:        d76c121f76a5fc8a462dc64594aea72fe18e1178.m
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

docker info output:

Client:
 Debug Mode: false

Server:
 Containers: 6
  Running: 0
  Paused: 0
  Stopped: 6
 Images: 37
 Server Version: 19.03.8-ce
 Storage Driver: overlay2
  Backing Filesystem: <unknown>
  Supports d_type: true
  Native Overlay Diff: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: d76c121f76a5fc8a462dc64594aea72fe18e1178.m
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.6.12-1-MANJARO
 Operating System: Manjaro Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.53GiB
 Name: mainframe
 ID: 2Y5C:436L:YDBV:YUH6:AW2M:EJ56:TIUI:I4ZO:ACZF:M64N:OQAG:LQBQ
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

@thaJeztah
Copy link
Member

Both Arch linux and Manjaro are distributions that are not officially supported. Packages distributed by those distributions are maintained by those distributions and use differences in both the build process, build-time dependencies, as well as different versions of (e.g.) containerd and runc.

I'd recommend opening a ticket in the corresponding distribution's issue tracker for that reason.

I'm closing this issue because of the above, but feel free to continue the conversation

@kthibodeaux
Copy link

@pedantic-git have you resolved this? did you create an issue elsewhere that I can follow? I'm seeing the same issue with the official ruby 2.6 docker image on an up to date manjaro install (updating an existing install doesnt seem to trigger the error)

@pedantic-git
Copy link
Author

@kthibodeaux I didn't create an issue anywhere else because I didn't have time to test it with the official build of Docker - have you been able to do that? It seems like even though it's happening to everyone on Arch/Manjaro it won't be supported here if it's only the Arch build of Docker.

@ttrmw
Copy link

ttrmw commented Dec 2, 2020

Did either of you get anywhere with this @pedantic-git @kthibodeaux? Got another example here, also on Arch.

@pedantic-git
Copy link
Author

Sadly not. I never got past the "Arch isn't supported" message above. I guess taking it up with the maintainers of the Arch package is the way to go.

Looks like that's Morten Linderud [email protected] if you'd like to raise a case and copy me in (quinn at fishpercolator.co.uk) but I'm not really even sure how one goes about doing that.

@ahwatts
Copy link

ahwatts commented Jan 21, 2021

I'm able to reproduce this on Fedora 33 using the package moby-engine-19.03.13-1.ce.git4484c46.fc33.x86_64.

I get the exact same output as @pedantic-git

My docker info:

awatts@ironic-mullet reproduce-copystream [master=] $ docker info
Client:
 Debug Mode: false

Server:
 Containers: 23
  Running: 21
  Paused: 0
  Stopped: 2
 Images: 64
 Server Version: 19.03.13
 Storage Driver: overlay2
  Backing Filesystem: btrfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: journald
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: /usr/libexec/docker/docker-init
 containerd version: 
 runc version: c9a9ce0286785bef3f3c3c87cd1232e535a03e15
 init version: 
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.10.8-200.fc33.x86_64
 Operating System: Fedora 33 (Workstation Edition)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.09GiB
 Name: ironic-mullet.eminorinc.com
 ID: SHFE:UOOW:MKUA:56MY:UHUL:QTQP:YZJX:C43V:XR4R:67RF:BWKS:22EM
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: true

@pedantic-git
Copy link
Author

@thaJeztah Now that this issue can be reproduced in an official build (albeit of Moby Engine rather than CE) is it possible this issue could warrant reopening?

@AlexVKO
Copy link

AlexVKO commented Feb 16, 2021

@thaJeztah any chance to reopen this issue?

Same issue here(on ubunto 20.04), running images Ubuntu:18:04, Ubuntu:20.04, and Ruby:3.0.0.

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.0-docker)

Server:
 Containers: 40
  Running: 6
  Paused: 0
  Stopped: 34
 Images: 485
 Server Version: 20.10.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.8.0-43-generic
 Operating System: Ubuntu 20.04.1 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 128
 Total Memory: 125.7GiB
 Name: alexvko
 ID: N5L3:NVEW:VSYS:OSK6:JWAO:T6ZE:APR5:5XVM:OTMQ:BXSZ:DNBV:STUU
 Docker Root Dir: /mnt/md0/docker
 Debug Mode: false
 Username: alexvko
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No blkio weight support
WARNING: No blkio weight_device support

@bmhughes
Copy link

bmhughes commented Mar 4, 2021

Same here on Fedora 33

@makmic
Copy link

makmic commented Mar 16, 2021

I just encountered the same issue and hope that it will be reopened and addressed here as it was far from easy to debug.

In the meantime I'll have to add this (Ruby) monkey patch to my docker setup:

module FileUtils
  class Entry_
    def copy_file(dest)
      Open3.capture2('cp', path(), dest)
    end
  end
end

@pedantic-git
Copy link
Author

@makmic There's actually no need to shell out to a separate and heavyweight cp binary to work around this issue. Even doing a no-op chmod or utime on the file before it is copied as in my example prevents this issue from occurring. Why, I have no idea, but it reliably does!

@ngoan98tv
Copy link

I'm facing to the same issue on CentOS.

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server:
 Containers: 24
  Running: 22
  Paused: 0
  Stopped: 2
 Images: 51
 Server Version: 20.10.5
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.18.0-240.15.1.el8_3.x86_64
 Operating System: CentOS Linux 8
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.65GiB
 Name: LMS-Server
 ID: DRNO:6ONC:XO6P:6ZAK:W6NI:R764:GT5Q:YTZD:QOEB:Y63W:JD6N:TW4Y
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

@dwarburt
Copy link

@makmic There's actually no need to shell out to a separate and heavyweight cp binary to work around this issue. Even doing a no-op chmod or utime on the file before it is copied as in my example prevents this issue from occurring. Why, I have no idea, but it reliably does!

@pedantic-git I don't suppose you have an example no-op chmod or utime that's effective? and do you think they'd be faster than my current workaround?

module FileUtilsPatch
  def copy_file(dest)
    FileUtils.touch(path())
    super
  end
end

module FileUtils
  class Entry_
    prepend FileUtilsPatch
  end
end

Could this be an interaction with LUKS Full Disk Encryption? That's the main disk related difference on the affected system I have, when the same project is fine on all others.

@pedantic-git
Copy link
Author

pedantic-git commented Mar 31, 2021

@dwarburt Have you seen the example repo in the original post? The last two tests are the no-op ones. They're basically the same as your touch example apart from that they set the times/mode back to their original values rather than updating them to new values. Hope that makes sense!
https://www.github.com/fishpercolator/reproduce-copystream/tree/master/repro.rb

@pimpin
Copy link

pimpin commented Apr 29, 2021

I tried the reproduct-copystream in my old laptop and could not reproduce the bug.
But I ran the same in my new laptop and then bug appears.

Working config :

docker info
Client:
 Debug Mode: false
Server:
 Containers: 23
  Running: 6
  Paused: 0
  Stopped: 17
 Images: 400
 Server Version: 19.03.1
 Storage Driver: aufs
  Root Dir: /var/lib/docker/aufs
  Backing Filesystem: extfs
  Dirs: 693
  Dirperm1 Supported: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 4.15.0-126-generic
 Operating System: Ubuntu 18.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.676GiB
 Name: tim-XPS-13
 ID: 5MUF:2FRA:A7CV:Z4QN:5TYH:PB3S:U4S7:SHP6:ARHK:UXPL:HBQ6:HJM7
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: tzmachine
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
WARNING: the aufs storage-driver is deprecated, and will be removed in a future release.

Failing config :

docker info
Client:
 Debug Mode: false

Server:
 Containers: 7
  Running: 7
  Paused: 0
  Stopped: 0
 Images: 20
 Server Version: 19.03.13
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8fba4e9a7d01810a393d5d25a3621dc101981175
 runc version: 
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.8.0-50-generic
 Operating System: Ubuntu Core 18
 OSType: linux
 Architecture: x86_64
 CPUs: 12
 Total Memory: 15.27GiB
 Name: pierrem-XPS-15-9500
 ID: 6CXC:YEKH:QWH2:NJ7Z:A566:IVCN:HSEQ:4KTV:MSCN:2NJU:U2GU:ZWDX
 Docker Root Dir: /var/snap/docker/common/var-lib-docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

This was also failing with Docker version 20.10.6, build 370c289 in my new laptop.

@pimpin
Copy link

pimpin commented Apr 30, 2021

So I thought it might be a regression appeared after 19.13.1, but a team mate made the test and does not have the bug with 20.10.6. So it's not only related to a docker version.

working config :

reproduce-copystream ➜   docker info                                                                                                                                      (master)
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
Server:
 Containers: 15
  Running: 0
  Paused: 0
  Stopped: 15
 Images: 47
 Server Version: 20.10.6
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runtime.v1.linux runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 4.15.0-29-generic
 Operating System: Linux Mint 18.3
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.18GiB
 Name: T480
 ID: U56P:745X:PPSO:XKPL:W5SN:T3EC:FZVJ:JQX6:DFEQ:OO5M:4LTG:A7CC
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
WARNING: No swap limit support

@mkleint
Copy link

mkleint commented May 12, 2021

Several our Bitbucket Pipelines customers are facing the same issue after our FlatCarOS upgrade 2605.12.0 to 2765.2.2 (kernel move from 5.4.92 directly to 5.10.25)
it mostly appears as ActiveStorage::IntegrityError while running rspec tests

docker info
Client:
 Debug Mode: false
Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 20.10.5
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Authorization: pipelines
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
WARNING: API is accessible on http://0.0.0.0:2375 without encryption.
         Access to the remote API is equivalent to root access on the host. Refer
         to the 'Docker daemon attack surface' section in the documentation for
         more information: https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface
 Init Binary: docker-init
 containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  userns
 Kernel Version: 5.10.25-flatcar
 Operating System: Alpine Linux v3.13 (containerized)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 30.91GiB
 Name: 44212645-d3ba-4afa-b2d9-8a1f81e7f1ee-wrmh2
 ID: KPMN:TGRY:QL5G:YYZT:4ZHE:XLQD:BCPI:UEUZ:6KD7:TGB4:GUVN:KAIB
 Docker Root Dir: /var/lib/docker/165536.165536
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
  http://10.203.197.49:5000/
 Live Restore Enabled: false
 Product License: Community Engine

@Joanna-Zhang
Copy link

Joanna-Zhang commented May 12, 2021

I ran into this same problem with ruby's FileUtil.cp on MacOS AMD64 using ruby:2.5.3 as the base image for the container.

touching the file after it was copied worked to get around it.

I was also able to work around it without changing source or spec code by mounting the /tmp directory (the destination of the file copied from a mounted directory) as a tmpfs upon container startup (could also be set through the tmpfs key in a docker-compose file):

--mount type=tmpfs,destination=/tmp

@stephen-turner
Copy link

Reopening as this is being seen in supported configurations.

@bdwyertech
Copy link

Seeing this on docker for Mac as well. Very strange. I can reproduce on one image, but not another.

@ZimbiX
Copy link

ZimbiX commented May 17, 2021

I first came across this issue on Arch on 2020-05-28 in a failing RSpec test using ActiveStorage, returning an ActiveStorage::IntegrityError.

@maddymarkovitz and I did much investigation on this in the following months. With strace, we discovered it was related to the copy_file_range syscall, as recently noted by @amanfredi in the Docker for Mac issue. By testing many different kernel versions in Qemu, we determined it to be a bug in the Linux kernel that was introduced in Linux 5.6.0-arch1-1.

The bug probably resides in the overlay2 subsystem, as it was changed in that release, and from my results of trying out different Docker storage drivers:

Enabling/disabling overlay2's 'Native Overlay Diff' didn't change the outcome.

We put together a minimal C file reproducing the bug by executing the copy_file_range syscall, and a test script that compiles and runs it, copying from a mounted file (resulting in failed copying due to the bug) and copying from a previously copied-in file (resulting in successful copying).

Source:

#include <unistd.h>
#include <sys/syscall.h>
#include <fcntl.h>

int main(int argc, char *argv[]) {
  int input = open(argv[1], O_RDONLY);
  int temp1 = open("/tmp/copy_file_range_test", O_WRONLY|O_CREAT|O_TRUNC, 0100644);

  syscall(SYS_copy_file_range, input, NULL, temp1, NULL, 6, 0);

  close(input);
  close(temp1);

  int temp2 = open("/tmp/copy_file_range_test", O_RDONLY);
  int dest = open("destination", O_WRONLY|O_CREAT|O_TRUNC, 0100644);

  syscall(SYS_copy_file_range, temp2, NULL, dest, NULL, 6, 0);

  close(temp2);
  close(dest);
}

Test script:

#!/bin/bash

set -Eeuo pipefail

source=${1-Gemfile}

check() {
  if wc -c destination | grep -E '^0 ' > /dev/null; then
    echo 'Copy failed'
  else
    echo OK
  fi
  echo
}

printf "FROM debian:10.8-slim\nRUN apt update && apt install -y gcc strace" > /tmp/strace_Dockerfile
docker build -f /tmp/strace_Dockerfile -t strace .
echo

echo Local:
(
  # set -x
  gcc test.c
  strace ./a.out "$source" &> test-strace
)
check

echo Docker - mounted:
docker run --rm -v "$(pwd):/work" -w /work strace bash -c "
  #set -x &&
  gcc test.c &&
  strace ./a.out '$source' &> test-strace-docker-mounted
"
check

echo Docker - copied:
docker run --rm -v "$(pwd):/work" -w /work strace bash -c "
  #set -x &&
  cp '$source' /x &&
  gcc test.c &&
  strace ./a.out /x &> test-strace-docker-copied
"
check

Bug output:

➜ ./ctest.sh                                 
[+] Building 2.3s (6/6) FINISHED                                                                                                                                                                                                                                                  
 => [internal] load build definition from strace_Dockerfile                                                                                                                                                                                                                  0.0s
 => => transferring dockerfile: 115B                                                                                                                                                                                                                                         0.0s
 => [internal] load .dockerignore                                                                                                                                                                                                                                            0.0s
 => => transferring context: 2B                                                                                                                                                                                                                                              0.0s
 => [internal] load metadata for docker.io/library/debian:10.8-slim                                                                                                                                                                                                          2.2s
 => [1/2] FROM docker.io/library/debian:10.8-slim@sha256:13f0764262a064b2dd9f8a828bbaab29bdb1a1a0ac6adc8610a0a5f37e514955                                                                                                                                                    0.0s
 => CACHED [2/2] RUN apt update && apt install -y gcc strace                                                                                                                                                                                                                 0.0s
 => exporting to image                                                                                                                                                                                                                                                       0.0s
 => => exporting layers                                                                                                                                                                                                                                                      0.0s
 => => writing image sha256:93c8b4395b5e2b3d5e0e7510d18201c4d94afab419415b5541b8852781ed0442                                                                                                                                                                                 0.0s
 => => naming to docker.io/library/strace                                                                                                                                                                                                                                    0.0s

Local:
OK

Docker - mounted:
Copy failed

Docker - copied:
OK

We had meant to report the bug to the Linux project, but never quite got around to it. Has anyone else done so already?

Running the script today while writing this (on 5.12.2-arch1-1), I am surprised to see that it actually passes! Maybe Linux fixed the bug recently...

Local:
OK

Docker - mounted:
OK

Docker - copied:
OK

Our colleagues using Mac were not seeing this issue at the time. Now however, Mac people would be seeing this bug too, due to Docker for Mac updating the Linux kernel version used in the VM (to a version above Linux 5.6.0). In Docker for Mac 3.0.0 the Linux kernel sees a massive upgrade, to 5.10.25.

@leomao10
Copy link

leomao10 commented May 19, 2021

Our team also found a comment in Rust stdlib about the issue of copy_file_range returns 0 and they mentioned overlay.

This is the related PR:
rust-lang/rust#75428

This is the related issue:
rust-lang/rust#75446

Maybe it would help to identify the impact of the issue

@esoterick
Copy link

Our team using both Mac and Linux have been seeing this issue while using test-kitchen inside docker containers to test our chef cookbooks. As @ZimbiX mentioned the bug seems to have been introduced in 5.6 kernel, it however did not seem to be present in kernel versions >= 5.11. Users originally saw the following errors which I am including below incase anyone else hits them as this issue was pretty rough to track down.

Chef::Exceptions::JSON::ParseError:
Top level JSON object must be a Hash or Array. (actual: NilClass)
[2021-05-05T19:35:36+00:00] INFO: HTTP Request Returned 404 Not Found: Object not found:
resolving cookbooks for run list: []
[2021-05-05T19:35:37+00:00] ERROR: #<TypeError: no implicit conversion of nil into Hash>
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-zero-5.3.2/lib/chef_zero/chef_data/cookbook_data.rb:76:in `merge!'
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-zero-5.3.2/lib/chef_zero/chef_data/cookbook_data.rb:76:in `from_json'
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-zero-5.3.2/lib/chef_zero/chef_data/cookbook_data.rb:34:in `metadata_from'
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-12.21.14/lib/chef/chef_fs/chef_fs_data_store.rb:828:in `get_single_cookbook_version'
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-12.21.14/lib/chef/chef_fs/chef_fs_data_store.rb:550:in `list'
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-zero-5.3.2/lib/chef_zero/data_store/v1_to_v2_adapter.rb:83:in `block in list'
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-zero-5.3.2/lib/chef_zero/data_store/v1_to_v2_adapter.rb:113:in `fix_exceptions'
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-zero-5.3.2/lib/chef_zero/data_store/v1_to_v2_adapter.rb:82:in `list'
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-zero-5.3.2/lib/chef_zero/rest_base.rb:100:in `list_data'

While test-kitchen was staging the files to be sent to the remote vm, it created a bunch of 0-byte cookbook files and json files. Depending on what we we're testing we hit the above different errors.

We ended up using the following monkeypatch in our testing environment's docker container to workaround the issue.

# frozen_string_literal: true

require 'fileutils'

module FileUtils
  class Entry_
    def copy_file(dest)
      File.open(path()) do |s|
        File.open(dest, 'wb', s.stat.mode) do |f|
          IO.copy_stream(s, f)
          f.chmod f.lstat.mode
        end
      end
    end
  end
end

We also put guards around it so it would only be loaded on the problematic kernels, so in the future in case we forgot about it... it wouldn't bite us...

@stanhu
Copy link

stanhu commented Dec 3, 2021

It seems some distributions that run a 5.4 kernel, such as Ubuntu 18.04.6, are also affected because they have pulled in torvalds/linux@1a980b8:

stanhu@stanhu-ubuntu-18:/tmp$ grep ovl_splice linux-gcp-5.4-5.4.0/fs/overlayfs/file.c
static ssize_t ovl_splice_read(struct file *in, loff_t *ppos,
ovl_splice_write(struct pipe_inode_info *pipe, struct file *out,
	.splice_read    = ovl_splice_read,
	.splice_write   = ovl_splice_write,

I've reported this to Ubuntu in https://bugs.launchpad.net/ubuntu/+source/linux-base/+bug/1953199. I'd suggest others run the reproduction step in #1015 (comment) and report the bugs to the distribution maintainers.

tgrathwell added a commit to codeforamerica/vita-min that referenced this issue Dec 3, 2021
…cleCI issues

As seen in this issue, certain versions of Docker and certain versions of the Linux
kernel manifest a problem where creating tempfiles wiht `IO.copy_stream` doesn't work:
docker/for-linux#1015

CircleCI seems to have upgraded their linux kernels and now we're seeing this problem.
A temporary workaround is to change the ActiveStorage code to not use IO.copy_stream
by making a custom service (since the Disk service is only for test anyway so we can
do whatever we want)
tgrathwell added a commit to codeforamerica/vita-min that referenced this issue Dec 3, 2021
…cleCI issues

As seen in this issue, certain versions of Docker and certain versions of the Linux
kernel manifest a problem where creating tempfiles wiht `IO.copy_stream` doesn't work:
docker/for-linux#1015

CircleCI seems to have upgraded their linux kernels and now we're seeing this problem.
A temporary workaround is to change the ActiveStorage code to not use IO.copy_stream
by making a custom service (since the Disk service is only for test anyway so we can
do whatever we want)
@stanhu
Copy link

stanhu commented Dec 3, 2021

CentOS 8 also has this bug even though it's shipping with a 4.18 kernel. I reported it here: https://bugs.centos.org/view.php?id=18370

RedHat Enterprise Linux 8.3 also has this problem: https://bugzilla.redhat.com/show_bug.cgi?id=2028998

ollietreend added a commit to alphagov/govuk-docker that referenced this issue Dec 7, 2021
This commit is a workaround for the issue #537. It allows the Content Publisher and Whitehall test suites to pass in the GOV.UK Docker development environment.

Once docker/for-linux#1015 has been fixed, this workaround will no longer be needed.
@thaJeztah
Copy link
Member

Thanks for the update, @stanhu - we added an internal ticket to either patch the kernel version as used by Docker Desktop, or to wait for the upstream kernel to include the fix; having it merged in upstream definitely makes it easier 👍

/cc @fredericdalleau @djs55 FYI

@smartygus
Copy link

Hi @thaJeztah,

I see that an internal ticket was opened to fix this kernel issue. I just wanted to ask if this has been planned and if there's an ETA? I've been holding off on updating Docker for Mac for a long time now (due to this issue, which is caused by the underlying kernel problem), and would love to be able to update without having to find a workaround.

Thanks heaps for your help!

@emanuelhfarias
Copy link

@smartygus Kernel 5.10.105 has been released fixing this bug!
Docker for Mac released 4.6.0 version that includes Kernel 5.10.104.
I confirm this version solves the problem in Docker for Mac!

@smartygus
Copy link

@emanuelhfarias this is great news!! Thanks for the heads up! :)

@stanhu
Copy link

stanhu commented Mar 17, 2022

Yes, I think we can close this issue. I see that upgrading Docker for Mac to v4.6.0 upgraded the kernel to 5.10.104-linuxkit from 5.10.76-linuxkit.

@stanhu
Copy link

stanhu commented Mar 17, 2022

Other related distribution notes:

@stanhu
Copy link

stanhu commented May 3, 2022

According to https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8-beta/html/8.6_release_notes/new-features#enhancement_kernel, RedHat 8.6 will be released on May 22, 2022 with kernel-4.18.0-369, so this should fix it. RedHat 8.5 still has this issue.

tillprochaska added a commit to tactilenews/100eyes that referenced this issue May 21, 2022
For some reason, temporary files weren’t created correctly when running tests inside our Docker development environment. This seems to be a Docker issue that can be worked around by mounting the `/tmp` directory.

References:
alphagov/govuk-docker#539
docker/for-linux#1015
tillprochaska added a commit to tactilenews/100eyes that referenced this issue May 21, 2022
For some reason, temporary files weren’t created correctly when running tests inside our Docker development environment. This seems to be a Docker issue that can be worked around by mounting the `/tmp` directory.

References:
alphagov/govuk-docker#539
docker/for-linux#1015
tillprochaska added a commit to tactilenews/100eyes that referenced this issue May 21, 2022
For some reason, temporary files weren’t created correctly when running tests inside our Docker development environment. This seems to be a Docker issue that can be worked around by mounting the `/tmp` directory.

References:
alphagov/govuk-docker#539
docker/for-linux#1015
tillprochaska added a commit to tactilenews/100eyes that referenced this issue May 21, 2022
For some reason, temporary files weren’t created correctly when running tests inside our Docker development environment. This seems to be a Docker issue that can be worked around by mounting the `/tmp` directory.

References:
alphagov/govuk-docker#539
docker/for-linux#1015
tillprochaska added a commit to tactilenews/100eyes that referenced this issue May 21, 2022
For some reason, temporary files weren’t created correctly when running tests inside our Docker development environment. This seems to be a Docker issue that can be worked around by mounting the `/tmp` directory.

References:
alphagov/govuk-docker#539
docker/for-linux#1015
tillprochaska added a commit to tactilenews/100eyes that referenced this issue May 22, 2022
For some reason, temporary files weren’t created correctly when running tests inside our Docker development environment. This seems to be a Docker issue that can be worked around by mounting the `/tmp` directory.

References:
alphagov/govuk-docker#539
docker/for-linux#1015
tillprochaska added a commit to tactilenews/100eyes that referenced this issue Jul 12, 2022
For some reason, temporary files weren’t created correctly when running tests inside our Docker development environment. This seems to be a Docker issue that can be worked around by mounting the `/tmp` directory.

References:
alphagov/govuk-docker#539
docker/for-linux#1015
mattwr18 pushed a commit to tactilenews/100eyes that referenced this issue Jul 27, 2022
For some reason, temporary files weren’t created correctly when running tests inside our Docker development environment. This seems to be a Docker issue that can be worked around by mounting the `/tmp` directory.

References:
alphagov/govuk-docker#539
docker/for-linux#1015
binford2k pushed a commit to voxpupuli/container-puppetdb that referenced this issue Nov 1, 2022
FOSS port of pe-puppetdb's 0886afb401129df40da1d0965b91ee2c08c76e7a

There's a particularly gnarly bug in Linux kernels 5.6 to 5.10 that
can result in 0 byte files being written when copying files inside
containers in a specific workflow as described in:

docker/for-linux#1015

This impacts the packaging gem used by ezbake when it renders ERB
templates, resulting in 0 byte files critical to the execution of
the build process.

Using a different filesystem (like tmpfs) for /tmp as a workaround
didn't seem to work.

Since there is no way to explicitly control the kernel version in
environments like Travis, the best approach is to monkey-patch the
_Entry class in Ruby that supports FileUtils.cp, such that a no-op
mode change is performed on the source and destination files before
and after being written to prevent the 0 byte files from being
written. In local OSX testing, no-op modifying the source file prior
to copy is the solution, but other users reported that no-op
modifying the destination file worked for them -- both solutions are
therefore employed for completeness.

This is a really hacky solution, but it only impacts two specific
scenarios:

* Developer builds against any commit in a branch without using
  mismatched packages
* Travis CI PR testing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests