Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TraceHeader duplication is time consuming #423

Closed
seisgo opened this issue Nov 14, 2019 · 20 comments
Closed

TraceHeader duplication is time consuming #423

seisgo opened this issue Nov 14, 2019 · 20 comments

Comments

@seisgo
Copy link

seisgo commented Nov 14, 2019

segyio version: 1.8.6

I created a new segyfile based on an existed one, and duplicated the trace headers from the existed one. While, this process is time consuming, especially for big-size data. For example, the size of the existed segy file is 5.8G, for about 4 hour, the size of the new segy file is only about 1G. Actually, the time consumed of this process depends on the trace count of segy file, here the trace count is 641*1851=1186491.

Following is part of my code:

with segyio.open(args['input'], 'r') as iFile:
    spec = segyio.tools.metadata(iFile)

    # create output file
    with segyio.create(args['output'], spec) as oFile:
        for i in range(1+spec.ext_headers):
            oFile.text[i] = iFile.text[i]
        oFile.bin = iFile.bin
        oFile.header = iFile.header

This code is almost same as the example of 'create' in segyio documentation.

So, is this time-consuming process is normal?
If not, could anyone give a suggestion about copying traceheader when creating new segy file?

Thanks!

@jokva
Copy link

jokva commented Nov 14, 2019

No, it's not normal at all, and your code looks perfectly fine. Do your system normally have this bad write throughput? Do any other segyio versions behave differently?

I've not observed header-copy to crazy slow like you observe, (un)fortunately. Do writing traces or even smaller files go as poorly?

That being said, if you're not planning on modifying the header as you go, a better way of approaching this problem is to first do a regular file copy, e.g. through shutil.copyfile, and then open the newly-created file with regular open(mode = 'r+'). That way, segyio should at least not slow down on header-copy, and you get to see if writing traces is as bad.

@seisgo
Copy link
Author

seisgo commented Nov 14, 2019

@jokva Thanks for your quick reply!

I have done traces initialization as follow, and it's not time consuming.

# Init trace data of output file object
for il in spec.ilines:
     oFile.iline[il] = np.zeros((nXline, nSample), dtype=np.single)

I will check some other version of segyio.

At the earliest, I have used copyfile to duplicate a new segy file, while it is also time consuming for big-size data at first glance, then I tried segyio.create. But I haven't compared the time consumed of this two operation. Thanks for your advice, I will also check it.

@jokva
Copy link

jokva commented Nov 14, 2019

At the earliest, I have used copyfile to duplicate a new segy file, while it is also time consuming for big-size data at first glance

It is, but that's just the performance of your system - copyfile is stupidly copying bytes, which is the fastest it can be. segyio can only beat that if you need to modify, because that way, things are just written once.

I have done traces initialization as follow, and it's not time consuming.

Ok, good - inline writing in batch iike you do here is typically the fastest operation in segyio, so that's at least something.

Do you mind unrolling the trace copy like this, and then get a feeling of when or where it slows down?

for i in tqdm(range(src.tracecount)):
    dst.header[i] = src.header[i]

@seisgo
Copy link
Author

seisgo commented Nov 14, 2019

@jokva
As your advice, I use tqdm to monitor the process. During the duplication process, at about 3% progress, the speed reduces from 300 it/s to 60 it/s, and reduces continuously. At progress about 5%, the speed reduces to 30 it/s.

What's more, this phenomenon is not due to segyio version, I have check 1.8.3, 1.8.6 and 1.8.8.

I then changed to use shutil.copyfile, the duplication process takes only about 4 minutes.


To note, all of those above tests are on Linux (RHEL 6.5), with anaconda environment (conda/4.7.12 requests/2.22.0 CPython/3.6.9 Linux/2.6.32-431.el6.x86_64 rhel/6.5 glibc/2.12).

However, on my Windows 10 PC, the time consumed for the same trace header duplication is about half an hour, which is acceptable.

In a word, copyfile performs best.

Thanks for your reply.

@jokva
Copy link

jokva commented Dec 2, 2019

Hi,

I ran some experiments, but were unable to reproduce poor performance on my (Debian 9) machine. I'll keep looking into it, but I found no obvious reason (in code) why it should be slow to write multiple headers.

For what it's worth, I used tqdm, and saw a throughput of ~5000 iterations per second.

@doiko
Copy link

doiko commented Dec 23, 2019

Hi @jokva we can confirm that in ubuntu 18.04 using segyio==1.9.0 running:

def create_empty_segy_(segy, initial_segy_filename):
    spec = segyio.tools.metadata(segy)
    spec.samples = spec.samples[:1]
    with segyio.create(initial_segy_filename, spec) as dst:
        dst.text[0] = segy.text[0]
        dst.bin = segy.bin
        dst.bin.update(hns=len(spec.samples))
        dst.header = segy.header
        dst.trace = segy.trace

It takes long time to copy the headers. For a large segy 700GB+ after 2 days not completed. Reading goes down to few Mbit/s.
We do not want to copy the segy just keep the trace header info for feature use.

@jokva
Copy link

jokva commented Dec 24, 2019

Ok, interesting. There's obviously something going on, even though I haven't been able to reproduce it yet. I'll take a new look into it.

Out of curiosity, is it slow if you're just walking the headers too, and not writing them back to disk?

@jokva
Copy link

jokva commented Dec 26, 2019

@seisgo So I'm still not able to reproduce yet, but I would like you to test something for me. What happens if you change your code to:

with segyio.open(args['input'], 'r') as iFile:
    spec = segyio.tools.metadata(iFile)

    # create output file
    with segyio.create(args['output'], spec) as oFile:
        for i in range(1+spec.ext_headers):
            oFile.text[i] = iFile.text[i]
        oFile.bin = iFile.bin
        oFile.header[-1] = {}
        oFile.header = iFile.header

Second-to-last line, i.e. write the last header first, then copy all. Is it as slow?

@seisgo
Copy link
Author

seisgo commented Jan 2, 2020

@jokva Sorry for answering late.
I have test as your advice.
To my surprise, at this time, with 'oFile.header[-1] = {}' or not, the traceheader copying is normal. I am really confused.
The version of segyio is 1.8.3, and the implementation is still on my RHEL computer. For a seismic file with size 5.8G, the copying process takes about 4 minutes. In addition, with tqdm monitoring, the iteration speed is about 4000+ it/s.

@seisgo
Copy link
Author

seisgo commented Jan 2, 2020

@jokva
I find the cause of this anomaly. The copying speed depends on the output file directory. If it's set as a subdirectory of /home, it works well. If it's set as a subdirectory of mounted hard disk, the speed is slow and reduces continuously.

Please check if it is same on your platform.

@jokva
Copy link

jokva commented Jan 2, 2020

What kind of mounted hard disk is it? Does it happen to be a network drive (maybe NFS)? Are there any particular mount opts?

What if you do the oFile.header[-1] trick?

@seisgo
Copy link
Author

seisgo commented Jan 2, 2020

@jokva
The mount type is NTFS-3G, without particular opts.

I have tested oFile.header[-1] trick, it makes no difference.

@jokva
Copy link

jokva commented Jan 2, 2020

Is it mounted with sync, by any chance? Check with mount | grep path where path is the drive root path, e.g. /media/external

@jokva
Copy link

jokva commented Jan 2, 2020

A couple of other experiments would be interesting too - I'm afraid I can't debug it as I don't have an NTFS drive about.

Try writing both headers and traces in tandem:

for i in range(src.tracecount):
    dst.header[i] = src.header[i]
    dst.trace[i] = src.trace[i] 

If this makes fast again, it's probably seeking past unwritten stuff that tanks the performance.

edit: also, how is CPU usage? Is it high? Does ntfs-3g use a lot of CPU, or is it mostly idle?

edit2:
https://www.tuxera.com/community/ntfs-3g-faq/#slow

A highly sparse file is being regularly written or updated.

Which is exactly what you're doing.

@seisgo
Copy link
Author

seisgo commented Jan 3, 2020

@jokva
The slow copying operation is to directory diska. The output of mount | grep diska is
/dev/sda1 on /diska type fuseblk (rw,allow_other,blksize=4096

In comparison, the output of 'mount |grep home' is
/dev/sdc2 on /home type ext4 (rw)

@seisgo
Copy link
Author

seisgo commented Jan 3, 2020

Yes, writing both headers and traces makes difference.

When the output of the copying file is a ntfs-3g mounted directory, the copying speed is about 1000it/s, and CPU usage is about 50%.

When the output of the copying file is a ext4 mounted directory (/home), the speed is a bit faster, about 2000it/s with fluctuation, and CPU usage is higher than 50%, sometimes even approaching 100%.

@jokva
Copy link

jokva commented Jan 3, 2020

It's not a great sign that CPU utilisation is so high even on ext4, but having >50% better performance on ext4 over ntfs-3g is expected.

I don't think is a problem with segyio, other than maybe documentation, it's a consequence of ntfs-3g. I'll get around to adding a FAQ entry for slow sparse writes, which is not obvious from the dst.header = src.header syntax.

@seisgo
Copy link
Author

seisgo commented Jan 3, 2020

Yes, from those above tests, it is not a problem with segyio.

Maybe I didn't make myself clear. The copying speed on ext4 is about twice over that on ntfs-3g.

@jokva
Copy link

jokva commented Jan 6, 2020

I've added a proposed fix to this solution in the readme: #434

@jokva
Copy link

jokva commented Jan 13, 2020

I'll consider this resolved, or at least the most likely cause identified, and it is outside of segyio.

@jokva jokva closed this as completed Jan 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants