-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TraceHeader duplication is time consuming #423
Comments
No, it's not normal at all, and your code looks perfectly fine. Do your system normally have this bad write throughput? Do any other segyio versions behave differently? I've not observed header-copy to crazy slow like you observe, (un)fortunately. Do writing traces or even smaller files go as poorly? That being said, if you're not planning on modifying the header as you go, a better way of approaching this problem is to first do a regular file copy, e.g. through shutil.copyfile, and then open the newly-created file with regular open(mode = 'r+'). That way, segyio should at least not slow down on header-copy, and you get to see if writing traces is as bad. |
@jokva Thanks for your quick reply! I have done traces initialization as follow, and it's not time consuming.
I will check some other version of segyio. At the earliest, I have used copyfile to duplicate a new segy file, while it is also time consuming for big-size data at first glance, then I tried segyio.create. But I haven't compared the time consumed of this two operation. Thanks for your advice, I will also check it. |
It is, but that's just the performance of your system - copyfile is stupidly copying bytes, which is the fastest it can be. segyio can only beat that if you need to modify, because that way, things are just written once.
Ok, good - inline writing in batch iike you do here is typically the fastest operation in segyio, so that's at least something. Do you mind unrolling the trace copy like this, and then get a feeling of when or where it slows down?
|
@jokva What's more, this phenomenon is not due to segyio version, I have check 1.8.3, 1.8.6 and 1.8.8. I then changed to use shutil.copyfile, the duplication process takes only about 4 minutes. To note, all of those above tests are on Linux (RHEL 6.5), with anaconda environment (conda/4.7.12 requests/2.22.0 CPython/3.6.9 Linux/2.6.32-431.el6.x86_64 rhel/6.5 glibc/2.12). However, on my Windows 10 PC, the time consumed for the same trace header duplication is about half an hour, which is acceptable. In a word, copyfile performs best. Thanks for your reply. |
Hi, I ran some experiments, but were unable to reproduce poor performance on my (Debian 9) machine. I'll keep looking into it, but I found no obvious reason (in code) why it should be slow to write multiple headers. For what it's worth, I used tqdm, and saw a throughput of ~5000 iterations per second. |
Hi @jokva we can confirm that in ubuntu 18.04 using segyio==1.9.0 running:
It takes long time to copy the headers. For a large segy 700GB+ after 2 days not completed. Reading goes down to few Mbit/s. |
Ok, interesting. There's obviously something going on, even though I haven't been able to reproduce it yet. I'll take a new look into it. Out of curiosity, is it slow if you're just walking the headers too, and not writing them back to disk? |
@seisgo So I'm still not able to reproduce yet, but I would like you to test something for me. What happens if you change your code to:
Second-to-last line, i.e. write the last header first, then copy all. Is it as slow? |
@jokva Sorry for answering late. |
@jokva Please check if it is same on your platform. |
What kind of mounted hard disk is it? Does it happen to be a network drive (maybe NFS)? Are there any particular mount opts? What if you do the |
@jokva I have tested oFile.header[-1] trick, it makes no difference. |
Is it mounted with sync, by any chance? Check with |
A couple of other experiments would be interesting too - I'm afraid I can't debug it as I don't have an NTFS drive about. Try writing both headers and traces in tandem:
If this makes fast again, it's probably seeking past unwritten stuff that tanks the performance. edit: also, how is CPU usage? Is it high? Does ntfs-3g use a lot of CPU, or is it mostly idle? edit2:
Which is exactly what you're doing. |
@jokva In comparison, the output of 'mount |grep home' is |
Yes, writing both headers and traces makes difference. When the output of the copying file is a ntfs-3g mounted directory, the copying speed is about 1000it/s, and CPU usage is about 50%. When the output of the copying file is a ext4 mounted directory (/home), the speed is a bit faster, about 2000it/s with fluctuation, and CPU usage is higher than 50%, sometimes even approaching 100%. |
It's not a great sign that CPU utilisation is so high even on ext4, but having >50% better performance on ext4 over ntfs-3g is expected. I don't think is a problem with segyio, other than maybe documentation, it's a consequence of ntfs-3g. I'll get around to adding a FAQ entry for slow sparse writes, which is not obvious from the |
Yes, from those above tests, it is not a problem with segyio. Maybe I didn't make myself clear. The copying speed on ext4 is about twice over that on ntfs-3g. |
I've added a proposed fix to this solution in the readme: #434 |
I'll consider this resolved, or at least the most likely cause identified, and it is outside of segyio. |
segyio version: 1.8.6
I created a new segyfile based on an existed one, and duplicated the trace headers from the existed one. While, this process is time consuming, especially for big-size data. For example, the size of the existed segy file is 5.8G, for about 4 hour, the size of the new segy file is only about 1G. Actually, the time consumed of this process depends on the trace count of segy file, here the trace count is 641*1851=1186491.
Following is part of my code:
This code is almost same as the example of 'create' in segyio documentation.
So, is this time-consuming process is normal?
If not, could anyone give a suggestion about copying traceheader when creating new segy file?
Thanks!
The text was updated successfully, but these errors were encountered: