Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nc_get_vars incredibly slow in Windows compared to Linux #2721

Open
abhibaruah opened this issue Jul 19, 2023 · 18 comments
Open

nc_get_vars incredibly slow in Windows compared to Linux #2721

abhibaruah opened this issue Jul 19, 2023 · 18 comments

Comments

@abhibaruah
Copy link

OS: Windows 10
NetCDF version: 4.9.1

I am trying to read a 3D double variable (2000 x 512 x 512) from a netCDF4 file with the following parameters:
start = {0,0,0}
count[] = {1000, 256, 256};
stride[] = {2, 2, 2};
chunk size: {20, 10, 10}
shuffle: no
deflate : yes
deflate_level : 6

I time the call to nc_get_vars.
On Debian 11, it takes ~25 seconds.
On Windows 10, it takes ~130 seconds.

I would expect Windows to be slightly slower, but >5x slowdown is unexpected.
I see similar slowdowns with 'nc_get_vars_double'

On the contrary, using 'nc_get_var_double' or 'nc_get_var' to read the whole variable is significantly faster (~3 sec on Linux, and ~1 sec on Windows)

  1. Is there a way to optimize the performance of 'nc_get_vars' or 'nc_get_vars_double' so that Windows performance is closer to Linux performance?

  2. Is reading the whole variable using 'nc_get_var' to memory and then slicing it later an option? I have seen that there were some discussions regarding this (Make netcdf-4 use the the stride > 1 facilities of hdf5 #908) and that a submission was made to make strided reads faster. But for my variable, reading the whole variable still seems to be significantly faster than strided reads (especially on Windows)

Please find the link to the nc file here.
Here is my code:

#include <stdio.h>
#include <string.h>
#include <netcdf.h>
#include <cstdlib>
#include <iostream>
#include <chrono>

int
main()
{
    int status;
    int ncid;
    int varid;

    int elems_x = 256;
	int elems_y = 256;
	int elems_z = 1000;
    double* outData = (double*)malloc (elems_x*elems_y*elems_z*sizeof(double));

    size_t start[] = {0, 0, 0};
    size_t count[] = {1000, 256, 256};
    ptrdiff_t stride[] = {2, 2, 2};

    
    // open the NetCDF-4 file
    status = nc_open("repro_nc4file.nc", NC_NOWRITE, &ncid);
    if(status != NC_NOERR) {
         printf("Could not open file.\n");
    }
   
    // get the varid 
    status = nc_inq_varid(ncid, "my_var", &varid);
    printf("status after inq var = %d\n", status);
    printf("varid = %d\n", varid);

    // get the strided subset
	auto timestart = std::chrono::high_resolution_clock::now();
    status = nc_get_vars(ncid, varid, start, count, stride, outData);
	auto timeend = std::chrono::high_resolution_clock::now();
	auto duration = std::chrono::duration_cast<std::chrono::seconds>(timeend - timestart);
	std::cout << "Execution time: " << duration.count() << " seconds" << std::endl;
    printf("status after getting strided subset = %d\n", status);

    // close the file 
    status = nc_close(ncid);
    printf("status after close = %d\n", status);

    printf("End of test.\n\n");

    return 0;
}

@edwardhartnett
Copy link
Contributor

I would rewrite the code to try to use vara to see if the speed problem goes away.

@abhibaruah
Copy link
Author

You mean use vara to read the values with stride 1 and then do the slicing later?

@edwardhartnett
Copy link
Contributor

Use vara and jump around to get the slicing you need, so you are reading the exact same data, but without vars.

@abhibaruah
Copy link
Author

abhibaruah commented Jul 19, 2023

Hello Ed,
I tried your recommendation. The issue is that for using 'nc_get_vara', I ll have to read twice as many elements now (since for my original case, the stride is 2). So, instead of 1000 x 256 x 256 elements, I have to read 2000 x 512 x 512 elements.

Even with nc_get_vara, I still find that Windows is significantly slower:

Windows time: 102 seconds
Linux time: 19 seconds

The only change I made to the previous code is to replace
status = nc_get_vars(ncid, varid, start, count, stride, outData);
with
status = nc_get_vara(ncid, varid, start, count, outData);

And
int elems_x = 512;
int elems_y = 512;
int elems_z = 2000;

@WardF
Copy link
Member

WardF commented Jul 20, 2023

I am taking a look at this to see if I can determine if the slowdown is in libnetcdf, or if it is something in libhdf5.

@abhibaruah a couple questions, if I may, to ensure I'm on the same page.

  1. When you say Windows, you mean Visual Studio, correct? Or a gcc/variant on Windows
  2. What version of libhdf5 are you linking against?

Since we're using libhdf5 for file access, my fear is that this is an issue in libhdf5; that may limit our ability to address this. But it's not necessarily the case. I'll start by reproducing the issue, and go from there :).

@abhibaruah
Copy link
Author

Thanks @WardF for taking a look.

  1. Yes, I am using Visual Studio (VS2019v16.11.7)
  2. I am linking against HDF5 v1.10.10

@DennisHeimbigner
Copy link
Collaborator

I recall that this issue was raised some time ago. If memory serves, we proposed to convert vars code to use the corresponding HDF5 operations (I assume we are talking netcdf-4 and not netcdf-3). But apparently this proposal never got implemented.

@abhibaruah
Copy link
Author

Was the proposed change to use the corresponding HDF5 operations only for Windows?
Because for my use case Linux time is reasonable (~20 sec) vs (>100 sec) for Windows.

@WardF
Copy link
Member

WardF commented Aug 4, 2023

I'm making some progress on this; I haven't narrowed it down to a solution, yet, but I'm able to replicate the observed issue using netCDF v4.9.1 and HDF5 1.10.10. Testing with netCDF main and HDF5 1.14.1, I see performance in line with what's observed in your linux environment. I'm still trying to determine if the culprit is a change in the netCDF code, or if it's a change in the HDF5 code.

@WardF
Copy link
Member

WardF commented Aug 11, 2023

@abhibaruah I'm seeing some mostly consistent results; out of curiosity, can you give it a try with v4.9.2?

@abhibaruah
Copy link
Author

Hello @WardF ,
When you say 'consistent' results, you mean consistent with the slow speeds I saw or similar to the speed on Linux?

Currently, we do not have v4.9.2 in our harness, and hence it will be difficult for me to build v4.9.2 with HDF5 v1.10.10 (will have to go through legal and administrative hoops for that).

I can download the Windows binaries from here (https://downloads.unidata.ucar.edu/netcdf/) and give it a try but I am guessing that you must have already tried it.

@WardF
Copy link
Member

WardF commented Aug 11, 2023

Let me clarify, thanks :). I'm seeing results consistent with what you've described, and I'm seeing them in a way I've been able to reproduce them. I'm not certain what the underlying issue is, but I am seeing much faster speeds using netCDF-C v4.9.2 (still slightly slower than on Linux, but that could be because of the VM I'm using, etc. But around 45 seconds instead of > 100).

I'm at a loss as to why this is only happening in Linux, and will continue trying to figure that out. I've tested with HDF5 1.10.10 as well as HDF5 1.14.1; the results are the same when using v4.9.1 (> 100 seconds), and faster when using netCDF v4.9.2 ( < 50 seconds), regardless of which version of HDF5 I'm using.

@WardF
Copy link
Member

WardF commented Oct 19, 2023

Just a note to follow up, HDF5 1.14.2 is out, I'm going to try to test this on Windows. I understand there are hoops to jump through, but the issue does appear to be related to the underlying HDF5 library.

@abhibaruah
Copy link
Author

@WardF
I tried the repro with netCDF 4.9.2 and HDF5 1.10.11. Unfortunately, I am still seeing the same performance difference between Windows 11 and Debian 11.
Windows 11 : ~130s
Debian 11: ~11s

I am not sure why I am still seeing the slowness on WIndows.
I created an HDF5 script to mimic my repro above (but with an H5 file), and the reading of the dataset is much faster (~30s).

@WardF
Copy link
Member

WardF commented Dec 1, 2023

@abhibaruah thank you, that is good to know at least, the HDF5 script does suggest it is something in netCDF, although why it would be Windows specific is puzzling. I'll pop this back to the top of the stack and see what I can sort out.

@abhibaruah
Copy link
Author

Hello @WardF
Hope you are doing well. I tried the repro steps for this issue with netCCDF 4.9.2 + HDF5 1.14.4.3 and I could still see the slowdown.

Windows time - ~123 s
Debian 12 time - ~15 s

Let me know if you find any new information regarding the same.

Thanks,
Abhi

@abhibaruah
Copy link
Author

I also tried the netCDF repro steps with older versions of netCDF. Here are the results (in seconds).

        Windows            Linux

4.6.1 284.1 228
4.8.1 17.8 10.51
4.9.1 115.55 12.25
4.9.2 140 23

Looks like the Windows regression was introduced sometime between 4.8.1 and 4.9.1.

@WardF
Copy link
Member

WardF commented Aug 19, 2024

Thank you @abhibaruah, that certainly narrows it down. Thank you for bringing this back to the top of the stack, I will see what I can do to dial it in. If I can come up with a test on Windows to replicate this (I should be able to), I can do a git bisect to narrow it down even further. To answer a question I see you asked separately (while I was out of the office on PTO last week), I'm hoping to have rc2 for 4.9.3 out by the end of next week, and then moving forward with the full release barring any feedback which would prevent that. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants