Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gather scatter #52

Draft
wants to merge 58 commits into
base: develop
Choose a base branch
from
Draft

Conversation

pmarguinaud
Copy link
Contributor

This is a proposal for improving the current gather/scatter methods.

Some minor modifications have to be applied to Atlas :

  • fields & views of type char
  • atlas::vector dimensioned and indexed by size_t
  • add a size method to distribution

I have assumed that for a distributed field, jloc1 < jloc2 <=> jglo1 < jglo2, jglo being the global indice, and jloc the local indice (on the current task), but I can do without this assumption.

The proposal, although not finalized yet, makes it possible :

  • to process fields of any type, any rank
  • to distribute evenly fields with multiple dimensions (multiple levels or variables)
  • could be extended to handle fields with NPROMA, NGPBLKS dimensions
  • is much faster than the current implementation; on my reference test case (2.5km global, 300 fields, the gather takes 25s vs 3 minutes on 4 Rome nodes)

The proposal includes two new features (these should be implemented as new methods for views) :

  • convert any view to a view of byte elements, simply by creating a view with an extra dimension
  • drop a dimension of a view; such as in Fortran : if ZFLD is an array of dimension 4, then it possible to derive arrays of dimension 3 (eg ZFLD (:,:,4,:)), simply by setting the value of a dimension

These templates are included in this file : arrayViewHelpers.h, and make it possible to recursively build a list of views of type atlas::array::ArrayView<byte,2> from any view (any type, any rank). The first dimension of atlas::array::ArrayView<byte,2> is the fastest varying dimension of the original view, and the second has a length which corresponds to the type size.

The GatherScatter class handles only atlas::array::ArrayView<byte,2> views, which makes the code quite simple, as all type and rank issues have already been resolved. This could be even simpler is we could assume that the fastest varying dimension of Atlas fields had a stride ==1 (this should be always the case); in this case, we could restrict ourselves to atlas::array::ArrayView<byte,1> views.

Some further possible improvements :

  • buffer packing/unpacking in parallel of communications
  • multiple dimensions are involved : number of MPI tasks, number of fields, size of global and distributed fields; choosing the better dimension for OpenMP depends on the test case

@pmarguinaud
Copy link
Contributor Author

pmarguinaud commented May 26, 2020

Hello Willem,

I have looked at the slicer function which is nice for people working with views a of known rank, but I cannot see how I could use it to reduce a view with N dimensions to a list of views of dimension 1. I have to do that recursively, but I cannot write code such as :

auto slice = ZFLD.slice( Range::all(), Range::all(), 4, Range::all() );

because I need to handle a view of an arbitrary rank, and the code above deals with views of rank 4 only.

Hence I thought I could create a function which removes a dimension to a view, whatever its dimension :
auto w = dropDimension (v, 4, 2);

Please look at dropDimension (arrayViewHelpers.h) and reconsider this point, or tell me how to address my basic need which is to create a list of views of dimension 1 from a view of any rank.

Philippe

@pmarguinaud
Copy link
Contributor Author

Hello again,

This is the code you wrote :

! Fortran
type(atlas_Field) :: field_GFL_4
field_GFL_4 = atlas_Field( GFL( 1:NPROMA, 1:NLEV, 4, 1:NBLK ) )

In this case, the fastest varying dimension is 1:NPROMA; it is indeed contiguous, and has a stride of 1. This what I would like to assume : that fields involved in I/O have a fastest varying dimension which is contiguous.

Is it reasonable to make this assumption or not ?

Philippe

@pmarguinaud
Copy link
Contributor Author

pmarguinaud commented May 26, 2020

Hello again,

What you ask for :

FunctionSpace structured_columns = functionspace::StructuredColumns( grid ,
                                        option::levels(nlev));
Field wind = structured_columns.createField<double>("wind",option::variables(2));
    // shape={ ngp(partition), nlev=100, nvar=2} ,  rank=3
Field u_global = structured_columns.createField<double>("u",option::global());
   // shape={ ngp(GLOBAL), nlev=100 }  , rank=2

gather( wind.slice( Range::all(), Range::all(), 0 ).data(), .... , u_global.data(), ... );
    // ugly API that needs to manually provide strides and shapes etc...

Looks very feasible to me with what I propose. In fact the ioFieldDescriptor lets the user distribute the fields in any possible way. It would be possible to choose a target processor for any of the 1:NGPTOT or 1:NPROMA/1:NGPBLKS fields.

Philippe

@wdeconinck
Copy link
Member

You are completely right about the slicing :) I was considering this as well after I wrote my reply, but still wondered if it could be useful or were aware. It will be good to have.
Note however that the return type of a slice is of type LocalView. The semantic difference is that the ArrayView gets created from an Array, which contains either NativeDataStore or GridToolsDataStore. The latter is GPU aware, and the ArrayView can then be either on host or device. A LocalView is more straightforward, created as a slice from an ArrayView, but exactly the same API.

This will mess up perhaps the API of gather/scatter which expects an ArrayView. Perhaps that can be templated to accept both LocalView and ArrayView.

@pmarguinaud
Copy link
Contributor Author

I have to admit that I had not this in mind at all :

auto T_global = Field( "T", array::make_datatype<double>(), array::make_shape(nlev, ngp_global));
auto T_local  = Field( "T", array::make_datatype<double>(), array::make_shape(ngp_local, nlev);

scatter(T_global, T_local);
gather(T_local,T_global);

I do not think it happens in ARPEGE/IFS for grid-point fields, only for spectral data we do stuff like that; but spectral data requires a slightly different treatment.

Achieving something like that is probably possible, I need to think about it. It is clear we would need to specify for each field which dimension is the grid index.

In any case, I think the assumption I suggested to make (ie contiguous field dimension) would not resist it.

Philippe

@pmarguinaud
Copy link
Contributor Author

pmarguinaud commented May 27, 2020

Hello Willem,

I can make this code work :

auto T_global = Field( "T", array::make_datatype<double>(), array::make_shape(nlev, ngp_global));
auto T_local  = Field( "T", array::make_datatype<double>(), array::make_shape(ngp_local, nlev);

scatter(T_global, T_local);
gather(T_local,T_global);

It is sufficient that the user explicitly states which dimension is the grid index; then I can drop all other dimensions while reducing the view to a list of 1D views. Same thing for NPROMA/NGPBLKS dimensions; I can easily add a blocking dimension. Note that fields without NPROMA (that is with a dimension 1:NGPTOT) are just a special case (single block, NPROMA=NGPTOT).

To be more explicit, the ioFieldDesc objects are essentially this :

  atlas::array::ArrayView<byte,2> _v;  // pointer to a grid section (1:NGPTOT) x (1:sizeof (double) or sizeof (long), etc.)
  const std::vector<atlas::idx_t> _ind;  // Values of dimensions we had to drop to build this view
  const atlas::Field & _f;             // Reference to the original field (access to its metadata)

ioFieldDesc objects therefore hold a chunk of field data and associated metadata (for fields with a level, the _ind member would hold the level index).

I do not know a lot about the difference between ArrayView and LocalView, but I am pretty sure we can sort out the problem using templates.

I have also made this assumption :

I have assumed that for a distributed field, jloc1 < jloc2 <=> jglo1 < jglo2, jglo being the global indice, and jloc the local indice (on the current task), but I can do without this assumption.

This simplifies things (I do not need a functionspace anymore, but only the distribution object). I would like you to comment on that.

You also address the issue of metadata. These are small and of unpredictable size; therefore, it would be better not to pack them with field data, but rather distribute them using an allgather. The situation would be different for an IO server but we can see that later.

Please also keep in mind that we are breaking the MPI standard (packing double/long data as bytes); but this allows much more flexibility.

@wdeconinck
Copy link
Member

  • Very nice that the transposition could work! A use case would be e.g. to distribute horizontal levels obtained from e.g. grib into a different transposed or blocked (NPROMA) configuration.

  • The assumption that jloc1 < jloc2 <=> jglo1 < jglo2 is not guaranteed. Especially as I have added mesh renumbering algorithms (Hilbert space filling curve).

  • It would be nice if we did not need to pass a functionspace. Currently one GatherScatter object exists per functionspace. Could we not pass the extra information contained within the functionspace to the GatherScatter setup? We can assume that all fields will share the same functionspace.

  • Why is casting to MPI_CHAR breaking the standard? We are at least not doing any reductions, only message passing.

@pmarguinaud
Copy link
Contributor Author

I will provide some tests/examples with NPROMA/NGPBLKS and NGPTOT/NFLEVG. It will take a few days.

@wdeconinck
Copy link
Member

I will provide some tests/examples with NPROMA/NGPBLKS and NGPTOT/NFLEVG. It will take a few days.

Thank you Philippe, having this functionality will be really great.

@pmarguinaud
Copy link
Contributor Author

Hello Willem,

I have added two new test cases (I use Fortran order to describe them) :

  • (1:NFLEVG,1:NGPTOT,1:NFLDS) <-> list of (1:NGPTOTG) fields

  • (1:NPROMA,1:NFLEVG,1:NGPBLKS) <-> list of (1:NGPTOTG) fields

Both of them are included in test_gatherscatter.cc. Please look at them and see whether you find that sensible.

There are other points we should sort out :

  • LocalView/ArrayView distinction; not 100% clear to me, but you said I have to do something about it
  • The assumption jloc1 < jloc2 <=> jglo1 < jglo2; how I can do without. Maybe an example where it is not the case would help me.

Regards,

Philippe

@wdeconinck wdeconinck added Status: In Progress This is currently being worked on Type: Enhancement New feature or request labels Jun 12, 2020
wdeconinck added a commit to wdeconinck/atlas that referenced this pull request Jan 8, 2021
…cal-64 to develop

* commit 'b26965ddfd3c31feecb07578de61ae20a2742d2d':
  ATLAS-317 Prevent integer overflows if ATLAS_BITS_LOCAL=64
  ATLAS-317 Support ATLAS_BITS_LOCAL=64 in tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: In Progress This is currently being worked on Type: Enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants