Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjoint gradients in Near2Far change with the number of processes #1555

Closed
mawc2019 opened this issue Apr 24, 2021 · 16 comments · Fixed by #1895
Closed

Adjoint gradients in Near2Far change with the number of processes #1555

mawc2019 opened this issue Apr 24, 2021 · 16 comments · Fixed by #1895
Labels

Comments

@mawc2019
Copy link
Contributor

The adjoint gradients in the Near2Far class change with the number of mpi processes. The test was based on the example code at Near2Far Optimization with Epigraph Formulation, with the objective function replaced by

def J(alpha):
    return npa.sum(npa.abs(alpha[0,:,2])**2)

opt = mpa.OptimizationProblem(
    simulation = sim,
    objective_functions = J,
    objective_arguments = ob_list,
    design_regions = [design_region],
    frequencies=frequencies,
    decay_by = 1e-6,
    decay_fields=[mp.Ez],
    maximum_run_time = 2000
)

The adjoint gradients were calculated as

x0 = 0.5*np.ones(Nx*Ny)
opt.update_design([x0])
f0, dJ_deps = opt()
g_adjoint = np.sum(dJ_deps, axis=1)

I ran the code with 1, 16, and 32 mpi processes. Their relative differences are as follows. The horizontal axis labels different pixels in the design region, and the vertical axis represents the relative differences w.r.t. the 1-process adjoint gradients, namely, 16-process adjoint gradients / 1-processes adjoint gradients − 1 and 32-processes adjoint gradients / 1-processes adjoint gradients − 1.
image
image

If we take the absolute value and plot them in the log scale, we can see that the differences are ubiquitous rather than limited to those spikes.
image
image

I also tested EigenmodeCoefficient, but did not see the variation of adjoint gradients among different numbers of processes.

@oskooi oskooi added the bug label Apr 25, 2021
@smartalecH
Copy link
Collaborator

See #1547, which should hopefully fix this.

@mawc2019
Copy link
Contributor Author

Thank you, Alec. Near2Far computes far field from near field on the Yee lattice and no voxel-center averaging is needed, but #1547 aims for dealing with the case of yee_grid=False which involves voxel-center averaging.

@stevengj
Copy link
Collaborator

Maybe add an assertion after this line to check that f->fc->gv.owns(ix0) is true. This ensures that #1547 is not needed.

@stevengj
Copy link
Collaborator

The other thing you could do is to add a printf statement after this line to print out the iloc (ix0) and the amplitude EH0 for all the adjoint source points.

The set of source locations and amplitudes should be the same regardless of the number of processes.

(You might want to first shrink the near-to-far region in order to reduce the number of source points you need to examine, while making sure you still reproduce the bug.)

@mochen4
Copy link
Collaborator

mochen4 commented Apr 29, 2021

Thanks Steven! I will follow your suggestions and do the checks

@mochen4
Copy link
Collaborator

mochen4 commented May 8, 2021

After some experiments, I find that the assertion f->fc->gv.owns(ix0) fails; there is no problem at all if I shrink the near-to-far region. And as you can expect, the size of the near-to-far region at which the assertion starts to fail depends on the number of processes.

@stevengj
Copy link
Collaborator

I find that the assertion f->fc->gv.owns(ix0) fails

That's not good — can you do some more digging to print out the coordinates of ix0 and see where it is in the near2far region?

@mawc2019
Copy link
Contributor Author

Thank you, @mochen4. Here I am trying to illustrate this bug using an example with only one process. In the example below, the finite-difference gradients and the adjoint gradients are inconsistent, which happens when the monitor is at some special positions, e.g., one pixel away from the lateral PML region. In normal situations, the monitor is not entirely located there. The purpose of this example is merely to demonstrate the bug. In addition, the inconsistency may arise at new positions when the number of processes increases, implying that the unusual case here might be related to the case mentioned in the beginning of this issue.

The code is as follows.

import meep as mp
import meep.adjoint as mpa
import numpy as np
from autograd import numpy as npa
from autograd import tensor_jacobian_product, grad, jacobian
import nlopt
from matplotlib import pyplot as plt
SiO2 = mp.Medium(index=1.45)
SiN = mp.Medium(index=2.02)

resolution,design_region_resolution = 40,10

Lx,Ly = 4,4
dpml = 1
design_region_width,design_region_height = 2,0.5
monitor_height,monitor_length = 1,0.5
monitor_pml_distance = 1/resolution # distance from the lateral PML
pml_layers = [mp.PML(dpml)]
cell_size = mp.Vector3(Lx+2*dpml,Ly+2*dpml)

fcen,fwidth = 2,0.2
source_center,source_size = mp.Vector3(0,-Ly/4),mp.Vector3(Lx+2*dpml,0)
src = mp.GaussianSource(frequency=fcen,fwidth=fwidth,is_integrated=True)
source = [mp.Source(src,component=mp.Hz,center=source_center,size=source_size)]
frequencies = [fcen-fwidth/2,fcen,fcen+fwidth/2]

Nx,Ny = int(design_region_resolution*design_region_width),int(design_region_resolution*design_region_height)
design_variables = mp.MaterialGrid(mp.Vector3(Nx,Ny),SiO2,SiN,grid_type='U_MEAN')
design_region = mpa.DesignRegion(design_variables,volume=mp.Volume(center=mp.Vector3(0,design_region_height/2),size=mp.Vector3(design_region_width,design_region_height)))

geometry = [mp.Block(center=design_region.center, size=design_region.size, material=design_variables)]
sim = mp.Simulation(cell_size=cell_size,boundary_layers=pml_layers,geometry=geometry,sources=source,default_material=SiO2,resolution=resolution)

far_point = [mp.Vector3(10,10,0)]
NearRegions = [mp.Near2FarRegion(center=mp.Vector3(monitor_pml_distance-Lx/2,monitor_height), size=mp.Vector3(0,monitor_length), weight=+1)]
FarFields = mpa.Near2FarFields(sim, NearRegions ,far_point)
ob_list = [FarFields]

def J(far_data):
    return npa.abs(far_data[0,1,5])**2

opt = mpa.OptimizationProblem(
    simulation = sim,
    objective_functions = J,
    objective_arguments = ob_list,
    design_regions = [design_region],
    frequencies=frequencies,
    decay_by = 1e-5,
    decay_fields=[mp.Ez]
)

x0 = 0.5*np.ones(Nx*Ny)
opt.update_design([x0])
f0, dJ_deps = opt()

db = 1e-4
choose = 10
np.random.seed(2021)

g_fd_idx, idx = opt.calculate_fd_gradient(num_gradients=choose,db=db)
g_fd_idx = np.array(g_fd_idx).flatten()
g_ad_full = np.sum(dJ_deps, axis=1).flatten()

g_fd = np.multiply(g_fd_idx,10**4)
g_ad = np.multiply(g_ad_full[idx],10**4)

min_g = np.min(g_fd)
max_g = np.max(g_fd)

(m, b) = np.polyfit(np.squeeze(g_fd), g_ad, 1)
plt.plot(g_fd,g_ad,"o")
plt.plot([min_g, max_g],[min_g, max_g],label='y=x for comparison')
plt.plot([min_g, max_g],[m*min_g+b, m*max_g+b],'--',label='Linear fit')
plt.xlabel('Finite Difference Gradient × $10^4$')
plt.ylabel('Adjoint Gradient × $10^4$')
plt.legend()
plt.grid(True)

The result is as follows, which shows no linear correlation.
image

The inconsistency is not restricted to the Ez component. For example, if the component in the source is replaced by mp.Hz and the function J is changed to npa.abs(far_data[0,1,5])**2 accordingly, the result is then as follows, which shows a little linear correlation with an incorrect slope.
image

The inconsistency is absent when monitor_pml_distance takes most other values, such as 0 or 2/resolution. But when the number of processes increases, the inconsistency may occur at new positions. For example, when the number of processes is 16 and monitor_pml_distance equals Lx/2, the inconsistency appears.

@mochen4
Copy link
Collaborator

mochen4 commented May 23, 2021

After some more digging, I found that in LOOP_OVER_IVECS, is and ie may or may not be owned by gv. The test case has near2far region in the x direction. gv.little_corner().x() is strictly less than (f->is).x() when (f->fc->gv.owns(f->is)), and gv.big_corner().x() is strictly larger than (f->ie).x() when (f->fc->gv.owns(f->ie)). However, sometimes is (or ie) isn't owned, and its value equals little_corner() (or big_corner()).

I only used one process in the experiment, and is was unowned much more often than ie.

@stevengj
Copy link
Collaborator

Can you print out the weight w for those points? Maybe it has zero weight?

@stevengj
Copy link
Collaborator

stevengj commented May 24, 2021

Also check that the is and ie give disjoint (non-intersecting) grid volumes when you look at all the chunks (for a given field component). (Technically, this is only true after applying the symmetry transformations, but in your test case you have no symmetries.)

@stevengj
Copy link
Collaborator

stevengj commented May 24, 2021

It looks like loop_in_chunks does not actually guarantee that the points are owned, although they are in the chunk and they should be disjoint between chunks (after symmetry mapping). A first draft at fixing this is in https://github.com/NanoComp/meep/tree/loop_in_chunks_fixes

@smartalecH
Copy link
Collaborator

A first draft at fixing this is in...

@stevengj do you have any updates on this fix? It might be worth it to add as a PR to see if checks are passing (and to check against other PRs).

@stevengj
Copy link
Collaborator

stevengj commented Jan 5, 2022

@mochen4, weren't you working on this at some point?

@mochen4
Copy link
Collaborator

mochen4 commented Jan 5, 2022

If I remember correctly, tests/symmetry.cpp and tests/near2far.cpp (which uses symmetry) were failing. When symmetry is used, the corners of the chunk might need to be shifted. However, I couldn't find a way that would pass all the tests. One particular issue I vaguely remember is that, in test_1d_periodic_mirror, the point at one end of the geometry got updated but it wasn't supposed to. Then you decided to look at the issue yourself? @stevengj

@mawc2019
Copy link
Contributor Author

mawc2019 commented Nov 23, 2022

A similar issue on EigenmodeCoefficient remains in the latest version of Meep, namely, v1.25.0. Some 3d tests are demonstrated here. The relative differences between different numbers of processes can be a few percent to 80 percent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants