Skip to content

2019 02 05

Matthias Köfferlein edited this page Feb 6, 2019 · 8 revisions

Hierarchical merge and performance on a real testcase

I'm starting the make the hierarchical processing features available to the Region class (so far, only booleans were supported). First achievement is the hierarchical merge function. "Merge" is an operation in which all connected polygons are joined into a big single one. The purpose is simplification and analysis - for example selecting polygons by area can only happen after merging.

Hierarchical merge utilizes the net formation feature and connects polygons if they touch. Each such connected cluster forms one merged polygon. The merged polygon will appear at the lowest possible level in the hierarchy at which all constituents are known.

I added a small sample P&R block into the project. Thanks to Jean-Paul Chaput for allowing me to use it. This block has about 8k standard cells. The design rules are symbolic 1µm rules, but the hierarchy is realistic.

Place-and-route blocks have a hierarchical structure of standard cells, sometimes also via cells or tie-down diodes. The routine usually is within the top cell of the block and a flat arrangement of metal wires (paths, polygons) and vias rectangles (unless these are cells on their own).

The layer table is:

1     n well
3     n implant (source/drain)
4     p implant (source/drain)
5     n implant (tie-down)
6     p implant (tie-down)
7     poly
10    contact
11    metal 1
14    via 1
16    metal 2
18    via 2
19    metal 3
21    via 3
22    metal 4
25    via 4
26    metal 5

Hierarchical processing code

Here is some code that uses the merge feature to merge the shapes from these layout's layers. It also performs boolean operations for forming source/drain and gate areas and separating poly from diffusion contacts. Both operations happen in flat and deep mode (controlled by the $flat global variable). You can set this flag externally using

klayout -b -r this_script.rb -rd flat=1

All operations are repeated 10 times, because otherwise the timer resolution won't be sufficient to capture the

Here is the script:

dss = RBA::DeepShapeStore::new

ly = RBA::Layout::new(false)
# run from the repo's root for this file:
ly.read("testdata/algo/vexriscv_clocked_r.oas.gz")

flat = ($flat.to_i != 0)

class TimedSection

  def initialize(what, &block)

    $stdout.write(what + " ... ")
    $stdout.flush

    timer = RBA::Timer::new
    timer.start

    # 10 times for more precise measurements
    10.times do 
      yield 
    end
    
    timer.stop
    
    $stdout.write(timer.to_s + "\n")
    $stdout.flush

  end

end

puts "Running merge in #{flat ? %q/flat/ : %q/deep/} mode"

ly.layer_indexes.sort { |a,b| ly.get_info(a).layer <=> ly.get_info(b).layer }.each do |li|
  
  ld = ly.get_info(li)
  
  lres = ly.layer(1000 + ld.layer, ld.datatype)
  iter = ly.top_cell.begin_shapes_rec(li)
    
  merged = nil
  TimedSection::new((flat ? "Flat" : "Deep") + " merge of layer " + ld.to_s) do
    region = flat ? RBA::Region::new(iter) : RBA::Region::new(iter, dss)
    merged = region.merged
  end
  
  ly.insert(ly.top_cell.cell_index, lres, merged)
  
end

nactive, pactive, poly, cont = [ 3, 4, 7, 10 ].collect do |layer|

  iter = ly.top_cell.begin_shapes_rec(ly.layer(layer, 0))
  flat ? RBA::Region::new(iter) : RBA::Region::new(iter, dss)

end

nsd = nil
TimedSection::new((flat ? "Flat" : "Deep") + " nactive-poly ") do 
  nsd = nactive - poly
end
  
ly.insert(ly.top_cell.cell_index, ly.layer(1200, 0), nsd)

psd = nil
TimedSection::new((flat ? "Flat" : "Deep") + " pactive-poly ") do 
  psd = pactive - poly
end
  
ly.insert(ly.top_cell.cell_index, ly.layer(1201, 0), psd)

ngate = nil
TimedSection::new((flat ? "Flat" : "Deep") + " nactive&poly ") do 
  ngate = nactive & poly
end
  
ly.insert(ly.top_cell.cell_index, ly.layer(1202, 0), ngate)

pgate = nil
TimedSection::new((flat ? "Flat" : "Deep") + " pactive&poly ") do 
  pgate = pactive & poly
end
  
ly.insert(ly.top_cell.cell_index, ly.layer(1203, 0), pgate)

poly_cont = nil
TimedSection::new((flat ? "Flat" : "Deep") + " cont&poly ") do 
  poly_cont = cont & poly
end
  
ly.insert(ly.top_cell.cell_index, ly.layer(1204, 0), poly_cont)

diff_cont = nil
TimedSection::new((flat ? "Flat" : "Deep") + " cont-poly ") do 
  diff_cont = cont - poly
end
  
ly.insert(ly.top_cell.cell_index, ly.layer(1205, 0), diff_cont)

ly.write("merged_#{flat ? %q/flat/ : %q/deep/}.oas.gz")

Performance numbers

The performance numbers on my laptop (Ubuntu 16, 64 bit) are:

Flat:

Flat merge of layer 1/0 ... 0.13s (sys), 11.05s (user), 11.177s (wall)
Flat merge of layer 3/0 ... 0s (sys), 6.81s (user), 6.808s (wall)
Flat merge of layer 4/0 ... 0s (sys), 8.9s (user), 8.899s (wall)
Flat merge of layer 5/0 ... 0s (sys), 1.3s (user), 1.3s (wall)
Flat merge of layer 6/0 ... 0s (sys), 1.23s (user), 1.232s (wall)
Flat merge of layer 7/0 ... 0.15s (sys), 13.7s (user), 13.851s (wall)
Flat merge of layer 10/0 ... 0.15s (sys), 7.92s (user), 8.076s (wall)
Flat merge of layer 11/0 ... 0.27s (sys), 22.93s (user), 23.204s (wall)
Flat merge of layer 14/0 ... 0s (sys), 0.82s (user), 0.821s (wall)
Flat merge of layer 16/0 ... 0s (sys), 1.86s (user), 1.865s (wall)
Flat merge of layer 18/0 ... 0.01s (sys), 0.84s (user), 0.848s (wall)
Flat merge of layer 19/0 ... 0.01s (sys), 2.07s (user), 2.081s (wall)
Flat merge of layer 21/0 ... 0s (sys), 0.06s (user), 0.062s (wall)
Flat merge of layer 22/0 ... 0s (sys), 0.09s (user), 0.094s (wall)
Flat merge of layer 24/0 ... 0.02s (sys), 5.43s (user), 5.446s (wall)
Flat merge of layer 25/0 ... 0s (sys), 0.05s (user), 0.054s (wall)
Flat merge of layer 26/0 ... 0s (sys), 0.11s (user), 0.103s (wall)
Flat merge of layer 255/0 ... 0.02s (sys), 2.27s (user), 2.287s (wall)
Flat nactive-poly  ... 0.46s (sys), 21.98s (user), 22.44s (wall)
Flat pactive-poly  ... 0.48s (sys), 25.67s (user), 26.146s (wall)
Flat nactive&poly  ... 1.02s (sys), 22.93s (user), 23.941s (wall)
Flat pactive&poly  ... 0.58s (sys), 24.68s (user), 25.268s (wall)
Flat cont&poly  ... 0.26s (sys), 20.98s (user), 21.237s (wall)
Flat cont-poly  ... 0.26s (sys), 22.5s (user), 22.762s (wall)

Hierarchical:

Deep merge of layer 1/0 ... 0.01s (sys), 1.75s (user), 1.744s (wall)
Deep merge of layer 3/0 ... 0s (sys), 0.29s (user), 0.296s (wall)
Deep merge of layer 4/0 ... 0s (sys), 0.32s (user), 0.32s (wall)
Deep merge of layer 5/0 ... 0s (sys), 0.25s (user), 0.253s (wall)
Deep merge of layer 6/0 ... 0s (sys), 0.23s (user), 0.239s (wall)
Deep merge of layer 7/0 ... 0s (sys), 0.39s (user), 0.386s (wall)
Deep merge of layer 10/0 ... 0s (sys), 0.49s (user), 0.496s (wall)
Deep merge of layer 11/0 ... 0.03s (sys), 12.76s (user), 12.793s (wall)
Deep merge of layer 14/0 ... 0.02s (sys), 3.54s (user), 3.559s (wall)
Deep merge of layer 16/0 ... 0.01s (sys), 4.56s (user), 4.583s (wall)
Deep merge of layer 18/0 ... 0.03s (sys), 3.51s (user), 3.532s (wall)
Deep merge of layer 19/0 ... 0.01s (sys), 3.23s (user), 3.252s (wall)
Deep merge of layer 21/0 ... 0s (sys), 0.33s (user), 0.331s (wall)
Deep merge of layer 22/0 ... 0s (sys), 0.33s (user), 0.338s (wall)
Deep merge of layer 24/0 ... 0s (sys), 0.54s (user), 0.547s (wall)
Deep merge of layer 25/0 ... 0s (sys), 0.3s (user), 0.299s (wall)
Deep merge of layer 26/0 ... 0s (sys), 0.31s (user), 0.315s (wall)
Deep merge of layer 255/0 ... 0.01s (sys), 6.07s (user), 6.073s (wall)
Deep nactive-poly  ... 0s (sys), 1.26s (user), 1.275s (wall)
Deep pactive-poly  ... 0s (sys), 1.2s (user), 1.2s (wall)
Deep nactive&poly  ... 0s (sys), 1.18s (user), 1.183s (wall)
Deep pactive&poly  ... 0s (sys), 1.16s (user), 1.17s (wall)
Deep cont&poly  ... 0.01s (sys), 1.25s (user), 1.265s (wall)
Deep cont-poly  ... 0s (sys), 1.31s (user), 1.306s (wall)

So overall a great improvement in the hierarchical frontend layers (factor 5 to 20 in layers 1-10). Still about factor 2 in layer 11 (metal 1) which has both hierarchical parts in the local wiring of the standard cells, but also flat parts from the power routing and pins. There is a 2-3x disadvantage of the hierarchical implementation in the flat layers (14-26) because the hierarchical analysis demands some overhead and there is no gain for the merge computation itself.

But the booleans pay this effort back with a 8x performance boost. That's because they are mainly frontend-driven and benefit from the hierarchical nature of the standard cells.

Total runtime of the script on my system is

  • 46s (hierarchical)
  • 210s (flat)

Another remarkable fact is the reduction of the memory footprint (maxresident on process)

  • 518M (hierarchical)
  • 1.6G (flat)