Skip to content
Vinay edited this page Dec 22, 2014 · 28 revisions

Frequently asked questions

Why are you releasing this?

For others to use in whatever way they please and contribute back to the project and improve the tool. One immediate use of MIAOW is in academic research which critically lacks any low-level GPU implementation. Our long term vision is to move forward the movement of Open Source Hardware by contributing an open source GPU implementation to it.

Has this technical documentation been scientifically peer-reviewed?

NO. It has NOT. It is currently under review.

What evidence do you have that MIAOW has any resemblance to “actual” GPUs?

We have NO CONCRETE evidence. Actual GPU designs are proprietary and the only information we have is what information has been made publicly available. By looking at our architecture whitepaper, the RTL, and the description of tradeoffs you will come to realize MIAOW is one reasonable way to build a GPU. Especially if you look at the whitepaper, you will see an analysis that shows MIAOW is a reasonable implementation.

Do you have any affiliation with GPU manufacturers?

NO.

Do real GPU designers consider this design reasonable?

We cannot share any concrete information. We have obtained anecdotal and informal comments to the effect: nothing here looks unreasonable or vastly different from commercial designs.

Who should NOT use MIAOW?

Do not use MIAOW if any of the following conditions are true in your case.

  1. Do not know Verilog.
  2. Do not know logic design.
  3. Do not know what a GPU is.
  4. After reading the architecture whitepaper and you are still unable to understand for yourself: “the description of the implementation choices, and the relative strengths and weaknesses of MIAOW relative to real GPU implementations. “

What MIAOW is NOT meant for.

  1. Generating area, power estimates for GPUs for your .* SIGARCH conference/journal paper.

What MIAOW is meant for.

  1. Understanding design complexity or timing impact of a microarchitecture design technique.
  2. Exploring research ideas that require RTL or gate level representation of a GPU - reliability studies for example.
  3. If you want to build your own GPU and need the core compute unit.
  4. Generating long running address traces or BIG workloads that are infeasible on simulators.
  5. Other uses you may think of.

DISCLAIMER 1: Any quantitative result of ANY BEHAVIOR modeled on MIAOW does NOT mean similar quantitative results will be seen on a commercial GPU.

DISCLAIMER 2: One should NOT presume MIAOW to be identical with ANY commercial GPU microarchitecture.

What are some things to be cautious about when using MIAOW?

If possible use a register file compiler/wrapper that you have available and replace our flip-flop based design. Because of licensing issues we cannot distribute anything better. Synopsys for example provides a register file/SRAM compiler for academic use.

What are known problems in MIAOW?

  1. Memory system design is known to be a problematic design. More details coming soon.

What can Neko - the FPGA version of MIAOW do?

Neko is able to run unit tests not related to memory operations. As work on Neko was what precipitated a redesign of the memory interface, until that rework is complete Neko cannot run memory related instructions and the more complex benchmarks that rely upon them.

Is MIAOW a complete GPU?

NO. MIAOW attempts to do just the programmable part of what modern GPUs do - in a sense it is purely a GPGPU.

What are the license terms for this?

MIAOW RTL and benchmarks are released under a relatively permissive license 3-clause BSD license. See our LICENSE file for details. The multi2sim verification/reference simulator with our patches is released under a GPL license.

###Below are some set of questions frequently asked by some of the Users/Reviewers:

Q1: How do you inject confidence into potential users of this new simulator, given that it has no reference to compare with? RTL model -- not the best choice for architecture and microarchitecture research.

Yes we agree RTL model is not the choice of architecture/microarchitecture design space exploration nor are we recommending that. Our Github has a more detailed discussion of when to use and when to not use MIAOW.
In particular MIAOW should be used for: Understanding microarchitecture design and complexity effects of an idea (see case study 1 and 2). Without RTL this gets overlooked and is underappreciated. Understanding physical design implications of mechanisms (pretty much any reliability study requires low level RTL to be meaningful). Two recent papers have shown fault injection in high level simulators provide poor understanding of the underlying behaviors - [hpca-14 “Understanding the Impact of Gate-Level Physical Reliability Effects on Whole Program Execution.”

Regarding the “confidence” we re-emphasize MIAOW is NOT meant to reproduce any known microarchitecture. It is one design (in our opinion reasonable and good design) of the Southern Islands ISA. All of the design decisions are documented in this paper (and in our github documents). Of course it is different from any GPU product. As far as we can tell many of the design decisions we made are similar and stand scrutiny when investigated in the light of what real products have. We have had product teams at AMD look at it and no one has said any of our design decisions are “incorrect” - we have received comments to the effect: “this all looks reasonable.” The area estimates are similar and differences can be attributed to things like datapath modules and not the organization of the hardware. In our opinion a tool like MIAOW is far more credible than many architecture simulators - reasonable people can disagree on this. For example, the WDDD paper “gem5, GPGPU-Sim, McPAT, GPUWattch, "Your favorite simulator here" Considered Harmful, WDDD 2014” outlines the outright howlers that are presents in today’s widely used architecture simulators which in our opinion has created misplaced confidence in architecture simulators.

Q2: Thorough related Work section? Why does the increase on the critical path not affect the overall performance in Thread Block Compaction case study?

The related work section has been updated, although we have discovered that not much work has been done in development of an open-source RTL development especially of a GPGPU. To keep the flow of the paper easy to read, we feel it is best to include a related work at the beginning folded into the introduction (we have included a paragraph header related work). Our paper includes about 40 references - so we respectfully submit our paper which does a scholarly and thorough treatment of the related work (albeit an unconventional organization eschewing a named related work section).

Overall performance with TBC remained the same because while the critical paths increased, the number of cycles needed to complete workloads decreased. This has been clarified in the paper.

Q3: How is "similar range" defined for CPI comparison to NVIDIA GPUs?

Similar range here means, that for the given workload the CPI numbers are not completely out-of-bound. We are able to reason out why MIAOW’s CPI for some workloads is close to NVIDIA GPU and for some 2x slower.

Quantitatively we defined similar as 2X factor and our goal was to explain the differences. In our opinion, two different microarchitecture will be quite different and our goal is simply to make sure we are not totally off the rails.

Q4: Power analysis breakdown.

Detailed power analysis of MIAOW design is given in the Other Comments Section below. Absolute power number for FQDS, Functional units and regfile is updated. Remaining details has been resorted to github wiki for users/readers.

Q5: Detailed physical design of MIAOW?

The “detailed physical design” aspect of this project underwent initial prototyping with the FPGA implementation. The primary issue when doing this implementation was a matter of density, wherein the resources needed to create the logic of a full MIAOW compute unit was more than what was available on the FPGA. As FPGAs are inherently less efficient than ASICs when it comes to representing logic, we believe that being able to fit the functional logic on the Virtex 7 indicates that the basic design is physically feasible.

Q6: SRAM based register-file design?

The SRAM based register did not have any ancillary logic for mapping its single port to the multiple ports the MIAOW register interface used. It was purely a drop-in used for power and area analysis as a proof of concept. It was NOT used for the performance comparisons.

Q7: Why does the paper lack area and power models for the PLI-based behavioral modules?

Our general position is that the specified components that are currently behavioral modules constitute only a small part of the overall chip’s power consumption compared to all of the functional units. As we continue verification of our dispatcher and work on other components this position may change, but at this point we have not seen anything that would cause us to drastically reevaluate our original conclusion. Note that we have another area comparison in table VII that actually compares MIAOW’s CU with that of Tahiti’s. While MIAOW’s is larger, it is not so much so that we feel there are any fundamental issues.

As also mentioned above, the dispatcher is currently undergoing verification and will be released once it is cleaned up. Besides the OCN (On-Chip network), the remaining memory related components tend to be specific to a technology process if/when someone attempts to actually fab MIAOW as an ASIC. For this reason, we do not feel their absence to detract from the concept of MIAOW as a GPU, especially as the dispatcher itself will be provided as RTL.

Q8: MIAOW’ support for 64-bit execution and Software compatibility.

The importance of 64-bit floating point is not a cut and dried issue. While it is true that many GPGPU programs targeting workstation or compute cards such as in the Quadro, Tesla, or FirePro lines make heavy use of them, a significant percentage are also written targeting consumer grade cards that often have crippled 64-bit floating point support. In such cases excessive dependency on 64-bit operations severely reduces the benefits of having a GPGPU kernel in the first place. In addition, the majority of the examples provided by AMD in their APP SDK do not use 64-bit values and operations. Of those that do, most also have a 32-bit variant as well. We thus felt that only supporting 32-bit operations captured a large enough subset of programs to still make MIAOW useful and keep it energy-efficient.

While we understand that 64-bit support is highly desirable, it was ultimately considered a second-tier goal when implementing the initial iteration of MIAOW. A future version is highly likely to incorporate it, especially if third parties wish to provide a contributed module. The complexity of correctly implementing 64-bit IEEE floating point however makes this a major endeavor and will take time to achieve.

In our opinion, this is a minor issue and not an architecture/microarchitecture issues. It is a datapath module availability that can addressed in many ways by individually researchers if necessary - co-simulation with a PLI/behavioral 64-bit module and a black-boxed area estimate for example. This detail about software compatibility has been updated in the paper section 2.1.

Q9: Discrepancies need to be clarified?

Github source code and design discrepancies We thank the reviewer for the detailed comments. We have fixed these discrepancies to the best of our knowledge. Although our design decodes 64 bit instructions, they have not been released in the github to avoid confusion (because there is no support for 64 bit floating point execution). So, we updated the section 2.1 not to include 64 bit instructions in the list of instructions. However, eventually we want to update the list with 64 bit instructions, once the 64 bit execution modules are added by the community. Ultra-threaded dispatcher is under the stage of verification and will be updated soon. In a design of this size we are sure inconsistencies still remain. This is why our RTL is released OPEN SOURCE for others to verify and improve for themselves.

Yes, global memory and device memory are interchangeable in NVIDIA, but the point was to correlate each definition in OpenCL to NVIDIA’s CUDA terminology in Table III.

Regarding other modules of AMD Radeon HD GPUs, we dont have information about their frequency in terms of FO4 delays, although FO4 can be calculated using the rule of thumb of 300lambda or 500lambda corresponding to the technology node of that product. But, we do have MIAOW’s frequency in terms of FO4 delay. One FO4 delay for 32nm technology node found to be 112.162ps and for the frequency of 4.5ns, the estimate comes to approximately 40 FO4 delays.

Q10: Why did the authors bank and double-pump the multi-ported register file for the FPGA when a design with a single-ported SRAM-based register file existed?

We refer you to UG473 7 Series FPGAs Memory Resources for a listing of what memory resources are physically available in the Xilinx 7 series FPGA family. This would give an explanation for the question.

In the project we early on decided to make progress on the FPGA design and the RTL design. Our “canonical” design was the multi-ported design and hence that is the one that is mapped to FPGA. Redoing the FPGA mapping for the single-ported SRAM would also be valuable and simply from a project timing and resource standpoint, we don’t have resources to maintain and support this additional one and redo our FPGA. We are hopeful (if the community finds MIAOW valuable) others will pick this up.

Q11: Rodinia OpenCL suite support?

Yes, the Rodinia suite do contain 64-bit floating point operations, and that is one of the reason for supporting only 4 out of 20 benchmarks. The other main reason is because of some complex flow-control, memory and synchronisation instructions currently not being supported by MIAOW. Rodinia applications were run only for the case studies and not with a motive of performance evaluation and hence no mention of their CPIs. We do not have space to address this in the paper. We will include a note about this in github. As stated in the intro, one of the non-goals for MIAOW was trying to match performance with commercial GPUs. As such we have chosen not to perform performance analyses of the rodinia runs. They were used primarily for functional verification.

Other Comments:

  • FQDS (Fetch Queue Decode Schedule) power number differences are due to MIAOW’s differing datapath. Breakdown of power reports is as follows:

Frontend (FQDS)

	Fetch Unit         - 7.21mW     	
	Wavepool           - 49.9mW         
	Decode             - 6.54mW          
	Issue              - 39.3mW       
	Instruction Queue  - 8.28mW         
	Total(FQDS)        - 111.23mW   	   

Functional Units

	SIMF(Floating point units) - 67.6mW * 4 = 270.4mW    
	SIMD(Integer units)        - 53.2mW * 4 = 212.8.4mW  
	Scalar ALU                 - 2.32mW   
	Exec                       - 16.4mW								
	LSU                        - 91.6mW								
	Total (FU)                 - 593.52mW
  •   RegFile										 - 144.3mW	
    

    Total CU - 849.09mW

    CU + Local Cache modules: 1102mW ~ 1.1W

  • Explanation of how to load programs into MIAOW’s testing framework is more mentioned in detail in ‘how to use’ documentation on the wiki.

  • The limitation of not reporting the PLI modules area/power has been added in the respective sections. Some of the design decision section 2.5 is updated to reflect the limitations. Software compatibility part is enhanced in section 2.1.

  • Explain how scalar registers of multiple wavefronts are mapped to the banked scalar register file- Since, the issue bandwidth is one, each wavefront executing a scalar instruction would access the banked scalar register file based on its bank address. The banking of SGPR was mainly done for the ease in encoding for scalar instructions.

  • Epoch of Wavefronts- Generally a group of wavefronts larger than the Workgroup size. Typically workgroup size is dependent on the SIMD or SIMF (Single Instruction Multiple Floating point values) units in a Compute Unit. But epoch generally is a term for 100 wavefronts samples and issued to a CU.

  • Section 7, Fault injection case study mainly focuses to see if the injected gate fault transits to a real fault for the results -- So the simulation is run to completion for sampled 2000 runs. All the AMD APP OpenCL benchmarks finishes in reasonable time. Rodinia applications with small input size takes a bit longer than the architectural simulators. But, architecture simulator cannot capture how the gate level bit faults would affect the application output.