Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use connected lists for tape data structure #954

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rohanjulka19
Copy link
Contributor

@rohanjulka19 rohanjulka19 commented Jun 20, 2024

Currently I have added a basic implementation of tape data structure with connected lists/slabs. It is more or less similar to current tape data strucutre I copied a lot of code just added blocks and only added the functionality which I feel is needed for reverse mode.

Right now NumericalDiff also uses tape so I defined two types one is "tape" which points to the new tape implementation and "old_tape" which points to current tape data strcuture being used.

Tests are running fine for the new implementation.

These are the benchmarks for old_tape and new_tape.

LastTest-new-tape.log
LastTest-tape.log

Fixes - #793

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 10 out of 16. Check the log or trigger a new build to see more.

//

#ifndef CLAD_NEWTAPE_H
#define CLAD_NEWTAPE_H
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: header guard does not follow preferred style [llvm-header-guard]

Suggested change
#define CLAD_NEWTAPE_H
#ifndef CLAD_DIFFERENTIATOR_NEWTAPE_H
#define CLAD_DIFFERENTIATOR_NEWTAPE_H

include/clad/Differentiator/NewTape.h:87:

- #endif // CLAD_NEWTAPE_H
+ #endif // CLAD_DIFFERENTIATOR_NEWTAPE_H

include/clad/Differentiator/NewTape.h Show resolved Hide resolved
include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 10 out of 12. Check the log or trigger a new build to see more.

@@ -0,0 +1,84 @@
#ifndef CLAD_NEWTAPE_H
#define CLAD_NEWTAPE_H
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: header guard does not follow preferred style [llvm-header-guard]

Suggested change
#define CLAD_NEWTAPE_H
#ifndef CLAD_DIFFERENTIATOR_NEWTAPE_H
#define CLAD_DIFFERENTIATOR_NEWTAPE_H

include/clad/Differentiator/NewTape.h:83:

- #endif // CLAD_NEWTAPE_H
+ #endif // CLAD_DIFFERENTIATOR_NEWTAPE_H

include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
void emplace_back(ArgsT&&... args) {
if (!cur_block || _size >= _capacity) {
Block<T>* prev_block = cur_block;
cur_block = new Block<T>(_capacity);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: assigning newly created 'gsl::owner<>' to non-owner 'Block *' [cppcoreguidelines-owning-memory]

      cur_block = new Block<T>(_capacity);
      ^

_size = 0;
}
_size += 1;
::new (const_cast<void*>(static_cast<const volatile void*>(block_end())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: do not use const_cast [cppcoreguidelines-pro-type-const-cast]

    ::new (const_cast<void*>(static_cast<const volatile void*>(block_end())))
           ^

include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
@vgvassilev
Copy link
Owner

@rohanjulka19, do you mind addressing the clang-tidy complaints where relevant?

@rohanjulka19
Copy link
Contributor Author

rohanjulka19 commented Jul 1, 2024

hi, yeah will address these clang-tidy issues, actually i got stuck on the loop condition differentiation issue (once again) #818. But now I am fixing that one, once I push that fix will address these then.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@@ -0,0 +1,87 @@
#ifndef CLAD_NEWTAPE_H
#define CLAD_NEWTAPE_H
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: header guard does not follow preferred style [llvm-header-guard]

Suggested change
#define CLAD_NEWTAPE_H
#ifndef CLAD_DIFFERENTIATOR_NEWTAPE_H
#define CLAD_DIFFERENTIATOR_NEWTAPE_H

include/clad/Differentiator/NewTape.h:86:

- #endif // CLAD_NEWTAPE_H
+ #endif // CLAD_DIFFERENTIATOR_NEWTAPE_H

include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
void emplace_back(ArgsT&&... args) {
if (!m_cur_block || m_size >= _capacity) {
Block<T>* prev_block = m_cur_block;
m_cur_block = new Block<T>(_capacity);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: assigning newly created 'gsl::owner<>' to non-owner 'Block *' [cppcoreguidelines-owning-memory]

      m_cur_block = new Block<T>(_capacity);
      ^

include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

include/clad/Differentiator/NewTape.h Outdated Show resolved Hide resolved
~Block() {
destroy(block_begin(), block_end());
::operator delete(
const_cast<void*>(static_cast<const volatile void*>(data)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: do not use const_cast [cppcoreguidelines-pro-type-const-cast]

        const_cast<void*>(static_cast<const volatile void*>(data)));
        ^

void emplace_back(ArgsT&&... args) {
if (!m_cur_block || m_size >= m_capacity) {
Block<T>* prev_block = m_cur_block;
m_cur_block = new Block<T>(m_capacity);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: assigning newly created 'gsl::owner<>' to non-owner 'Block *' [cppcoreguidelines-owning-memory]

      m_cur_block = new Block<T>(m_capacity);
      ^

m_size = 0;
}
m_size += 1;
::new (const_cast<void*>(static_cast<const volatile void*>(end())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: do not use const_cast [cppcoreguidelines-pro-type-const-cast]

    ::new (const_cast<void*>(static_cast<const volatile void*>(end())))
           ^

@rohanjulka19
Copy link
Contributor Author

@vgvassilev I have fixed most of the suggestion, the above four I don't think needs to be addressed.

@vgvassilev
Copy link
Owner

@vgvassilev I have fixed most of the suggestion, the above four I don't think needs to be addressed.

The include protector suggestions seem good. Looks like the new tape has a problem with cuda.

Copy link

codecov bot commented Jul 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.37%. Comparing base (d82f7fd) to head (26fcee4).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #954   +/-   ##
=======================================
  Coverage   94.37%   94.37%           
=======================================
  Files          50       50           
  Lines        8366     8366           
=======================================
  Hits         7895     7895           
  Misses        471      471           

@rohanjulka19 rohanjulka19 force-pushed the tape-impl branch 2 times, most recently from 452b55d to 38fdfb3 Compare July 4, 2024 18:08
@rohanjulka19
Copy link
Contributor Author

The include protector suggestions seem good. Looks like the new tape has a problem with cuda.

Have changed the include protector to align with the suggestion. The cuda test is passing but I am not sure if it will work when this runs on more than one thread in parallel

@vgvassilev
Copy link
Owner

I am a little puzzled since the new tape implementation is supposed to be faster than the existing one but it seems consistently slower: https://github.com/vgvassilev/clad/actions/runs/9798455926/job/27056951548?pr=954#step:31:596

Do you have any clue what might be going wrong?

@rohanjulka19
Copy link
Contributor Author

I thought it was due to the small block size leading to more memory allocation but increasing the block size also didn't worked so will investigate this further

@vgvassilev
Copy link
Owner

@rohanjulka19, do we have a clue what might be still wrong for the benchmarks? Maybe we should profile it with something like heaptrack, kcachegrind or similar...

@rohanjulka19
Copy link
Contributor Author

rohanjulka19 commented Jul 19, 2024

oh yeah I tried with timer but i couldn't figure out but I had a intution that because I am assigning memory twice once for the block and once for the array inside the block maybe that's whats causing the issue. So I am trying to just assign memory once by taking a fixed array of size 32 that should allocate all the memory once, but I am not very good with cpp so there is some issue with either memory allocation or something else. I will investigate the code this weekend, what do you think can this be the performance issue ? Should I use heaptrack to check if this is true ?

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions


template <typename T> class Block {
public:
T data[capacity];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: do not declare C-style arrays, use std::array<> instead [cppcoreguidelines-avoid-c-arrays]

  T data[capacity];
  ^

CUDA_HOST_DEVICE void emplace_back(ArgsT&&... args) {
if (!m_cur_block || m_size >= capacity) {
Block<NonConstT>* prev_block = m_cur_block;
m_cur_block = new Block<NonConstT>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: assigning newly created 'gsl::owner<>' to non-owner 'Block *' (aka 'Block<typename std::remove_cv::type> *') [cppcoreguidelines-owning-memory]

      m_cur_block = new Block<NonConstT>();
      ^


~new_tape_impl() {
if (m_cur_block) {
// destroy();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vgvassilev so when I comment this line the time reduces. This function calls the destructors for all the blocks. This should never be called as we are destructing the block when we pop the last element from it, but this is to be safe in case tape is terminated without popping all the values due to some error. Now even though I have put this function inside the if stmt still this function is affecting the performance even when this is not being called. I believe this is because of function inlining. Is there way to ensure this doesn't happen ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably that's right. If you compile the code with clang you can add -Rpass-missed=.* to get more information about the missed optimization opportunities.

@rohanjulka19
Copy link
Contributor Author

Additionally on my local for BM_ReverseModeProductExecute I am getting this time

BM_ReverseModeProductExecute 0.151 ns 0.150 ns 1000000000

But the benchmark returns this result

BM_ReverseModeProductExecute 114 ns 114 ns 5956940

What must be the reason for this ? Also I am on M2 Mac (ARM) so like what tool should I use to actually understand what's the issue ?

if (m_size == 0) {
Block<NonConstT>* temp = m_cur_block;
m_cur_block = m_cur_block->prev;
temp->~Block<NonConstT>();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we do not delete the m_cur_block here?

Copy link
Contributor Author

@rohanjulka19 rohanjulka19 Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are deleting the current block at line 107.

I added the destroy function in the tape destructor to ensure that incase tape terminates without popping all the values, it should then call the destructor for all the pending blocks.

Copy link
Contributor Author

@rohanjulka19 rohanjulka19 Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  ~new_tape_impl() {
    if (m_cur_block) {
      //       destroy();

In this line I don't want the compiler optimization and instead keep the code of destroy function separate because if it is added to the destructor it will slow down the destructor even though the destroy function won't likely be called. If we are sure that all the values will always be popped before the tape terminates then I can just remove this function from the tape destructor


~new_tape_impl() {
if (m_cur_block) {
// destroy();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably that's right. If you compile the code with clang you can add -Rpass-missed=.* to get more information about the missed optimization opportunities.

@vgvassilev
Copy link
Owner

Additionally on my local for BM_ReverseModeProductExecute I am getting this time

BM_ReverseModeProductExecute 0.151 ns 0.150 ns 1000000000

But the benchmark returns this result

BM_ReverseModeProductExecute 114 ns 114 ns 5956940

What must be the reason for this ? Also I am on M2 Mac (ARM) so like what tool should I use to actually understand what's the issue ?

Do you mean that the bots give different benchmark results? If so, that's expected because they are different machines with different properties.

@rohanjulka19
Copy link
Contributor Author

rohanjulka19 commented Jul 24, 2024

Additionally on my local for BM_ReverseModeProductExecute I am getting this time
BM_ReverseModeProductExecute 0.151 ns 0.150 ns 1000000000
But the benchmark returns this result
BM_ReverseModeProductExecute 114 ns 114 ns 5956940
What must be the reason for this ? Also I am on M2 Mac (ARM) so like what tool should I use to actually understand what's the issue ?

Do you mean that the bots give different benchmark results? If so, that's expected because they are different machines with different properties.

Yeah but in my laptop the new tape is perfoming better than old tape and according to the benchmarks exectued by the pipeline the old tape is doing better than the new tape

@vgvassilev
Copy link
Owner

Additionally on my local for BM_ReverseModeProductExecute I am getting this time
BM_ReverseModeProductExecute 0.151 ns 0.150 ns 1000000000
But the benchmark returns this result
BM_ReverseModeProductExecute 114 ns 114 ns 5956940
What must be the reason for this ? Also I am on M2 Mac (ARM) so like what tool should I use to actually understand what's the issue ?

Do you mean that the bots give different benchmark results? If so, that's expected because they are different machines with different properties.

Yeah but in my laptop the new tape is perfoming better than old tape and according to the benchmarks exectued by the pipeline the old tape is doing better than the new tape

Do you run your benchmarks on release builds?

Could you log in the bots and run some profiling there? You can do that by following: https://clad.readthedocs.io/en/latest/user/DevelopersDocumentation.html#debugging-github-runners

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

using iterator = pointer;

CUDA_HOST_DEVICE Block() {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use '= default' to define a trivial default constructor [modernize-use-equals-default]

Suggested change
}
CUDA_HOST_DEVICE Block() = default;

public:
new_tape_impl() = default;

~new_tape_impl() { }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use '= default' to define a trivial destructor [modernize-use-equals-default]

Suggested change
~new_tape_impl() { }
~new_tape_impl() = default;

if (m_size == 0) {
Block<NonConstT>* temp = m_cur_block;
m_cur_block = m_cur_block->prev;
delete temp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: deleting a pointer through a type that is not marked 'gsl::owner<>'; consider using a smart pointer instead [cppcoreguidelines-owning-memory]

      delete temp;
      ^
Additional context

include/clad/Differentiator/NewTape.h:101: variable declared here

      Block<NonConstT>* temp = m_cur_block;
      ^

Instead of a single list use mutliple connected lists to store elements.
This allows to dynamically increase the size of tape without the need of
relocating elements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants