Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add paddle/memory/README.md #2552

Merged
merged 8 commits into from
Jun 26, 2017
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 141 additions & 0 deletions paddle/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
In my mind, the memory package works like the following:

## Design

### Usage

To allocate 4KB CPU memory:

```cpp
p = memory::Alloc(platform::CPUPlace(), 4*1024);
```

To allocate 4KB memory on the 3rd GPU:

```cpp
p = memory::Alloc(platform::GPUPlace(2), 4*1024);
```

To free memory and check the so-far used amount of memory on a place:

```cpp
auto pl = platform::GPUPlace(0);
p = memory::Alloc(pl, 4*1024);
cout << memory::Used(pl);
memory::Free(pl, p);
```

### The API

In `paddle/memory/memory.h` we have:

```cpp
template <typeanme Place> void* Alloc(Place, size_t);
Copy link
Contributor

@gangliao gangliao Jun 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typeame -> typename

Since Place is a variant, why not directly use the typeid to static analyze place type during compilation?

void* Alloc(Place, size_t);
void* Alloc(Place place, size_t size) {
  if (place.type() == typeid(GPUPlace)) {
     ....
  } else {...}   
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we want the mismatched-type error get reported as early as possible. Err at compile time is earlier than runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it. Thanks.

But variant also supports throw the mismatched-type error during compile time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think if we regulate the way we use GPUPlace and CPUPlace, like here -- two and only two specializations of Alloc<Place>, we might not need typedef boost::variant<GPUPlace, CPUPlace> Place at all. And this might help us remove the dependency to boost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great to think about this. The hard part to remove boost::variant is Dim.

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Jun 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Agree. And thanks for reminding of boost::variant. Let's start by minimizing dependencies of each piece of our work.

template <typeanme Place> void Free(Place, void*);
}
```

These function templates have specializations on either `platform::CPUPlace` or `platform::GPUPlace`:

```cpp
template<>
void Alloc<CPUPlace>(CPUPlace p, size_t size) {
return GetCPUBuddyAllocator()->Alloc(size);
}
```

and

```cpp
template<>
void Alloc(GPUPlace)(GPUPlace p, size_t size) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alloc(GPUPlace) -> Alloc<GPUPlace>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return GetGPUBuddyAllocator(p.id)->Alloc(size);
}
```

### The Implementation

`GetCPUBuddyAllocator` and `GetGPUBuddyAllocator` are singletions.
Copy link
Contributor

@gangliao gangliao Jun 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Singleton is great in here.


```cpp
BuddyAllocator* GetCPUBuddyAllocator() {
static BuddyAllocator* a = NULL;
if (a == NULL) {
a = new BuddyAllocator(new CPUAllocator /*backup allocator*/, ...);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All member functions in CPUAllocator are static functions, I thought it's better if CPUAllocator /CPUAllocator could be a template type of BuddyAllocator

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Jun 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think GPUAllocator::Alloc could be a static method, as it reads GPUAllocator::place_ and writes GPUAllocator::used_.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I guess we do not need the member variables in here.

GPUAllocator::place_ is unnecessary because you got place from memory::Alloc(place ... each time and GPUAllocator::Alloc insides memory::Alloc.

GPUAllocator::used_ is unnecessary. Because each time when you retrieve/query the used memory, it's non-deterministic and likely to change and fluctuate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, in this design, each GPUAllocator instance corresponds to a GPU device.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why GPUAllocator::used_ is unnecessary? I think memory profiling is very important for a DL system?

Copy link
Contributor

@gangliao gangliao Jun 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, It's useful. I mean it's unnecessary to save it in GPUAllocator::used_. because its behavior always to change and fluctuate.

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Jun 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I cannot tell how valuable GPUAllocator::used_ could be. But I'd like to have it, as it enables us to profile the memory usage. We may figure out good usages of this piece of information later.

}
return a;
}

BuddyAllocator* GetGPUBuddyAllocator(int gpu_id) {
static BuddyAllocator* as = NULL;
if (as == NULL) {
as = new BuddyAllocator*[platform::NumGPUs()];
for (int gpu = 0; gpu < platform::NumGPUs(); gpu++) {
as[gpu] = new BuddyAllocator(new GPUAllocator(gpu) /* backup allocator */, ...);
}
}
return as[gpu_id);
```

#### `BuddyAllocator`

`BuddyAllocator` implements the buddy allocation algorithm. Its constructor takes parameters only related with the algorithm:

```cpp
BuddyAllocator::BuddyAllocator(initial_pool_size, max_pool_size) {
...
}
```

Please be aware that **`BuddyAllocator` always allocate aligned memory**, aligned on 32-bytes, which can hold a `BuddyAllocator::Block` object:

```cpp
class BuddyAllocator {
private:
struct Block {
size_t size;
Blobk* left, right;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blobk -> Block.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

};
...
};
```

#### System Allocators

The `GPUAllocator` and `CPUAllocator` are calls *system allocators*. They hold information about the device, including the amount of memory has been allocated. So that we can call

- `GPUAllocator::Used` and
Copy link
Collaborator

@reyoung reyoung Jun 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that System Allocators are used when constructing a BuddyAllocator and they are private data member inside a BuddyAllocator. How can we get GPUAllocator::Used for each device?
Is our code like this?

auto* buddyAllocator = GetGPUBuddyAllocator(0);
buddyAllocator->SystemAllocator()->Used();

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or

GetGPUAllocator(0)->Used();

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder. Let me explain more here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- `CPUAllocator::Used`

to get the amount of memory that has been allocated so far.


## Why Such a Design

I got inspiration from Majel and Caffe2, though above design look different from both.

### Caffe2

In Caffe2, `Tensor<Context>::mutable_data()` allocates the memroy. In particular, [`Tensor<Context>::mutable_data`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/tensor.h#L523) calls [`Tensor<Context>::raw_mutable_data`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/tensor.h#L459), which in turn calls [`Context::New`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/tensor.h#L479).

There are two implementations of `Context`:

1. [`CPUContext`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context.h#L105), whose [`New` method](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context.h#L131) calls [`g_cpu_allocator.get()->New(size_t)`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context.cc#L15) to allocate the memory.

1. [`CUDAContext`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context_gpu.h#L99), which has a data member [`int gpu_id_`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context_gpu.h#L202). This looks very similar to class `majel::GPUPlace`, who also has an `int id_` data member. `CUDAContext::New(size_t)` calls [`g_cub_allocator->DeviceAllocate(&ptr, nbytes)`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context_gpu.cu#L355) to allocate the memory.

### Majel

In Majel, there are basically two allocator types:

1. `cpu::SystemAllocator`, which has similar functionality to `caffe2::CPUContext::New/Delete`.
1. `gpu::SystemAllocator`, which has similar functionality to `caffe2::CUDAContext::New/Delete`.

However, memory allocation is not via these two allocators. Instead, these two allocators are defined in hidden namespaces.

In Majel there are hidden global variables like:

1. `cpu::SystemAllocator g_cpu_allocator`, and
1. `vector<gpu::SystemAllocator*> g_gpu_allocators(NUM_GPUS)`.

Programs allocate memory via a BuddyAllocator, which can take the `g_cpu_allocator` or a `g_gpu_allocators[gpu_id]` as its *fallback allocator*, so that if BuddyAllocator cannot find a block in its memory pool, it extends its memory pool by calling the fallback allocator's `New(size_t)`.
4 changes: 2 additions & 2 deletions python/paddle/trainer_config_helpers/networks.py
Original file line number Diff line number Diff line change
Expand Up @@ -1381,7 +1381,7 @@ def inputs(layers, *args):
if len(args) != 0:
layers.extend(args)

Inputs(*[l.name for l in layers])
Inputs(* [l.name for l in layers])


def outputs(layers, *args):
Expand Down Expand Up @@ -1424,7 +1424,7 @@ def __dfs_travel__(layer,
assert len(layers) > 0

if HasInputsSet(): # input already set
Outputs(*[l.name for l in layers])
Outputs(* [l.name for l in layers])
return # just return outputs.

if len(layers) != 1:
Expand Down