From 7cb68a8d9315bd3c3c769e47ee3752867854ee12 Mon Sep 17 00:00:00 2001
From: Yi Wang <yi.wang.2005@gmail.com>
Date: Wed, 21 Jun 2017 13:19:40 -0700
Subject: [PATCH 1/7] Add paddle/memory/README.md

---
 paddle/README.md | 141 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 141 insertions(+)
 create mode 100644 paddle/README.md
diff --git a/paddle/README.md b/paddle/README.md
new file mode 100644
index 0000000000000..24af37987e057
--- /dev/null
+++ b/paddle/README.md
@@ -0,0 +1,141 @@
+In my mind, the memory package works like the following:
+
+## Design
+
+### Usage
+
+To allocate 4KB CPU memory:
+
+```cpp
+p = memory::Alloc(platform::CPUPlace(), 4*1024);
+```
+
+To allocate 4KB memory on the 3rd GPU:
+
+```cpp
+p = memory::Alloc(platform::GPUPlace(2), 4*1024);
+```
+
+To free memory and check the so-far used amount of memory on a place:
+
+```cpp
+auto pl = platform::GPUPlace(0);
+p = memory::Alloc(pl, 4*1024);
+cout << memory::Used(pl);
+memory::Free(pl, p);
+```
+
+### The API
+
+In `paddle/memory/memory.h` we have:
+
+```cpp
+template <typeanme Place> void* Alloc(Place, size_t);
+template <typeanme Place> void Free(Place, void*);
+}
+```
+
+These function templates have specializations on either `platform::CPUPlace` or `platform::GPUPlace`:
+
+```cpp
+template<>
+void Alloc<CPUPlace>(CPUPlace p, size_t size) {
+  return GetCPUBuddyAllocator()->Alloc(size);
+}
+```
+
+and 
+
+```cpp
+template<>
+void Alloc(GPUPlace)(GPUPlace p, size_t size) {
+  return GetGPUBuddyAllocator(p.id)->Alloc(size);
+}
+```
+
+### The Implementation
+
+`GetCPUBuddyAllocator` and `GetGPUBuddyAllocator` are singletions.
+
+```cpp
+BuddyAllocator* GetCPUBuddyAllocator() {
+  static BuddyAllocator* a = NULL;
+  if (a == NULL) {
+    a = new BuddyAllocator(new CPUAllocator /*backup allocator*/, ...);
+  }
+  return a;
+}
+
+BuddyAllocator* GetGPUBuddyAllocator(int gpu_id) {
+  static BuddyAllocator* as = NULL;
+  if (as == NULL) {
+    as = new BuddyAllocator*[platform::NumGPUs()];
+    for (int gpu = 0; gpu < platform::NumGPUs(); gpu++) {
+      as[gpu] = new BuddyAllocator(new GPUAllocator(gpu) /* backup allocator */, ...);
+    }
+  }
+  return as[gpu_id);
+```
+
+#### `BuddyAllocator`
+
+`BuddyAllocator` implements the buddy allocation algorithm.  Its constructor takes parameters only related with the algorithm:
+
+```cpp
+BuddyAllocator::BuddyAllocator(initial_pool_size, max_pool_size) {
+  ...
+}
+```
+
+Please be aware that **`BuddyAllocator` always allocate aligned memory**, aligned on 32-bytes, which can hold a `BuddyAllocator::Block` object:
+
+```cpp
+class BuddyAllocator {
+ private:
+  struct Block {
+    size_t size;
+    Blobk* left, right;
+  };
+  ...
+};
+```
+
+#### System Allocators
+
+The `GPUAllocator` and `CPUAllocator` are calls *system allocators*.  They hold information about the device, including the amount of memory has been allocated.  So that we can call
+
+- `GPUAllocator::Used` and
+- `CPUAllocator::Used`
+
+to get the amount of memory that has been allocated so far.
+
+
+## Why Such a Design
+
+I got inspiration from Majel and Caffe2, though above design look different from both.
+
+### Caffe2
+
+In Caffe2, `Tensor<Context>::mutable_data()` allocates the memroy.  In particular, [`Tensor<Context>::mutable_data`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/tensor.h#L523) calls [`Tensor<Context>::raw_mutable_data`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/tensor.h#L459), which in turn calls [`Context::New`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/tensor.h#L479).
+
+There are two implementations of `Context`:
+
+1. [`CPUContext`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context.h#L105), whose [`New` method](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context.h#L131) calls [`g_cpu_allocator.get()->New(size_t)`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context.cc#L15) to allocate the memory.
+
+1. [`CUDAContext`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context_gpu.h#L99), which has a data member [`int gpu_id_`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context_gpu.h#L202).  This looks very similar to class `majel::GPUPlace`, who also has an `int id_` data member.   `CUDAContext::New(size_t)` calls [`g_cub_allocator->DeviceAllocate(&ptr, nbytes)`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context_gpu.cu#L355) to allocate the memory.
+
+### Majel
+
+In Majel, there are basically two allocator types:
+
+1. `cpu::SystemAllocator`, which has similar functionality to `caffe2::CPUContext::New/Delete`.
+1. `gpu::SystemAllocator`, which has similar functionality to `caffe2::CUDAContext::New/Delete`.
+
+However, memory allocation is not via these two allocators.  Instead, these two allocators are defined in hidden namespaces.
+
+In Majel there are hidden global variables like:
+
+1. `cpu::SystemAllocator g_cpu_allocator`, and
+1. `vector<gpu::SystemAllocator*> g_gpu_allocators(NUM_GPUS)`.
+
+Programs allocate memory via a BuddyAllocator, which can take the `g_cpu_allocator` or a `g_gpu_allocators[gpu_id]` as its *fallback allocator*, so that if BuddyAllocator cannot find a block in its memory pool, it extends its memory pool by calling the fallback allocator's `New(size_t)`.

From 0a92908b5ea68daa040155a7088b7f520c16c51d Mon Sep 17 00:00:00 2001
From: Yi Wang <yiwang01@baidu.com>
Date: Wed, 21 Jun 2017 17:02:30 -0700
Subject: [PATCH 2/7] Has to auto format networks.py because CI complains about
 it.

---
 python/paddle/trainer_config_helpers/networks.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/paddle/trainer_config_helpers/networks.py b/python/paddle/trainer_config_helpers/networks.py
index 1bf59ed4840ae..67154a8d7d366 100755
--- a/python/paddle/trainer_config_helpers/networks.py
+++ b/python/paddle/trainer_config_helpers/networks.py
@@ -1381,7 +1381,7 @@ def inputs(layers, *args):
     if len(args) != 0:
         layers.extend(args)
 
-    Inputs(*[l.name for l in layers])
+    Inputs(* [l.name for l in layers])
 
 
 def outputs(layers, *args):
@@ -1424,7 +1424,7 @@ def __dfs_travel__(layer,
     assert len(layers) > 0
 
     if HasInputsSet():  # input already set
-        Outputs(*[l.name for l in layers])
+        Outputs(* [l.name for l in layers])
         return  # just return outputs.
 
     if len(layers) != 1:

From d3e2db4b4f3efa537a2b85bb88d8d8f3e780f09c Mon Sep 17 00:00:00 2001
From: Yi Wang <yi.wang.2005@gmail.com>
Date: Thu, 22 Jun 2017 08:12:10 -0700
Subject: [PATCH 3/7] Revert changes made by misleading errors from Travis CI

---
 python/paddle/trainer_config_helpers/networks.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/paddle/trainer_config_helpers/networks.py b/python/paddle/trainer_config_helpers/networks.py
index 67154a8d7d366..1bf59ed4840ae 100755
--- a/python/paddle/trainer_config_helpers/networks.py
+++ b/python/paddle/trainer_config_helpers/networks.py
@@ -1381,7 +1381,7 @@ def inputs(layers, *args):
     if len(args) != 0:
         layers.extend(args)
 
-    Inputs(* [l.name for l in layers])
+    Inputs(*[l.name for l in layers])
 
 
 def outputs(layers, *args):
@@ -1424,7 +1424,7 @@ def __dfs_travel__(layer,
     assert len(layers) > 0
 
     if HasInputsSet():  # input already set
-        Outputs(* [l.name for l in layers])
+        Outputs(*[l.name for l in layers])
         return  # just return outputs.
 
     if len(layers) != 1:

From 8cfa48dc88c0c702b30094ca558bf2182e00faba Mon Sep 17 00:00:00 2001
From: Yi Wang <yi.wang.2005@gmail.com>
Date: Thu, 22 Jun 2017 10:27:36 -0700
Subject: [PATCH 4/7] Move README.md from paddle/ to paddle/memory/

---
 paddle/{ => memory}/README.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename paddle/{ => memory}/README.md (100%)

diff --git a/paddle/README.md b/paddle/memory/README.md
similarity index 100%
rename from paddle/README.md
rename to paddle/memory/README.md

From c617520776c58791d77d1382eba67ac4264916f0 Mon Sep 17 00:00:00 2001
From: Yi Wang <yi.wang.2005@gmail.com>
Date: Thu, 22 Jun 2017 10:35:52 -0700
Subject: [PATCH 5/7] In response to comments from Liao Gang and Yu Yang

---
 paddle/memory/README.md | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/paddle/memory/README.md b/paddle/memory/README.md
index 24af37987e057..b71ca29696513 100644
--- a/paddle/memory/README.md
+++ b/paddle/memory/README.md
@@ -25,14 +25,16 @@ cout << memory::Used(pl);
 memory::Free(pl, p);
 ```
 
-### The API
+### API
 
 In `paddle/memory/memory.h` we have:
 
 ```cpp
-template <typeanme Place> void* Alloc(Place, size_t);
-template <typeanme Place> void Free(Place, void*);
-}
+namespace memory {
+template <typename Place> void* Alloc(Place, size_t);
+template <typename Place> void Free(Place, void*);
+template <typename Place> void Used(Place);
+}  // namespace memory
 ```
 
 These function templates have specializations on either `platform::CPUPlace` or `platform::GPUPlace`:
@@ -48,12 +50,14 @@ and
 
 ```cpp
 template<>
-void Alloc(GPUPlace)(GPUPlace p, size_t size) {
+void Alloc<GPUPlace>(GPUPlace p, size_t size) {
   return GetGPUBuddyAllocator(p.id)->Alloc(size);
 }
 ```
 
-### The Implementation
+Similar specializations exist for `Free` and `Used`.
+
+### Implementation
 
 `GetCPUBuddyAllocator` and `GetGPUBuddyAllocator` are singletions.
 
@@ -94,7 +98,7 @@ class BuddyAllocator {
  private:
   struct Block {
     size_t size;
-    Blobk* left, right;
+    Block* left, right;
   };
   ...
 };
@@ -102,15 +106,15 @@ class BuddyAllocator {
 
 #### System Allocators
 
-The `GPUAllocator` and `CPUAllocator` are calls *system allocators*.  They hold information about the device, including the amount of memory has been allocated.  So that we can call
+The `GPUAllocator` and `CPUAllocator` are calls *system allocators*.  They work as the fallback allocators of `BuddyAllocator`.  A system allocator holds information about a device, including the amount of memory has been allocated, so we can call
 
-- `GPUAllocator::Used` and
-- `CPUAllocator::Used`
+- `GPUAllocator::Used()` and
+- `CPUAllocator::Used()`
 
 to get the amount of memory that has been allocated so far.
 
 
-## Why Such a Design
+## Justification
 
 I got inspiration from Majel and Caffe2, though above design look different from both.
 

From b55df90dfdf6b9720548613885d291ae8769705b Mon Sep 17 00:00:00 2001
From: Yi Wang <yiwang01@baidu.com>
Date: Fri, 23 Jun 2017 11:42:48 -0700
Subject: [PATCH 6/7] Remove unnecessary preamble

---
 paddle/memory/README.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/paddle/memory/README.md b/paddle/memory/README.md
index b71ca29696513..fd32d07ef40fc 100644
--- a/paddle/memory/README.md
+++ b/paddle/memory/README.md
@@ -1,5 +1,3 @@
-In my mind, the memory package works like the following:
-
 ## Design
 
 ### Usage

From ab2550c6400bce5d2596f5bff8629ef67ed195b8 Mon Sep 17 00:00:00 2001
From: Yi Wang <yi.wang.2005@gmail.com>
Date: Sun, 25 Jun 2017 15:44:55 -0700
Subject: [PATCH 7/7] Update design

---
 paddle/memory/README.md | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/paddle/memory/README.md b/paddle/memory/README.md
index fd32d07ef40fc..e5f7880e4cad3 100644
--- a/paddle/memory/README.md
+++ b/paddle/memory/README.md
@@ -31,7 +31,7 @@ In `paddle/memory/memory.h` we have:
 namespace memory {
 template <typename Place> void* Alloc(Place, size_t);
 template <typename Place> void Free(Place, void*);
-template <typename Place> void Used(Place);
+template <typename Place> size_t Used(Place);
 }  // namespace memory
 ```
 
@@ -39,7 +39,7 @@ These function templates have specializations on either `platform::CPUPlace` or
 
 ```cpp
 template<>
-void Alloc<CPUPlace>(CPUPlace p, size_t size) {
+void* Alloc<CPUPlace>(CPUPlace p, size_t size) {
   return GetCPUBuddyAllocator()->Alloc(size);
 }
 ```
@@ -102,15 +102,11 @@ class BuddyAllocator {
 };
 ```
 
-#### System Allocators
-
-The `GPUAllocator` and `CPUAllocator` are calls *system allocators*.  They work as the fallback allocators of `BuddyAllocator`.  A system allocator holds information about a device, including the amount of memory has been allocated, so we can call
+Because BuddyAllocator has the meta-data of each block, it can trace the used memory -- record the amount returned by `Alloc` freed in `Free`.  Instead, `CPUAllocator` and `GPUAllocator` doesn't know the size of freed memory block and cannot do the trace.
 
-- `GPUAllocator::Used()` and
-- `CPUAllocator::Used()`
-
-to get the amount of memory that has been allocated so far.
+#### System Allocators
 
+The `GPUAllocator` and `CPUAllocator` are calls *system allocators*.  They work as the fallback allocators of `BuddyAllocator`.
 
 ## Justification