From c6402faa28e733aa0874d0fe4f16a3219b55dd82 Mon Sep 17 00:00:00 2001
From: Talia Chopra <talia.chopra@gmail.com>
Date: Fri, 1 Nov 2019 13:16:20 -0700
Subject: [PATCH 1/5] a new round of link fixes

---
 .../gluon_from_experiment_to_deployment.md    |  5 +-
 .../logistic_regression_explained.md          |  2 +-
 .../getting-started/to-mxnet/pytorch.md       |  4 +-
 docs/python_docs/python/tutorials/index.rst   | 10 ++--
 .../gluon/blocks/activations/activations.md   |  8 +--
 .../python/tutorials/packages/gluon/index.rst |  2 +-
 .../tutorials/packages/optimizer/index.md     | 56 +++++++++----------
 .../tutorials/mxnet_cpp_inference_tutorial.md |  7 ++-
 docs/static_site/src/pages/api/faq/float16.md |  4 +-
 julia/docs/mkdocs.yml                         |  2 +-
 julia/docs/src/tutorial/char-lstm.md          |  2 +-
 11 files changed, 51 insertions(+), 51 deletions(-)

diff --git a/docs/python_docs/python/tutorials/getting-started/gluon_from_experiment_to_deployment.md b/docs/python_docs/python/tutorials/getting-started/gluon_from_experiment_to_deployment.md
index 47b629991650..20e9cabcdaf8 100644
--- a/docs/python_docs/python/tutorials/getting-started/gluon_from_experiment_to_deployment.md
+++ b/docs/python_docs/python/tutorials/getting-started/gluon_from_experiment_to_deployment.md
@@ -267,8 +267,7 @@ finetune_net.export("flower-recognition", epoch=epochs)
 MXNet provides various useful tools and interfaces for deploying your model for inference. For example, you can use [MXNet Model Server](https://github.com/awslabs/mxnet-model-server) to start a service and host your trained model easily.
 Besides that, you can also use MXNet's different language APIs to integrate your model with your existing service. We provide [Python](/api/python.html),    [Java](/api/java.html), [Scala](/api/scala.html), and [C++](/api/cpp) APIs.
 
-Here we will briefly introduce how to run inference using Module API in Python. There is more detailed explanation available in the [Predict Image Tutorial](https://mxnet.apache.org/tutorials/python/predict_image.html).
-In general, prediction consists of the following steps:
+Here we will briefly introduce how to run inference using Module API in Python. In general, prediction consists of the following steps:
 1. Load the model architecture (symbol file) and trained parameter values (params file)
 2. Load the synset file for label names
 3. Load the image and apply the same transformation we did on validation dataset during training
@@ -311,7 +310,7 @@ probability=9.798435, class=lotus
 
 ## What's next
 
-You can continue to the [next tutorial](https://mxnet.apache.org/versions/master/tutorials/c++/mxnet_cpp_inference_tutorial.html) on how to load the model we just trained and run inference using MXNet C++ API.
+You can continue to the [next tutorial](/api/cpp/docs/tutorials/cpp_inference) on how to load the model we just trained and run inference using MXNet C++ API.
 
 You can also find more ways to run inference and deploy your models here:
 1. [Java Inference examples](https://github.com/apache/incubator-mxnet/tree/master/scala-package/examples/src/main/java/org/apache/mxnetexamples/javaapi/infer)
diff --git a/docs/python_docs/python/tutorials/getting-started/logistic_regression_explained.md b/docs/python_docs/python/tutorials/getting-started/logistic_regression_explained.md
index 36f2e5a68062..8cd29f2a32b3 100644
--- a/docs/python_docs/python/tutorials/getting-started/logistic_regression_explained.md
+++ b/docs/python_docs/python/tutorials/getting-started/logistic_regression_explained.md
@@ -55,7 +55,7 @@ batch_size = 10
 
 ## Working with data
 
-To work with data, Apache MXNet provides [Dataset](https://mxnet.apache.org/api/python/gluon/data.html#mxnet.gluon.data.Dataset) and [DataLoader](https://mxnet.apache.org/api/python/gluon/data.html#mxnet.gluon.data.DataLoader) classes. The former is used to provide an indexed access to the data, the latter is used to shuffle and batchify the data. To learn more about working with data in Gluon, please refer to [Gluon Datasets and Dataloaders](https://mxnet.apache.org/tutorials/gluon/datasets.html) tutorial.
+To work with data, Apache MXNet provides [Dataset](https://mxnet.apache.org/api/python/gluon/data.html#mxnet.gluon.data.Dataset) and [DataLoader](https://mxnet.apache.org/api/python/gluon/data.html#mxnet.gluon.data.DataLoader) classes. The former is used to provide an indexed access to the data, the latter is used to shuffle and batchify the data. To learn more about working with data in Gluon, please refer to [Gluon Datasets and Dataloaders](/api/python/docs/api/gluon/data/index.html).
 
 Below we define training and validation datasets, which we are going to use in the tutorial.
 
diff --git a/docs/python_docs/python/tutorials/getting-started/to-mxnet/pytorch.md b/docs/python_docs/python/tutorials/getting-started/to-mxnet/pytorch.md
index 1ab490fbaa42..ec4bdfcdc77e 100644
--- a/docs/python_docs/python/tutorials/getting-started/to-mxnet/pytorch.md
+++ b/docs/python_docs/python/tutorials/getting-started/to-mxnet/pytorch.md
@@ -106,7 +106,7 @@ mx_train_data = gluon.data.DataLoader(
 
 Both frameworks allows you to download MNIST data set from their sources and specify that only training part of the data set is required.
 
-The main difference between the code snippets is that MXNet uses [transform_first](https://mxnet.apache.org/api/python/docs/api/gluon/_autogen/mxnet.gluon.data.Dataset.html) method to indicate that the data transformation is done on the first element of the data batch, the MNIST picture, rather than the second element, the label.
+The main difference between the code snippets is that MXNet uses [transform_first](/api/python/docs/api/gluon/data/index.html#mxnet.gluon.data.Dataset.transform_first) method to indicate that the data transformation is done on the first element of the data batch, the MNIST picture, rather than the second element, the label.
 
 ### 2. Creating the model
 
@@ -143,7 +143,7 @@ We used the Sequential container to stack layers one after the other in order to
 
 * After the model structure is defined, Apache MXNet requires you to explicitly call the model initialization function.
 
-With a Sequential block, layers are executed one after the other. To have a different execution model, with PyTorch you can inherit from `nn.Module` and then customize how the `.forward()` function is executed. Similarly, in Apache MXNet you can inherit from [nn.Block](https://mxnet.apache.org/api/python/docs/api/gluon/mxnet.gluon.nn.Block.html) to achieve similar results.
+With a Sequential block, layers are executed one after the other. To have a different execution model, with PyTorch you can inherit from `nn.Module` and then customize how the `.forward()` function is executed. Similarly, in Apache MXNet you can inherit from [nn.Block](/api/python/docs/api/gluon/nn/index.html#mxnet.gluon.nn.Block) to achieve similar results.
 
 ### 3. Loss function and optimization algorithm
 
diff --git a/docs/python_docs/python/tutorials/index.rst b/docs/python_docs/python/tutorials/index.rst
index af906deba22b..762430d40598 100644
--- a/docs/python_docs/python/tutorials/index.rst
+++ b/docs/python_docs/python/tutorials/index.rst
@@ -56,13 +56,13 @@ Packages & Modules
 
    .. card::
       :title: Symbol API
-      :link: packages/symbol/index.html
+      :link: /api/python/docs/api/symbol/index.html
 
-      How to use MXNet's Symbol API.
+      MXNet Symbol API has been depricated. API documentation is still available for reference. 
 
    .. card::
       :title: Autograd API
-      :link: packages/autograd/autograd.html
+      :link: /api/python/docs/tutorials/packages/autograd/index.html
 
       How to use Automatic Differentiation with the Autograd API.
 
@@ -86,13 +86,13 @@ Performance
 
    .. card::
       :title: Compression: int8
-      :link: performance/int8.html
+      :link: performance/compression/int8.html
 
       How to use int8 in your model to boost training speed.
 
    .. card::
       :title: MKL-DNN
-      :link: performance/backend/mkl-dnn.html
+      :link: performance/backend/mkldnn/mkldnn_quantization
 
       How to get the most from your CPU by using Intel's MKL-DNN.
 
diff --git a/docs/python_docs/python/tutorials/packages/gluon/blocks/activations/activations.md b/docs/python_docs/python/tutorials/packages/gluon/blocks/activations/activations.md
index e33e94182156..755253708b43 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/blocks/activations/activations.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/blocks/activations/activations.md
@@ -270,7 +270,7 @@ visualize_activation(mx.gluon.nn.Swish())
 ## Next Steps
 
 Activations are just one component of neural network architectures. Here are a few MXNet resources to learn more about activation functions and how they they combine with other components of neural nets.
-* Learn how to create a Neural Network with these activation layers and other neural network layers in the [gluon crash course](http://beta.mxnet.io/guide/getting-started/crash-course/2-nn.html).
-* Check out the guide to MXNet [gluon layers and blocks](http://beta.mxnet.io/guide/packages/gluon/nn.html) to learn about the other neural network layers in implemented in MXNet and how to create custom neural networks with these layers.
-* Also check out the [guide to normalization layers](http://beta.mxnet.io/guide/packages/gluon/normalization/normalization.html) to learn about neural network layers that normalize their inputs.
-* Finally take a look at the [Custom Layer guide](http://beta.mxnet.io/guide/packages/gluon/custom_layer_beginners.html) to learn how to implement your own custom activation layer.
+* Learn how to create a Neural Network with these activation layers and other neural network layers in the [gluon crash course](/api/python/docs/tutorials/getting-started/crash-course/index.html).
+* Check out the guide to MXNet [gluon layers and blocks](/api/python/docs/tutorials/packages/gluon/blocks/nn.html) to learn about the other neural network layers in implemented in MXNet and how to create custom neural networks with these layers.
+* Also check out the [guide to normalization layers](/api/python/docs/tutorials/packages/gluon/training/normalization/index.html) to learn about neural network layers that normalize their inputs.
+* Finally take a look at the [Custom Layer guide](/api/python/docs/tutorials/extend/custom_layer.html) to learn how to implement your own custom activation layer.
diff --git a/docs/python_docs/python/tutorials/packages/gluon/index.rst b/docs/python_docs/python/tutorials/packages/gluon/index.rst
index c41bf3d9d116..e2bdb1856953 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/index.rst
+++ b/docs/python_docs/python/tutorials/packages/gluon/index.rst
@@ -167,7 +167,7 @@ Training
 
    .. card::
       :title: Autograd API
-      :link: ../autograd/autograd.html
+      :link: /api/python/docs/tutorials/packages/autograd/index.html
 
       How to use Automatic Differentiation with the Autograd API.
 
diff --git a/docs/python_docs/python/tutorials/packages/optimizer/index.md b/docs/python_docs/python/tutorials/packages/optimizer/index.md
index b52f212b9a35..b7b6c7453a89 100644
--- a/docs/python_docs/python/tutorials/packages/optimizer/index.md
+++ b/docs/python_docs/python/tutorials/packages/optimizer/index.md
@@ -19,9 +19,9 @@
 
 Deep learning models are comprised of a model architecture and the model parameters. The model architecture is chosen based on the task - for example Convolutional Neural Networks (CNNs) are very successful in handling image based tasks and Recurrent Neural Networks (RNNs) are better suited for sequential prediction tasks. However, the values of the model parameters are learned by solving an optimization problem during model training.
 
-To learn the parameters, we start with an initialization scheme and iteratively refine the parameter initial values by moving them along a direction that is opposite to the (approximate) gradient of the loss function. The extent to which the parameters are updated in this direction is governed by a hyperparameter called the learning rate. This process, known as gradient descent, is the backbone of optimization algorithms in deep learning. In MXNet, this functionality is abstracted by the [Optimizer API](http://beta.mxnet.io/api/gluon-related/mxnet.optimizer.html).
+To learn the parameters, we start with an initialization scheme and iteratively refine the parameter initial values by moving them along a direction that is opposite to the (approximate) gradient of the loss function. The extent to which the parameters are updated in this direction is governed by a hyperparameter called the learning rate. This process, known as gradient descent, is the backbone of optimization algorithms in deep learning. In MXNet, this functionality is abstracted by the [Optimizer API](/api/python/docs/api/optimizer/index.html#module-mxnet.optimizer).
 
-When training a deep learning model using the MXNet [gluon API](http://beta.mxnet.io/guide/packages/gluon/index.html), a gluon [Trainer](http://beta.mxnet.io/guide/packages/gluon/trainer.html) is initialized with the all the learnable parameters and the optimizer to be used to learn those parameters. A single step of iterative refinement of model parameters in MXNet is achieved by calling [`trainer.step`](http://beta.mxnet.io/api/gluon/_autogen/mxnet.gluon.Trainer.step.html) which in turn uses the gradient (and perhaps some state information) to update the parameters by calling `optimizer.update`.
+When training a deep learning model using the MXNet [gluon API](/api/python/docs/tutorials/packages/gluon/index.html), a gluon [Trainer](/api/python/docs/tutorials/packages/gluon/training/trainer.html) is initialized with the all the learnable parameters and the optimizer to be used to learn those parameters. A single step of iterative refinement of model parameters in MXNet is achieved by calling [`trainer.step`](/api/python/docs/api/gluon/trainer.html#mxnet.gluon.Trainer.step) which in turn uses the gradient (and perhaps some state information) to update the parameters by calling `optimizer.update`.
 
 Here is an example of how a trainer with an optimizer is created for, a simple Linear (Dense) Network.
 
@@ -35,7 +35,7 @@ optim = optimizer.SGD(learning_rate=0.1)
 trainer = gluon.Trainer(net.collect_params(), optimizer=optim)
 ```
 
-In model training, the code snippet above would be followed by a training loop which, at every iteration performs a forward pass (to compute the loss), a backward pass (to compute the gradient of the loss with respect to the parameters) and a trainer step (which updates the parameters using the gradient). See the [gluon Trainer guide](http://beta.mxnet.io/guide/packages/gluon/trainer.html) for a complete example.
+In model training, the code snippet above would be followed by a training loop which, at every iteration performs a forward pass (to compute the loss), a backward pass (to compute the gradient of the loss with respect to the parameters) and a trainer step (which updates the parameters using the gradient). See the [gluon Trainer guide](/api/python/docs/tutorials/packages/gluon/training/trainer.html) for a complete example.
 
 We can also create the trainer by passing in the optimizer name and optimizer params into the trainer constructor directly, as shown below.
 
@@ -45,14 +45,14 @@ trainer = gluon.Trainer(net.collect_params(), optimizer='adam', optimizer_params
 ```
 
 ### What should I use?
-For many deep learning model architectures, the `sgd` and `adam` optimizers are a really good place to start. If you are implementing a deep learning model and trying to pick an optimizer, start with [`'sgd'`](http://beta.mxnet.io/api/gluon-related/_autogen/mxnet.optimizer.SGD.html#mxnet.optimizer.SGD) as you will often get good enough results as long as your learning problem is tractable. If you already have a trainable model and you want to improve the convergence then you can try [`'adam'`](http://beta.mxnet.io/api/gluon-related/_autogen/mxnet.optimizer.SGD.html#mxnet.optimizer.Adam). If you would like to improve your model training process further, there are a number of specialized optimizers out there with many of them already implemented in MXNet. This guide walks through these optimizers in some detail.
+For many deep learning model architectures, the `sgd` and `adam` optimizers are a really good place to start. If you are implementing a deep learning model and trying to pick an optimizer, start with [`'sgd'`](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.SGD) as you will often get good enough results as long as your learning problem is tractable. If you already have a trainable model and you want to improve the convergence then you can try [`'adam'`](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.Adam). If you would like to improve your model training process further, there are a number of specialized optimizers out there with many of them already implemented in MXNet. This guide walks through these optimizers in some detail.
 
 ## Stochastic Gradient Descent
-[Gradient descent](https://en.wikipedia.org/wiki/Gradient_descent) is a general purpose algorithm for minimizing a function using information from the gradient of the function with respect to its parameters. In deep learning, the function we are interested in minimizing is the [loss function](http://beta.mxnet.io/guide/packages/gluon/loss.html). Our model accepts training data as inputs and the loss function tells us how good our model predictions are. Since the training data can routinely consist of millions of examples, computing the loss gradient on the full batch of training data is very computationally expensive. Luckily, we can effectively approximate the full gradient with the gradient of the loss function on randomly chosen minibatches of our training data. This variant of gradient descent is [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent).
+[Gradient descent](https://en.wikipedia.org/wiki/Gradient_descent) is a general purpose algorithm for minimizing a function using information from the gradient of the function with respect to its parameters. In deep learning, the function we are interested in minimizing is the [loss function](/api/python/docs/tutorials/packages/gluon/loss/loss.html). Our model accepts training data as inputs and the loss function tells us how good our model predictions are. Since the training data can routinely consist of millions of examples, computing the loss gradient on the full batch of training data is very computationally expensive. Luckily, we can effectively approximate the full gradient with the gradient of the loss function on randomly chosen minibatches of our training data. This variant of gradient descent is [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent).
 
 Technically, stochastic gradient descent (SGD) refers to an online approximation of the gradient descent algorithm that computes the gradient of the loss function applied to a *single datapoint*, instead of your entire dataset, and uses this approximate gradient to update the model parameter values. However, in MXNet, and other deep learning frameworks, the SGD optimizer is agnostic to how many datapoints the loss function is applied to, and it is more effective to use a mini-batch loss gradient, as described earlier, instead of a single datapoint loss gradient.
 
-### [SGD optimizer](http://beta.mxnet.io/api/gluon-related/_autogen/mxnet.optimizer.SGD.html#mxnet.optimizer.SGD)
+### [SGD optimizer](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.SGD)
 
 For an SGD optimizer initialized with learning rate $lr$, the update function accepts parameters (weights) $w_i$, and their gradients $grad(w_i)$, and performs the single update step:
 
@@ -101,9 +101,9 @@ To create an SGD optimizer with momentum $\gamma$ and weight decay in MXNet simp
 sgd_optimizer = optimizer.SGD(learning_rate=0.1, wd=0., momentum=0.8)
 ```
 
-### [Nesterov Accelerated Stochastic Gradient Descent](http://beta.mxnet.io/api/gluon-related/_autogen/mxnet.optimizer.NAG.html#mxnet.optimizer.NAG)
+### [Nesterov Accelerated Stochastic Gradient Descent](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.NAG)
 
-The momentum method of [Nesterov](https://goo.gl/M5xbuX) is a modification to SGD with momentum that allows for even faster convergence in practice. With Nesterov accelerated gradient (NAG) descent, the update term is derived from the gradient of the loss function with respect to *refined parameter values*. These refined parameter values are computed by performing a SGD update step using the momentum history as the gradient term.
+The momentum method of [Nesterov] is a modification to SGD with momentum that allows for even faster convergence in practice. With Nesterov accelerated gradient (NAG) descent, the update term is derived from the gradient of the loss function with respect to *refined parameter values*. These refined parameter values are computed by performing a SGD update step using the momentum history as the gradient term.
 
 Alternatively, you can think of the NAG optimizer as performing two update steps:
 * The first (internal) update step approximates uses the current momentum history $v_i$ to calculate the refined parameter values $(w_i + \gamma \cdot v_i)$. This is also known as the lookahead step.
@@ -132,7 +132,7 @@ nag_optimizer = optimizer.NAG(learning_rate=0.1, momentum=0.8)
 
 The gradient methods implemented by the optimizers described above use a global learning rate hyperparameter for all parameter updates. This has a well-documented shortcoming in that it makes the training process and convergence of the optimization algorithm really sensitive to the choice of the global learning rate. Adaptive learning rate methods avoid this pitfall by incorporating some history of the gradients observed in earlier iterations to scale step sizes (learning rates) to each learnable parameter in the model.
 
-### [AdaGrad](http://beta.mxnet.io/api/gluon-related/_autogen/mxnet.optimizer.AdaGrad.html)
+### [AdaGrad](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.AdaGrad)
 
 The AdaGrad optimizer, which implements the optimization method originally described by [Duchi et al](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf), multiplies the global learning rate by the $L_2$ norm of the preceeding gradient estimates for each paramater to obtain the per-parameter learning rate. To achieve this, AdaGrad introduces a new term which we'll denote as $g^2$ - the accumulated square of the gradient of the loss function with respect to the parameters.
 
@@ -152,7 +152,7 @@ To instantiate the Adagrad optimizer in MXNet you can use the following line of
 adagrad_optimizer = optimizer.AdaGrad(learning_rate=0.1, eps=1e-07)
 ```
 
-### [RMSProp](http://beta.mxnet.io/api/gluon-related/_autogen/mxnet.optimizer.RMSProp.html)
+### [RMSProp](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.RMSProp)
 
 RMSProp, introduced by [Tielemen and Hinton](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf), is similar to AdaGrad described above, but, instead of accumulating the sum of historical square gradients, maintains an exponential decaying average of the historical square gradients, in order to give more weighting to more recent gradients.
 
@@ -186,7 +186,7 @@ rmsprop_optimizer = optimizer.RMSProp(learning_rate=0.001, gamma1=0.9, gamma2=0.
 
 In the code snippet above, `gamma1` is $\beta$ in the equations above and `gamma2` is $\gamma$, which is only used where `centered=True`.
 
-### [AdaDelta](http://beta.mxnet.io/api/gluon-related/_autogen/mxnet.optimizer.AdaDelta.html)
+### [AdaDelta](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.AdaDelta)
 
 AdaDelta was introduced to address some remaining lingering issues with AdaGrad and RMSProp - the selection of a global learning rate. AdaGrad and RMSProp assign each parameter its own learning rate but the per-parameter learning rate are still calculated using the global learning rate. In contrast, AdaDelta does not require a global learning rate, instead, it tracks the square of previous update steps, represented below as $\mathbb{E}[\Delta w^2]$ and uses the root mean square of the previous update steps as an estimate of the learning rate.
 
@@ -205,7 +205,7 @@ Here is the code snippet creating the AdaDelta optimizer in MXNet. The argument
 adadelta_optimizer = optimizer.AdaDelta(rho=0.9, epsilon=1e-07)
 ```
 
-### [Adam](http://beta.mxnet.io/api/gluon-related/_autogen/mxnet.optimizer.Adam.html)
+### [Adam](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.Adam)
 Adam, introduced by [Kingma and Ba](https://arxiv.org/abs/1412.6980), is one of the popular adaptive algorithms for deep learning. It combines elements of RMSProp with momentum SGD. Like RMSProp, Adam uses the RootMeanSquare of decaying average of historical gradients but also explicitly keeps track of a decaying average of momentum and uses that for the update step direction. Thus, Adam accepts two hyperparameters $\beta_1$ and $\beta_2$ for momentum weighting and gradient RMS weighting respectively. Adam also accepts a global learning rate that's adaptively tuned to each parameter with the gradient RootMeanSquare. Finally, Adam also includes bias correction steps within the update that transform the biased estimates of first and second order moments, $v_{i+1}$ and $\mathbb{E}[g^2]_{i+1}$ to their unbiased counterparts $\tilde{v}_{i+1}$ and $\tilde{\mathbb{E}[g^2]}_{i+1}$
 
 The Adam optimizer performs the update step described the following equations:
@@ -223,7 +223,7 @@ In MXNet, you can construct the Adam optimizer with the following line of code.
 adam_optimizer = optimizer.Adam(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08)
 ```
 
-### [Adamax](http://beta.mxnet.io/api/gluon-related/_autogen/mxnet.optimizer.Adamax.html)
+### [Adamax](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.Adamax)
 Adamax is a variant of Adam also included in the original paper by [Kingma and Ba](https://arxiv.org/abs/1412.6980). Like Adam, Adamax maintains a moving average for first and second moments but Adamax uses the $L_{\infty}$ norm for the exponentially weighted average of the gradients, instead of the $L_2$ norm used in Adam used to keep track of the gradient second moment. The $L_{\infty}$ norm of a vector is equivalent to take the maximum absolute value of elements in that vector.
 
 $$ v_{i+1} = \beta_1 \cdot v_{i} + (1 - \beta_1) \cdot grad(w_i) $$
@@ -238,7 +238,7 @@ See the code snippet below for how to construct Adamax in MXNet.
 adamax_optimizer = optimizer.Adamax(learning_rate=0.002, beta1=0.9, beta2=0.999)
 ```
 
-### [Nadam](http://beta.mxnet.io/api/gluon-related/_autogen/mxnet.optimizer.Nadam.html)
+### [Nadam](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.Nadam)
 Nadam is also a variant of Adam and draws from the perspective that Adam can be viewed as a combination of RMSProp and classical Momentum (or Polyak Momentum). Nadam replaces the classical Momentum component of Adam with Nesterov Momentum (See [paper](http://cs229.stanford.edu/proj2015/054_report.pdf) by Dozat). The consequence of this is that the gradient used to update the weighted average of the momentum term is a lookahead gradient as is the case with NAG.
 
 The Nadam optimizer performs the update step:
@@ -262,7 +262,7 @@ Training very deep neural networks can be time consuming and as such it is very
 
 While all the preceding optimizers, from SGD to Adam, can be readily used in the distributed setting, the following optimizers in MXNet provide extra features targeted at alleviating some of the problems associated with distributed training.
 
-### [Signum](http://beta.mxnet.io/api/gluon-related/_autogen/mxnet.optimizer.Signum.html)
+### [Signum](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.Signum)
 In distributed training, communicating gradients across multiple worker nodes can be expensive and create a performance bottleneck. The Signum optimizer addresses this problem by transmitting just the sign of each minibatch gradient instead of the full precision gradient. In MXNet, the signum optimizer implements two variants of compressed gradients described in the paper by [Bernstein et al](https://arxiv.org/pdf/1802.04434.pdf).
 
 The first variant, achieved by constructing the Signum optimizer with `momentum=0`, implements SignSGD update which performs the update below.
@@ -281,7 +281,7 @@ Here is how to create the signum optimizer in MXNet.
 signum_optimizer = optimizer.Signum(learning_rate=0.01, momentum=0.9, wd_lh=0.0)
 ```
 
-### [LBSGD](http://beta.mxnet.io/api/gluon-related/_autogen/mxnet.optimizer.LBSGD.html)
+### [LBSGD](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.LBSGD)
 LBSGD stands for Large Batch Stochastic Gradient Descent and implements a technique where Layer-wise Adaptive Rate Scaling (LARS) is used to maintain a separate learning rate for each layer of the neural network. LBSGD has no additional modifications to SGD and performs the same parameter update steps as the SGD optimizer described above.
 
 LBSGD was introduced by [You et al](https://arxiv.org/pdf/1708.03888.pdf) for distributed training with data-parallel synchronous SGD across multiple worker nodes to overcome the issue of reduced model accuracy when the number of workers, and by extension effective batch size, is increased.
@@ -308,7 +308,7 @@ LBSGD has a number of extra keyword arguments described below
 * `updates_per_epoch` - How many updates to the learning rate to perform every epoch. For example during warmup the warmup strategy is applied to increase the learning rate a total of `warmup_epochs*updates_per_epoch` number of times.
 * `begin_epoch` - The epoch at which to start warmup.
 
-### [DCASGD](http://beta.mxnet.io/api/gluon-related/_autogen/mxnet.optimizer.DCASGD.html)
+### [DCASGD](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.DCASGD)
 
 The DCASGD optimizer implements Delay Compensated Asynchronous Stochastic Gradient Descent by [Zheng et al](https://arxiv.org/pdf/1609.08326.pdf). In asynchronous distributed SGD, it is possible that a training worker node add its gradients too late to the global (parameter) server resulting in a delayed gradient being used to update the current parameters. DCASGD addresses this issue of delayed gradients by compensating for this delay in the parameter update steps.
 
@@ -328,7 +328,7 @@ Before deep neural networks became popular post 2012, people were already solvin
 
 The class of optimization algorithms designed to tackle online learning problems have also seen some success in offline training of deep neural models. The following optimizers are algorithms taken from online learning that have been implemented in MXNet.
 
-### [FTRL](http://beta.mxnet.io/api/gluon-related/_autogen/mxnet.optimizer.Ftrl.html)
+### [FTRL](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.Ftrl)
 
 FTRL stands for Follow the Regularized Leader and describes a family of algorithms originally designed for online learning tasks.
 
@@ -351,7 +351,7 @@ Here is how to initialize the FTRL optimizer in MXNet
 ftrl_optimizer = optimizer.Ftrl(lamda1=0.01, learning_rate=0.1, beta=1)
 ```
 
-### [FTML](http://beta.mxnet.io/api/gluon-related/_autogen/mxnet.optimizer.FTML.html)
+### [FTML](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.FTML)
 
 FTML stands for Follow the Moving Leader and is a variant of the FTRL family of algorithms adapted specifically to deep learning. Regular FTRL algorithms, described above, solve an optimization problem every update that involves the sum of all previous gradients. This is not well suited for the non-convex loss functions in deep learning. In the non-convex settings, older gradients are likely uninformative as the parameter updates can move to converge towards different local minima at different iterations. FTML addresses this problem by reweighing the learning subproblems in each iteration as shown below.
 
@@ -379,7 +379,7 @@ Here `beta1` and `beta2` are similar to the arguments in the Adam optimizer.
 ## Bayesian SGD
 A notable shortcoming of deep learning is that the model parameters learned after training are only point estimates, therefore deep learning model predictions have no information about uncertainty or confidence bounds. This is in contrast to a fully Bayesian approach which incorporates prior distributions on the model parameters and estimates the model parameters as belonging to a posterior distribution. This approach allows the predictions of a bayesian model to have information about uncertainty, as you can sample different values from the posterior distribution to obtain different model parameters. One approach to close the bayesian gap in deep learning is to incorporate the gradient descent algorithm with properties that allow the model parameters to converge to a distribution instead of a single value or point estimate.
 
-### [SGLD](http://beta.mxnet.io/api/gluon-related/_autogen/mxnet.optimizer.SGLD.html)
+### [SGLD](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.SGLD)
 Stochastic Gradient Langevin Dynamics or SGLD was introduced to allow uncertainties around model parameters to be captured directly during model training. With every update in SGLD, the learning rate decreases to zero and a gaussian noise of known variances is injected into the SGD step. This has the effect of having the training parameters converge to a sufficient statistic for a posterior distribution instead of simply a point estimate of the model parameters.
 
 SGLD performs the parameter update:
@@ -401,14 +401,14 @@ If you would like to use a particular optimizer that is not yet implemented in M
 
 Step 1: First create a function that is able to perform your desired updates given the weights, gradients and other state information.
 
-Step 2: You will have to write your own optimizer class that extends the [base optimizer class](http://beta.mxnet.io/api/gluon-related/_autogen/mxnet.optimizer.Optimizer.html#mxnet.optimizer.Optimizer) and override the following functions
+Step 2: You will have to write your own optimizer class that extends the [base optimizer class](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.Optimizer) and override the following functions
 * `__init__`: accepts the parameters of your optimizer algorithm as inputs as saves them as member variables.
 * `create_state`: If your custom optimizer uses some additional state information besides the gradient, then you should implement a function that accepts the weights and returns the state.
 * `update`: Implement your optimizer update function using the function in Step 1
 
 Step 3: Register your optimizer with `@register` decorator on your optimizer class.
 
-See the [source code](http://beta.mxnet.io/_modules/mxnet/optimizer/optimizer.html#NAG) for the NAG optimizer for a concrete example.
+See the [source code](/api/python/docs/api/optimizer/index.html#mxnet.optimizer.NAG) for the NAG optimizer for a concrete example.
 
 ## Summary
 * MXNet implements many state-of-the-art optimizers which can be passed directly into a gluon trainer object. Calling `trainer.step` during model training uses the optimizers to update the model parameters.
@@ -421,9 +421,9 @@ See the [source code](http://beta.mxnet.io/_modules/mxnet/optimizer/optimizer.ht
 
 ## Next Steps
 While optimization and optimizers play a significant role in deep learning model training, there are still other important components to model training. Here are a few suggestions about where to look next.
-* The [trainer API](http://beta.mxnet.io/api/gluon/mxnet.gluon.Trainer.html) and [guide](http://beta.mxnet.io/guide/packages/gluon/trainer.html) have information about how to construct the trainer that encapsulate the optimizers and will actually be used in your model training loop.
-* Check out the guide to MXNet gluon [Loss functions](http://beta.mxnet.io/guide/packages/gluon/loss.html) and [custom losses](http://beta.mxnet.io/guide/packages/gluon/custom-loss/custom-loss.html) to learn about the loss functions optimized by these optimizers, see what loss functions are already implemented in MXNet and understand how to write your own custom loss functions.
-* Take a look at the [guide to parameter initialization](http://beta.mxnet.io/guide/packages/gluon/init.html) in MXNet to learn about what initialization schemes are already implemented, and how to implement your custom initialization schemes.
-* Also check out the [autograd guide](http://beta.mxnet.io/guide/packages/autograd/autograd.html) to learn about automatic differentiation and how gradients are automatically computed in MXNet.
-* Make sure to take a look at the [guide to scheduling learning rates](https://mxnet.apache.org/versions/master/tutorials/gluon/learning_rate_schedules.html) to learn how to create learning rate schedules to supercharge the convergence of your optimizer.
-* Finally take a look at the [KVStore API](http://beta.mxnet.io/api/gluon-related/mxnet.kvstore.KVStore.html#mxnet.kvstore.KVStore) to learn how parameter values are synchronized over multiple devices.
+* The [trainer API](/api/python/docs/api/gluon/trainer.html) and [guide](/api/python/docs/tutorials/packages/gluon/training/trainer.html) have information about how to construct the trainer that encapsulate the optimizers and will actually be used in your model training loop.
+* Check out the guide to MXNet gluon [Loss functions](/api/python/docs/tutorials/packages/gluon/loss/loss.html) and [custom losses](/api/python/docs/tutorials/packages/gluon/loss/custom-loss.html) to learn about the loss functions optimized by these optimizers, see what loss functions are already implemented in MXNet and understand how to write your own custom loss functions.
+* Take a look at the [guide to parameter initialization](/api/python/docs/tutorials/packages/gluon/blocks/init.html) in MXNet to learn about what initialization schemes are already implemented, and how to implement your custom initialization schemes.
+* Also check out the [autograd guide](/api/python/docs/tutorials/packages/autograd/index.html) to learn about automatic differentiation and how gradients are automatically computed in MXNet.
+* Make sure to take a look at the [guide to scheduling learning rates](/api/python/docs/tutorials/packages/gluon/training/learning_rates/learning_rate_schedules.html) to learn how to create learning rate schedules to supercharge the convergence of your optimizer.
+* Finally take a look at the [KVStore API](/api/python/docs/tutorials/packages/kvstore/index.html) to learn how parameter values are synchronized over multiple devices.
diff --git a/docs/static_site/src/pages/api/cpp/docs/tutorials/mxnet_cpp_inference_tutorial.md b/docs/static_site/src/pages/api/cpp/docs/tutorials/mxnet_cpp_inference_tutorial.md
index 6d9998d7a7a9..dcc96d4547ca 100644
--- a/docs/static_site/src/pages/api/cpp/docs/tutorials/mxnet_cpp_inference_tutorial.md
+++ b/docs/static_site/src/pages/api/cpp/docs/tutorials/mxnet_cpp_inference_tutorial.md
@@ -34,12 +34,13 @@ We will focus on the MXNet C++ API. We have slightly modified the code in [C++ I
 ## Prerequisites
 
 To complete this tutorial, you need to:
-- Complete the training part of [Gluon end to end tutorial](/api/python/docs/tutorials/getting-started/gluon_from_experiment_to_deployment.html)
-- Learn the basics about [MXNet C++ API](/api/cpp)
+- Complete the training part of [Gluon end to end tutorial](/api/python/docs/tutorials/getting-started/gluon_from_experiment_to_deployment.html).
+- Learn the basics about [MXNet C++ API](/api/cpp).
 
 
 ## Setup the MXNet C++ API
-To use the C++ API in MXNet, you need to build MXNet from source with C++ package. Please follow the [built from source guide](/get_started/ubuntu_setup.html), and [C++ Package documentation](/api/cpp)
+
+To use the C++ API in MXNet, you need to build MXNet from source with C++ package. Please follow the [built from source guide](/get_started/ubuntu_setup.html), and [C++ Package documentation](/api/cpp).
 The summary of those two documents is that you need to build MXNet from source with `USE_CPP_PACKAGE` flag set to 1. For example: `make -j USE_CPP_PACKAGE=1`.
 
 ## Load the model and run inference
diff --git a/docs/static_site/src/pages/api/faq/float16.md b/docs/static_site/src/pages/api/faq/float16.md
index 6ffb04054554..8a6d413449a6 100644
--- a/docs/static_site/src/pages/api/faq/float16.md
+++ b/docs/static_site/src/pages/api/faq/float16.md
@@ -59,7 +59,7 @@ net.cast('float16')
 data = data.astype('float16', copy=False)
 ```
 
-If you are using images and DataLoader, you can also use a [Cast transform]({{'/api/python/docs/api/gluon/_autogen/mxnet.gluon.data.vision.transforms.Cast.html#mxnet.gluon.data.vision.transforms.Cast'|relative_url}}).
+If you are using images and DataLoader, you can also use a [Cast transform](/api/python/docs/api/gluon/data/vision/transforms/index.html#mxnet.gluon.data.vision.transforms.Cast).
 
 3. It is preferable to use **multi_precision mode of optimizer** when training in float16. This mode of optimizer maintains a master copy of the weights in float32 even when the training (i.e. forward and backward pass) is in float16. This helps increase precision of the weight updates and can lead to faster convergence in some scenarios.
 
@@ -67,7 +67,7 @@ If you are using images and DataLoader, you can also use a [Cast transform]({{'/
 optimizer = mx.optimizer.create('sgd', multi_precision=True, lr=0.01)
 ```
 
-You can play around with mixed precision using the image classification [example](https://github.com/apache/incubator-mxnet/blob/master/example/gluon/image_classification.py). We suggest using the Caltech101 dataset option in that example and using a ResNet50V1 network so you can quickly see the performance improvement and how the accuracy is unaffected. Here's the starter command to run this example.
+You can play around with mixed precision using the image classification [example](https://github.com/apache/incubator-mxnet/blob/master/example/image-classification/train_imagenet.py). We suggest using the Caltech101 dataset option in that example and using a ResNet50V1 network so you can quickly see the performance improvement and how the accuracy is unaffected. Here's the starter command to run this example.
 
 ```bash
 python image_classification.py --model resnet50_v1 --dataset caltech101 --gpus 0 --num-worker 30 --dtype float16
diff --git a/julia/docs/mkdocs.yml b/julia/docs/mkdocs.yml
index 383505621540..880fad24d5b8 100644
--- a/julia/docs/mkdocs.yml
+++ b/julia/docs/mkdocs.yml
@@ -16,7 +16,7 @@
 # under the License.
 
 site_name: MXNet.jl
-repo_url:  https://github.com/dmlc/MXNet.jl
+repo_url:  https://github.com/apache/incubator-mxnet/tree/master/julia#mxnet
 
 theme: material
 
diff --git a/julia/docs/src/tutorial/char-lstm.md b/julia/docs/src/tutorial/char-lstm.md
index 1109f3554c17..d1f8e43db7c8 100644
--- a/julia/docs/src/tutorial/char-lstm.md
+++ b/julia/docs/src/tutorial/char-lstm.md
@@ -31,7 +31,7 @@ networks yet, the example shown here is an implementation of LSTM by
 using the default FeedForward model via explicitly unfolding over time.
 We will be using fixed-length input sequence for training. The code is
 adapted from the [char-rnn example for MXNet's Python
-binding](https://github.com/dmlc/mxnet-notebooks/blob/master/python/tutorials/char_lstm.ipynb),
+binding](/api/r/docs/tutorials/char_rnn_model),
 which demonstrates how to use low-level
 [Symbolic API](@ref) to build customized neural
 network models directly.

From de541cdd694ae7ce4b4b21231dae857cc505b319 Mon Sep 17 00:00:00 2001
From: Talia Chopra <talia.chopra@gmail.com>
Date: Fri, 8 Nov 2019 10:23:12 -0800
Subject: [PATCH 2/5] fixing merge conflict

---
 3rdparty/mkldnn | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
index a0a87d662ede..9008d8ab096a 160000
--- a/3rdparty/mkldnn
+++ b/3rdparty/mkldnn
@@ -1 +1 @@
-Subproject commit a0a87d662edeef38d01db4ac5dd25f59a1f0881f
+Subproject commit 9008d8ab096ae29f158840231ff431aea8bf3467

From 0b2fa235acf0a254d75d4ad914bcb60d74a1b226 Mon Sep 17 00:00:00 2001
From: Talia <31782251+TEChopra1000@users.noreply.github.com>
Date: Fri, 8 Nov 2019 23:15:58 -0800
Subject: [PATCH 3/5] Nudging test.

---
 docs/python_docs/python/tutorials/index.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/python_docs/python/tutorials/index.rst b/docs/python_docs/python/tutorials/index.rst
index 762430d40598..c130ce607ea6 100644
--- a/docs/python_docs/python/tutorials/index.rst
+++ b/docs/python_docs/python/tutorials/index.rst
@@ -58,7 +58,7 @@ Packages & Modules
       :title: Symbol API
       :link: /api/python/docs/api/symbol/index.html
 
-      MXNet Symbol API has been depricated. API documentation is still available for reference. 
+      MXNet Symbol API has been depricated. API documentation is still available for reference.
 
    .. card::
       :title: Autograd API
@@ -172,4 +172,4 @@ Next steps
    packages/index
    performance/index
    deploy/index
-   extend/index
\ No newline at end of file
+   extend/index

From d486e2dc158aba42824577eb66378e8b61daf9ee Mon Sep 17 00:00:00 2001
From: Talia <31782251+TEChopra1000@users.noreply.github.com>
Date: Sun, 10 Nov 2019 22:47:56 -0800
Subject: [PATCH 4/5] Update julia/docs/src/tutorial/char-lstm.md

---
 julia/docs/src/tutorial/char-lstm.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/julia/docs/src/tutorial/char-lstm.md b/julia/docs/src/tutorial/char-lstm.md
index d1f8e43db7c8..c7dc9d6c07db 100644
--- a/julia/docs/src/tutorial/char-lstm.md
+++ b/julia/docs/src/tutorial/char-lstm.md
@@ -31,7 +31,7 @@ networks yet, the example shown here is an implementation of LSTM by
 using the default FeedForward model via explicitly unfolding over time.
 We will be using fixed-length input sequence for training. The code is
 adapted from the [char-rnn example for MXNet's Python
-binding](/api/r/docs/tutorials/char_rnn_model),
+binding](https://github.com/apache/incubator-mxnet/blob/8004a027ad6a73f8f6eae102de8d249fbdfb9a2d/example/rnn/old/char-rnn.ipynb),
 which demonstrates how to use low-level
 [Symbolic API](@ref) to build customized neural
 network models directly.

From 40099d786c0e6ecf1c3648c2f28c28d9e5efff18 Mon Sep 17 00:00:00 2001
From: Talia Chopra <talia.chopra@gmail.com>
Date: Tue, 12 Nov 2019 13:28:39 -0800
Subject: [PATCH 5/5] fixing mkldnn version to match upstream/master

---
 3rdparty/mkldnn | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
index 9008d8ab096a..a0a87d662ede 160000
--- a/3rdparty/mkldnn
+++ b/3rdparty/mkldnn
@@ -1 +1 @@
-Subproject commit 9008d8ab096ae29f158840231ff431aea8bf3467
+Subproject commit a0a87d662edeef38d01db4ac5dd25f59a1f0881f