Soohyun’s Machine-learning

Backpropagation and Neural Networks 본문

Lectures/CS231N (ML basic)

Backpropagation and Neural Networks

Alex_Rose 2017. 10. 15. 10:16


 



   

     ▼이게         ▼이렇게 되는 건 quotient rule을 사용해서 f(x)/g(x)의 x에 대한 미분은 ( f'(x)g(x) - f(x)g'(x) ) / g(x)^2 라서 





Always check : the gradient with respect (w.r.t) to a variable should have the same shape as the variable



Layer 를 체크할 때, input 은 layer로 안 친다. hidden layer 와 output layer 만 NN으로 쳐서 2 layer라고 하면 hidden layer 1개, output layer 1개라는 이야기. 





Table of Contents:

- Introduction

- Simple expressions, interpreting the gradient

- Compound expressions, chain rule, backpropagation

- Intuitive understanding of backpropagation

- Modularity : Sigmoid example

- Backprop in practice : Staged computation

- Patterns in backward flow

- Gradients for vectorized operations

- Summary



Introduction


Motivation.

In this section we will develop expertise with an intuitive understanding of backpropagation, which is a way of computing gradients of expressions through recursive application of chain rule. Understanding of this process and its subtleties is critical for you to understand, and effectively develop, design and debug Neural Networks.


Problem statement.

The core problem studied in this section is as follows: We are given some function f(x) where x is a vector of inputs and we are interested in computing the gradient of f at x (i.e. )


Motivation. 

Recall that the primary reason we are interested in this problem is that in the specific case of Neural Networks, f will correspond to the loss function (

) and the inputs x will consist of the training data and the neural network weights. For example, the loss could be the SVM loss function and the inputs are both the training data  and the weights and biases W, b. Note that (as is usually the case in Machine Learning) we think of the training data as given and fixed, and of the weights as variables we have control over. Hence, even though we can easily use backpropagation to compute the gradient on the input examples , in practice we usually only compute the gradient for the parameters (e.g. W,b) so that we can use it to perform a parameter update. However, as we will see later in the class the gradient on  can still be useful sometimes, for example for purposes of visualization and interpreting what the Neural Network might be doing. 


If you are coming to this class and you're comfortable with deriving gradients with chain rule, we would still like to encourage you to at least skim this section, since it presents a rarely developed view of backpropagation as backward flow in real-valued circuits and any insights you'll gain may help you throughout the class.



Simple expressions and interpretation of the gradient


Lets start simple so that we can develop the notation and conventions for more complex expressions. Consider a simple multiplication function of two numbers f(x, y) = xy. It is a matter of simple calculus to derive the partial derivative for either input:



interpretation.

Keep in mind what the derivatives tell you: They indicate the rate of change of a function with respect to that variable surrounding and infinitesimally small region near a particular point:




A technical note is that the division sign on the left-hand side is, unlike the division sign on the right-hand side, not a division. Instead, this notation indicates that the operator is being applied to the function , and returns a different function (the derivative). A nice way to think about the expression above is that when is very small, then the functions is well-approximated by a straight line, and the derivative is its slope. In other words, the derivative on each variable tells you the sensitivity of the whole expression on its value. For example, if  then and the derivative on . This tells us that if we were to increase the value of this variable by a tiny amount, the effect on the whole expression would be to decrease it (due to the negative sign), and by three times that amount. This can be seen by rearranging the above equation . Analogously, since , we expect that increasing the value of  by some very small amount  would also increase the output of the function (due to the positive sign), and by.



As mentioned, the gradient  is the vector of partial derivatives, so we have that . Even though the gradient is technically a vector, we will often use terms such as "the gradient on x" instead of the technically correct phrase "the partial derivative on x" for simplicity. 


We can also derive the derivatives for the addition operation:





that is, the derivative on both x, y is one regardless of what the values of x,y are. This makes sense, since increasing either x,y would increase the output of f, and the rate of that increase would be independent of what the actual values of x,y are (unlike the case of multiplication above). The last function we'll use quite a bit in the class is the max operation:





That is, the (sub) gradient is 1 on the input that was larger and 0 on the other input. Intuitively, if the inputs are x = 4, y = 2, then the max is 4, and the function is not sensitive to the setting of y. That is, if we were to increase it by a tiny amount h, the function would keep outputting 4, and therefore the gradient is zero: there is no effect. Of course, if we were to change y by a large amount (e.g. larger than 2), then the value of f would change, but the derivatives tell us nothing about the effect of such large changes on the inputs of a function; They are only informative for tiny, infinitesimally small changes on the inputs, as indicated by the in its definition. 




Compound expressions with chain rule


Lets now start to consider more complicated expressions that involve multiple composed functions, such as f(x,y,z) = (x + y)z. This expression is still simple enough to differentiate directly, but we'll take a particular approach to it that will be helpful with understanding the intuition behind backpropagation. In particular, note that this expression can be broken down into two expressions: q = x + y and f = qz. Moreover, we know how to compute the derivatives of both expressions separately, as seen in the previous section. 


 is just multiplication of   and  , so  and  is addition of   and   so . However, we don't necessarily care about the gradient on the intermediate value   the value of   is not useful. Instead, we are ultimately interested in the gradient of  with respect to its inputs  . The chain rule tells us that the correct way to "chain" these gradient expressions together is through multiplication. For example, . In practice this is simply a multiplication of the two numbers that hold the two gradients. Lets see this with an example:





# set some inputs

x = -2;

y = 5;

z = -4


# perform the forward pass

q = x + y                                    # q becomes 3

f = q * z                                     # f becomes -12


# perform the backward pass (backpropagation) in reverse order

# first backprop through f = q * z

dfdz = q                                     # df/dz = q, so gradient on z becomes 3

dfdq = z                                     # df/dq = z, so gradient on q becomes -4


# now backprop through q = x + y

dfdx = 1.0 * dfdq                          # dq/dx = 1. And the multiplication here is the chain rule!

dfdy = 1.0 * dfdq                          # dq/dy = 1





At the end we are left with the gradient in the variables [dfdx, dfdy, dfdz], which tell us the sensitivity of the variables x, y, z on f ! This is the simplest example of backpropagation. Going forward, we will want to use a more concise notation so that we don't have to keep writing the df part. That is, for example instead of dfdq we would simply write dq, and always assume that the gradient is with respect to the final output. 


This computation can also be nicely visualized with a circuit diagram:







Intuitive understanding of backpropagation


Notice that backpropagation is a beautifully local process. Every gate in a circuit diagram gets some inputs and can right away compute two things: 1. its output value and 2. the local gradient of its inputs with respect to its output value. Notice that the gates can do this completely independently without being aware of any of the details of the full circuit that they are embedded in. However, once the forward pass is over, during backpropagation the gate will eventually learn about the gradient of its output value on the final output of the entire circuit. Chain rule says that the gate should take that gradient and multiply it into every gradient it normally computes for all of its inputs. 



Lets get an intuition for how this works by referring again to the example. The add gate received inputs [-2, 5] and computed output 3. Since the gate is computing the addition operation, its local gradient for both of its inputs is +1. The rest of the circuit computed the final value, which is -12. During the backward pass in which the chain rule is applied recursively backwards through the circuit, the add gate (which is an input to the multiply gate) learns that the gradient for its output was -4. If we anthropomorphize the circuit as wanting to output a higher value (which can help with intuition), then we can think of the circuit as "wanting" the output of the add gate to be lower (due to negative sign), and with a force of 4. To continue the recurrence and to chain the gradient, the add gate takes that gradient and multiplies it to all of the local gradients for its inputs (making the gradient on both x and y 1 * -4 = -4). Notice that this has the desired effect: If x,y were to decrease (responding to their negative gradient) then the add gate's output would decrease, which in turn makes the multiply gate's output increase. 


Backpropagation can thus be thought of as gates communicating to each other (through the gradient signal) whether they want their outputs to increase or decrease (and how strongly), so as to make the final output value higher. 



Modularity : Sigmoid example


The gates we introduced above are relatively arbitrary. Any kind of differentiable function can act as a gate, and we can group multiple gates into a single gate, or decompose a function into multiple gates whenever it is convenient. Lets look at another expression that illustrates this point:




as we will see later in the class, this expression describes a 2-dimensional neuron (with inputs x and weights w) that uses the sigmoid activation function. But for now lets think of this very simply as just a function from inputs w, x to a single number. The function is made up of multiple gates. In addition to the ones described already above (add, mul, max), there are four more:



Where the functions   translate the input by a constant of c and scale the input by a constant of a, respectively. These are technically special cases of addition and multiplication, but we introduce them as (new) unary gates here since we do need the gradients for the constants. c, a. The full circuit then looks as follows:








In the example above, we see a long chain of function applications that operates on the results of the dot product between w, x. The function that these operations implement is called the sigmoid function . It turns out that the derivative of the sigmoid function with respect to its input simplifies if you perform the derivation (after a fun tricky part where we add and subtract  a 1 in the numerator):




As we see, the gradient turns out to simplify and becomes surprisingly simple. For example, the sigmoid expression receives the input 1.0 and computes the output 0.73 during the forward pass. The derivation above shows that the local gradient would simply be (1 - 0.73) * 0.73 ~= 0.2, as the circuit computed before (see the image above), except this way it would be done with single, simple and efficient expression (and with less numerical issues). Therefore, in any real practical application it would be very useful to group these operations into a single gate. Lets see the backprop for this neuron in code:




w = [2, -3, -3]             # assume some random weights and data

x = [-1, -2]


# forward pass

dot = w[0] * x[0] + w[1] * x[1] + w[2]

f = 1.0 / (1 + math.exp(-dot))        # sigmoid function


# backward pass through the neuron (backpropagation)

ddot = (1 - f) * f         # gradient on dot variable, using the sigmoid gradient derivation

dx = [w[0] * ddot, w[1] * ddot]     # backprop into x

dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot]   # backprop into w

# we're done ! we have the gradients on the inputs to the circuit


 


Implementation protip : staged backpropagation.

As shown in the code above, in practice it is always helpful to break down the forward pass into stages that are easily backpropped through. For example here we created an intermediate variable dot which holds the output of the dot product between W and x. During backward pass we then successively compute (in reverse order) the corresponding variables (e.g. ddot, and ultimately dw, dx) that hold the gradients of those variables. 


The point of this section is that the details of how the backpropagation is performed, and which parts of the forward function we think of as gates, is a matter of convenience. It helps to be aware of which parts of the expression have easy local gradients, so that they can be chained together with the least amount of code and effort. 



Backprop in practice: Staged computation


Lets see this with another example. Suppose that we have a function of the form:





To be clear, this function is completely useless and it's not clear why you would ever want to compute its gradient, except for the fact that it is a good example of backpropagation in practice. It is very important to stress that if you were to launch into performing the differentiation with respect to either x or y, you would end up with very large and complex expressions. However, it turns out that doing so is completely unnecessary because we don't need to have an explicit function written down that evaluates the gradient. We only have to know how to compute it. Here is how we would structure the forward pass of such expression:






















Comments