If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

More formal treatment of multivariable chain rule

For those of you who want to see how the multivariable chain rule looks in the context of the limit definitions of various forms of the derivative. Created by Grant Sanderson.

Want to join the conversation?

  • mr pants teal style avatar for user jackeames
    Did Grant ever get around to making videos on the connection between Linear Algebra and MV Calc, like he mentioned at the end of the video? Or the more general version on the MV chain rule?
    (20 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user Taras.Pokalchuk
    I don't see how this is equal.
    (7 votes)
    Default Khan Academy avatar avatar for user
    • leaf green style avatar for user Pedro Vielman
      I'm assuming you are asking about the connection Grant mentions with the formal definition of the directional derivative. The whole limit that Grant writes in Magenta is really the directional derivate, because if you go back to video he mentions, the vector "a" is "v(t)" in this case, and the nudge "hv" is "h*dv/dt" (remeber this is a vector from Sal's videos from the previous section), so really it's all just the same! And that's where the directional derivative pops out.
      To be honest I wasn't seeing it at the beggining but I do see it know and it's kinda amazing :)

      Cheers
      (19 votes)
  • blobby green style avatar for user tataganesh95
    At , how did Grant arrive on this expression ( involving the error term ) from the previous expression involving limit ( The formal definition of derivative )?
    (7 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user ninextyuk
    It would be interesting to look at the formal proof why at you can just cross o(h).
    (4 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user Taras.Pokalchuk
    what if f(v(t)) is itself a vector valued function? gradient of that is not defined.
    (3 votes)
    Default Khan Academy avatar avatar for user
    • male robot hal style avatar for user Sean
      Assuming that you are saying the derivative of f(v(t)) with respect to t. If that is so, then you can write the function as a sum of a scalar valued functions that depends on r(t) multiplied by the corresponding basis vector. Since the derivative operator is linear, then it distributes over this sum, hence you would use the chain rule per component.
      More tangibly, suppose that you defined this function as <p(v(t)),q(v(t))> in two dimension. Then surely the derivative will be d/dt(p(v(t))i+d/dt(q(v(t))j, and then we would need the chain rule at that point.
      (2 votes)
  • blobby green style avatar for user Kumar
    At around minutes, we ignore o(h) because it's much smaller than h. I can understand o(h)/h -> 0, as h-> 0, but here o(h) is part of the input to a function f. It's not immediately clear how we can ignore this.
    (2 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user Caj M Norlén
    When you are re-writing v(t + h) from the above expression, shouldn't it be (-)o(h) or maybe it just doesn't matter?
    (2 votes)
    Default Khan Academy avatar avatar for user
  • old spice man green style avatar for user Dania  Zaheer
    Are there any videos on this with worked examples?
    (2 votes)
    Default Khan Academy avatar avatar for user
  • male robot johnny style avatar for user MustardManExtremeTurboEdition
    OK so if I am getting this right, d(f(v(t)))/dt is equal to the directional derivative of f at v(t) in the direction of v'(t) BECAUSE the directional derivative defines how f changes in relation to each component of v(t), and dotting that with v'(t) will tell us the total change in each component in v(t)while inside the function of f, based on a change of t. Basically it all just says how much f(v(t)) changes when t changes, but there are a bunch of hoops you jump through to get there because it's a lot of information in some very compact notation.

    I wanna know if I am correctly relating this to the limit definition. dv(t)*h/dt can really be thought of as an infinitesimal change in v(t) as h is effectively the change in t, and so f(v(t)+dv(t)) is really analogous to how we were saying f(t+h)-f(t), but with some negligable error as we have a the derivative of v(t) as part of the input.
    (2 votes)
    Default Khan Academy avatar avatar for user
  • starky tree style avatar for user Vikrant Jaltare
    At - formal definition of directional derivative part.... we should take unit vector along v' (v' cap) and not v' ... I guess there should be a correction
    (0 votes)
    Default Khan Academy avatar avatar for user
    • duskpin ultimate style avatar for user Malith Lakshan
      No you don't. If you took the unit vector, the magnitude of the vector plays no role in the final derivative. The directional derivative is not exactly the same thing as the slope of a function. What we need here is the directional derivative of the function as we are on the way to find the derivative of f with respect to t. The said vector v is just an intermediate step.
      (4 votes)

Video transcript

- [Voiceover] Hello everyone. So this is what I might call a more optional video. In the last couple videos, I talked about this multivariable chain rule, and I give some justification. And it might have been considered a little bit hand-wavy by some. I was doing a lot of things that looked kind of like taking a derivative with respect to t, and then multiplying that by an infinitesimal quantity, dt, and thinking of canceling those out. And some people might say, "Ah! "But this isn't really a fraction. "That's a derivative, that's a differential operator, "and you're treating it incorrectly." And while that's true, the intuitions underlying a lot of this actually matches with the formal argument pretty well. So what I wanna do here is just talk about what the formal argument behind the multivariable chain rule is, and just to remind ourselves of the setup of where we are. You're thinking of v as a vector-valued function, so this is something that takes as an input, t, that lives on a number line, and then v maps this to some kind of high-dimensional space. In the simplest case, you might just think of that as a two-dimensional space, maybe it's three-dimensional space. Or it could be 100-dimensional. You don't have to literally be visualizing it. And then f, our function f, somehow takes that 100-dimensional space, or two-dimensional, or three-dimensional, whatever it is, and then maps it onto the number line. So the overall effect of the composition function is to just take a real number to a real number so it's a single-variable function. So that's where we're taking this ordinary derivative, rather than a partial derivative, or gradient, or anything like that. But because it goes through a multi-dimensional space, and you have this intermediary, multivariable nature to it, that's why you have a gradient, and a vector-value derivative. With the formal argument, the first thing you might do is just write out the formal definition of a derivative. And in this case, it's a limit. Definitions of derivatives are always gonna be some kind of limit as a variable goes to zero. And here, you're loosely thinking about h as being dt. And you could write delta t, but it's common to use h just because that can be used for whatever your differential quantity is. So that's on the denominator, 'cause you're thinking of it as dt. And the top is whatever the change to this whole function is when you nudge that input by t. And what I mean by that is you'll take f of v, not of t, but of t plus h, that kind of nudged output value, and you're wondering how different that is from f of v of t, the original value, v of t. So this is what happens when you just apply the formal definition of the derivative, the ordinary derivative, to your composition function. And now, what do you do as you're trying to reason about what this should equal? And a good place to start, actually, is to look back to the intuition that I was giving for the multivariable chain rule in the first place. You imagine nudging your input by some dt, some tiny change, and I was saying, oh, so that causes a change in the intermediary space of some kind of, you know, you could call it dv, a change in the vector. And the way that you're thinking that, that as you take the vector value derivative and multiply it by dt, it's the proportionality constant between the size of your nudge and the resulting vector. And loosely, you might imagine those dt's crossing out as if they were fractions. It doesn't really matter. And then you say, "What does this change?" This change by a dv cause for f, and by definition, the resulting nudge to the output space of f is the directional derivative in the direction of whatever your vector nudge is of the function f. So this is the loose intuition, and where does that carry over to formality? You say, "Well, in this intermediary space, "we have to deal with the vector value derivative of v." So it might be a good thing to just write down that definition, right? Write down the fact that the definition for the vector value derivative of v, again, it looks almost identical. All these derivative definitions really do look kind of the same 'cause what you're doing is you're taking the limit as h goes to zero, h we're still thinking of as being dt. So that kind of sits on the bottom. But here you're just wondering how your vector changes. And the difference, even though we're kind of writing this the same way, and it looks almost identical notationally, what's on the numerator here, this v of t plus h, and this v of t, these are vectors. So this is kind of a vector minus a vector. When you take the limit, you're getting a limiting vector, something in your high-dimensional space. It's not just a number. And now, another way to write this, one that's more helpful, more conducive to manipulation, is to say not that it equals the limit of this value, and I'm gonna go ahead and just copy this value here, kind of down here, and say, the value of our derivative actually equals this, subject to some kind of error, which I'll just write as E of h, like an error function of h. And what you should be thinking is that that error function goes to zero as h goes to zero. This is just writing things so that we're able to manipulate it a little bit more easily. So I'll give ourselves some room here. And what you can do with this is multiply all sides by h. So this is our vector value derivative, just rewriting it. Multiply it by h. And you're thinking of this h as a dt, so maybe in the back of your mind, you're kind of thinking of canceling this dt with the h. And what it equals is this top, this numerator here, which was v of t plus h minus v of t. And in the back of your mind, you might be thinking, this whole thing represents dv, a change in v. So the idea of canceling out that dt with the h really does kind of come through here. But the difference between the more hand-waving argument before of canceling those out and what we're doing here is now we're accounting for that error function. In this case it's now multiplied by h 'cause everything was multiplied by h error function. And there's actually another way that I'm gonna write this. There's a very useful convention in analysis where I'll take something like this and instead I'll write it as little o of h. And this isn't literally a function. It's just a stand-in to say whatever this is, whatever function that represents, it satisfies the property that when we take that function and divide it by h, that will go to zero as h goes to zero, right? Which is true here because you imagine taking this and dividing by h, and that would be, this h cancels out and you just have your error function is gonna go to zero. So now what I do is I use this entire expression to write this v of t plus h. And the reason I wanna do that if we kind of scroll back up is because we see v of t plus h showing up in the original definition we care about. So this is just a way of starting to get a grapple on that a little bit more firmly. So what I'd write, I'd say that that v of t plus h, v of t plus h, that nudged output value, is equal to the original value that I have, v of t plus, and it's gonna be plus this derivative term, and you can kind of think that it's almost like a Taylor polynomial, where this is our first order term. We're evaluating it at whatever that t is, but we're multiplying it by the value of that nudge, that linear term. And then the rest of the stuff is just some little o of h. And maybe you'd say, "Shouldn't you be subtracting "off that little o of h?" And it's not an actual function. It just represents anything that shrinks. And maybe I should say it's the absolute value, like the magnitude, 'cause in this case, this is a vector-valued quantity. You know, that error is a vector. So it's the size of that vector divided by the size of h goes to zero. So this is the main tool that we're gonna end up using. This is the way to represent v of t plus h. And now if we go back up to the original definition of the vector value derivative, and I'll go ahead and copy that, go ahead and copy that guy. Little bit of debris. So copy that original definition for the ordinary derivative of the composition function, and now when I write things in according to all the manipulations that we just did, this is really, it's still a limit, 'cause h goes to zero, but what we put on the inside here is it's f of, now instead of writing v of t plus h, I'm gonna use everything that I did up there. It's the value of v of t plus the derivative at our point times the size of h. So again, it's kind of like a Taylor polynomial. This is your linear term, and then it's plus something that we don't care about, something that's gonna get really small as h goes small, and really small in comparison to h, more importantly. And from that you subtract off f of v of t. Kind of running off the edge. I always keep running off the edge. And all of that is divided by h. Now, the point here is when you look at this limit, because we're taking it as h goes to zero, we'll basically be able to ignore this o of h component because as h goes to zero, this gets very, very small in comparison to h. So everything that's on the inside here is basically just the v of t plus this vector value, right? And this is h times some kind of vector. But if you think back, I made a video on the formal definition of the directional derivative. And if you remembered, or if you kind of go back and take a look now, this is exactly the formal definition of the directional derivative. We're taking h to go to zero, the thing we're multiplying it by is a certain vector quantity. That vector is the nudge to your original value, and then we're dividing everything by h. So by definition, this entire thing is the directional derivative in the direction of the derivative of the function of t. I'm writing v prime t instead of getting the whole dv, dt down there. All of that of f evaluated at where? Well, the place that we're starting is just v of t, so that's v of t. And that's it, that's the answer. 'Cause when you evaluate the directional derivative, the way that you do that, you take the gradient of f, evaluate it at whatever point you're starting at, in this case it's the output of v of t, and you take the dot product between that and the vector value derivative. Well, I mean (chuckles), the dot product between that and whatever your vector is, which, in this case, is the vector-value derivative of v, and that's the multivariable chain rule. And if you look back through the line of reasoning, it all really did match the thoughts of kind of nudging, nudging, and seeing how that nudged, right? Because the reason we thought to use the vector-value derivative was because of that intuition. And the reason for all the manipulation that I did is just because I wanted to be able to express what a nudge to the input of v looks like. And what that looks like is the original value plus a certain vector here. This was the resulting nudge in the intermediary space. I wanted to express that in a formal way. And sure, we have this kind of o of h term that expresses something that shrinks really fast, but once you express it like that, you just end up plopping out the definition of the directional derivative. So I hope that gives kind of a satisfying reason for those of you who are a little bit more rigor-inclined for why the multivariable chain rule works. I should also maybe mention there's a more general multivariable chain rule for vector-valued functions. I'll get to that at another point when I talk about the connections between multivariable calculus and linear algebra. But for now, that's pretty much all you need to know on the multivariable chain rule when the ultimate composition is, you know, just a real number to a real number. And I'll see you next video.