If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains ***.kastatic.org** and ***.kasandbox.org** are unblocked.

Main content

Current time:0:00Total duration:11:55

- [Voiceover] Hello everyone. So this is what I might
call a more optional video. In the last couple videos, I talked about this multivariable chain rule, and I give some justification. And it might have been considered a little bit hand-wavy by some. I was doing a lot of things that looked kind of
like taking a derivative with respect to t, and
then multiplying that by an infinitesimal quantity, dt, and thinking of canceling those out. And some people might say, "Ah! "But this isn't really a fraction. "That's a derivative, that's
a differential operator, "and you're treating it incorrectly." And while that's true, the intuitions underlying a lot of this actually matches with the
formal argument pretty well. So what I wanna do here is just talk about what the formal argument behind the multivariable chain rule is, and just to remind ourselves
of the setup of where we are. You're thinking of v as
a vector-valued function, so this is something that
takes as an input, t, that lives on a number line, and then v maps this to some
kind of high-dimensional space. In the simplest case, you
might just think of that as a two-dimensional space, maybe it's three-dimensional space. Or it could be 100-dimensional. You don't have to literally
be visualizing it. And then f, our function f, somehow takes that 100-dimensional space, or two-dimensional, or three-dimensional, whatever it is, and then maps it onto the number line. So the overall effect of
the composition function is to just take a real
number to a real number so it's a single-variable function. So that's where we're taking
this ordinary derivative, rather than a partial
derivative, or gradient, or anything like that. But because it goes through
a multi-dimensional space, and you have this intermediary,
multivariable nature to it, that's why you have a gradient, and a vector-value derivative. With the formal argument,
the first thing you might do is just write out the formal
definition of a derivative. And in this case, it's a limit. Definitions of derivatives
are always gonna be some kind of limit as a
variable goes to zero. And here, you're loosely
thinking about h as being dt. And you could write delta t, but it's common to use h
just because that can be used for whatever your
differential quantity is. So that's on the denominator, 'cause you're thinking of it as dt. And the top is whatever the change to this whole function is when you nudge that input by t. And what I mean by that
is you'll take f of v, not of t, but of t plus h, that kind of nudged output value, and you're wondering how different that is from f of v of t, the
original value, v of t. So this is what happens
when you just apply the formal definition of the derivative, the ordinary derivative, to
your composition function. And now, what do you do as you're trying to reason
about what this should equal? And a good place to start, actually, is to look back to the
intuition that I was giving for the multivariable chain
rule in the first place. You imagine nudging your input by some dt, some tiny change, and I was saying, oh, so that causes a change in the intermediary space of some kind of, you know, you could call it dv,
a change in the vector. And the way that you're thinking that, that as you take the
vector value derivative and multiply it by dt, it's
the proportionality constant between the size of your nudge
and the resulting vector. And loosely, you might imagine
those dt's crossing out as if they were fractions. It doesn't really matter. And then you say, "What does this change?" This change by a dv cause for f, and by definition, the resulting nudge to the output space of f is the directional
derivative in the direction of whatever your vector
nudge is of the function f. So this is the loose intuition, and where does that
carry over to formality? You say, "Well, in this
intermediary space, "we have to deal with the
vector value derivative of v." So it might be a good thing to just write down that definition, right? Write down the fact that the definition for the
vector value derivative of v, again, it looks almost identical. All these derivative definitions really do look kind of the same 'cause what you're doing
is you're taking the limit as h goes to zero, h we're still thinking of as being dt. So that kind of sits on the bottom. But here you're just wondering
how your vector changes. And the difference, even though
we're kind of writing this the same way, and it looks
almost identical notationally, what's on the numerator
here, this v of t plus h, and this v of t, these are vectors. So this is kind of a
vector minus a vector. When you take the limit, you're getting a limiting vector, something in your high-dimensional space. It's not just a number. And now, another way to write this, one that's more helpful, more conducive to manipulation, is to say not that it equals
the limit of this value, and I'm gonna go ahead and
just copy this value here, kind of down here, and say, the value of our derivative actually equals this, subject
to some kind of error, which I'll just write as E of h, like an error function of h. And what you should be thinking is that that error function goes
to zero as h goes to zero. This is just writing
things so that we're able to manipulate it a little bit more easily. So I'll give ourselves some room here. And what you can do with this is multiply all sides by h. So this is our vector value derivative, just rewriting it. Multiply it by h. And you're thinking of this h as a dt, so maybe in the back of your mind, you're kind of thinking of
canceling this dt with the h. And what it equals is this
top, this numerator here, which was v of t plus h minus v of t. And in the back of your mind, you might be thinking,
this whole thing represents dv, a change in v. So the idea of canceling
out that dt with the h really does kind of come through here. But the difference between the more hand-waving argument before
of canceling those out and what we're doing here is now we're accounting
for that error function. In this case it's now multiplied by h 'cause everything was
multiplied by h error function. And there's actually another
way that I'm gonna write this. There's a very useful
convention in analysis where I'll take something like this and instead I'll write it as little o of h. And this isn't literally a function. It's just a stand-in to
say whatever this is, whatever function that represents, it satisfies the property that
when we take that function and divide it by h, that will go to zero as
h goes to zero, right? Which is true here because
you imagine taking this and dividing by h, and that would be, this h cancels out and you
just have your error function is gonna go to zero. So now what I do is I use
this entire expression to write this v of t plus h. And the reason I wanna do that if we kind of scroll back up is because we see v of t plus h showing up in the original definition we care about. So this is just a way of
starting to get a grapple on that a little bit more firmly. So what I'd write, I'd say
that that v of t plus h, v of t plus h, that nudged output value, is equal to the original
value that I have, v of t plus, and it's gonna be
plus this derivative term, and you can kind of think that it's almost like a Taylor polynomial, where this is our first order term. We're evaluating it at whatever that t is, but we're multiplying it
by the value of that nudge, that linear term. And then the rest of the stuff
is just some little o of h. And maybe you'd say,
"Shouldn't you be subtracting "off that little o of h?" And it's not an actual function. It just represents anything that shrinks. And maybe I should say
it's the absolute value, like the magnitude, 'cause in this case, this is a vector-valued quantity. You know, that error is a vector. So it's the size of that vector divided by the size of h goes to zero. So this is the main tool that
we're gonna end up using. This is the way to represent v of t plus h. And now if we go back up
to the original definition of the vector value derivative, and I'll go ahead and copy that, go ahead and copy that guy. Little bit of debris. So copy that original definition for the ordinary derivative
of the composition function, and now when I write things in according to all the manipulations that we just did, this is really, it's still a limit, 'cause h goes to zero, but what we put on the inside here is it's f of, now instead of writing v of t plus h, I'm gonna use everything
that I did up there. It's the value of v of t plus the derivative at our point times the size of h. So again, it's kind of
like a Taylor polynomial. This is your linear term, and then it's plus something
that we don't care about, something that's gonna get really small as h goes small, and really small in comparison
to h, more importantly. And from that you subtract off f of v of t. Kind of running off the edge. I always keep running off the edge. And all of that is divided by h. Now, the point here is when you look at this limit,
because we're taking it as h goes to zero, we'll basically be able to
ignore this o of h component because as h goes to zero, this gets very, very
small in comparison to h. So everything that's on the inside here is basically just the v of t plus this vector value, right? And this is h times some kind of vector. But if you think back, I made a video on the formal definition of
the directional derivative. And if you remembered,
or if you kind of go back and take a look now, this is
exactly the formal definition of the directional derivative. We're taking h to go to zero, the thing we're multiplying it by is a certain vector quantity. That vector is the nudge
to your original value, and then we're dividing everything by h. So by definition, this entire thing is the directional derivative
in the direction of the derivative of the function of t. I'm writing v prime t
instead of getting the whole dv, dt down there. All of that of f evaluated at where? Well, the place that we're starting is just v of t, so that's v of t. And that's it, that's the answer. 'Cause when you evaluate
the directional derivative, the way that you do that,
you take the gradient of f, evaluate it at whatever
point you're starting at, in this case it's the output of v of t, and you take the dot product between that and the vector value derivative. Well, I mean (chuckles), the dot product between that
and whatever your vector is, which, in this case, is
the vector-value derivative of v, and that's the
multivariable chain rule. And if you look back through
the line of reasoning, it all really did match the thoughts of kind of nudging, nudging, and seeing how that nudged, right? Because the reason we thought to use the vector-value derivative was because of that intuition. And the reason for all the
manipulation that I did is just because I wanted
to be able to express what a nudge to the input of v looks like. And what that looks like
is the original value plus a certain vector here. This was the resulting nudge
in the intermediary space. I wanted to express that in a formal way. And sure, we have this kind of o of h term that expresses something
that shrinks really fast, but once you express it like that, you just end up plopping out the definition of the
directional derivative. So I hope that gives kind
of a satisfying reason for those of you who are a
little bit more rigor-inclined for why the multivariable
chain rule works. I should also maybe mention
there's a more general multivariable chain rule
for vector-valued functions. I'll get to that at another point when I talk about the connections between multivariable
calculus and linear algebra. But for now, that's pretty
much all you need to know on the multivariable chain rule when the ultimate composition is, you know, just a real
number to a real number. And I'll see you next video.