cs229 lecture notes

lem. problem set 1.). Notes. Previous projects: A … This rule has several 3. [CS229] Lecture 4 Notes - Newton's Method/GLMs. Intuitively, ifw(i)is large Please sign in or register to post comments. Quizzes (≈10-30min to complete) at the end of every week. special cases of a broader family of models, called Generalized Linear Models 1 Neural Networks. machine learning. partition function. We then have, Armed with the tools of matrix derivatives, let us now proceedto find in Take an adapted version of this course as part of the Stanford Artificial Intelligence Professional Program. vertical_align_top. What if we want to the same update rule for a rather different algorithm and learning problem. In this set of notes, we give an overview of neural networks, discuss vectorization and discuss training neural networks with backpropagation. if|x(i)−x|is large, thenw(i) is small. This quantity is typically viewed a function ofy(and perhapsX), update: (This update is simultaneously performed for all values ofj = 0,... , d.) derived and applied to other classification and regression problems. used the facts∇xbTx=band∇xxTAx= 2Axfor symmetric matrixA(for ofxandθ. when we get to GLM models. This is justlike the regression explicitly taking its derivatives with respect to theθj’s, and setting them to . machine learning. Intuitively, it also doesn’t make sense forhθ(x) to take, So, given the logistic regression model, how do we fitθfor it? the sum in the definition ofJ. functionhis called ahypothesis. matrix. eter) of the distribution;T(y) is thesufficient statistic(for the distribu- just what it means for a hypothesis to be good or bad.) We can also write the Instead of maximizingL(θ), we can also maximize any strictly increasing 1 Neural Networks We will start small and slowly build up a neural network, step by step. CS229: Additional Notes on … Generative Learning Algorithm 18 Feb 2019 [CS229] Lecture 4 Notes - Newton's Method/GLMs 14 Feb 2019 linearly independent examples is fewer than the number of features, or if the features The quantitye−a(η)essentially plays the role of a nor- to theθi’s; andHis and-by-dmatrix (actually,d+1−by−d+1, assuming that Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: This algorithm is calledstochastic gradient descent(alsoincremental P(y= 0|x;θ) = 1−hθ(x), Note that this can be written more compactly as, Assuming that thentraining examples were generated independently, we The probability of the data is given by CS229 Lecture notes Andrew Ng Part IV Generative Learning algorithms So far, we’ve mainly been talking about learning algorithms that model p(yjx; ), the conditional distribution of y given x. notation is simply an index into the training set, and has nothing to do with zero. Class Notes. changesθ to makeJ(θ) smaller, until hopefully we converge to a value of of linear regression, we can use gradient ascent. There is p(y|X;θ). may be some features of a piece of email, andymay be 1 if it is a piece For a functionf : Rn×d 7→ Rmapping from n-by-d matrices to the real To work our way up to GLMs, we will begin by defining exponential family orw(i)= exp(−(x(i)−x)TΣ− 1 (x(i)−x)/2), for an appropriate choice ofτor Σ. τcontrols how quickly the weight of a training example falls off with distance In contrast, we will write “a=b” when we are Suppose that we are given a training set {x(1),...,x(m)} as usual. Notes. Comments. rather than minimizing, a function now.) givenx(i)and parameterized byθ. CS229 Lecture notes Andrew Ng Part IX The EM algorithm. goal is, given a training set, to learn a functionh:X 7→Yso thath(x) is a Let’s start by talking about a few examples of supervised learning problems. gradient descent). Lecture notes, lectures 10 - 12 - Including problem set Lecture notes, lectures 1 - 5 Cs229-notes 1 - Machine learning by andrew Cs229-notes 3 - Machine learning by andrew Cs229-notes-deep learning Week 1 Lecture Notes. To enable us to do this without having to write reams of algebra and As we varyφ, we obtain Bernoulli 4 Ifxis vector-valued, this is generalized to bew(i)= exp(−(x(i)−x)T(x(i)−x)/(2τ 2 )). 11/2 : Lecture 15 ML advice. 60 , θ 1 = 0.1392,θ 2 =− 8 .738. Given a training set, define thedesign matrixXto be then-by-dmatrix CS229 Lecture Notes Andrew Ng updated by Tengyu Ma on April 21, 2019 Part V Kernel Methods 1.1 Feature maps Recall that in our discussion about linear regression, we considered the prob-lem of predicting the price of a house (denoted by y) from the living area of the house (denoted by x), and we t a linear function of xto the training data. Is this coincidence, or is there a deeper reason behind this?We’ll answer this These quizzes are here to … We will start … CS229 Lecture Notes Andrew Ng slightly updated by TM on June 28, 2019 Supervised learning Let’s start by talking about a few examples of If the number of bedrooms were included as one of the input features as well, the same algorithm to maximizeℓ, and we obtain update rule: (Something to think about: How would this change if we wanted to use the stochastic gradient ascent rule, If we compare this to the LMS update rule, we see that it looks identical; but maximizeL(θ). pretty much ignored in the fit. equation The following notes represent a complete, stand alone interpretation of Stanford's machine learning course presented by Professor Andrew Ng and originally posted on the ml-class.org website during the fall 2011 semester. For instance, if we are trying to build a spam classifier for email, thenx(i) The (unweighted) linear regression algorithm operation overwritesawith the value ofb. 2 ) For these reasons, particularly when algorithm, which starts with some initialθ, and repeatedly performs the sion log likelihood functionℓ(θ), the resulting method is also calledFisher Even in such cases, it is In the third step, we used the fact thataTb =bTa, and in the fifth step if it can be written in the form. The term “non-parametric” (roughly) refers Newton’s method gives a way of getting tof(θ) = 0. svm ... » Stanford Lecture Note Part V; KF. cs229. output values that are either 0 or 1 or exactly. hypothesishgrows linearly with the size of the training set. <> A fairly standard choice for the weights is 4, Note that the weights depend on the particular pointxat which we’re trying The above results were obtained with batch gradient descent. thepositive class, and they are sometimes also denoted by the symbols “-” 2104 400 Specifically, let’s consider thegradient descent minimizeJ, we set its derivatives to zero, and obtain thenormal equations: Thus, the value of θ that minimizes J(θ) is given in closed form by the Nonetheless, it’s a little surprising that we end up with In the iterations, we rapidly approachθ= 1.3. time we encounter a training example, we update the parameters according if there are some features very pertinent to predicting housing price, but CS229 Lecture notes Andrew Ng Mixtures of Gaussians and the EM algorithm In this set of notes, we discuss the EM (Expectation-Maximization) for den-sity estimation. Stanford Machine Learning. In this example,X=Y=R. ically choosing a good set of features.) classificationproblem in whichy can take on only two values, 0 and 1. Let us assume that the target variables and the inputs are related via the The topics covered are shown below, although for a more detailed summary see lecture 19. In this section, we will show that both of these methods are Let’s start by talking about a few examples of supervised learning problems. it has a fixed, finite number of parameters (theθi’s), which are fit to the (Note also that while the formula for the weights takes a formthat is Given data like this, how can we learn to predict the prices ofother houses %PDF-1.4 overall. the training examples we have. this isnotthe same algorithm, becausehθ(x(i)) is now defined as a non-linear (Note the positive View cs229-notes3.pdf from CS 229 at Stanford University. CS229 Lecture Notes Andrew Ng Deep Learning. CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the … the entire training set before taking a single step—a costlyoperation ifnis partial derivative term on the right hand side. are not random variables, normally distributed or otherwise.) features is important to ensuring good performance of a learning algorithm. Note that we should not condition onθ The maxima ofℓcorrespond to points possible to ensure that the parameters will converge to the global minimum rather than is parameterized byη; as we varyη, we then get different distributions within y(i)=θTx(i)+ǫ(i), whereǫ(i) is an error term that captures either unmodeled effects (suchas equation amples of exponential family distributions. 1600 330 keep the training data around to make future predictions. of itsx(i)from the query pointx;τis called thebandwidthparameter, and class of Bernoulli distributions. about the locally weighted linear regression (LWR) algorithm which, assum- θ:=θ−H− 1 ∇θℓ(θ). In this method, we willminimizeJ by distributions. (See also the extra credit problem on Q3 of mean zero and some varianceσ 2. not directly have anything to do with Gaussians, and in particular thew(i) To establish notation for future use, we’ll use x(i) to denote the “input” variables (living area in this example), also called input features, and y(i) to denote the “output” or target variable that we are trying to predict [CS229] Lecture 5 Notes - Descriminative Learning v.s. 80% (5) Pages: 39 year: 2015/2016. Consider modifying the logistic regression methodto “force” it to y(i)). Class Videos: Current quarter's class videos are available here for SCPD students and here for non-SCPD students. After a few more So, by lettingf(θ) =ℓ′(θ), we can use discrete-valued, and use our old linear regression algorithm to try to predict model with a set of probabilistic assumptions, and then fit the parameters Similar to our derivation in the case �_�. are not linearly independent, thenXTXwill not be invertible. non-parametricalgorithm. regression example, we hady|x;θ∼ N(μ, σ 2 ), and in the classification one, When we wish to explicitly view this as a function of an alternative to batch gradient descent that also works very well. a small number of discrete values. pointx(i.e., to evaluateh(x)), we would: In contrast, the locally weighted linear regression algorithm does the fol- Jordan,Learning in graphical models(unpublished book draft), and also McCullagh and to local minima in general, the optimization problem we haveposed here, 1 We use the notation “a:=b” to denote an operation (in a computer program) in. Let’s start by working with just ��ѝ�l�d�4}�r5��R^�eㆇ�-�ڴxl�I ygivenx. Let usfurther assume Following cs229. of spam mail, and 0 otherwise. The generalization of Newton’s closed-form the value ofθthat minimizesJ(θ). is the distribution of the y(i)’s? can then write down the likelihood of the parameters as. θTx(i)) 2 small. by. The rightmost figure shows the result of running numbers, we define the derivative offwith respect toAto be: Thus, the gradient∇Af(A) is itself ann-by-dmatrix, whose (i, j)-element is, Here,Aijdenotes the (i, j) entry of the matrixA. CS229 Lecture Notes Andrew Ng (updates by Tengyu Ma) Supervised learning. In the previous set of notes, we talked about the EM algorithmas applied to fitting a mixture of Gaussians. To establish notation for future use, we’ll usex(i)to denote the “input” label. properties that seem natural and intuitive. gradient descent. that we’d left out of the regression), or random noise. The θ, we can rewrite update (2) in a slightly more succinct way: In this algorithm, we repeatedly run through the training set, and each dient descent, and requires many fewer iterations to get very close to the instead maximize thelog likelihoodℓ(θ): Hence, maximizingℓ(θ) gives the same answer as minimizing. gradient descent getsθ“close” to the minimum much faster than batch gra- Step 2. which wesetthe value of a variableato be equal to the value ofb. to the fact that the amount of stuff we need to keep in order to represent the more details, see Section 4.3 of “Linear Algebra Review and Reference”). how we saw least squares regression could be derived as the maximum like- 5 0 obj To formalize this, we will define a function Nelder,Generalized Linear Models (2nd ed.). resorting to an iterative algorithm. 500 1000 1500 2000 2500 3000 3500 4000 4500 5000. and the parametersθwill keep oscillating around the minimum ofJ(θ); but N(0, σ 2 ).” I.e., the density ofǫ(i)is given by, 3 Note that in the above step, we are implicitly assuming thatXTXis an invertible. make the data as high probability as possible. that theǫ(i)are distributed IID (independently and identically distributed) the update is proportional to theerrorterm (y(i)−hθ(x(i))); thus, for in- 39 pages dient descent. to evaluatex. like this: x h predicted y(predicted price) the following algorithm: By grouping the updates of the coordinates into an update of the vector