intro-stat-learning
diff --git a/‎Ch03-linreg-lab.Rmd
Lines changed: 2 additions & 2 deletions b/‎Ch03-linreg-lab.Rmd
Lines changed: 2 additions & 2 deletions
diff --git a/‎Ch04-classification-lab.Rmd
Lines changed: 11 additions & 11 deletions b/‎Ch04-classification-lab.Rmd
Lines changed: 11 additions & 11 deletions
diff --git a/‎Ch05-resample-lab.Rmd
Lines changed: 12 additions & 12 deletions b/‎Ch05-resample-lab.Rmd
Lines changed: 12 additions & 12 deletions
diff --git a/‎Ch06-varselect-lab.Rmd
Lines changed: 7 additions & 7 deletions b/‎Ch06-varselect-lab.Rmd
Lines changed: 7 additions & 7 deletions
diff --git a/‎Ch07-nonlin-lab.Rmd
Lines changed: 12 additions & 12 deletions b/‎Ch07-nonlin-lab.Rmd
Lines changed: 12 additions & 12 deletions
@@ -343,7 +343,7 @@ As mentioned above, there is an existing function to add a line to a plot --- `a
 
 
 Next we examine some diagnostic plots, several of which were discussed
-in Section~\ref{Ch3:problems.sec}.
+in Section 3.3.3.
 We can find the fitted values and residuals
 of the fit as attributes of the `results` object.
 Various influence measures describing the regression model
@@ -440,7 +440,7 @@ We can access the individual components of `results` by name
 and
 `np.sqrt(results.scale)` gives us the RSE.
 
-Variance inflation factors (section~\ref{Ch3:problems.sec}) are sometimes useful
+Variance inflation factors (section 3.3.3) are sometimes useful
 to assess the effect of collinearity in the model matrix of a regression model.
 We will compute the VIFs in our multiple regression fit, and use the opportunity to introduce the idea of *list comprehension*.
 
 
@@ -405,7 +405,7 @@ lda.fit(X_train, L_train)
 
 ```
 Here we have used the list comprehensions introduced
-in Section~\ref{Ch3-linreg-lab:multivariate-goodness-of-fit}. Looking at our first line above, we see that the right-hand side is a list
+in Section 3.6.4. Looking at our first line above, we see that the right-hand side is a list
 of length two. This is because the code `for M in [X_train, X_test]` iterates over a list
 of length two. While here we loop over a list,
 the list comprehension method works when looping over any iterable object.
@@ -454,7 +454,7 @@ lda.scalings_
 
 ```
 
-These values provide the linear combination of `Lag1`  and `Lag2`  that are used to form the LDA decision rule. In other words, these are the multipliers of the elements of $X=x$ in (\ref{Ch4:bayes.multi}).
+These values provide the linear combination of `Lag1`  and `Lag2`  that are used to form the LDA decision rule. In other words, these are the multipliers of the elements of $X=x$ in (4.24).
   If $-0.64\times `Lag1`  - 0.51 \times `Lag2` $ is large, then the LDA classifier will predict a market increase, and if it is small, then the LDA classifier will predict a market decline.
 
 ```{python}
@@ -463,7 +463,7 @@ lda_pred = lda.predict(X_test)
 ```
 
 As we observed in our comparison of classification methods
- (Section~\ref{Ch4:comparison.sec}),  the LDA and logistic
+ (Section 4.5),  the LDA and logistic
 regression predictions are almost identical.
 
 ```{python}
@@ -522,7 +522,7 @@ The LDA classifier above is the first classifier from the
 `sklearn` library. We will use several other objects
 from this library. The objects
 follow a common structure that simplifies tasks such as cross-validation,
-which we will see in Chapter~\ref{Ch5:resample}. Specifically,
+which we will see in Chapter 5. Specifically,
 the methods first create a generic classifier without
 referring to any data. This classifier is then fit
 to data with the `fit()`  method and predictions are
@@ -875,7 +875,7 @@ This is double the rate that one would obtain from random guessing.
 The number of neighbors in KNN is referred to as a *tuning parameter*, also referred to as a *hyperparameter*.
 We do not know *a priori* what value to use. It is therefore of interest
 to see how the classifier performs on test data as we vary these
-parameters. This can be achieved with a `for` loop, described in Section~\ref{Ch2-statlearn-lab:for-loops}.
+parameters. This can be achieved with a `for` loop, described in Section 2.3.8.
 Here we use a for loop to look at the accuracy of our classifier in the group predicted to purchase
 insurance as we vary the number of neighbors from 1 to 5:
 
@@ -902,7 +902,7 @@ As a comparison, we can also fit a logistic regression model to the
 data. This can also be done
 with `sklearn`, though by default it fits
 something like the *ridge regression* version
-of logistic regression, which we introduce in Chapter~\ref{Ch6:varselect}. This can
+of logistic regression, which we introduce in Chapter 6. This can
 be modified by appropriately setting the argument `C` below. Its default
 value is 1 but by setting it to a very large number, the algorithm converges to the same solution as the usual (unregularized)
 logistic regression estimator discussed above.
@@ -946,7 +946,7 @@ confusion_table(logit_labels, y_test)
 
 ```
 ## Linear and Poisson Regression on the Bikeshare Data
-Here we fit linear and  Poisson regression models to the `Bikeshare` data, as described in Section~\ref{Ch4:sec:pois}.
+Here we fit linear and  Poisson regression models to the `Bikeshare` data, as described in Section 4.6.
 The response `bikers` measures the number of bike rentals per hour
 in Washington, DC in the period 2010--2012.
 
@@ -987,7 +987,7 @@ variables constant, there are on average about 7 more riders in
 February than in January. Similarly there are about 16.5 more riders
 in March than in January.
 
-The results seen in Section~\ref{sec:bikeshare.linear}
+The results seen in Section 4.6.1
 used a slightly different coding of the variables `hr` and `mnth`, as follows:
 
 ```{python}
@@ -1041,7 +1041,7 @@ np.allclose(M_lm.fittedvalues, M2_lm.fittedvalues)
 ```
 
 
-To reproduce the left-hand side of Figure~\ref{Ch4:bikeshare}
+To reproduce the left-hand side of Figure 4.13
 we must first obtain the coefficient estimates associated with
 `mnth`. The coefficients for January through November can be obtained
 directly from the `M2_lm` object. The coefficient for December
@@ -1081,7 +1081,7 @@ ax_month.set_ylabel('Coefficient', fontsize=20);
 
 ```
 
-Reproducing the  right-hand plot in Figure~\ref{Ch4:bikeshare}  follows a similar process.
+Reproducing the  right-hand plot in Figure 4.13  follows a similar process.
 
 ```{python}
 coef_hr = S2[S2.index.str.contains('hr')]['coef']
@@ -1116,7 +1116,7 @@ M_pois = sm.GLM(Y, X2, family=sm.families.Poisson()).fit()
 
 ```
 
-We can plot the coefficients associated with `mnth` and `hr`, in order to reproduce  Figure~\ref{Ch4:bikeshare.pois}. We first complete these coefficients as before.
+We can plot the coefficients associated with `mnth` and `hr`, in order to reproduce  Figure 4.15. We first complete these coefficients as before.
 
 ```{python}
 S_pois = summarize(M_pois)
 
@@ -237,7 +237,7 @@ for i, d in enumerate(range(1,6)):
 cv_error
 
 ```
-As in Figure~\ref{Ch5:cvplot}, we see a sharp drop in the estimated test MSE between the linear and
+As in Figure 5.4, we see a sharp drop in the estimated test MSE between the linear and
 quadratic fits, but then no clear improvement from using higher-degree polynomials.
 
 Above we introduced the `outer()`  method of the `np.power()`
@@ -278,7 +278,7 @@ cv_error
 Notice that the computation time is much shorter than that of LOOCV.
 (In principle, the computation time for LOOCV for a least squares
 linear model should be faster than for $k$-fold CV, due to the
-availability of the formula~(\ref{Ch5:eq:LOOCVform})  for LOOCV;
+availability of the formula~(5.2)  for LOOCV;
 however, the generic `cross_validate()`  function does not make
 use of this formula.)  We still see little evidence that using cubic
 or higher-degree polynomial terms leads to a lower test error than simply
@@ -325,7 +325,7 @@ incurred by picking different random folds.
 
 ## The Bootstrap
 We illustrate the use of the bootstrap in the simple example
- {of Section~\ref{Ch5:sec:bootstrap},}  as well as on an example involving
+ {of Section 5.2,}  as well as on an example involving
 estimating the accuracy of the linear regression model on the  `Auto`
 data set.
 ### Estimating the Accuracy of a Statistic of Interest
@@ -340,8 +340,8 @@ in a dataframe.
 To illustrate the bootstrap, we
 start with a simple example.
 The  `Portfolio`  data set in the `ISLP` package is described
-in Section~\ref{Ch5:sec:bootstrap}. The goal is to estimate the
-sampling variance of the parameter $\alpha$ given in formula~(\ref{Ch5:min.var}).  We will
+in Section 5.2. The goal is to estimate the
+sampling variance of the parameter $\alpha$ given in formula~(5.7).  We will
 create a function
 `alpha_func()`, which takes as input a dataframe `D` assumed
 to have columns `X` and `Y`, as well as a
@@ -360,7 +360,7 @@ def alpha_func(D, idx):
 ```
 This function returns an estimate for $\alpha$
 based on applying the minimum
-    variance formula (\ref{Ch5:min.var}) to the observations indexed by
+    variance formula (5.7) to the observations indexed by
 the argument `idx`.  For instance, the following command
 estimates $\alpha$ using all 100 observations.
 
@@ -430,7 +430,7 @@ intercept and slope terms for the linear regression model that uses
 `horsepower` to predict `mpg` in the  `Auto`  data set. We
 will compare the estimates obtained using the bootstrap to those
 obtained using the formulas for ${\rm SE}(\hat{\beta}_0)$ and
-${\rm SE}(\hat{\beta}_1)$ described in Section~\ref{Ch3:secoefsec}.
+${\rm SE}(\hat{\beta}_1)$ described in Section 3.1.2.
 
 To use our `boot_SE()` function, we must write a function (its
 first argument)
@@ -499,7 +499,7 @@ This indicates that the bootstrap estimate for ${\rm SE}(\hat{\beta}_0)$ is
 0.85, and that the bootstrap
 estimate for ${\rm SE}(\hat{\beta}_1)$ is
 0.0074.  As discussed in
-Section~\ref{Ch3:secoefsec}, standard formulas can be used to compute
+Section 3.1.2, standard formulas can be used to compute
 the standard errors for the regression coefficients in a linear
 model. These can be obtained using the `summarize()`  function
 from `ISLP.sm`.
@@ -513,21 +513,21 @@ model_se
 
 
 The standard error estimates for $\hat{\beta}_0$ and $\hat{\beta}_1$
-obtained using the formulas  from Section~\ref{Ch3:secoefsec}  are
+obtained using the formulas  from Section 3.1.2  are
 0.717 for the
 intercept and
 0.006 for the
 slope. Interestingly, these are somewhat different from the estimates
 obtained using the bootstrap.  Does this indicate a problem with the
 bootstrap? In fact, it suggests the opposite.  Recall that the
 standard formulas given in
- {Equation~\ref{Ch3:se.eqn} on page~\pageref{Ch3:se.eqn}}
+ {Equation 3.8 on page~\pageref{Ch3:se.eqn}}
 rely on certain assumptions. For example,
 they depend on the unknown parameter $\sigma^2$, the noise
 variance. We then estimate $\sigma^2$ using the RSS. Now although the
 formulas for the standard errors do not rely on the linear model being
 correct, the estimate for $\sigma^2$ does.  We see
- {in Figure~\ref{Ch3:polyplot} on page~\pageref{Ch3:polyplot}}  that there is
+ {in Figure 3.8 on page~\pageref{Ch3:polyplot}}  that there is
 a non-linear relationship in the data, and so the residuals from a
 linear fit will be inflated, and so will $\hat{\sigma}^2$.  Secondly,
 the standard formulas assume (somewhat unrealistically) that the $x_i$
@@ -540,7 +540,7 @@ the results from `sm.OLS`.
 Below we compute the bootstrap standard error estimates and the
 standard linear regression estimates that result from fitting the
 quadratic model to the data. Since this model provides a good fit to
-the data (Figure~\ref{Ch3:polyplot}), there is now a better
+the data (Figure 3.8), there is now a better
 correspondence between the bootstrap estimates and the standard
 estimates of ${\rm SE}(\hat{\beta}_0)$, ${\rm SE}(\hat{\beta}_1)$ and
 ${\rm SE}(\hat{\beta}_2)$.
 
@@ -89,7 +89,7 @@ Hitters.shape
 ```
 
 
-We first choose the best model using forward selection based on $C_p$ (\ref{Ch6:eq:cp}). This score
+We first choose the best model using forward selection based on $C_p$ (6.2). This score
 is not built in as a metric to `sklearn`. We therefore define a function to compute it ourselves, and use
 it as a scorer. By default, `sklearn` tries to maximize a score, hence
   our scoring function  computes the negative $C_p$ statistic.
@@ -114,7 +114,7 @@ sigma2 = OLS(Y,X).fit().scale
 
 ```
 
-The function `sklearn_selected()` expects a scorer with just three arguments --- the last three in the definition of `nCp()` above. We use the function `partial()` first seen in Section~\ref{Ch5-resample-lab:the-bootstrap} to freeze the first argument with our estimate of $\sigma^2$.
+The function `sklearn_selected()` expects a scorer with just three arguments --- the last three in the definition of `nCp()` above. We use the function `partial()` first seen in Section 5.3.3 to freeze the first argument with our estimate of $\sigma^2$.
 
 ```{python}
 neg_Cp = partial(nCp, sigma2)
@@ -366,7 +366,7 @@ Since we
 standardize first, in order to find coefficient
 estimates on the original scale, we must *unstandardize*
 the coefficient estimates. The parameter
-$\lambda$ in (\ref{Ch6:ridge}) and (\ref{Ch6:LASSO}) is called `alphas` in `sklearn`. In order to
+$\lambda$ in (6.5) and (6.7) is called `alphas` in `sklearn`. In order to
 be consistent with the rest of this chapter, we use `lambdas`
 rather than `alphas` in what follows.  {At the time of publication, ridge fits like the one in code chunk [22] issue unwarranted convergence warning messages; we expect these to disappear as this package matures.}
 
@@ -643,7 +643,7 @@ not perform variable selection!
 ### Evaluating Test Error of Cross-Validated Ridge
 Choosing $\lambda$ using cross-validation provides a single regression
 estimator, similar to fitting a linear regression model as we saw in
-Chapter~\ref{Ch3:linreg}. It is therefore reasonable to estimate what its test error
+Chapter 3. It is therefore reasonable to estimate what its test error
 is. We run into a problem here in that cross-validation will have
 *touched* all of its data in choosing $\lambda$, hence we have no
 further data to estimate test error. A compromise is to do an initial
@@ -779,11 +779,11 @@ Principal components regression (PCR) can be performed using
 `PCA()`  from the `sklearn.decomposition`
 module. We now apply PCR to the  `Hitters`  data, in order to
 predict `Salary`. Again, ensure that the missing values have
-been removed from the data, as described in Section~\ref{Ch6-varselect-lab:lab-1-subset-selection-methods}.
+been removed from the data, as described in Section 6.5.1.
 
 We use `LinearRegression()`  to fit the regression model
 here. Note that it fits an intercept by default, unlike
-the `OLS()` function seen earlier in Section~\ref{Ch6-varselect-lab:lab-1-subset-selection-methods}.
+the `OLS()` function seen earlier in Section 6.5.1.
 
 ```{python}
 pca = PCA(n_components=2)
@@ -867,7 +867,7 @@ cv_null = skm.cross_validate(linreg,
 The `explained_variance_ratio_`
 attribute of our `PCA` object provides the *percentage of variance explained* in the predictors and in the response using
 different numbers of components. This concept is discussed in greater
-detail in Section~\ref{Ch10:sec:pca}.
+detail in Section 12.2.
 
 ```{python}
 pipe.named_steps['pca'].explained_variance_ratio_
 
@@ -58,7 +58,7 @@ from ISLP.pygam import (approx_lam,
 ```
 
 ## Polynomial Regression and Step Functions
-We start by demonstrating how Figure~\ref{Ch7:fig:poly} can be reproduced.
+We start by demonstrating how Figure 7.1 can be reproduced.
 Let's  begin by loading the data.
 
 ```{python}
@@ -70,7 +70,7 @@ age = Wage['age']
 
 Throughout most of this lab, our response is `Wage['wage']`, which
 we have stored as `y` above. 
-As in Section~\ref{Ch3-linreg-lab:non-linear-transformations-of-the-predictors}, we will use the `poly()` function to create a model matrix
+As in Section 3.6.6, we will use the `poly()` function to create a model matrix
 that will fit a $4$th degree polynomial in `age`.
 
 ```{python}
@@ -84,7 +84,7 @@ summarize(M)
 This polynomial is constructed using the function `poly()`,
 which creates
 a special *transformer* `Poly()` (using `sklearn` terminology
-for feature transformations such as `PCA()` seen in Section \ref{Ch6-varselect-lab:principal-components-regression}) which
+for feature transformations such as `PCA()` seen in Section 6.5.3) which
 allows for easy evaluation of the polynomial at new data points. Here `poly()` is referred to as a *helper* function, and sets up the transformation; `Poly()` is the actual workhorse that computes the transformation. See also 
 the 
 discussion of transformations on
@@ -151,7 +151,7 @@ def plot_wage_fit(age_df,
 We include an argument `alpha` to `ax.scatter()`
 to add some transparency to the points. This provides a visual indication
 of density. Notice the use of the `zip()` function in the
-`for` loop above (see Section~\ref{Ch2-statlearn-lab:for-loops}).
+`for` loop above (see Section 2.3.8).
 We have three lines to plot, each with different colors and line
 types. Here `zip()` conveniently bundles these together as
 iterators in the loop. {In `Python`{} speak, an "iterator" is an object with a finite number of values, that can be iterated on, as in a loop.}
@@ -254,7 +254,7 @@ anova_lm(*[sm.OLS(y, X_).fit() for X_ in XEs])
 
 
 As an alternative to using hypothesis tests and ANOVA, we could choose
-the polynomial degree using cross-validation, as discussed in Chapter~\ref{Ch5:resample}.
+the polynomial degree using cross-validation, as discussed in Chapter 5.
 
 Next we consider the task of predicting whether an individual earns
 more than $250,000 per year. We proceed much as before, except
@@ -313,7 +313,7 @@ value do not cover each other up. This type of plot is often called a
 *rug plot*.
 
 In order to fit a step function, as discussed in
-Section~\ref{Ch7:sec:scolstep-function},   we first use the `pd.qcut()`
+Section 7.2,   we first use the `pd.qcut()`
 function to discretize `age` based on quantiles.  Then  we use `pd.get_dummies()` to create the
 columns of the model matrix for this categorical variable. Note that this function will
 include *all* columns for a given categorical, rather than the usual approach which drops one
@@ -345,7 +345,7 @@ evaluation functions are in the `scipy.interpolate` package;
 we have simply wrapped them as transforms
 similar to `Poly()` and `PCA()`.
 
-In Section~\ref{Ch7:sec:scolr-splin}, we saw
+In Section 7.4, we saw
 that regression splines can be fit by constructing an appropriate
 matrix of basis functions.  The `BSpline()`  function generates the
 entire matrix of basis functions for splines with the specified set of
@@ -360,7 +360,7 @@ bs_age.shape
 ```
 This results in a seven-column matrix, which is what is expected for a cubic-spline basis with 3 interior knots. 
 We can form this same matrix using the `bs()` object,
-which facilitates adding this to a model-matrix builder (as in `poly()` versus its workhorse `Poly()`) described in Section~\ref{Ch7-nonlin-lab:polynomial-regression-and-step-functions}.
+which facilitates adding this to a model-matrix builder (as in `poly()` versus its workhorse `Poly()`) described in Section 7.8.1.
 
 We now fit a cubic spline model to the `Wage`  data. 
 
@@ -469,7 +469,7 @@ of a model matrix with a particular smoothing operation:
 `s` for smoothing spline; `l` for linear, and `f` for factor or categorical variables.
 The argument `0` passed to `s` below indicates that this smoother will
 apply to the first column of a feature matrix. Below, we pass it a
-matrix with a single column: `X_age`. The argument `lam` is the penalty parameter $\lambda$ as discussed in Section~\ref{Ch7:sec5.2}.
+matrix with a single column: `X_age`. The argument `lam` is the penalty parameter $\lambda$ as discussed in Section 7.5.2.
 
 ```{python}
 X_age = np.asarray(age).reshape((-1,1))
@@ -559,7 +559,7 @@ The strength of generalized additive models lies in their ability to fit multiva
 
 We now fit a GAM by hand to predict
 `wage` using natural spline functions of `year` and `age`,
-treating `education` as a qualitative predictor, as in (\ref{Ch7:nsmod}).
+treating `education` as a qualitative predictor, as in (7.16).
 Since this is just a big linear regression model
 using an appropriate choice of basis functions, we can simply do this
 using the `sm.OLS()`  function.
@@ -642,9 +642,9 @@ ax.set_title('Partial dependence of year on wage', fontsize=20);
 
 ```
 
-We now fit the model (\ref{Ch7:nsmod})  using smoothing splines rather
+We now fit the model (7.16)  using smoothing splines rather
 than natural splines.  All of the
-terms in  (\ref{Ch7:nsmod})  are fit simultaneously, taking each other
+terms in  (7.16)  are fit simultaneously, taking each other
 into account to explain the response. The `pygam` package only works with matrices, so we must convert
 the categorical series `education` to its array representation, which can be found
 with the `cat.codes` attribute of `education`. As `year` only has 7 unique values, we