Title: | Metodos Predictivos de Aprendizaje Estadistico (Statistical Learning Predictive Methods) |
---|---|
Description: | Functions and datasets used in the book: Fernandez-Casal, R., Costa, J. and Oviedo-de la Fuente, M. (2024) "Metodos predictivos de aprendizaje estadistico" <https://rubenfcasal.github.io/aprendizaje_estadistico/>. |
Authors: | Ruben Fernandez-Casal [aut, cre]
|
Maintainer: | Ruben Fernandez-Casal <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.2 |
Built: | 2025-02-24 05:35:23 UTC |
Source: | https://github.com/rubenfcasal/mpae |
Functions and datasets used in the book Fernández-Casal, Costa and Oviedo-de la Fuente (2024) Métodos predictivos de aprendizaje estadístico.
For more information visit https://rubenfcasal.github.io/mpae/.
Fernández-Casal R., Costa J. and Oviedo-de la Fuente M. (2024). Métodos predictivos de aprendizaje estadístico (github).
Fernández-Casal R., Roca-Pardiñas J., Costa J. and Oviedo-de la Fuente M. (2022). Introducción al Análisis de Datos con R (github).
Fernández-Casal R., Cao R. and Costa J. (2023). Técnicas de Simulación y Remuestreo, segunda edición, (github).
Computes accuracy measurements.
accuracy(pred, obs, na.rm = FALSE, tol = sqrt(.Machine$double.eps))
accuracy(pred, obs, na.rm = FALSE, tol = sqrt(.Machine$double.eps))
pred |
a numeric vector with the predicted values. |
obs |
a numeric vector with the observed values. |
na.rm |
a logical indicating whether NA values should be stripped before the computation proceeds. |
tol |
divide underflow tolerance. |
Returns a named vector with the following components:
me
mean error
rmse
root mean squared error
mae
mean absolute error
mpe
mean percent error
mape
mean absolute percent error
r.squared
pseudo R-squared
set.seed(1) nobs <- nrow(bodyfat) itrain <- sample(nobs, 0.8 * nobs) train <- bodyfat[itrain, ] test <- bodyfat[-itrain, ] fit <- lm(bodyfat ~ abdomen + wrist, data = train) pred <- predict(fit, newdata = test) obs <- test$bodyfat pred.plot(pred, obs) accuracy(pred, obs)
set.seed(1) nobs <- nrow(bodyfat) itrain <- sample(nobs, 0.8 * nobs) train <- bodyfat[itrain, ] test <- bodyfat[-itrain, ] fit <- lm(bodyfat ~ abdomen + wrist, data = train) pred <- predict(fit, newdata = test) obs <- test$bodyfat pred.plot(pred, obs) accuracy(pred, obs)
Modification of the bodyfat
dataset for classification.
The response bfan
is a factor indicating a body fat value above the normal
range.
The variable bodyfat
was dropped for convenience, and two new variables
bmi
(body mass index, in kg/m^2) and bmi2
(alternate body mass index,
in kg^1.2/m^3.3) were computed (see examples below).
bfan
bfan
A data frame with 246 rows and 16 columns:
Body fat above normal range
Age (years)
Weight (kg)
Height (cm)
Neck circumference (cm)
Chest circumference (cm)
Abdomen circumference (cm)
Hip circumference (cm)
Thigh circumference (cm)
Knee circumference (cm)
Ankle circumference (cm)
Biceps (extended) circumference (cm)
Forearm circumference (cm)
Wrist circumference (cm)
Body mass index (kg/m2)
Alternate body mass index
See bodyfat
and bodyfat.raw
for details.
StatLib Datasets Archive: https://lib.stat.cmu.edu/datasets/bodyfat.
Penrose, K., Nelson, A. and Fisher, A. (1985). Generalized Body Composition Prediction Equation for Men Using Simple Measurement Techniques. Medicine and Science in Sports and Exercise, 17(2), 189. doi:10.1249/00005768-198504000-00037.
bfan <- bodyfat # Body fat above normal bfan[1] <- factor(bfan$bodyfat > 24 , # levels = c('FALSE', 'TRUE'), labels = c('No', 'Yes')) names(bfan)[1] <- "bfan" bfan$bmi <- with(bfan, weight/(height/100)^2) bfan$bmi2 <- with(bfan, weight^1.2/(height/100)^3.3) fit <- glm(bfan ~ abdomen, family = binomial, data = bfan) summary(fit)
bfan <- bodyfat # Body fat above normal bfan[1] <- factor(bfan$bodyfat > 24 , # levels = c('FALSE', 'TRUE'), labels = c('No', 'Yes')) names(bfan)[1] <- "bfan" bfan$bmi <- with(bfan, weight/(height/100)^2) bfan$bmi2 <- with(bfan, weight^1.2/(height/100)^3.3) fit <- glm(bfan ~ abdomen, family = binomial, data = bfan) summary(fit)
Modification of the dataset analysed in Penrose et al. (1985). Lists estimates of the percentage of body fat determined by underwater weighing and various body measurements for 246 men.
bodyfat
bodyfat
A data frame with 246 rows and 14 columns:
Percent body fat (from Siri's 1956 equation)
Age (years)
Weight (kg)
Height (cm)
Neck circumference (cm)
Chest circumference (cm)
Abdomen circumference (cm)
Hip circumference (cm)
Thigh circumference (cm)
Knee circumference (cm)
Ankle circumference (cm)
Biceps (extended) circumference (cm)
Forearm circumference (cm)
Wrist circumference (cm)
This data set can be used to illustrate multiple regression techniques (e.g. Johnson 1996). Instead of estimating body fat percentage from body density, which is not easy to measure, it is desirable to have a simpler method that allow this to be done from body measurements.
bodyfat.raw
contains the original data.
According to Johnson (1996), there were data entry errors (cases 42, 48, 76,
96 and 182 of the original data) and he suggested some rules to correct them.
These outliers were removed in the bodyfat
dataset, as well as an influential
observation (case 39, which has a big effect on regression estimates).
Additionally, the variable density
was dropped for convenience, and variables
height
and weight
were transformed into metric units (centimetres and
kilograms) for consistency.
See bodyfat.raw
for more details.
StatLib Datasets Archive: https://lib.stat.cmu.edu/datasets/bodyfat.
Johnson, R. W. (1996). Fitting Percentage of Body Fat to Simple Body Measurements. Journal of Statistics Education, 4(1). doi:10.1080/10691898.1996.11910505.
Penrose, K., Nelson, A. and Fisher, A. (1985). Generalized Body Composition Prediction Equation for Men Using Simple Measurement Techniques. Medicine and Science in Sports and Exercise, 17(2), 189. doi:10.1249/00005768-198504000-00037.
fit <- lm(bodyfat ~ abdomen, bodyfat) summary(fit) plot(bodyfat ~ abdomen, bodyfat) abline(fit)
fit <- lm(bodyfat ~ abdomen, bodyfat) summary(fit) plot(bodyfat ~ abdomen, bodyfat) abline(fit)
Popular dataset originally analysed in Penrose et al. (1985). Lists estimates of the percentage of body fat determined by underwater weighing and various body measurements for 252 men.
bodyfat.raw
bodyfat.raw
A data frame with 252 rows and 15 columns:
Density (gm/cm^3; determined from underwater weighing)
Percent body fat (from Siri's 1956 equation)
Age (years)
Weight (lbs)
Height (inches)
Neck circumference (cm)
Chest circumference (cm)
Abdomen 2 circumference (cm)
Hip circumference (cm)
Thigh circumference (cm)
Knee circumference (cm)
Ankle circumference (cm)
Biceps (extended) circumference (cm)
Forearm circumference (cm)
Wrist circumference (cm)
This data set can be used to illustrate data cleaning and multiple regression techniques (e.g. Johnson 1996). Percentage of body fat for an individual can be estimated from body density, for instance by using Siri's (1956) equation:
Volume, and hence body density, can be accurately measured by underwater weighing (e.g. Katch and McArdle, 1977). However, this procedure for the accurate measurement of body fat is inconvenient and costly. It is desirable to have easy methods of estimating body fat from body measurements.
"Measurement standards are apparently those listed in Benhke and Wilmore (1974), pp. 45-48 where, for instance, the abdomen 2 circumference is measured 'laterally, at the level of the iliac crests, and anteriorly, at the umbilicus'.
Johnson (1996) uses the original data in an activity to introduce students to data cleaning before performing multiple linear regression. An examination of the data reveals some unusual cases:
Cases 48, 76, and 96 seem to have a one-digit error in the listed density values.
Case 42 appears to have a one-digit error in the height value.
Case 182 appears to have an error in the density value (as it is greater than 1.1, the density of the "fat free mass"; resulting in a negative estimate of body fat percentage that was truncated to zero).
Johnson (1996) suggests some rules for correcting these values (see examples below).
StatLib Datasets Archive: https://lib.stat.cmu.edu/datasets/bodyfat.
Johnson, R. W. (1996). Fitting Percentage of Body Fat to Simple Body Measurements. Journal of Statistics Education, 4(1). doi:10.1080/10691898.1996.11910505.
Penrose, K., Nelson, A. and Fisher, A. (1985). Generalized Body Composition Prediction Equation for Men Using Simple Measurement Techniques. Medicine and Science in Sports and Exercise, 17(2), 189. doi:10.1249/00005768-198504000-00037.
Siri, W. E. (1956). Gross Composition of the Body, in Advances in Biological and Medical Physics (Vol. IV), eds. J. H. Lawrence and C. A. Tobias, Academic Press.
bodyfat <- bodyfat.raw # Johnson's (1996) corrections cases <- c(48, 76, 96) # bodyfat != 495/density - 450 bodyfat$density[cases] <- 495 / (bodyfat$bodyfat[cases] + 450) bodyfat$height[42] <- 69.5 # Other possible data entry errors # See https://stat-ata-asu.github.io/PredictiveModelBuilding/BFdata.html bodyfat$ankle[31] <- 23.9 bodyfat$ankle[86] <- 23.7 bodyfat$forearm[159] <- 24.9 # Outlier and influential observation outliers <- c(182, 39) bodyfat[outliers, ] bodyfat <- bodyfat[-outliers, ] # Body mass index (kg/m2) bodyfat$bmi <- with(bodyfat, weight/(height*0.0254)^2) # Alternate body mass index bodyfat$bmi2 <- with(bodyfat, (weight*0.45359237)^1.2/(height*0.0254)^3.3) # See e.g. https://en.wikipedia.org/wiki/Body_fat_percentage#From_BMI # \text{(Adult) body fat percentage} = (1.39 \times \text{BMI}) # + (0.16 \times \text{age}) - (10.34 \times \text{gender}) - 9
bodyfat <- bodyfat.raw # Johnson's (1996) corrections cases <- c(48, 76, 96) # bodyfat != 495/density - 450 bodyfat$density[cases] <- 495 / (bodyfat$bodyfat[cases] + 450) bodyfat$height[42] <- 69.5 # Other possible data entry errors # See https://stat-ata-asu.github.io/PredictiveModelBuilding/BFdata.html bodyfat$ankle[31] <- 23.9 bodyfat$ankle[86] <- 23.7 bodyfat$forearm[159] <- 24.9 # Outlier and influential observation outliers <- c(182, 39) bodyfat[outliers, ] bodyfat <- bodyfat[-outliers, ] # Body mass index (kg/m2) bodyfat$bmi <- with(bodyfat, weight/(height*0.0254)^2) # Alternate body mass index bodyfat$bmi2 <- with(bodyfat, (weight*0.45359237)^1.2/(height*0.0254)^3.3) # See e.g. https://en.wikipedia.org/wiki/Body_fat_percentage#From_BMI # \text{(Adult) body fat percentage} = (1.39 \times \text{BMI}) # + (0.16 \times \text{age}) - (10.34 \times \text{gender}) - 9
A dataset containing observations of customers of the industrial distribution company HBAT. The variables can be classified into three groups: the first 6 (categorical) are shopper characteristics (data warehouse classification), variables 7 to 19 (numerical) measure shopper perceptions of HBAT and the last 5 are possible target variables (responses), the purchase outcomes.
hbat
hbat
A data frame with 200 rows and 24 columns:
Customer ID.
Customer Type. Length of time a particular customer has been
buying from HBAT: Menos de 1 año
= less than 1 year. De 1 a 5 años
= between 1 and 5 years.
Más de 5 años
= longer than 5 years.
Type of industry that purchases HBAT’s paper products:
Revista
= magazine industry, Periodico
= newsprint industry.
Employee size:
Pequeña (<500)
= small firm, fewer than 500 employees,
Grande (>=500)
= large firm, 500 or more employees.
Customer location:
America del norte
= USA/North America, Otros
= outside North America.
Distribution System. How paper products are sold to customers:
Indirecta
= sold indirectly through a broker, Directa
= sold directly.
Product Quality. Perceived level of quality of HBAT’s paper products.
E-Commerce Activities/Web Site. Overall image of HBAT’s Web site, especially user-friendliness.
Technical Support. Extent to which technical support is offered to help solve product/service issues.
Complaint Resolution. Extent to which any complaints are resolved in a timely and complete manner.
Advertising. Perceptions of HBAT’s advertising campaigns in all types of media.
Product Line. Depth and breadth of HBAT’s product line to meet customer needs.
Salesforce Image. Overall image of HBAT’s salesforce.
Competitive Pricing. Extent to which HBAT offers competitive prices.
Warranty and Claims. Extent to which HBAT stands behind its product/service warranties and claims.
New Products. Extent to which HBAT develops and sells new products.
Ordering and Billing. Perception that ordering and billing is handled efficiently and correctly.
Price Flexibility. Perceived willingness of HBAT sales reps to negotiate price on purchases of paper products.
Delivery Speed. Amount of time it takes to deliver the paper products once an order has been confirmed.
Customer satisfaction with past purchases from HBAT, measured on a 10-point graphic rating scale.
Likelihood of recommending HBAT to other firms as a supplier of paper products, measured on a 10-point graphic rating scale.
Likelihood of purchasing paper products from HBAT in the future, measured on a 10-point graphic rating scale.
Percentage of Purchases from HBAT. Percentage of the responding firm’s paper needs purchased from HBAT, measured on a 100-point percentage scale.
Perception of Future Relationship with HBAT. Extent to which
the customer/respondent perceives his or her firm would engage in strategic
alliance/partnership with HBAT:
No
= Would not consider. Si
= Yes, would consider strategic alliance or partnership.
For more details, consult the reference Hair et al. (1998).
Hair et al. (1998).
Hair, J. F., Anderson, R. E., Tatham, R. L., y Black, W. (1998). Multivariate Data Analysis. Prentice Hall.
str(hbat) as.data.frame(attr(hbat, "variable.labels")) summary(hbat)
str(hbat) as.data.frame(attr(hbat, "variable.labels")) summary(hbat)
Generates plots comparing predictions with observations.
pred.plot(pred, obs, ...) ## Default S3 method: pred.plot( pred, obs, xlab = "Predicted", ylab = "Observed", lm.fit = TRUE, lowess = TRUE, ... ) ## S3 method for class 'factor' pred.plot( pred, obs, type = c("frec", "perc", "cperc"), xlab = "Observed", ylab = NULL, legend.title = "Predicted", label.bars = TRUE, ... )
pred.plot(pred, obs, ...) ## Default S3 method: pred.plot( pred, obs, xlab = "Predicted", ylab = "Observed", lm.fit = TRUE, lowess = TRUE, ... ) ## S3 method for class 'factor' pred.plot( pred, obs, type = c("frec", "perc", "cperc"), xlab = "Observed", ylab = NULL, legend.title = "Predicted", label.bars = TRUE, ... )
pred |
a numeric vector with the predicted values. |
obs |
a numeric vector with the observed values. |
... |
additional graphical parameters or further arguments passed to
other methods (e.g. to |
xlab |
a title for the x axis. |
ylab |
a title for the y axis. |
lm.fit |
logical indicating if a |
lowess |
logical indicating if a |
type |
types of the desired plots. Any combination of the following
values is possible: |
legend.title |
a title for the legend. |
label.bars |
if |
The default method draws a scatter plot of the observed values against the predicted values.
pred.plot.factor()
creates bar plots representing frequencies, percentages
or conditional percentages of pred
within levels of obs
.
This method is a front end to RcmdrMisc::Barplot()
.
The default method invisibly returns the fitted linear model if
lm.fit == TRUE
.
pred.plot.factor()
invisibly returns the horizontal coordinates of the
centers of the bars.
set.seed(1) nobs <- nrow(hbat) itrain <- sample(nobs, 0.8 * nobs) train <- hbat[itrain, ] test <- hbat[-itrain, ] # Regression fit <- lm(fidelida ~ velocida + calidadp, data = train) pred <- predict(fit, newdata = test) obs <- test$fidelida res <- pred.plot(pred, obs) summary(res) # Classification fit2 <- glm(alianza ~ velocida + calidadp, family = binomial, data = train) obs <- test$alianza p.est <- predict(fit2, type = "response", newdata = test) pred <- factor(p.est > 0.5, labels = levels(obs)) pred.plot(pred, obs, type = "frec", style = "parallel") old.par <- par(mfrow = c(1, 2)) pred.plot(pred, obs, type = c("perc", "cperc")) par(old.par)
set.seed(1) nobs <- nrow(hbat) itrain <- sample(nobs, 0.8 * nobs) train <- hbat[itrain, ] test <- hbat[-itrain, ] # Regression fit <- lm(fidelida ~ velocida + calidadp, data = train) pred <- predict(fit, newdata = test) obs <- test$fidelida res <- pred.plot(pred, obs) summary(res) # Classification fit2 <- glm(alianza ~ velocida + calidadp, family = binomial, data = train) obs <- test$alianza p.est <- predict(fit2, type = "response", newdata = test) pred <- factor(p.est > 0.5, labels = levels(obs)) pred.plot(pred, obs, type = "frec", style = "parallel") old.par <- par(mfrow = c(1, 2)) pred.plot(pred, obs, type = c("perc", "cperc")) par(old.par)
Computes the standardized (regression) coefficients, also called beta coefficients or beta weights, to quantify the importance (the effect) of the predictors on the dependent variable in a multiple regression analysis where the variables are measured in different units.
scaled.coef(object, ...) ## Default S3 method: scaled.coef(object, scale.response = TRUE, complete = FALSE, ...)
scaled.coef(object, ...) ## Default S3 method: scaled.coef(object, scale.response = TRUE, complete = FALSE, ...)
object |
an object for which the extraction of model coefficients is meaningful. |
... |
further arguments passed to or from other methods. |
scale.response |
logical indicating if the response variable should be standardized. |
complete |
for the default (used for lm, etc) and aov methods: logical indicating if the full coefficient vector should be returned also in case of an over-determined system where some coefficients will be set to NA. |
The beta weights are the coefficient estimates resulting from a regression
analysis where the underlying data have been standardized so that the
variances of dependent and explanatory variables are equal to 1.
Therefore, standardized coefficients are unitless and refer to how many
standard deviations a dependent variable will change, per standard deviation
increase in the predictor variable.
See https://en.wikipedia.org/wiki/Standardized_coefficient or
QuantPsyc::lm.beta
.
Based on QuantPsyc::lm.beta
.
A named vector with the scaled coefficients.
fit <- lm(fidelida ~ velocida + calidadp, hbat) coef(fit) scaled.coef(fit) fit2 <- lm(scale(fidelida) ~ scale(velocida) + scale(calidadp), hbat) coef(fit2) fit3 <- lm(fidelida ~ scale(velocida) + scale(calidadp), hbat) coef(fit3) scaled.coef(fit, scale.response = FALSE)
fit <- lm(fidelida ~ velocida + calidadp, hbat) coef(fit) scaled.coef(fit) fit2 <- lm(scale(fidelida) ~ scale(velocida) + scale(calidadp), hbat) coef(fit2) fit3 <- lm(fidelida ~ scale(velocida) + scale(calidadp), hbat) coef(fit3) scaled.coef(fit, scale.response = FALSE)
A subset related to the white variant of the Portuguese "Vinho Verde" wine,
containing physicochemical information (fixed.acidity
, volatile.acidity
,
citric.acid
, residual.sugar
, chlorides
, free.sulfur.dioxide
,
total.sulfur.dioxide
, density
, pH
, sulphates
and alcohol
)
and sensory (quality
).
winequality
winequality
A data frame with 1,250 rows and 12 columns:
fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol
median of at least 3 evaluations of wine quality carried out by experts, who evaluated them between 0 (very bad) and 10 (very excellent)
For more details, consult https://www.vinhoverde.pt/en/ or the reference Cortez et al. (2009).
UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/186/wine+quality.
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4), 547-553.
str(winequality)
str(winequality)
A subset related to the white variant of the Portuguese "Vinho Verde" wine,
containing physicochemical information (fixed.acidity
, volatile.acidity
,
citric.acid
, residual.sugar
, chlorides
, free.sulfur.dioxide
,
total.sulfur.dioxide
, density
, pH
, sulphates
and alcohol
)
and sensory (taste
), which indicates the quality of the wine (it is
considered good if the median of the wine quality evaluations, made by experts,
who evaluated them between 0 = very bad and 10 = very excellent, is not less
than 6).
winetaste
winetaste
A data frame with 1,250 rows and 12 columns:
fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol
factor with levels "good"
and "bad"
indicating the quality
of the wine
For more details, consult https://www.vinhoverde.pt/en/ or the reference Cortez et al. (2009).
UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/186/wine+quality.
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4), 547-553.
winetaste <- winequality[, names(winequality)!="quality"] winetaste$taste <- factor(winequality$quality < 6, labels = c('good', 'bad')) # levels = c('FALSE', 'TRUE') str(winetaste)
winetaste <- winequality[, names(winequality)!="quality"] winetaste$taste <- factor(winequality$quality < 6, labels = c('good', 'bad')) # levels = c('FALSE', 'TRUE') str(winetaste)