Title: | Ascent Training Datasets |
---|---|
Description: | Datasets to be used primarily in conjunction with Ascent training materials but also for the book 'SAMS Teach Yourself R in 24 Hours' (ISBN: 978-0-672-33848-9). Version 1.0-7 is largely for use with the book; however, version 1.1 has a much greater focus on use with training materials, whilst retaining compatibility with the book. |
Authors: | Ascent [aut], Harry Alexander [aut, cre, ctb, dtc, rev] |
Maintainer: | Harry Alexander <[email protected]> |
License: | GPL-2 |
Version: | 1.0.0 |
Built: | 2025-02-13 03:56:45 UTC |
Source: | https://github.com/cran/ascentTraining |
Datasets designed to be used in conjunction with Ascent training materials.
Datasets designed to be used in conjunction with Ascent training materials and book, SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9). The data covers a range of applications and has been collected together from a number of sources. The airquality dataset, from the Core R datasets package is also provided in xlsx format in the extdata directory of this package.
Ascent
Contact: Ascent [email protected]
Data concerns city-cycle fuel consumption - revised from CMU StatLib library.
auto_mpg
auto_mpg
A matrix containing 398 observations and 10 attributes.
mpg
Miles per gallon of the engine. Predictor attribute
cylinders
Number of cylinders in the engine
displacement
Engine displacement
horsepower
Horsepower of the car
weight
Weight of the car (lbs)
acceleration
Acceleration of the car (seconds taken for 0-60mph)
model_year
Model year of the car in the 1900s
origin
Car origin
make
Car manufacturer
car_name
Name of the car
http://archive.ics.uci.edu/ml/datasets/Auto+MPG
Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
A collection of BBC news articles from the business or politics sections. There are a total of 927 articles used.
bbc_articles
bbc_articles
A tibble with 201,571 observations, each a word on a document.
word
A word in an article
document
The document/article ID where the word was taken from
Full BBC Articles data
bbc_articles_full
bbc_articles_full
A tibble, with 927 observations of separate documents and their contents. This results in two columns.
words
The words from a given article
document
The 'document' (article) ID
A collection of business and politics BBC news articles. Each row represents each article (document),
with a document ID and a string of the text content with stop words removed. This is a 'dirty' version of the
bbc_articles
dataset, where we now have a string of text for each observation, as opposed to a single word.
A single BBC Business article (not included in the full BBC articles dataset), given in tidy, one word per row format.
bbc_business_123
bbc_business_123
A tibble with 107 observations, each a word on a document.
word
A word in an article
document
The document/article ID where the word was taken from. Note: this only has one unique value, however the column is kept for comparison with other BBC datasets.
A single BBC Politics article (not included in the full BBC articles dataset), given in tidy, one word per row format.
bbc_politics_123
bbc_politics_123
A tibble with 86 observations, each a word on a document.
word
A word in an article
document
The document/article ID where the word was taken from. Note: this only has one unique value, however the column is kept for comparison with other BBC datasets.
Body image dataset
body_image
body_image
A tibble of 246 observations on 8 attributes.
ethnicity
Subject's ethnicity (Asian, Europn, Maori, Pacific)
married
How many times have they been married?
bodyim
Subject's rating of themselves (slight.uw, right, slight.ow, mod.ow, very.ow)
sm.ever
Have they ever smoked?
weight
Weight in kilograms
height
Height in centimetres
age
Age in years
stressgp
What stress group are they in?
A simulated dataset containing data on the self-image of subjects with differing body aesthetics
Simulated data
A mixed up collection of words from different book sections of two books.
book_sections
book_sections
A tibble with 108,657 observations, each a word on a document. This data set is designed to show how LDA can be used to separate a set of mixed documents into two distinct "topics" (or books).
word
Words from a given section within a book.
document
The book section ID that the word came from.
Data taken from two books of the Gutenberg Project
Dataset containing housing values in the suburbs of Boston.
boston
boston
This data frame contains the following columns:
tract
Census tract
medv
Median value of owner-occupied homes in $1,000s.
crim
Per capita crime rate by town.
zn
Proportion of residential land zoned for lots over 25,000 sq.ft.
indus
Proportion of non-retail business acres per town.
chas
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox
Nitrogen oxides concentration (parts per 10 million).
rm
Average number of rooms per dwelling.
age
Proportion of owner-occupied units built prior to 1940.
dis
Weighted mean of distances to five Boston employment centres.
rad
Index of accessibility to radial highways.
tax
Full-value property-tax rate per $10,000.
ptratio
Pupil-teacher ratio by town.
b
where
is the proportion of blacks by town.
lstat
Lower status of the population (percent).
The boston
data frame has 506 rows and 15 columns.
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.
Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
The data contain measurements on cells in suspicious lumps in a women's breast. Features are computed from a digitised image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. All samples are classified as either benign or malignant.
breast_cancer
breast_cancer
breast_cancer
is a tibble
with 22 columns. The first column
is an ID column. The second indicates whether the sample is classified as benign or malignant.
The remaining columns contain measurements for 20 features. Ten real-valued features are computed
for each cell nucleus. The references listed below contain detailed descriptions of how these features
are computed. The mean, and "worst" (or largest - mean of the three largest values) of these features were computed
for each image, resulting in 20 features. Below are descriptions of these features where *
should be replaced by either mean
or worst
.
*_radius
mean of distances from center to points on the perimeter
*_texture
standard deviation of gray-scale values
*_perimeter
perimeter value
*_area
area value
*_smoothness
local variation in radius lengths
*_compactness
perimeter^2 / area - 1.0
*_concavity
severity of concave portions of the contour
*_concave_points
number of concave portions of the contour
*_symmetry
symmetry value
*_fractal_dimension
"coastline approximation" - 1
This breast cancer database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository.
Irvine, CA: University of California, School of Information and Computer
Science.
O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via
linear programming",
SIAM News, Volume 23, Number 5, September 1990, pp 1
& 18. William H. Wolberg and O.L. Mangasarian: "Multisurface method of
pattern separation for medical diagnosis applied to breast cytology",
Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December
1990, pp 9193-9196. K. P. Bennett & O. L. Mangasarian: "Robust linear
programming discrimination of two linearly inseparable sets",
Optimization Methods and Software 1, 1992, 23-34 (Gordon & Breach Science
Publishers).
Wisconsin Breast Cancer Database
breast_cancer_clean_features
breast_cancer_clean_features
A list containing a training and test dataset. These come from a data frame with 699 observations on 11 variables, however the ID and class columns have been removed. There is a train to test ratio of 0.8.
Cl.thickness
Clump Thickness
Cell.size
Uniformity of Cell Size
Cell.shape
Uniformity of Cell Shape
Marg.adhesion
Marginal Adhesion
Epith.c.size
Single Epithelial Cell Size
Bare.nuclei
Bare Nuclei
Bl.cromatin
Bland Chromatin
Normal.nucleoli
Normal Nucleoli
Mitoses
Mitoses
Creator: Dr. WIlliam H. Wolberg (physician); University of Wisconsin Hospital ;Madison; Wisconsin; USA
Donor: Olvi Mangasarian ([email protected])
Received: David W. Aha ([email protected])
These data have been taken from the UCI Repository Of Machine Learning Databases at
and were converted to R format by Evgenia Dimitriadou.
1. Wolberg,W.H., \& Mangasarian,O.L. (1990). Multisurface method
of pattern separation for medical diagnosis applied to breast cytology. In
Proceedings of the National Academy of Sciences, 87, 9193-9196.
- Size of
data set: only 369 instances (at that point in time)
- Collected
classification results: 1 trial only
- Two pairs of parallel hyperplanes
were found to be consistent with 50% of the data
- Accuracy on remaining
50% of dataset: 93.5%
- Three pairs of parallel hyperplanes were found
to be consistent with 67% of data
- Accuracy on remaining 33% of
dataset: 95.9%
2. Zhang,J. (1992). Selecting typical instances in instance-based learning.
In Proceedings of the Ninth International Machine Learning Conference (pp.
470-479). Aberdeen, Scotland: Morgan Kaufmann.
- Size of data set: only
369 instances (at that point in time)
- Applied 4 instance-based learning
algorithms
- Collected classification results averaged over 10 trials
- Best accuracy result:
- 1-nearest neighbor: 93.7%
- trained on 200
instances, tested on the other 169
- Also of interest:
- Using only
typical instances: 92.2% (storing only 23.1 instances)
- trained on 200
instances, tested on the other 169
Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.
Wisconsin Breast Cancer Database
breast_cancer_clean_target
breast_cancer_clean_target
A list containing a training and test dataset. These come from a data frame with 699 observations on 11 variables, however only the target classes have been kept. There is a train to test ratio of 0.8.
Class.Benign
Whether the sample was classified as benign
Class.malignant
Whether the sample was classified as malignant
2. Zhang,J. (1992). Selecting typical instances in instance-based learning.
In Proceedings of the Ninth International Machine Learning Conference (pp.
470-479). Aberdeen, Scotland: Morgan Kaufmann.
- Size of data set: only
369 instances (at that point in time)
- Applied 4 instance-based learning
algorithms
- Collected classification results averaged over 10 trials
- Best accuracy result:
- 1-nearest neighbor: 93.7%
- trained on 200
instances, tested on the other 169
- Also of interest:
- Using only
typical instances: 92.2% (storing only 23.1 instances)
- trained on 200
instances, tested on the other 169
Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.
Creator: Dr. WIlliam H. Wolberg (physician); University of Wisconsin Hospital ;Madison; Wisconsin; USA
Donor: Olvi Mangasarian ([email protected])
Received: David W. Aha ([email protected])
These data have been taken from the UCI Repository Of Machine Learning Databases at
and were converted to R format by Evgenia Dimitriadou.
This data comes from the RITA/Transtats database
carriers
carriers
A dataframe with 1492 observations and 2 variables
Code
A character string giving the IATA code for the carrier
Description
Carrier name/description
Data from the ACS Survey detailing the use of different transport modes
commute
commute
A tibble containing 3,496 observations of 9 variables
city
City
state
State
city_size
City Size -
Small = 20K to 99,999
Medium = 100K to 199,999
Large = >= 200K
mode
Mode of transport, either walk or bike
n
Number of individuals
percent
Percent of total individuals
moe
Margin of Error (percent)
state_abb
Abbreviated state name
state_region
ACS State region
American Community Survey, United States Census Bureau
R For Data Science repository: https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-11-05
Article and underlying data can be found at: https://www.census.gov/library/publications/2014/acs/acs-25.html?#
A simulated dataset containing demographic data about a number of subjects.
demo_data demoData
demo_data demoData
A data frame with 33 observations on the following 7 demographic variables. This data is designed so that it can be merged with the dataset pk_data.
Subject
A numeric vector giving the subject identifier
Sex
A factor with levels F
M
Age
A numeric vector giving the age of the subject
Weight
A numeric vector giving weight in kg
Height
A numeric vector giving height in cm
BMI
A numeric vector giving the subject body mass index
Smokes
A factor with levels No
Yes
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Simulated data
Dataset containing the Dow Jones Index between 2014-01-01 and 2015-01-01, which is a stock market index that measures the stock performance of 30 large companies listed on stock exchanges in the United States.
dow_jones_data dowJonesData
dow_jones_data dowJonesData
A data frame with 252 observations on the following 7 variables containing data from 2014-01-01 to 2015-01-01.
Date
Date of observation in character string format "%m/%d/%Y"
DJI.Open
Opening value of DJI on the specified date
DJI.High
High value of the DJI on the specified date
DJI.Low
Low value of the DJI on the specified date
DJI.Close
Closing value of the DJI on the specified date
DJI.Volume
the number of shares or contracts traded
DJI.Adj.Close
Close price adjusted for dividends and splits
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Data obtained using yahooSeries
from the fImport
package.
Repeated Measures Drug data
drugs
drugs
A data frame with 20 observations on the following 3 variables.
Subj
A numeric vector, giving the subject ID
Drug
A numeric vector giving the drug ID, numbered 1 to 4
Value
A numeric vector, giving the observation value
Generated from example data used in https://www.stattutorials.com/SAS/TUTORIAL-PROC-GLM-REPEAT.htm
Data that can be used to fit or plot Emax models
emax_data emaxData
emax_data emaxData
A data frame with 64 observations on the following 6 variables.
Subject
a numeric vector giving the unique subject ID
Dose
a numeric vector giving the dose group
E
a numeric vector giving the Emax
Gender
a numeric vector giving the gender
Age
a numeric vector giving the age of the subject
Weight
a numeric vector giving the weight
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Simulated data
Calculation used for Emax in Ascent materials. Note: This function has be renamed using tidyverse-style snake_case naming conventions. However the original name of the function has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
emax_fun(Dose, E0 = 0, ED50 = 50, Emax = 100)
emax_fun(Dose, E0 = 0, ED50 = 50, Emax = 100)
Dose |
User provided dose values |
E0 |
Effect at time 0 |
ED50 |
50% of maximum effect |
Emax |
Maximum effect |
Numeric value/vector representing the response value.
emax_fun(Dose = 100)
emax_fun(Dose = 100)
Simple logistic function as used in Ascent training materials. Note: This function has be renamed using tidyverse-style snake_case naming conventions. However the original name of the function has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
logistic_fun(Dose, E0 = 0, EC50 = 50, Emax = 1, rc = 5)
logistic_fun(Dose, E0 = 0, EC50 = 50, Emax = 1, rc = 5)
Dose |
The dose value to calculate at |
E0 |
Effect at time 0 |
EC50 |
50% of maximum effect |
Emax |
Maximum effect |
rc |
rate constant |
Numeric value/vector representing the response value.
logistic_fun(Dose = 50)
logistic_fun(Dose = 50)
Simulated dataset for examples of reshaping data
messy_data messyData
messy_data messyData
A data frame with 33 observations on the following 7 variables. This data has been designed to show reshaping/tidying of data.
Subject
A numeric vector giving the subject ID
Placebo.1
A numeric vector giving the subjects observed value on treatment Placebo at time 1
Placebo.2
A numeric vector giving the subjects observed value on treatment Placebo at time 2
Drug1.1
A numeric vector giving the subjects observed value on treatment Drug1 at time 1
Drug1.2
A numeric vector giving the subjects observed value on treatment Drug1 at time 2
Drug2.1
A numeric vector giving the subjects observed value on treatment Drug2 at time 1
Drug2.2
A numeric vector giving the subjects observed value on treatment Drug2 at time 2
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Simulated data
Clinical trial data
missing_pk missingPk
missing_pk missingPk
A data frame with 165 observations on the following 4 variables.
Subject
a numeric vector giving the subject identifier
Dose
a numeric vector giving the dose group
Time
a numeric vector giving the observation times
Conc
a numeric vector giving the observed concentration
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Simulated from 'pk_data'
Typical PK data
pk_data pkData
pk_data pkData
A data frame with 165 observations on the following 4 variables.
Subject
a numeric vector giving the subject identifier
Dose
a numeric vector giving the dose group
Time
a numeric vector giving the observation times
Conc
a numeric vector giving the observed concentration
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Simulated data
Insurance Policy Data
policy_data policyData
policy_data policyData
A data frame with 926 observations on the following 13 variables.
Year
The four digit year of the policy
PolicyNo
The policy number
TotalPremium
The total insurance premium
BonusMalus
Discount level
WeightClass
The weight class of the car
Region
Region of the car owner
Age
Age of the main driver
Mileage
Estimated annual mileage
Usage
Car usage
PremiumClass
Class of the car
NoClaims
Number of previous claims
GrossIncurred
Claim cost
Exposure
How long they have been driving
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Simulated based on details of how to simulate car insurance data in Modern Actuarial Risk Theory Using R 2nd Edition (Rob Kaas, Marc Goovaerts, Jan Dhaene, Michel Denuit)
Typical PK data
qtpk2
qtpk2
A data frame with 2061 observations on the following 8 variables.
subjid
A numeric vector giving the subject ID
treat
A factor giving the treatment
time
A numeric vector giving the observation times
qt
A numeric vector giving the QT interval value
qtcb
A numeric vector giving corrected QT interval
hr
A numeric vector giving the heart rate
rr
A numeric vector giving the R-R interval
sex
A factor giving the subject sex
A subset of the data qtpk originally provided in the QT package
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
run_data runData
run_data runData
A data frame with 73 observations on the following 10 variables.
ID
a numeric vector giving the subject ID
DAY
a numeric vector giving the day of the observation
CL
a numeric vector giving the clearance value
V
a numeric vector giving the volume of distribution
WT
a numeric vector giving the weight
DV
a numeric vector giving the dependent variable
IPRE
a numeric vector giving the individual prediction
PRED
a numeric vector giving the population prediction
RES
a numeric vector giving the residual
WRES
a numeric vector giving the weighted residual
Simulated Data
Students simulated data
students
students
A tibble with 146 observations of 15 variables.
Grade
Final grade (A, B, C, D)
Pass
Did they pass the course? (No, Yes)
Exam
Mark in final exam (out of 100)
Degree
The degree type undertaken by the student
Gender
Gender of the student
Attend
Did they regularly attend class? (No, Yes)
Assign
Score obtained in mid-term assignment (out of 20)
Test
Score obtained in previous term test (out of 20)
B
Mark for short answer section (out of 20)
C
Mark for long answer section (out of 20)
MC
Mark for multiple choice sectionC (out of 30)
Colour
Colour of exam booklet (Blue, Green, Pink, Yellow)
Stage1
Stage one grade (A, B, C)
Years.Since
Number of years since doing Stage 1
Repeat
Where they repeating the paper? (No, Yes)
Simulated data
London Tube Performance data
tube_data tubeData
tube_data tubeData
A data frame with 1050 observations on the following 9 variables.
Line
A factor with 10 levels, one for each London tube line
Month
A numeric vector indicating the month of the observation
Scheduled
A numeric vector giving the scheduled running time
Excess
A numeric vector giving the excess running time
TOTAL
A numeric vector giving the total running time
Opened
A numeric vector giving the year the line opened
Length
A numeric vector giving the line length
Type
A factor indicating the type of tube line
Stations
A numeric vector giving the number of stations on the line
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
This data was taken from "https://data.london.gov.uk/dataset/tube-network-performance-data-transport-committee-report"
This data was taken from Edgar Anderson's famous iris data set. This gives the measurements (in centimeters)
of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris.
The species are Iris setosa, versicolor, and virginica. However, the species is seen as the target variable, and as such
has been removed from this dataset, whilst being added to the counterpart y_iris
dataset. Furthermore, the 4 remaining
'predictor' variables have been separated into a training and test set with a ratio of 4:1, followed by centering and scaling.
x_iris
x_iris
A list of two named matrices, 'train' and 'test', representing the training and test sets for the predictors. These have 4 columns each, with 120 and 30 rows respectively.
Sepal.Length
Sepal length
Sepal.Width
Sepal width
Petal.Length
Petal length
Petal.Width
Petal width
Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179-188. The data were collected by Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2-5
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
Typical NONMEM data
xp_data xpData
xp_data xpData
A data frame with 1061 observations on the following 23 variables.
ID
a numeric vector giving the subject ID
SEX
a numeric vector giving the subject sex
RACE
a numeric vector giving the subject race
SMOK
a numeric vector giving the subject smoking status
HCTZ
a numeric vector giving the treatment status
PROP
a numeric vector giving the treatment status
CON
a numeric vector giving the treatment status
DV
a numeric vector giving the dependent variable
PRED
a numeric vector giving population prediction
RES
a numeric vector giving the residual
WRES
a numeric vector giving the weighted residual
AGE
a numeric vector giving the subject age
HT
a numeric vector giving the subject height
WT
a numeric vector giving the subject weight
SECR
a numeric vector giving the serum creatinine value
OCC
a numeric vector giving the occasion
TIME
a numeric vector giving the time of the observation time
IPRE
a numeric vector giving individual prediction
IWRE
a numeric vector giving the individual weighted residual
SID
a numeric vector giving the site ID
CL
a numeric vector giving the clearance
V
a numeric vector giving the volume of distribution
KA
a numeric vector giving the absorption rate constant
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Simulated Data
This data was taken from Edgar Anderson's famous iris data set. This gives the measurements (in centimeters)
of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris.
The species are Iris setosa, versicolor, and virginica. This is the target dataset (as a counterpart to the x_iris
dataset)
and thus only retains the Species information. As with the x_iris
dataset, the data has been split into a training and test
set with a ratio of 4:1. Following this the species class has been one-hot encoded to give three columns, one for each species level.
y_iris
y_iris
A list of two named matrices, 'train' and 'test', representing the training and test sets for the predictors. These have 3 indicator columns each, with 120 and 30 rows respectively.
Species.setosa
Indicator column for the species class setosa
Species.versicolor
Indicator column for the species class versicolor
Species.virginica
Indicator column for the species class virginica
Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179-188. The data were collected by Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2-5
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.