factor.encoder() creates an encoder function for a qualitative (factor or character) variable.
This encoder converts the variable into a one-hot encoded (dummy) design matrix.
factor.frame() is a helper function to create a "factor.frame" object that defines the encoding scheme.
Usage
factor.encoder(
x,
k = NULL,
lump = c("none", "auto", "rank", "order"),
others = "others",
sep = ">",
weights = NULL,
frame = NULL,
tag = "x"
)
factor.frame(levels, others = NULL, map = NULL, original = NULL, tag = "x")Arguments
- x
a vector to be encoded as a qualitative variable.
- k
an integer specifying the maximum number of distinct levels to retain (including the catch-all level). If not positive, all unique values of
xare used.- lump
a character string specifying the lumping strategy:
"none", no lumping is performed;"rank", lumps levels based on frequency rank;"order"merges adjacent levels based on cumulative frequency to preserve order; and"auto"automatically selects"order"for ordered factors and"rank"for others.- others
a character string for the catch-all level (used when
lump = "rank").- sep
a character string used to separate the start and end levels when merging ordered factors (e.g., "Level1..Level3").
- weights
an optional numeric vector of sample weights for
x.- frame
a "factor.frame" object or a character vector that explicitly defines the levels of the variable.
- tag
the name of the variable.
- levels
a vector to be used as the levels of the variable.
- map
a named vector that maps original levels to lumped levels.
- original
a character vector to be used as the original levels for expanding the frame. Defaults to
NULL.
Value
factor.encoder() returns an object of class "encoder". This is a list containing the following components:
- frame
a "factor.frame" object containing the encoding information (levels).
- n
the number of encoding levels (i.e., columns in the design matrix).
- type
a character string describing the encoding type: "factor" or "null".
- envir
an environment for the
transformandencodefunctions.- transform
a function
transform(x, lumped = TRUE, ...)that converts a vector into a factor with the encoded levels.- encode
a function
encode(x, ...)that converts a vector into the one-hot encoded matrix.
factor.frame() returns a "factor.frame" object containing the encoding information.
Details
This function is designed to handle qualitative data for use in the MID model's linear system formulation.
The primary mechanism is one-hot encoding.
Each unique level of the input variable becomes a column in the output matrix.
For a given observation, the column corresponding to its level is assigned a 1, and all other columns are assigned 0.
When a variable has many unique levels (high cardinality), you can use the lump and k arguments to reduce dimensionality.
This is crucial for preventing MID models from becoming overly complex.
Examples
# Create an encoder for a qualitative variable
data(iris, package = "datasets")
enc <- factor.encoder(x = iris$Species, lump = "none", tag = "Species")
enc
#>
#> Factor encoder with 3 levels
#>
#> Frame:
#> Species
#> 1 setosa
#> 2 versicolor
#> 3 virginica
#>
# Encode a vector with NA
enc$encode(iris$Species[c(50, 100, 150)])
#> setosa versicolor virginica
#> [1,] 1 0 0
#> [2,] 0 1 0
#> [3,] 0 0 1
# Lumping by rank (retain top k - 1 levels and others)
enc <- factor.encoder(x = iris$Species, k = 2, lump = "rank")
enc$encode(iris$Species[c(50, 100, 150)])
#> setosa others
#> [1,] 1 0
#> [2,] 0 1
#> [3,] 0 1
# Lumping by order (merge adjacent levels)
enc <- factor.encoder(x = iris$Species, k = 2, lump = "order")
enc$encode(iris$Species[c(50, 100, 150)])
#> setosa versicolor>virginica
#> [1,] 1 0
#> [2,] 0 1
#> [3,] 0 1
