Encoder for Qualitative Variables — factor.encoder • midr

factor.encoder() returns an encoder for a qualitative variable.

Usage

factor.encoder(
  x,
  k,
  use.catchall = TRUE,
  catchall = "(others)",
  tag = "x",
  frame = NULL,
  weights = NULL
)

factor.frame(levels, catchall = "(others)", tag = "x")

Arguments

x: a vector to be encoded as a qualitative variable.
k: an integer specifying the maximum number of distinct levels. If not positive, all unique values of x are used as levels.
use.catchall: logical. If TRUE, less frequent levels are dropped and replaced by the catchall level.
catchall: a character string to be used as the catchall level.
tag: character string. The name of the variable.
frame: a "factor.frame" object or a character vector that defines the levels of the variable.
weights: optional. A numeric vector of sample weights for each value of x.
levels: a vector to be used as the levels of the variable.

Value

factor.encoder() returns a list containing the following components:

frame: an object of class "factor.frame".
encode: a function to encode x into a dummy matrix.
n: the number of encoding levels.
type: the type of encoding.

factor.frame() returns a "factor.frame" object containing the encoding information.

Details

factor.encoder() extracts the unique values (levels) from the vector x and returns a list containing the encode() function to convert a vector into a dummy matrix using one-hot encoding. If use.catchall is TRUE and the number of levels exceeds k, only the most frequent k - 1 levels are used and the other values are replaced by the catchall.

Examples

data(iris, package = "datasets")
enc <- factor.encoder(x = iris$Species, use.catchall = FALSE, tag = "Species")
enc$frame
#>      Species Species_level
#> 1     setosa             1
#> 2 versicolor             2
#> 3  virginica             3
enc$encode(x = c("setosa", "virginica", "ensata", NA, "versicolor"))
#>      setosa versicolor virginica
#> [1,]      1          0         0
#> [2,]      0          0         1
#> [3,]      0          0         0
#> [4,]      0          0         0
#> [5,]      0          1         0

frm <- factor.frame(c("setosa", "virginica"), "other iris")
enc <- factor.encoder(x = iris$Species, frame = frm)
enc$encode(c("setosa", "virginica", "ensata", NA, "versicolor"))
#>      setosa virginica other iris
#> [1,]      1         0          0
#> [2,]      0         1          0
#> [3,]      0         0          1
#> [4,]      0         0          1
#> [5,]      0         0          1

enc <- factor.encoder(x = iris$Species, frame = c("setosa", "versicolor"))
enc$encode(c("setosa", "virginica", "ensata", NA, "versicolor"))
#>      setosa versicolor
#> [1,]      1          0
#> [2,]      0          0
#> [3,]      0          0
#> [4,]      0          0
#> [5,]      0          1