skimr – Rパッケージ活用事例集

パッケージの概要

skimr パッケージはデータフレームの要約に特化したパッケージです。skimr() 関数を用いることで、R の標準関数である summary() よりも詳細かつ視認性の高い要約を得ることができます。

データフレームを要約する

skim() 関数をデータフレームに適用すると、特徴量ごとに欠損値の数（n_missing）、非欠損値の割合（complete_rate）、数値型（numeric）の特徴量の平均および標準偏差、因子型（factor）の特徴量の最頻水準や水準数などの要約が出力されます。特に、数値変数について簡易的なヒストグラムが出力される点が特徴的です。

library(skimr)

Warning: package 'skimr' was built under R version 4.5.1

skim(iris)

Data summary
Name	iris
Number of rows	150
Number of columns	5
_______________________
Column type frequency:
factor	1
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Species	0	1	FALSE	3	set: 50, ver: 50, vir: 50

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Sepal.Length	1	5.84	0.83	4.3	5.1	5.80	6.4	7.9	▆▇▇▅▂
Sepal.Width	1	3.06	0.44	2.0	2.8	3.00	3.3	4.4	▁▆▇▂▁
Petal.Length	1	3.76	1.77	1.0	1.6	4.35	5.1	6.9	▇▁▆▇▂
Petal.Width	1	1.20	0.76	0.1	0.3	1.30	1.8	2.5	▇▁▇▅▃

要約結果を加工・抽出する

skim() 関数の返り値は、“skim_df” という特別なクラスを持つデータフレームです。

class(iris_skim <- skim(iris))

[1] "skim_df"    "tbl_df"     "tbl"        "data.frame"

skimr には “skim_df” オブジェクトを加工するための関数が用意されています。たとえば、yank() 関数を用いることで特定のデータ型（skim_type）に関する情報を取り出すことができます。

# 数値型特徴量を取り出す
iris_skim |> yank("numeric")

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Sepal.Length	1	5.84	0.83	4.3	5.1	5.80	6.4	7.9	▆▇▇▅▂
Sepal.Width	1	3.06	0.44	2.0	2.8	3.00	3.3	4.4	▁▆▇▂▁
Petal.Length	1	3.76	1.77	1.0	1.6	4.35	5.1	6.9	▇▁▆▇▂
Petal.Width	1	1.20	0.76	0.1	0.3	1.30	1.8	2.5	▇▁▇▅▃

また、focus() 関数を用いることで、特定の列を取り出すことができます。

# 一部の要約を選択したのち因子型特徴量を取り出す
iris_skim |> focus(factor.n_unique, factor.ordered) |> yank("factor")

Variable type: factor

skim_variable	n_unique	ordered
Species	3	FALSE

グループ化されたデータフレームを要約する

skim() 関数は group_by() でグループ化されたデータフレームにも対応しており、各特徴量に関するグループごとの集計結果を出力することができます。

iris |> dplyr::group_by(Species) |> skim()

Data summary
Name	dplyr::group_by(iris, Spe…
Number of rows	150
Number of columns	5
_______________________
Column type frequency:
numeric	4
________________________
Group variables	Species

Variable type: numeric

skim_variable	Species	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Sepal.Length	setosa	1	5.01	0.35	4.3	4.80	5.00	5.20	5.8	▃▃▇▅▁
Sepal.Length	versicolor	1	5.94	0.52	4.9	5.60	5.90	6.30	7.0	▂▇▆▃▃
Sepal.Length	virginica	1	6.59	0.64	4.9	6.23	6.50	6.90	7.9	▁▃▇▃▂
Sepal.Width	setosa	1	3.43	0.38	2.3	3.20	3.40	3.68	4.4	▁▃▇▅▂
Sepal.Width	versicolor	1	2.77	0.31	2.0	2.52	2.80	3.00	3.4	▁▅▆▇▂
Sepal.Width	virginica	1	2.97	0.32	2.2	2.80	3.00	3.18	3.8	▂▆▇▅▁
Petal.Length	setosa	1	1.46	0.17	1.0	1.40	1.50	1.58	1.9	▁▃▇▃▁
Petal.Length	versicolor	1	4.26	0.47	3.0	4.00	4.35	4.60	5.1	▂▂▇▇▆
Petal.Length	virginica	1	5.55	0.55	4.5	5.10	5.55	5.88	6.9	▃▇▇▃▂
Petal.Width	setosa	1	0.25	0.11	0.1	0.20	0.20	0.30	0.6	▇▂▂▁▁
Petal.Width	versicolor	1	1.33	0.20	1.0	1.20	1.30	1.50	1.8	▅▇▃▆▁
Petal.Width	virginica	1	2.03	0.27	1.4	1.80	2.00	2.30	2.5	▂▇▆▅▇

データフレーム以外のオブジェクトを要約する

skimr はデータフレームの要約を効率的に行うことを目的として設計されていますが、ベクトル、行列、時系列データなど、データフレームに変換することが可能な他のデータ型のオブジェクトに対しても使うことができます。

# integer型ベクトルのスキミング
skim(1:100)

Data summary
Name	1:100
Number of rows	100
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
data	0	1	50.5	29.01	1	25.75	50.5	75.25	100	▇▇▇▇▇

# 行列のスキミング
skim(matrix(1:9, 3, 3))

Data summary
Name	matrix(1:9, 3, 3)
Number of rows	3
Number of columns	3
_______________________
Column type frequency:
numeric	3
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
V1	1	2	1	1	1.5	2	2.5	3	▇▁▇▁▇
V2	1	5	1	4	4.5	5	5.5	6	▇▁▇▁▇
V3	1	8	1	7	7.5	8	8.5	9	▇▁▇▁▇

# 時系列データのスキミング
class(Nile) # ts

[1] "ts"

skim(Nile)

Data summary
Name	Nile
Number of rows	100
Number of columns	1
_______________________
Column type frequency:
ts	1
________________________
Group variables	None

Variable type: ts

skim_variable	n_missing	complete_rate	start	end	frequency	deltat	mean	sd	min	max	median	line_graph
x	0	1	1871	1970	1	1	919.35	169.23	456	1370	893.5	⢁⠊⢂⠊⢄⣀⠔⢄

要約関数を自作する

skim_with() 関数を用いることで、要約関数を自作することも可能です。詳しい使い方はパッケージの Vignette をご参照ください。

my_skim <- skim_with(numeric = sfl(n = length, sum, var))
iris |>
  dplyr::group_by(Species) |>
  my_skim() |>
  yank("numeric") |>
  dplyr::select(skim_variable, Species, hist, n, sum, var)

Variable type: numeric

skim_variable	Species	hist	n	sum	var
Sepal.Length	setosa	▃▃▇▅▁	50	250.3	0.12
Sepal.Length	versicolor	▂▇▆▃▃	50	296.8	0.27
Sepal.Length	virginica	▁▃▇▃▂	50	329.4	0.40
Sepal.Width	setosa	▁▃▇▅▂	50	171.4	0.14
Sepal.Width	versicolor	▁▅▆▇▂	50	138.5	0.10
Sepal.Width	virginica	▂▆▇▅▁	50	148.7	0.10
Petal.Length	setosa	▁▃▇▃▁	50	73.1	0.03
Petal.Length	versicolor	▂▂▇▇▆	50	213.0	0.22
Petal.Length	virginica	▃▇▇▃▂	50	277.6	0.30
Petal.Width	setosa	▇▂▂▁▁	50	12.3	0.01
Petal.Width	versicolor	▅▇▃▆▁	50	66.3	0.04
Petal.Width	virginica	▂▇▆▅▇	50	101.3	0.08