C50

Author

データサイエンス関連基礎調査WG

Published

August 1, 2025

パッケージの概要

C50パッケージではC5.0アルゴリズムを用いた決定木モデルを使用できます。

使用例:PimaIndiansDiabetesデータの分類

PimaIndiansDiabetesデータを用いて、種々の項目からその患者が糖尿病患者であるかを判定する決定木モデルをC50パッケージを用いて構築します。

library(mlbench)
data(PimaIndiansDiabetes)
df <- PimaIndiansDiabetes
tar.idx <- 9

データの中身の確認です。(head)

head(df)
  pregnant glucose pressure triceps insulin mass pedigree age diabetes
1        6     148       72      35       0 33.6    0.627  50      pos
2        1      85       66      29       0 26.6    0.351  31      neg
3        8     183       64       0       0 23.3    0.672  32      pos
4        1      89       66      23      94 28.1    0.167  21      neg
5        0     137       40      35     168 43.1    2.288  33      pos
6        5     116       74       0       0 25.6    0.201  30      neg

データの中身の確認です。(str)

str(df)
'data.frame':   768 obs. of  9 variables:
 $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
 $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
 $ pressure: num  72 66 64 66 40 74 50 0 70 96 ...
 $ triceps : num  35 29 0 23 35 0 32 0 45 0 ...
 $ insulin : num  0 0 0 94 168 0 88 0 543 0 ...
 $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
 $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
 $ age     : num  50 31 32 21 33 30 26 29 53 54 ...
 $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

各変数の意味は以下の通りです。

変数名 データ型 概要
pregnant    num 妊娠回数    
glucose num 血漿グルコース濃度(負荷試験)
pressure     num 拡張期血圧 [mm Hg]    
triceps      num 上腕三頭筋皮膚襞の厚さ [mm]
insulin      num 2時間血清インスリン [mu U/ml]
mass num BMI   
pedigree num 糖尿病血統機能
age num 年齢
diabetes Factor クラス変数(糖尿病の検査)

summaryを表示します。

summary(df)
    pregnant         glucose         pressure         triceps     
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
 Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
 Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
    insulin           mass          pedigree           age        diabetes 
 Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00   neg:500  
 1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00   pos:268  
 Median : 30.5   Median :32.00   Median :0.3725   Median :29.00            
 Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24            
 3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00            
 Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00            

psychパッケージのpairs.panelsを用いて、散布図の一覧を表示します。

library(psych)
df.x <- df[-tar.idx]
df.y <- df[tar.idx]
df.y <- ifelse(df.y == "pos", 1, 2)
pairs.panels(df.x, bg=c("skyblue", "salmon")[df.y], pch=21)

ggplot2パッケージを用いて、ヒストグラムを表示します。変数ごとにpos/negの分布の傾向を把握することができます。

library(ggplot2)

Attaching package: 'ggplot2'
The following objects are masked from 'package:psych':

    %+%, alpha
library(tidyr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
# 数値変数だけを抽出(diabetes以外)
num_vars <- names(df)[sapply(df, is.numeric)]

# 長い形式に変換
df_long <- df %>%
  pivot_longer(cols = all_of(num_vars), names_to = "variable", values_to = "value")

# ヒストグラムを1つの図にまとめて作成
ggplot(df_long, aes(x = value, fill = diabetes)) +
  geom_histogram(position = "identity", alpha = 0.5, bins = 30) +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Histograms of Variables by Diabetes",
       x = "Value", y = "Count") +
  theme_minimal()

モデル構築

モデル生成のための訓練データと、モデル評価のためのテストデータに分割します。データ割合は訓練データを7割、テストデータを3割とします。

# 再現性のためにシードを設定
set.seed(1234)

# データの分割
n <- nrow(df)
train.rate <- 0.7

# データの抽出
train.set <- sample(n, n * train.rate)
train.df <- df[train.set, ]

test.set <- setdiff(1:n, train.set)
test.df <- df[test.set, ]

モデルを生成します。

library(C50)
Warning: package 'C50' was built under R version 4.5.1
model <- C5.0(
  x = train.df[-tar.idx], 
  y = train.df[[tar.idx]]
)

summaryを用いると、モデルのサマリが表示されます。

summary(model)

Call:
C5.0.default(x = train.df[-tar.idx], y = train.df[[tar.idx]])


C5.0 [Release 2.07 GPL Edition]     Fri Aug  1 17:34:30 2025
-------------------------------

Class specified by attribute `outcome'

Read 537 cases (9 attributes) from undefined.data

Decision tree:

glucose <= 127:
:...age <= 28:
:   :...mass <= 32.2: neg (113/1)
:   :   mass > 32.2:
:   :   :...triceps > 8: neg (64/8)
:   :       triceps <= 8:
:   :       :...glucose <= 110: neg (7)
:   :           glucose > 110: pos (4)
:   age > 28:
:   :...mass <= 26.3: neg (33/1)
:       mass > 26.3:
:       :...pedigree <= 0.845: neg (100/30)
:           pedigree > 0.845:
:           :...pregnant <= 1: neg (5/1)
:               pregnant > 1: pos (10)
glucose > 127:
:...glucose <= 165:
    :...mass <= 29.9: neg (55/15)
    :   mass > 29.9:
    :   :...insulin <= 196: pos (70/20)
    :       insulin > 196:
    :       :...age <= 32: neg (13/1)
    :           age > 32: pos (8/1)
    glucose > 165:
    :...pedigree > 1.114: neg (5/2)
        pedigree <= 1.114:
        :...age > 27: pos (41/1)
            age <= 27:
            :...triceps <= 26: neg (3)
                triceps > 26: pos (6)


Evaluation on training data (537 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

        16   81(15.1%)   <<


       (a)   (b)    <-classified as
      ----  ----
       339    22    (a): class neg
        59   117    (b): class pos


    Attribute usage:

    100.00% glucose
     89.76% mass
     75.79% age
     31.66% pedigree
     16.95% insulin
     15.64% triceps
      2.79% pregnant


Time: 0.0 secs

plotを用いると、決定木の様子が可視化されます。(図がつぶれてしまうことがありますので、参考程度です)

plot(model)

構築したモデルを使用して予測を行います。

res_pred <- predict(model, test.df[-tar.idx])

予測結果から得られる混同行列を確認します。
(Summaryにて訓練データでの混同行列が確認できるので、以下テストデータでの混同行列の表示を省略します)

res_act <- test.df[[tar.idx]]
conf_mat <- table(res_act, res_pred)
print(conf_mat)
       res_pred
res_act neg pos
    neg 116  23
    pos  47  45

各種評価指標を確認します。計算用の関数の定義です。

create_df_res <- function(cmat){
  TN <- cmat[1,1]
  FP <- cmat[1,2]
  FN <- cmat[2,1]
  TP <- cmat[2,2]
  
  # 正解率:どれだけ正しく分類できたかの割合
  acc <- round((TP + TN) / (TP + TN + FN + FP), 3)
  
  # 適合率:陽性と判定されたものがどれだけ正しく陽性であるかの割合
  prec <- round(TP / (TP + FP), 3)
  
  # 再現率(真陽性率):実際に陽性のものをどれだけ正しく陽性と判定できたかの割合
  rec <- round(TP / (TP + FN), 3)
  
  # F値:適合率と再現率の調和平均(両者はトレードオフの関係)
  Fval <- round(2 * rec * prec / (rec + prec), 3)
  
  # 真陰性率:実際に陰性のものをどれだけ正しく陰性と判定できたかの割合
  TNRat <- round(TN / (TN + FP), 3)
  
  df_rat <- data.frame(
    Item = c("正解率", "適合率", "再現率(真陽性率)", "F値", "真陰性率"), 
    Rate = c(acc, prec, rec, Fval, TNRat)
  )
  
  df_rat
}

data.frameの出力整形用です。

library(knitr)
library(kableExtra)

Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':

    group_rows
show_df <- function(df, cap){
  kable(df, format = "pandoc", caption = cap) %>%
    kable_styling(full_width = FALSE, position = "left")
}

各種指標を確認します。

df_res <- create_df_res (conf_mat)
show_df(df_res, "結果")
結果
Item Rate
正解率 0.697
適合率 0.662
再現率(真陽性率) 0.489
F値 0.562
真陰性率 0.835

C5.0のハイパーパラメーター

以下、C5.0モデルのパラメータをいくつか変更し、モデル性能の差を簡易的に見てゆきます。

trials

trials:C5.0モデルはブースティングを実装しており、ブースティングの繰り返しの数を設定できます。

model_trials <- C5.0(
  x = train.df[-tar.idx], 
  y = train.df[[tar.idx]],
  trials = 3
)

summary(model_trials)

Call:
C5.0.default(x = train.df[-tar.idx], y = train.df[[tar.idx]], trials = 3)


C5.0 [Release 2.07 GPL Edition]     Fri Aug  1 17:34:31 2025
-------------------------------

Class specified by attribute `outcome'

Read 537 cases (9 attributes) from undefined.data

-----  Trial 0:  -----

Decision tree:

glucose <= 127:
:...age <= 28:
:   :...mass <= 32.2: neg (113/1)
:   :   mass > 32.2:
:   :   :...triceps > 8: neg (64/8)
:   :       triceps <= 8:
:   :       :...glucose <= 110: neg (7)
:   :           glucose > 110: pos (4)
:   age > 28:
:   :...mass <= 26.3: neg (33/1)
:       mass > 26.3:
:       :...pedigree <= 0.845: neg (100/30)
:           pedigree > 0.845:
:           :...pregnant <= 1: neg (5/1)
:               pregnant > 1: pos (10)
glucose > 127:
:...glucose <= 165:
    :...mass <= 29.9: neg (55/15)
    :   mass > 29.9:
    :   :...insulin <= 196: pos (70/20)
    :       insulin > 196:
    :       :...age <= 32: neg (13/1)
    :           age > 32: pos (8/1)
    glucose > 165:
    :...pedigree > 1.114: neg (5/2)
        pedigree <= 1.114:
        :...age > 27: pos (41/1)
            age <= 27:
            :...triceps <= 26: neg (3)
                triceps > 26: pos (6)

-----  Trial 1:  -----

Decision tree:

glucose > 154: pos (75.3/15.4)
glucose <= 154:
:...mass <= 26.3: neg (96.8/8.6)
    mass > 26.3:
    :...age <= 28:
        :...pressure <= 44: pos (9.1/2.4)
        :   pressure > 44: neg (150.8/33.5)
        age > 28:
        :...insulin <= 142: neg (163.6/79.5)
            insulin > 142: pos (41.4/9.3)

-----  Trial 2:  -----

Decision tree:

glucose <= 120: neg (213.2/29.8)
glucose > 120:
:...mass <= 29.9:
    :...age <= 25: neg (19.5)
    :   age > 25:
    :   :...glucose <= 164: neg (47.7/15.6)
    :       glucose > 164: pos (6.8)
    mass > 29.9:
    :...pedigree <= 0.315:
        :...age <= 30: neg (22.9/2.1)
        :   age > 30: pos (36.5/13.9)
        pedigree > 0.315:
        :...pressure <= 66: pos (29.6/3.3)
            pressure > 66:
            :...age > 49: pos (19.1/2.3)
                age <= 49:
                :...insulin > 465: neg (5.4)
                    insulin <= 465:
                    :...mass > 46.8: pos (10.3)
                        mass <= 46.8:
                        :...glucose <= 123: neg (5.8)
                            glucose > 123:
                            :...pressure <= 70: pos (11.2/0.7)
                                pressure > 70:
                                :...glucose > 156: pos (14.7/2.3)
                                    glucose <= 156:
                                    :...glucose <= 132: pos (16.6/3.5)
                                        glucose > 132: neg (32.9/11.4)


Evaluation on training data (537 cases):

Trial       Decision Tree   
-----     ----------------  
      Size      Errors  

   0        16   81(15.1%)
   1         6  110(20.5%)
   2        15   94(17.5%)
boost            79(14.7%)   <<


       (a)   (b)    <-classified as
      ----  ----
       349    12    (a): class neg
        67   109    (b): class pos


    Attribute usage:

    100.00% glucose
    100.00% mass
     99.81% age
     51.40% pedigree
     46.55% pressure
     45.81% insulin
     15.64% triceps
      2.79% pregnant


Time: 0.0 secs

結果を確認します。各種数値が向上していることが分かります。

res_trials_pred <- predict(model_trials, test.df[-tar.idx])
conf_mat_trials <- table(res_act, res_trials_pred)
df_res_trials <- create_df_res(conf_mat_trials)
show_df(df_res_trials, "結果(trials)")
結果(trials)
Item Rate
正解率 0.740
適合率 0.750
再現率(真陽性率) 0.522
F値 0.616
真陰性率 0.885

winnow

winnow:デフォルトはFALSEです。TRUEにすると、重要でない特徴量を削減してくれます。

model_winnow <- C5.0(
  x = train.df[-tar.idx], 
  y = train.df[[tar.idx]],
 control = C5.0Control(winnow = TRUE)
)

summary(model_winnow)

Call:
C5.0.default(x = train.df[-tar.idx], y = train.df[[tar.idx]], control
 = C5.0Control(winnow = TRUE))


C5.0 [Release 2.07 GPL Edition]     Fri Aug  1 17:34:32 2025
-------------------------------

Class specified by attribute `outcome'

Read 537 cases (9 attributes) from undefined.data

4 attributes winnowed
Estimated importance of remaining attributes:

     29%  glucose
      5%  age
      1%  insulin
     <1%  pedigree

Decision tree:

glucose <= 127: neg (336/55)
glucose > 127:
:...glucose > 165: pos (55/7)
    glucose <= 165:
    :...age <= 23: neg (25/5)
        age > 23:
        :...pedigree > 0.722: pos (21/3)
            pedigree <= 0.722:
            :...pedigree > 0.661: neg (6)
                pedigree <= 0.661:
                :...pedigree <= 0.305: neg (48/19)
                    pedigree > 0.305: pos (46/15)


Evaluation on training data (537 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

         7  104(19.4%)   <<


       (a)   (b)    <-classified as
      ----  ----
       336    25    (a): class neg
        79    97    (b): class pos


    Attribute usage:

    100.00% glucose
     27.19% age
     22.53% pedigree


Time: 0.0 secs

結果の確認です。最初のモデルよりも正解率は高いですが、F値は下がりました。

res_winnow_pred <- predict(model_winnow, test.df[-tar.idx])
conf_mat_winnow <- table(res_act, res_winnow_pred)
df_res_winnow <- create_df_res(conf_mat_winnow)
show_df(df_res_winnow, "結果(winnow)")
結果(winnow)
Item Rate
正解率 0.710
適合率 0.736
再現率(真陽性率) 0.424
F値 0.538
真陰性率 0.899

CF

CF:信頼係数です。デフォルトは0.25です。小さくすると過学習を防いでくれます。

model_CF <- C5.0(
  x = train.df[-tar.idx], 
  y = train.df[[tar.idx]],
  control = C5.0Control(CF = 0.1)
)

summary(model_CF)

Call:
C5.0.default(x = train.df[-tar.idx], y = train.df[[tar.idx]], control
 = C5.0Control(CF = 0.1))


C5.0 [Release 2.07 GPL Edition]     Fri Aug  1 17:34:32 2025
-------------------------------

Class specified by attribute `outcome'

Read 537 cases (9 attributes) from undefined.data

Decision tree:

glucose <= 127:
:...age <= 28: neg (188/13)
:   age > 28:
:   :...mass <= 26.3: neg (33/1)
:       mass > 26.3:
:       :...pedigree <= 0.845: neg (100/30)
:           pedigree > 0.845: pos (15/4)
glucose > 127:
:...glucose > 165: pos (55/7)
    glucose <= 165:
    :...mass <= 29.9: neg (55/15)
        mass > 29.9:
        :...insulin <= 196: pos (70/20)
            insulin > 196:
            :...age <= 32: neg (13/1)
                age > 32: pos (8/1)


Evaluation on training data (537 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

         9   92(17.1%)   <<


       (a)   (b)    <-classified as
      ----  ----
       329    32    (a): class neg
        60   116    (b): class pos


    Attribute usage:

    100.00% glucose
     66.48% age
     54.75% mass
     21.42% pedigree
     16.95% insulin


Time: 0.0 secs

結果の確認です。最初のモデルよりも性能が向上しています。

res_CF_pred <- predict(model_CF, test.df[-tar.idx])
conf_mat_CF <- table(res_act, res_CF_pred)
df_res_CF <- create_df_res(conf_mat_CF)
show_df(df_res_CF, "結果(CF)")
結果(CF)
Item Rate
正解率 0.723
適合率 0.684
再現率(真陽性率) 0.565
F値 0.619
真陰性率 0.827

minCases

minCases:葉ノードに含まれる必要サンプル数です。デフォルトは2です。大きくすると過学習を防いでくれます。

model_minCases <- C5.0(
  x = train.df[-tar.idx], 
  y = train.df[[tar.idx]],
  control = C5.0Control(minCases = 15)
)

summary(model_minCases)

Call:
C5.0.default(x = train.df[-tar.idx], y = train.df[[tar.idx]], control
 = C5.0Control(minCases = 15))


C5.0 [Release 2.07 GPL Edition]     Fri Aug  1 17:34:32 2025
-------------------------------

Class specified by attribute `outcome'

Read 537 cases (9 attributes) from undefined.data

Decision tree:

glucose <= 127:
:...age <= 28: neg (188/13)
:   age > 28:
:   :...mass <= 26.3: neg (33/1)
:       mass > 26.3:
:       :...pedigree <= 0.845: neg (100/30)
:           pedigree > 0.845: pos (15/4)
glucose > 127:
:...glucose > 165: pos (55/7)
    glucose <= 165:
    :...mass <= 29.9: neg (55/15)
        mass > 29.9:
        :...pregnant > 7: pos (18/2)
            pregnant <= 7:
            :...insulin <= 196: pos (56/18)
                insulin > 196: neg (17/4)


Evaluation on training data (537 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

         9   94(17.5%)   <<


       (a)   (b)    <-classified as
      ----  ----
       330    31    (a): class neg
        63   113    (b): class pos


    Attribute usage:

    100.00% glucose
     62.57% age
     54.75% mass
     21.42% pedigree
     16.95% pregnant
     13.59% insulin


Time: 0.0 secs

結果の確認です。最初のモデルよりも性能が向上しています。

res_minCases_pred <- predict(model_minCases, test.df[-tar.idx])
conf_mat_minCases <- table(res_act, res_minCases_pred)
df_res_minCases <- create_df_res(conf_mat_minCases)
show_df(df_res_minCases, "結果(minCases)")
結果(minCases)
Item Rate
正解率 0.727
適合率 0.699
再現率(真陽性率) 0.554
F値 0.618
真陰性率 0.842

各モデルの結果を一覧にします。

df_res_all <- data.frame(
  Item = df_res[,1],
  default = df_res[, 2], 
  trials = df_res_trials[,2], 
  winnow = df_res_winnow[,2], 
  CF = df_res_CF[,2],
  minCases = df_res_minCases[,2])
show_df(df_res_all, "各モデルの結果一覧")
各モデルの結果一覧
Item default trials winnow CF minCases
正解率 0.697 0.740 0.710 0.723 0.727
適合率 0.662 0.750 0.736 0.684 0.699
再現率(真陽性率) 0.489 0.522 0.424 0.565 0.554
F値 0.562 0.616 0.538 0.619 0.618
真陰性率 0.835 0.885 0.899 0.827 0.842