最近给本科生当助教,出了一道实现决策树的题,还有一个预剪枝的题,自己也顺便实现一下。
我实现的这个决策树主要是参照了C4.5算法。没加入剪枝。实现的其实很简单,只针对离散特征,做了一个二叉决策树。也就是将所有特征先做one-hot,这样所有的特征都变成0和1了,然后对其进行二分。
原理其实很简单,选择一种划分指标,遍历所有的特征,找到最优划分特征,然后分割训练集,从剩余特征中删除当前的最优特征,然后分左子树和右子树递归地继续创建结点即可。无非是递归的终止条件,递归的终止条件有三点:
如果当前结点内所有的样本同属一类,则直接做叶子结点
如果当前深度达到最大深度,直接做叶子结点
如果无剩余特征可供划分,直接做叶子节点
第三题:实现决策树 实验内容: 使用LendingClub Safe Loans数据集:
实现信息增益、信息增益率、基尼指数三种划分标准
使用给定的训练集完成三种决策树的训练过程
计算三种决策树在最大深度为10时在训练集和测试集上的精度,查准率,查全率,F1值
在这部分,我们会实现一个很简单的二叉决策树
1. 读取数据 1 2 3 4 import pandas as pdimport numpy as npimport json
1 2 loans = pd.read_csv('data/lendingclub/lending-club-data.csv' , low_memory=False )
数据中有两列是我们想预测的指标,一项是safe_loans,一项是bad_loans,分别表示正例和负例,我们对其进行处理,将正例的safe_loans设为1,负例设为-1,删除bad_loans这列
1 2 3 loans['safe_loans' ] = loans['bad_loans' ].apply(lambda x : +1 if x==0 else -1 ) del loans['bad_loans' ]
我们只使用grade, term, home_ownership, emp_length这四列作为特征,safe_loans作为标记,只保留loans中的这五列
1 2 3 4 5 6 7 features = ['grade' , 'term' , 'home_ownership' , 'emp_length' , ] target = 'safe_loans' loans = loans[features + [target]]
2. 划分训练集和测试集 1 2 3 4 5 6 from sklearn.utils import shuffleloans = shuffle(loans, random_state = 34 ) split_line = int(len(loans) * 0.6 ) train_data = loans.iloc[: split_line] test_data = loans.iloc[split_line:]
3. 特征预处理 可以看到所有的特征都是离散类型的特征,需要对数据进行预处理,使用one-hot编码对其进行处理。
one-hot编码的思想就是将离散特征变成向量,假设特征$A$有三种取值$\lbrace a, b, c\rbrace$,这三种取值等价,如果我们使用1,2,3三个数字表示这三种取值,那么在计算时就会产生偏差,有一些涉及距离度量的算法会认为,2和1离得近,3和1离得远,但这三个值应该是等价的,这种表示方法会造成模型在判断上出现偏差。解决方案就是使用一个三维向量表示他们,用$[1, 0, 0]$表示a,$[0, 1, 0]$表示b,$[0, 0, 1]$表示c,这样三个向量之间的距离就都是相等的了,任意两个向量在欧式空间的距离都是$\sqrt{2}$。这就是one-hot编码是思想。
pandas中使用get_dummies生成one-hot向量
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 def one_hot_encoding (data, features_categorical) : ''' Parameter ---------- data: pd.DataFrame features_categorical: list(str) ''' for cat in features_categorical: one_encoding = pd.get_dummies(data[cat], prefix = cat) data = pd.concat([data, one_encoding],axis=1 ) del data[cat] return data
首先对训练集生成one-hot向量,然后对测试集生成one-hot向量,这里需要注意的是,如果训练集中,特征$A$的取值为$\lbrace a, b, c\rbrace$,这样我们生成的特征就有三列,分别为$A_a$, $A_b$, $A_c$,然后我们使用这个训练集训练模型,模型就只会考虑这三个特征,在测试集中如果有一个样本的特征$A$的值为$d$,那它的$A_a$,$A_b$,$A_c$就都为0,我们不去考虑$A_d$,因为这个特征在训练模型的时候是不存在的。
1 train_data = one_hot_encoding(train_data, features)
获取所有特征的名字
1 2 3 one_hot_features = train_data.columns.tolist() one_hot_features.remove(target) one_hot_features
['grade_A',
'grade_B',
'grade_C',
'grade_D',
'grade_E',
'grade_F',
'grade_G',
'term_ 36 months',
'term_ 60 months',
'home_ownership_MORTGAGE',
'home_ownership_OTHER',
'home_ownership_OWN',
'home_ownership_RENT',
'emp_length_1 year',
'emp_length_10+ years',
'emp_length_2 years',
'emp_length_3 years',
'emp_length_4 years',
'emp_length_5 years',
'emp_length_6 years',
'emp_length_7 years',
'emp_length_8 years',
'emp_length_9 years',
'emp_length_< 1 year']
接下来是对测试集进行one_hot编码,但只要保留出现在one_hot_features中的特征即可·
1 test_data_tmp = one_hot_encoding(test_data, features)
1 2 3 4 5 6 7 8 9 test_data = pd.DataFrame(columns = train_data.columns) for feature in train_data.columns: if feature in test_data_tmp.columns: test_data[feature] = test_data_tmp[feature].copy() else : test_data[feature] = np.zeros(test_data_tmp.shape[0 ], dtype = 'uint8' )
(73564, 25)
(49043, 25)
训练集有37224个样本,测试集有9284个样本,处理完后,所有的特征都是0和1,标记是1和-1 ,以上就是数据预处理流程
4. 实现3种特征划分准则 决策树中有很多常用的特征划分方法,比如信息增益、信息增益率、基尼指数
我们需要实现一个函数,它的作用是,给定决策树的某个结点内的所有样本的标记,让它计算出对应划分指标的值是多少
接下来我们会实现上述三种划分指标
这里我们约定,将所有特征取值为0的样本,划分到左子树,特征取值为1的样本,划分到右子树
4.1 信息增益 信息熵:
信息增益:
计算信息熵时约定:若$p = 0$,则$p \log_2p = 0$
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 def information_entropy (labels_in_node) : ''' 求当前结点的信息熵 Parameter ---------- labels_in_node: np.ndarray, 如[-1, 1, -1, 1, 1] Returns ---------- float: information entropy ''' num_of_samples = labels_in_node.shape[0 ] if num_of_samples == 0 : return 0 num_of_positive = len(labels_in_node[labels_in_node == 1 ]) num_of_negative = len(labels_in_node[labels_in_node == -1 ]) prob_positive = num_of_positive / num_of_samples prob_negative = num_of_negative / num_of_samples if prob_positive == 0 : positive_part = 0 else : positive_part = prob_positive * np.log2(prob_positive) if prob_negative == 0 : negative_part = 0 else : negative_part = prob_negative * np.log2(prob_negative) return - ( positive_part + negative_part )
下面是6个测试样例
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 example_labels = np.array([-1 , -1 , 1 , 1 , 1 ]) print(information_entropy(example_labels)) example_labels = np.array([-1 , -1 , 1 , 1 , 1 , 1 , 1 ]) print(information_entropy(example_labels)) example_labels = np.array([-1 , -1 , -1 , -1 , -1 , 1 , 1 ]) print(information_entropy(example_labels)) example_labels = np.array([-1 ] * 9 + [1 ] * 8 ) print(information_entropy(example_labels)) example_labels = np.array([1 ] * 8 ) print(information_entropy(example_labels)) example_labels = np.array([]) print(information_entropy(example_labels))
0.970950594455
0.863120568567
0.863120568567
0.997502546369
-0.0
0
接下来完成计算所有特征的信息增益的函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 def compute_information_gains (data, features, target, annotate = False) : ''' 计算所有特征的信息增益 Parameter ---------- data: pd.DataFrame,传入的样本,带有特征和标记的dataframe features: list(str),特征名组成的list target: str, 标记(label)的名字 annotate, boolean,是否打印所有特征的信息增益值,默认为False Returns ---------- information_gains: dict, key: str, 特征名 value: float,信息增益 ''' information_gains = dict() for feature in features: left_split_target = data[data[feature] == 0 ][target] right_split_target = data[data[feature] == 1 ][target] left_entropy = information_entropy(left_split_target) left_weight = len(left_split_target) / (len(left_split_target) + len(right_split_target)) right_entropy = information_entropy(right_split_target) right_weight = len(right_split_target) / (len(left_split_target) + len(right_split_target)) current_entropy = information_entropy(data[target]) gain = current_entropy - (left_weight * left_entropy + right_weight * right_entropy) information_gains[feature] = gain if annotate: print(" " , feature, gain) return information_gains
1 2 3 4 5 6 7 8 print(compute_information_gains(train_data, one_hot_features, target)['grade_A' ]) print(compute_information_gains(train_data, one_hot_features, target)['term_ 60 months' ]) print(compute_information_gains(train_data, one_hot_features, target)['grade_B' ])
0.0175919801789
0.0142918503294
0.00370492003453
4.2 信息增益率 信息增益率:
其中
完成计算所有特征信息增益率的函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 def compute_information_gain_ratios (data, features, target, annotate = False) : ''' 计算所有特征的信息增益率并保存起来 Parameter ---------- data: pd.DataFrame, 带有特征和标记的数据 features: list(str),特征名组成的list target: str, 特征的名字 annotate: boolean, default False,是否打印注释 Returns ---------- gain_ratios: dict, key: str, 特征名 value: float,信息增益率 ''' gain_ratios = dict() for feature in features: left_split_target = data[data[feature] == 0 ][target] right_split_target = data[data[feature] == 1 ][target] left_entropy = information_entropy(left_split_target) left_weight = len(left_split_target) / (len(left_split_target) + len(right_split_target)) right_entropy = information_entropy(right_split_target) right_weight = len(right_split_target) / (len(left_split_target) + len(right_split_target)) current_entropy = information_entropy(data[target]) gain = current_entropy - (left_weight * left_entropy + right_weight * right_entropy) if left_weight == 0 : left_IV = 0 else : left_IV = left_weight * np.log2(left_weight) if right_weight == 0 : right_IV = 0 else : right_IV = right_weight * np.log2(right_weight) IV = - (left_IV + right_IV) gain_ratio = gain / (IV + np.finfo(np.longdouble).eps) gain_ratios[feature] = gain_ratio if annotate: print(" " , feature, gain_ratio) return gain_ratios
1 2 3 4 5 6 7 8 print(compute_information_gain_ratios(train_data, one_hot_features, target)['grade_A' ]) print(compute_information_gain_ratios(train_data, one_hot_features, target)['grade_B' ]) print(compute_information_gain_ratios(train_data, one_hot_features, target)['term_ 60 months' ])
0.025734780668
0.00417549506943
0.0197093627186
4.3 基尼指数 数据集$D$的基尼值:
属性$a$的基尼指数:
完成数据集基尼值的计算
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 def gini (labels_in_node) : ''' 计算一个结点内样本的基尼指数 Paramters ---------- label_in_data: np.ndarray, 样本的标记,如[-1, -1, 1, 1, 1] Returns --------- gini: float,基尼指数 ''' num_of_samples = labels_in_node.shape[0 ] if num_of_samples == 0 : return 0 num_of_positive = len(labels_in_node[labels_in_node == 1 ]) num_of_negative = len(labels_in_node[labels_in_node == -1 ]) prob_positive = num_of_positive / num_of_samples prob_negative = num_of_negative / num_of_samples gini = 1 - (prob_positive ** 2 + prob_negative ** 2 ) return gini
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 example_labels = np.array([-1 , -1 , 1 , 1 , 1 ]) print(gini(example_labels)) example_labels = np.array([-1 , -1 , 1 , 1 , 1 , 1 , 1 ]) print(gini(example_labels)) example_labels = np.array([-1 , -1 , -1 , -1 , -1 , 1 , 1 ]) print(gini(example_labels)) example_labels = np.array([-1 ] * 9 + [1 ] * 8 ) print(gini(example_labels)) example_labels = np.array([1 ] * 8 ) print(gini(example_labels)) example_labels = np.array([]) print(gini(example_labels))
0.48
0.40816326530612246
0.40816326530612246
0.4982698961937716
0.0
0
然后计算所有特征的基尼指数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 def compute_gini_indices (data, features, target, annotate = False) : ''' 计算使用各个特征进行划分时,各特征的基尼指数 Parameter ---------- data: pd.DataFrame, 带有特征和标记的数据 features: list(str),特征名组成的list target: str, 特征的名字 annotate: boolean, default False,是否打印注释 Returns ---------- gini_indices: dict, key: str, 特征名 value: float,基尼指数 ''' gini_indices = dict() for feature in features: left_split_target = data[data[feature] == 0 ][target] right_split_target = data[data[feature] == 1 ][target] left_gini = gini(left_split_target) left_weight = len(left_split_target) / (len(left_split_target) + len(right_split_target)) right_gini = gini(right_split_target) right_weight = len(right_split_target) / (len(left_split_target) + len(right_split_target)) gini_index = left_weight * left_gini + right_weight * right_gini gini_indices[feature] = gini_index if annotate: print(" " , feature, gini_index) return gini_indices
1 2 3 4 5 6 7 8 print(compute_gini_indices(train_data, one_hot_features, target)['grade_A' ]) print(compute_gini_indices(train_data, one_hot_features, target)['grade_B' ]) print(compute_gini_indices(train_data, one_hot_features, target)['term_ 36 months' ])
0.3009520964964362
0.3056855375882364
0.30055418611740065
5. 完成最优特征的选择 到此,我们完成了三种划分策略的实现,接下来就是完成获取最优特征的函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 def best_splitting_feature (data, features, target, criterion = 'gini' , annotate = False) : ''' 给定划分方法和数据,找到最优的划分特征 Parameters ---------- data: pd.DataFrame, 带有特征和标记的数据 features: list(str),特征名组成的list target: str, 特征的名字 criterion: str, 使用哪种指标,三种选项: 'information_gain', 'gain_ratio', 'gini' annotate: boolean, default False,是否打印注释 Returns ---------- best_feature: str, 最佳的划分特征的名字 ''' if criterion == 'information_gain' : if annotate: print('using information gain' ) information_gains = compute_information_gains(data, features, target, annotate) best_feature = max(information_gains.items(), key = lambda x: x[1 ])[0 ] return best_feature elif criterion == 'gain_ratio' : if annotate: print('using information gain ratio' ) gain_ratios = compute_information_gain_ratios(data, features, target, annotate) best_feature = max(gain_ratios.items(), key = lambda x: x[1 ])[0 ] return best_feature elif criterion == 'gini' : if annotate: print('using gini' ) gini_indices = compute_gini_indices(data, features, target, annotate) best_feature = min(gini_indices.items(), key = lambda x: x[1 ])[0 ] return best_feature else : raise Exception("传入的criterion不合规!" , criterion)
6. 判断结点内样本的类别是否为同一类 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 def intermediate_node_num_mistakes (labels_in_node) : ''' 求树的结点中,样本数少的那个类的样本有多少,比如输入是[1, 1, -1, -1, 1],返回2 Parameter ---------- labels_in_node: np.ndarray, pd.Series Returns ---------- int:个数 ''' if len(labels_in_node) == 0 : return 0 num_of_one = len(labels_in_node[labels_in_node == 1 ]) num_of_minus_one = len(labels_in_node[labels_in_node == -1 ]) return num_of_one if num_of_minus_one > num_of_one else num_of_minus_one
1 2 3 4 5 6 7 8 print(intermediate_node_num_mistakes(np.array([1 , 1 , -1 , -1 , -1 ]))) print(intermediate_node_num_mistakes(np.array([]))) print(intermediate_node_num_mistakes(np.array([1 ])))
2
0
0
7. 创建叶子结点 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 def create_leaf (target_values) : ''' 计算出当前叶子结点的标记是什么,并且将叶子结点信息保存在一个dict中 Parameter: ---------- target_values: pd.Series, 当前叶子结点内样本的标记 Returns: ---------- leaf: dict,表示一个叶结点, leaf['splitting_features'], None,叶结点不需要划分特征 leaf['left'], None,叶结点没有左子树 leaf['right'], None,叶结点没有右子树 leaf['is_leaf'], True, 是否是叶子结点 leaf['prediction'], int, 表示该叶子结点的预测值 ''' leaf = {'splitting_feature' : None , 'left' : None , 'right' : None , 'is_leaf' : True } num_ones = len(target_values[target_values == +1 ]) num_minus_ones = len(target_values[target_values == -1 ]) if num_ones > num_minus_ones: leaf['prediction' ] = 1 else : leaf['prediction' ] = -1 return leaf
8. 递归地创建决策树 递归的创建决策树 递归算法终止的三个条件:
如果结点内所有的样本的标记都相同,该结点就不需要再继续划分,直接做叶子结点即可
如果结点所有的特征都已经在之前使用过了,在当前结点无剩余特征可供划分样本,该结点直接做叶子结点
如果当前结点的深度已经达到了我们限制的树的最大深度,直接做叶子结点
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 def decision_tree_create (data, features, target, criterion = 'gini' , current_depth = 0 , max_depth = 10 , annotate = False) : ''' Parameter: ---------- data: pd.DataFrame, 数据 features: iterable, 特征组成的可迭代对象,比如一个list target: str, 标记的名字 criterion: 'str', 特征划分方法,只支持三种:'information_gain', 'gain_ratio', 'gini' current_depth: int, 当前深度,递归的时候需要记录 max_depth: int, 树的最大深度,我们设定的树的最大深度,达到最大深度需要终止递归 Returns: ---------- dict, dict['is_leaf'] : False, 当前顶点不是叶子结点 dict['prediction'] : None, 不是叶子结点就没有预测值 dict['splitting_feature']: splitting_feature, 当前结点是使用哪个特征进行划分的 dict['left'] : dict dict['right'] : dict ''' if criterion not in ['information_gain' , 'gain_ratio' , 'gini' ]: raise Exception("传入的criterion不合规!" , criterion) remaining_features = features[:] target_values = data[target] print("-" * 50 ) print("Subtree, depth = %s (%s data points)." % (current_depth, len(target_values))) if intermediate_node_num_mistakes(target_values) == 0 : print("Stopping condition 1 reached." ) return create_leaf(target_values) if len(remaining_features) == 0 : print("Stopping condition 2 reached." ) return create_leaf(target_values) if current_depth >= max_depth: print("Reached maximum depth. Stopping for now." ) return create_leaf(target_values) splitting_feature = best_splitting_feature(data, features, target, criterion, annotate) left_split = data[data[splitting_feature] == 0 ] right_split = data[data[splitting_feature] == 1 ] remaining_features.remove(splitting_feature) print("Split on feature %s. (%s, %s)" % (\ splitting_feature, len(left_split), len(right_split))) if len(left_split) == len(data): print("Creating leaf node." ) return create_leaf(left_split[target]) if len(right_split) == len(data): print("Creating right node." ) return create_leaf(right_split[target]) left_tree = decision_tree_create(left_split, remaining_features, target, criterion, current_depth + 1 , max_depth, annotate) right_tree = decision_tree_create(right_split, remaining_features, target, criterion, current_depth + 1 , max_depth, annotate) return {'is_leaf' : False , 'prediction' : None , 'splitting_feature' : splitting_feature, 'left' : left_tree, 'right' : right_tree}
1 my_decision_tree = decision_tree_create(train_data, one_hot_features, target, 'gini' , max_depth = 6 , annotate = False )
--------------------------------------------------
Subtree, depth = 0 (73564 data points).
Split on feature term_ 36 months. (14831, 58733)
--------------------------------------------------
Subtree, depth = 1 (14831 data points).
Split on feature grade_F. (13003, 1828)
--------------------------------------------------
Subtree, depth = 2 (13003 data points).
Split on feature grade_E. (9818, 3185)
--------------------------------------------------
Subtree, depth = 3 (9818 data points).
Split on feature home_ownership_RENT. (6796, 3022)
--------------------------------------------------
Subtree, depth = 4 (6796 data points).
Split on feature grade_G. (6507, 289)
--------------------------------------------------
Subtree, depth = 5 (6507 data points).
Split on feature grade_D. (4368, 2139)
--------------------------------------------------
Subtree, depth = 6 (4368 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2139 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (289 data points).
Split on feature home_ownership_OWN. (249, 40)
--------------------------------------------------
Subtree, depth = 6 (249 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (40 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (3022 data points).
Split on feature grade_G. (2827, 195)
--------------------------------------------------
Subtree, depth = 5 (2827 data points).
Split on feature grade_D. (1651, 1176)
--------------------------------------------------
Subtree, depth = 6 (1651 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (1176 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (195 data points).
Split on feature emp_length_2 years. (176, 19)
--------------------------------------------------
Subtree, depth = 6 (176 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (19 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 3 (3185 data points).
Split on feature home_ownership_RENT. (1980, 1205)
--------------------------------------------------
Subtree, depth = 4 (1980 data points).
Split on feature emp_length_3 years. (1828, 152)
--------------------------------------------------
Subtree, depth = 5 (1828 data points).
Split on feature emp_length_10+ years. (1057, 771)
--------------------------------------------------
Subtree, depth = 6 (1057 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (771 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (152 data points).
Split on feature home_ownership_OTHER. (151, 1)
--------------------------------------------------
Subtree, depth = 6 (151 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (1 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 4 (1205 data points).
Split on feature emp_length_1 year. (1124, 81)
--------------------------------------------------
Subtree, depth = 5 (1124 data points).
Split on feature emp_length_8 years. (1073, 51)
--------------------------------------------------
Subtree, depth = 6 (1073 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (51 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (81 data points).
Split on feature grade_A. (81, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 2 (1828 data points).
Split on feature home_ownership_RENT. (1030, 798)
--------------------------------------------------
Subtree, depth = 3 (1030 data points).
Split on feature emp_length_3 years. (957, 73)
--------------------------------------------------
Subtree, depth = 4 (957 data points).
Split on feature emp_length_2 years. (886, 71)
--------------------------------------------------
Subtree, depth = 5 (886 data points).
Split on feature home_ownership_OTHER. (884, 2)
--------------------------------------------------
Subtree, depth = 6 (884 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 5 (71 data points).
Split on feature home_ownership_MORTGAGE. (12, 59)
--------------------------------------------------
Subtree, depth = 6 (12 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (59 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (73 data points).
Split on feature home_ownership_MORTGAGE. (13, 60)
--------------------------------------------------
Subtree, depth = 5 (13 data points).
Split on feature grade_A. (13, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 5 (60 data points).
Split on feature grade_A. (60, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 3 (798 data points).
Split on feature emp_length_7 years. (740, 58)
--------------------------------------------------
Subtree, depth = 4 (740 data points).
Split on feature emp_length_3 years. (673, 67)
--------------------------------------------------
Subtree, depth = 5 (673 data points).
Split on feature emp_length_9 years. (646, 27)
--------------------------------------------------
Subtree, depth = 6 (646 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (27 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (67 data points).
Split on feature grade_A. (67, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 4 (58 data points).
Split on feature grade_A. (58, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 1 (58733 data points).
Split on feature grade_A. (45632, 13101)
--------------------------------------------------
Subtree, depth = 2 (45632 data points).
Split on feature grade_B. (25130, 20502)
--------------------------------------------------
Subtree, depth = 3 (25130 data points).
Split on feature grade_C. (11066, 14064)
--------------------------------------------------
Subtree, depth = 4 (11066 data points).
Split on feature home_ownership_MORTGAGE. (6987, 4079)
--------------------------------------------------
Subtree, depth = 5 (6987 data points).
Split on feature grade_F. (6650, 337)
--------------------------------------------------
Subtree, depth = 6 (6650 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (337 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (4079 data points).
Split on feature grade_G. (4003, 76)
--------------------------------------------------
Subtree, depth = 6 (4003 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (76 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (14064 data points).
Split on feature home_ownership_MORTGAGE. (8209, 5855)
--------------------------------------------------
Subtree, depth = 5 (8209 data points).
Split on feature emp_length_2 years. (7209, 1000)
--------------------------------------------------
Subtree, depth = 6 (7209 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (1000 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (5855 data points).
Split on feature emp_length_10+ years. (3802, 2053)
--------------------------------------------------
Subtree, depth = 6 (3802 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2053 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 3 (20502 data points).
Split on feature home_ownership_MORTGAGE. (10775, 9727)
--------------------------------------------------
Subtree, depth = 4 (10775 data points).
Split on feature home_ownership_OTHER. (10741, 34)
--------------------------------------------------
Subtree, depth = 5 (10741 data points).
Split on feature emp_length_1 year. (9754, 987)
--------------------------------------------------
Subtree, depth = 6 (9754 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (987 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (34 data points).
Split on feature emp_length_10+ years. (27, 7)
--------------------------------------------------
Subtree, depth = 6 (27 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (7 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (9727 data points).
Split on feature emp_length_< 1 year. (9186, 541)
--------------------------------------------------
Subtree, depth = 5 (9186 data points).
Split on feature emp_length_9 years. (8807, 379)
--------------------------------------------------
Subtree, depth = 6 (8807 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (379 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (541 data points).
Split on feature grade_C. (541, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 2 (13101 data points).
Split on feature home_ownership_MORTGAGE. (5830, 7271)
--------------------------------------------------
Subtree, depth = 3 (5830 data points).
Split on feature emp_length_3 years. (5283, 547)
--------------------------------------------------
Subtree, depth = 4 (5283 data points).
Split on feature emp_length_1 year. (4705, 578)
--------------------------------------------------
Subtree, depth = 5 (4705 data points).
Split on feature emp_length_7 years. (4467, 238)
--------------------------------------------------
Subtree, depth = 6 (4467 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (238 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (578 data points).
Split on feature home_ownership_OTHER. (576, 2)
--------------------------------------------------
Subtree, depth = 6 (576 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 4 (547 data points).
Split on feature home_ownership_OTHER. (545, 2)
--------------------------------------------------
Subtree, depth = 5 (545 data points).
Split on feature home_ownership_OWN. (468, 77)
--------------------------------------------------
Subtree, depth = 6 (468 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (77 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (2 data points).
Split on feature grade_B. (2, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 3 (7271 data points).
Split on feature emp_length_2 years. (6702, 569)
--------------------------------------------------
Subtree, depth = 4 (6702 data points).
Split on feature emp_length_4 years. (6234, 468)
--------------------------------------------------
Subtree, depth = 5 (6234 data points).
Split on feature emp_length_3 years. (5689, 545)
--------------------------------------------------
Subtree, depth = 6 (5689 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (545 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (468 data points).
Split on feature grade_B. (468, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 4 (569 data points).
Split on feature grade_B. (569, 0)
Creating leaf node.
现在,模型就训练好了
9. 预测 接下来我们需要完成预测函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 def classify (tree, x, annotate = False) : ''' 递归的进行预测,一次只能预测一个样本 Parameters ---------- tree: dict x: pd.Series,样本 x: pd.DataFrame, 待预测的样本 annotate, boolean, 是否显示注释 Returns ---------- 返回预测的标记 ''' if tree['is_leaf' ]: if annotate: print ("At leaf, predicting %s" % tree['prediction' ]) return tree['prediction' ] else : split_feature_value = x[tree['splitting_feature' ]] if annotate: print ("Split on %s = %s" % (tree['splitting_feature' ], split_feature_value)) if split_feature_value == 0 : return classify(tree['left' ], x, annotate) else : return classify(tree['right' ], x, annotate)
我们取测试集第一个样本来测试
1 2 test_sample = test_data.iloc[0 ] print(test_sample)
safe_loans 1
grade_A 0
grade_B 0
grade_C 0
grade_D 0
grade_E 1
grade_F 0
grade_G 0
term_ 36 months 1
term_ 60 months 0
home_ownership_MORTGAGE 1
home_ownership_OTHER 0
home_ownership_OWN 0
home_ownership_RENT 0
emp_length_1 year 0
emp_length_10+ years 0
emp_length_2 years 1
emp_length_3 years 0
emp_length_4 years 0
emp_length_5 years 0
emp_length_6 years 0
emp_length_7 years 0
emp_length_8 years 0
emp_length_9 years 0
emp_length_< 1 year 0
Name: 37225, dtype: int64
1 2 print('True class: %s ' % (test_sample['safe_loans' ])) print('Predicted class: %s ' % classify(my_decision_tree, test_sample))
True class: 1
Predicted class: 1
打印出使用决策树判断的过程
1 classify(my_decision_tree, test_sample, annotate=True )
Split on term_ 36 months = 1
Split on grade_A = 0
Split on grade_B = 0
Split on grade_C = 0
Split on home_ownership_MORTGAGE = 1
Split on grade_G = 0
At leaf, predicting 1
1
10. 在测试集上对我们的模型进行评估 1 2 3 4 from sklearn.metrics import accuracy_scorefrom sklearn.metrics import precision_scorefrom sklearn.metrics import recall_scorefrom sklearn.metrics import f1_score
先来编写一个批量预测的函数,传入的是整个测试集那样的pd.DataFrame,这个函数返回一个np.ndarray,存储模型的预测结果
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 def predict (tree, data) : ''' 按行遍历data,对每个样本进行预测,将值存储起来,最后返回np.ndarray Parameter ---------- tree, dict, 模型 data, pd.DataFrame, 数据 Returns ---------- predictions, np.ndarray, 模型对这些样本的预测结果 ''' predictions = np.zeros(len(data)) for i in range(len(data)): predictions[i] = classify(tree, data.iloc[i]) return predictions
11. 请你计算使用不同评价指标得到模型的四项指标的值,填写在下方表格内 树的最大深度为6
树的最大深度为6
双击此处编写
划分标准
精度
查准率
查全率
F1
信息增益
0.0
0.0
0.0
0.0
信息增益率
0.0
0.0
0.0
0.0
基尼指数
0.0
0.0
0.0
0.0
12. 扩展:使用Echarts绘制决策树 我们可以使用echarts绘制出我们训练的决策树,这时候可以利用pyecharts这个库pyecharts pyecharts可以与jupyter notebook无缝衔接,直接在notebook中绘制图表。提醒:pyecharts还未支持jupyter lab
1 2 from pyecharts import Tree
echarts中的树形图要求我们提供一组这样的数据1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 [ { value: 1212, # 数值 # 子节点 children: [ { # 子节点数值 value: 2323, # 子节点名 name: 'description of this node', children: [...], }, { value: 4545, name: 'description of this node', children: [ { value: 5656, name: 'description of this node', children: [...] }, ... ] } ] }, ... ]
关于pyecharts中的树形图的文档地址:pyecharts Tree
其实和我们训练得到的树结构类似,只不过每个结点有个”name”属性,表示这个结点的名字,”value”表示它的值,”children”是一个list,里面还有这样的dict,我们可以写一个递归的函数完成这种数据的生成
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 def generate_echarts_data (tree) : value = dict() if tree['is_leaf' ] == True : value['value' ] = tree['prediction' ] value['name' ] = 'label: %s' %(tree['prediction' ]) return value value['name' ] = tree['splitting_feature' ] value['children' ] = [generate_echarts_data(tree['left' ]), generate_echarts_data(tree['right' ])] return value
1 data = generate_echarts_data(my_decision_tree)
使用下面的代码进行绘制,绘制完成后,树的结点是可以点击的,点击后会展开它的子树
1 2 3 4 5 6 7 8 9 10 11 tree = Tree(width=800 , height=400 ) tree.add("" , [data], tree_collapse_interval=5 , tree_top="15%" , tree_right="20%" , tree_symbol = 'rect' , tree_symbol_size = 20 , ) tree.render() tree
第四题:预剪枝 实验内容
实现使用信息增益率划分的预剪枝
我们会以信息增益率作为划分准则,构造带有预剪枝的二叉决策树 使用的数据和第三题一样,剪枝需要使用验证集,所以数据划分策略会和第三题不同
1. 读取数据 1 2 3 import pandas as pdimport numpy as np
1 2 loans = pd.read_csv('data/lendingclub/lending-club-data.csv' , low_memory=False )
1 2 3 loans['safe_loans' ] = loans['bad_loans' ].apply(lambda x : +1 if x==0 else -1 ) del loans['bad_loans' ]
1 2 3 4 5 6 7 features = ['grade' , 'term' , 'home_ownership' , 'emp_length' , ] target = 'safe_loans' loans = loans[features + [target]]
2. 划分训练集和测试集 1 from sklearn.utils import shuffle
1 loans = shuffle(loans, random_state = 34 )
我们使用数据的60%做训练集,20%做验证集,20%做测试集
1 2 3 4 5 split_line1 = int(len(loans) * 0.6 ) split_line2 = int(len(loans) * 0.8 ) train_data = loans.iloc[: split_line1] validation_data = loans.iloc[split_line1: split_line2] test_data = loans.iloc[split_line2:]
3. 特征预处理 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 def one_hot_encoding (data, features_categorical) : ''' Parameter ---------- data: pd.DataFrame features_categorical: list(str) ''' for cat in features_categorical: one_encoding = pd.get_dummies(data[cat], prefix = cat) data = pd.concat([data, one_encoding],axis=1 ) del data[cat] return data
1 train_data = one_hot_encoding(train_data, features)
1 2 one_hot_features = train_data.columns.tolist() one_hot_features.remove(target)
1 2 3 4 5 6 7 validation_tmp = one_hot_encoding(validation_data, features) validation_data = pd.DataFrame(columns = train_data.columns) for feature in train_data.columns: if feature in validation_tmp: validation_data[feature] = validation_tmp[feature].copy() else : validation_data[feature] = np.zeros(len(validation_tmp), dtype = 'uint8' )
1 2 3 4 5 6 7 test_data_tmp = one_hot_encoding(test_data, features) test_data = pd.DataFrame(columns = train_data.columns) for feature in train_data.columns: if feature in test_data_tmp.columns: test_data[feature] = test_data_tmp[feature].copy() else : test_data[feature] = np.zeros(test_data_tmp.shape[0 ], dtype = 'uint8' )
打印一下3个数据集的shape
1 print(train_data.shape, validation_data.shape, test_data.shape)
(73564, 25) (24521, 25) (24522, 25)
4. 实现信息增益率的计算 信息熵:
信息增益:
信息增益率:
其中
计算信息熵时约定:若$p = 0$,则$p \log_2p = 0$
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 def information_entropy (labels_in_node) : ''' 求当前结点的信息熵 Parameter ---------- labels_in_node: np.ndarray, 如[-1, 1, -1, 1, 1] Returns ---------- float: information entropy ''' num_of_samples = labels_in_node.shape[0 ] if num_of_samples == 0 : return 0 num_of_positive = len(labels_in_node[labels_in_node == 1 ]) num_of_negative = len(labels_in_node[labels_in_node == -1 ]) prob_positive = num_of_positive / num_of_samples prob_negative = num_of_negative / num_of_samples if prob_positive == 0 : positive_part = 0 else : positive_part = prob_positive * np.log2(prob_positive) if prob_negative == 0 : negative_part = 0 else : negative_part = prob_negative * np.log2(prob_negative) return - ( positive_part + negative_part )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 example_labels = np.array([-1 , -1 , 1 , 1 , 1 ]) print(information_entropy(example_labels)) example_labels = np.array([-1 , -1 , 1 , 1 , 1 , 1 , 1 ]) print(information_entropy(example_labels)) example_labels = np.array([-1 , -1 , -1 , -1 , -1 , 1 , 1 ]) print(information_entropy(example_labels)) example_labels = np.array([-1 ] * 9 + [1 ] * 8 ) print(information_entropy(example_labels)) example_labels = np.array([1 ] * 8 ) print(information_entropy(example_labels)) example_labels = np.array([]) print(information_entropy(example_labels))
0.970950594455
0.863120568567
0.863120568567
0.997502546369
-0.0
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 def compute_information_gain_ratios (data, features, target, annotate = False) : ''' 计算所有特征的信息增益率并保存起来 Parameter ---------- data: pd.DataFrame, 带有特征和标记的数据 features: list(str),特征名组成的list target: str, 特征的名字 annotate: boolean, default False,是否打印注释 Returns ---------- gain_ratios: dict, key: str, 特征名 value: float,信息增益率 ''' gain_ratios = dict() for feature in features: left_split_target = data[data[feature] == 0 ][target] right_split_target = data[data[feature] == 1 ][target] left_entropy = information_entropy(left_split_target) left_weight = len(left_split_target) / (len(left_split_target) + len(right_split_target)) right_entropy = information_entropy(right_split_target) right_weight = len(right_split_target) / (len(left_split_target) + len(right_split_target)) current_entropy = information_entropy(data[target]) gain = current_entropy - (left_weight * left_entropy + right_weight * right_entropy) if left_weight == 0 : left_IV = 0 else : left_IV = left_weight * np.log2(left_weight) if right_weight == 0 : right_IV = 0 else : right_IV = right_weight * np.log2(right_weight) IV = - (left_IV + right_IV) gain_ratio = gain / (IV + np.finfo(np.longdouble).eps) gain_ratios[feature] = gain_ratio if annotate: print(" " , feature, gain_ratio) return gain_ratios
1 2 3 4 5 6 7 8 print(compute_information_gain_ratios(train_data, one_hot_features, target)['grade_A' ]) print(compute_information_gain_ratios(train_data, one_hot_features, target)['grade_B' ]) print(compute_information_gain_ratios(train_data, one_hot_features, target)['term_ 60 months' ])
0.025734780668
0.00417549506943
0.0197093627186
5. 完成最优特征的选择 这里我们没有实现信息增益和基尼指数的最优特征求解,感兴趣的同学可以按上一题实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 def best_splitting_feature (data, features, target, criterion = 'gain_ratios' , annotate = False) : ''' 给定划分方法和数据,找到最优的划分特征 Parameters ---------- data: pd.DataFrame, 带有特征和标记的数据 features: list(str),特征名组成的list target: str, 特征的名字 criterion: str, 使用哪种指标,三种选项: 'information_gain', 'gain_ratio', 'gini' annotate: boolean, default False,是否打印注释 Returns ---------- best_feature: str, 最佳的划分特征的名字 ''' if criterion == 'information_gain' : if annotate: print('using information gain' ) return None elif criterion == 'gain_ratio' : if annotate: print('using information gain ratio' ) gain_ratios = compute_information_gain_ratios(data, features, target, annotate) best_feature = max(gain_ratios.items(), key = lambda x: x[1 ])[0 ] return best_feature elif criterion == 'gini' : if annotate: print('using gini' ) return None else : raise Exception("传入的criterion不合规!" , criterion)
6. 判断结点内样本的类别是否为同一类 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 def intermediate_node_num_mistakes (labels_in_node) : ''' 求树的结点中,样本数少的那个类的样本有多少,比如输入是[1, 1, -1, -1, 1],返回2 Parameter ---------- labels_in_node: np.ndarray, pd.Series Returns ---------- int:个数 ''' if len(labels_in_node) == 0 : return 0 num_of_one = len(labels_in_node[labels_in_node == 1 ]) num_of_minus_one = len(labels_in_node[labels_in_node == -1 ]) return num_of_one if num_of_minus_one > num_of_one else num_of_minus_one
7. 创建叶子结点 先编写一个辅助函数majority_class,求树的结点中,样本数多的那个类是什么
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 def majority_class (labels_in_node) : ''' 求树的结点中,样本数多的那个类是什么 ''' if len(labels_in_node) == 0 : return 0 num_of_one = len(labels_in_node[labels_in_node == 1 ]) num_of_minus_one = len(labels_in_node[labels_in_node == -1 ]) return 1 if num_of_minus_one < num_of_one else -1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 def create_leaf (target_values) : ''' 计算出当前叶子结点的标记是什么,并且将叶子结点信息保存在一个dict中 Parameter: ---------- target_values: pd.Series, 当前叶子结点内样本的标记 Returns: ---------- leaf: dict,表示一个叶结点, leaf['splitting_features'], None,叶结点不需要划分特征 leaf['left'], None,叶结点没有左子树 leaf['right'], None,叶结点没有右子树 leaf['is_leaf'], True, 是否是叶子结点 leaf['prediction'], int, 表示该叶子结点的预测值 ''' leaf = {'splitting_feature' : None , 'left' : None , 'right' : None , 'is_leaf' : True } num_ones = len(target_values[target_values == +1 ]) num_minus_ones = len(target_values[target_values == -1 ]) leaf['prediction' ] = majority_class(target_values) return leaf
8. 递归地创建决策树 递归的创建决策树 决策树终止的三个条件:
如果结点内所有的样本的标记都相同,该结点就不需要再继续划分,直接做叶子结点即可
如果结点所有的特征都已经在之前使用过了,在当前结点无剩余特征可供划分样本,该结点直接做叶子结点
如果当前结点的深度已经达到了我们限制的树的最大深度,直接做叶子结点
对于预剪枝来说,实质上是增加了第四个终止条件:
如果当前结点划分后,模型的泛化能力没有提升,则不进行划分
如何判断泛化能力有没有提升?我们需要使用验证集 就像使用训练集递归地划分数据一样,我们在递归地构造决策树时,也需要递归地将验证集进行划分,计算决策树在验证集上的精度
因为我们是递归地对决策树进行划分,所以每次计算验证集上精度是否提升时,也只是针对当前结点内的样本,因为是否对当前结点内的样本进行划分,不会影响它的兄弟结点及兄弟结点的子结点的精度
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 def decision_tree_create (training_data, validation_data, features, target, criterion = 'gain_ratios' , pre_pruning = False, current_depth = 0 , max_depth = 10 , annotate = False) : ''' Parameter: ---------- trianing_data: pd.DataFrame, 数据 features: iterable, 特征组成的可迭代对象 target: str, 标记的名字 criterion: 'str', 特征划分方法 current_depth: int, 当前深度 max_depth: int, 树的最大深度 Returns: ---------- dict, dict['is_leaf'] : False, 当前顶点不是叶子结点 dict['prediction'] : None, 不是叶子结点就没有预测值 dict['splitting_feature']: splitting_feature, 当前结点是使用哪个特征进行划分的 dict['left'] : dict dict['right'] : dict ''' if criterion not in ['information_gain' , 'gain_ratio' , 'gini' ]: raise Exception("传入的criterion不合规!" , criterion) remaining_features = features[:] target_values = training_data[target] validation_values = validation_data[target] print("-" * 50 ) print("Subtree, depth = %s (%s data points)." % (current_depth, len(target_values))) if intermediate_node_num_mistakes(target_values) == 0 : print("Stopping condition 1 reached." ) return create_leaf(target_values) if len(remaining_features) == 0 : print("Stopping condition 2 reached." ) return create_leaf(target_values) if current_depth >= max_depth: print("Reached maximum depth. Stopping for now." ) return create_leaf(target_values) splitting_feature = best_splitting_feature(training_data, features, target, criterion, annotate) left_split = training_data[training_data[splitting_feature] == 0 ] right_split = training_data[training_data[splitting_feature] == 1 ] validation_left_split = validation_data[validation_data[splitting_feature] == 0 ] validation_right_split = validation_data[validation_data[splitting_feature] == 1 ] if pre_pruning: true_class = majority_class(target_values) acc_without_splitting = len(validation_values[validation_values == true_class]) / (len(validation_values) + np.finfo(np.longdouble).eps) left_true_class = majority_class(left_split[target]) right_true_class = majority_class(right_split[target]) vali_left_num_of_majority = len(validation_left_split[validation_left_split[target] == left_true_class]) vali_right_num_of_majority = len(validation_right_split[validation_right_split[target] == right_true_class]) acc_with_splitting = (vali_left_num_of_majority + vali_right_num_of_majority) / (len(validation_data) + np.finfo(np.longdouble).eps) if annotate == True : print('acc before splitting: %.3f' %(acc_without_splitting)) print('acc after splitting: %.3f' %(acc_with_splitting)) if acc_with_splitting < acc_without_splitting: print('Pre-Pruning' ) return create_leaf(target_values) remaining_features.remove(splitting_feature) print("Split on feature %s. (%s, %s)" % (\ splitting_feature, len(left_split), len(right_split))) if len(left_split) == len(training_data): print("Creating leaf node." ) return create_leaf(left_split[target]) if len(right_split) == len(training_data): print("Creating right node." ) return create_leaf(right_split[target]) left_tree = decision_tree_create(left_split, validation_left_split, remaining_features, target, criterion, pre_pruning, current_depth + 1 , max_depth, annotate) right_tree = decision_tree_create(right_split, validation_right_split, remaining_features, target, criterion, pre_pruning, current_depth + 1 , max_depth, annotate) return {'is_leaf' : False , 'prediction' : None , 'splitting_feature' : splitting_feature, 'left' : left_tree, 'right' : right_tree}
1 tree_without_pre_pruning = decision_tree_create(train_data, validation_data, one_hot_features, target, criterion = 'gain_ratio' , pre_pruning = False , current_depth = 0 , max_depth = 6 , annotate = False )
--------------------------------------------------
Subtree, depth = 0 (73564 data points).
Split on feature grade_F. (71229, 2335)
--------------------------------------------------
Subtree, depth = 1 (71229 data points).
Split on feature grade_A. (57869, 13360)
--------------------------------------------------
Subtree, depth = 2 (57869 data points).
Split on feature grade_G. (57232, 637)
--------------------------------------------------
Subtree, depth = 3 (57232 data points).
Split on feature grade_E. (51828, 5404)
--------------------------------------------------
Subtree, depth = 4 (51828 data points).
Split on feature grade_D. (40326, 11502)
--------------------------------------------------
Subtree, depth = 5 (40326 data points).
Split on feature term_ 36 months. (5760, 34566)
--------------------------------------------------
Subtree, depth = 6 (5760 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (34566 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (11502 data points).
Split on feature term_ 36 months. (3315, 8187)
--------------------------------------------------
Subtree, depth = 6 (3315 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (8187 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (5404 data points).
Split on feature term_ 36 months. (3185, 2219)
--------------------------------------------------
Subtree, depth = 5 (3185 data points).
Split on feature home_ownership_OTHER. (3184, 1)
--------------------------------------------------
Subtree, depth = 6 (3184 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (1 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 5 (2219 data points).
Split on feature emp_length_1 year. (2011, 208)
--------------------------------------------------
Subtree, depth = 6 (2011 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (208 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 3 (637 data points).
Split on feature emp_length_3 years. (590, 47)
--------------------------------------------------
Subtree, depth = 4 (590 data points).
Split on feature emp_length_2 years. (541, 49)
--------------------------------------------------
Subtree, depth = 5 (541 data points).
Split on feature home_ownership_OWN. (495, 46)
--------------------------------------------------
Subtree, depth = 6 (495 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (46 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (49 data points).
Split on feature term_ 36 months. (32, 17)
--------------------------------------------------
Subtree, depth = 6 (32 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (17 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (47 data points).
Split on feature home_ownership_OTHER. (46, 1)
--------------------------------------------------
Subtree, depth = 5 (46 data points).
Split on feature home_ownership_OWN. (44, 2)
--------------------------------------------------
Subtree, depth = 6 (44 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (1 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 2 (13360 data points).
Split on feature term_ 36 months. (259, 13101)
--------------------------------------------------
Subtree, depth = 3 (259 data points).
Split on feature emp_length_9 years. (252, 7)
--------------------------------------------------
Subtree, depth = 4 (252 data points).
Split on feature home_ownership_RENT. (202, 50)
--------------------------------------------------
Subtree, depth = 5 (202 data points).
Split on feature emp_length_8 years. (192, 10)
--------------------------------------------------
Subtree, depth = 6 (192 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (10 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 5 (50 data points).
Split on feature emp_length_4 years. (48, 2)
--------------------------------------------------
Subtree, depth = 6 (48 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (7 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 3 (13101 data points).
Split on feature home_ownership_MORTGAGE. (5830, 7271)
--------------------------------------------------
Subtree, depth = 4 (5830 data points).
Split on feature emp_length_7 years. (5592, 238)
--------------------------------------------------
Subtree, depth = 5 (5592 data points).
Split on feature emp_length_3 years. (5045, 547)
--------------------------------------------------
Subtree, depth = 6 (5045 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (547 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (238 data points).
Split on feature home_ownership_OWN. (184, 54)
--------------------------------------------------
Subtree, depth = 6 (184 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (54 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (7271 data points).
Split on feature emp_length_2 years. (6702, 569)
--------------------------------------------------
Subtree, depth = 5 (6702 data points).
Split on feature emp_length_4 years. (6234, 468)
--------------------------------------------------
Subtree, depth = 6 (6234 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (468 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (569 data points).
Split on feature grade_B. (569, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 1 (2335 data points).
Split on feature emp_length_7 years. (2197, 138)
--------------------------------------------------
Subtree, depth = 2 (2197 data points).
Split on feature term_ 36 months. (1719, 478)
--------------------------------------------------
Subtree, depth = 3 (1719 data points).
Split on feature home_ownership_OTHER. (1717, 2)
--------------------------------------------------
Subtree, depth = 4 (1717 data points).
Split on feature emp_length_3 years. (1577, 140)
--------------------------------------------------
Subtree, depth = 5 (1577 data points).
Split on feature home_ownership_RENT. (904, 673)
--------------------------------------------------
Subtree, depth = 6 (904 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (673 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (140 data points).
Split on feature home_ownership_RENT. (73, 67)
--------------------------------------------------
Subtree, depth = 6 (73 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (67 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (2 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 3 (478 data points).
Split on feature emp_length_8 years. (460, 18)
--------------------------------------------------
Subtree, depth = 4 (460 data points).
Split on feature emp_length_4 years. (433, 27)
--------------------------------------------------
Subtree, depth = 5 (433 data points).
Split on feature home_ownership_MORTGAGE. (287, 146)
--------------------------------------------------
Subtree, depth = 6 (287 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (146 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (27 data points).
Split on feature home_ownership_OWN. (25, 2)
--------------------------------------------------
Subtree, depth = 6 (25 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 4 (18 data points).
Split on feature home_ownership_OWN. (17, 1)
--------------------------------------------------
Subtree, depth = 5 (17 data points).
Split on feature home_ownership_MORTGAGE. (11, 6)
--------------------------------------------------
Subtree, depth = 6 (11 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (6 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (1 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 2 (138 data points).
Split on feature term_ 36 months. (109, 29)
--------------------------------------------------
Subtree, depth = 3 (109 data points).
Split on feature home_ownership_RENT. (51, 58)
--------------------------------------------------
Subtree, depth = 4 (51 data points).
Split on feature home_ownership_MORTGAGE. (8, 43)
--------------------------------------------------
Subtree, depth = 5 (8 data points).
Split on feature grade_A. (8, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 5 (43 data points).
Split on feature grade_A. (43, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 4 (58 data points).
Split on feature grade_A. (58, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 3 (29 data points).
Split on feature home_ownership_OWN. (25, 4)
--------------------------------------------------
Subtree, depth = 4 (25 data points).
Split on feature home_ownership_MORTGAGE. (13, 12)
--------------------------------------------------
Subtree, depth = 5 (13 data points).
Split on feature grade_A. (13, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 5 (12 data points).
Split on feature grade_A. (12, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 4 (4 data points).
Split on feature grade_A. (4, 0)
Creating leaf node.
1 tree_with_pre_pruning = decision_tree_create(train_data, validation_data, one_hot_features, target, criterion = "gain_ratio" , pre_pruning = True , current_depth = 0 , max_depth = 6 , annotate = False )
--------------------------------------------------
Subtree, depth = 0 (73564 data points).
Split on feature grade_F. (71229, 2335)
--------------------------------------------------
Subtree, depth = 1 (71229 data points).
Split on feature grade_A. (57869, 13360)
--------------------------------------------------
Subtree, depth = 2 (57869 data points).
Split on feature grade_G. (57232, 637)
--------------------------------------------------
Subtree, depth = 3 (57232 data points).
Split on feature grade_E. (51828, 5404)
--------------------------------------------------
Subtree, depth = 4 (51828 data points).
Split on feature grade_D. (40326, 11502)
--------------------------------------------------
Subtree, depth = 5 (40326 data points).
Split on feature term_ 36 months. (5760, 34566)
--------------------------------------------------
Subtree, depth = 6 (5760 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (34566 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (11502 data points).
Split on feature term_ 36 months. (3315, 8187)
--------------------------------------------------
Subtree, depth = 6 (3315 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (8187 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (5404 data points).
Split on feature term_ 36 months. (3185, 2219)
--------------------------------------------------
Subtree, depth = 5 (3185 data points).
Split on feature home_ownership_OTHER. (3184, 1)
--------------------------------------------------
Subtree, depth = 6 (3184 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (1 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 5 (2219 data points).
Split on feature emp_length_1 year. (2011, 208)
--------------------------------------------------
Subtree, depth = 6 (2011 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (208 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 3 (637 data points).
Split on feature emp_length_3 years. (590, 47)
--------------------------------------------------
Subtree, depth = 4 (590 data points).
Split on feature emp_length_2 years. (541, 49)
--------------------------------------------------
Subtree, depth = 5 (541 data points).
Pre-Pruning
--------------------------------------------------
Subtree, depth = 5 (49 data points).
Split on feature term_ 36 months. (32, 17)
--------------------------------------------------
Subtree, depth = 6 (32 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (17 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (47 data points).
Split on feature home_ownership_OTHER. (46, 1)
--------------------------------------------------
Subtree, depth = 5 (46 data points).
Split on feature home_ownership_OWN. (44, 2)
--------------------------------------------------
Subtree, depth = 6 (44 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (1 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 2 (13360 data points).
Split on feature term_ 36 months. (259, 13101)
--------------------------------------------------
Subtree, depth = 3 (259 data points).
Split on feature emp_length_9 years. (252, 7)
--------------------------------------------------
Subtree, depth = 4 (252 data points).
Split on feature home_ownership_RENT. (202, 50)
--------------------------------------------------
Subtree, depth = 5 (202 data points).
Split on feature emp_length_8 years. (192, 10)
--------------------------------------------------
Subtree, depth = 6 (192 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (10 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 5 (50 data points).
Split on feature emp_length_4 years. (48, 2)
--------------------------------------------------
Subtree, depth = 6 (48 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (7 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 3 (13101 data points).
Split on feature home_ownership_MORTGAGE. (5830, 7271)
--------------------------------------------------
Subtree, depth = 4 (5830 data points).
Split on feature emp_length_7 years. (5592, 238)
--------------------------------------------------
Subtree, depth = 5 (5592 data points).
Split on feature emp_length_3 years. (5045, 547)
--------------------------------------------------
Subtree, depth = 6 (5045 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (547 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (238 data points).
Split on feature home_ownership_OWN. (184, 54)
--------------------------------------------------
Subtree, depth = 6 (184 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (54 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (7271 data points).
Split on feature emp_length_2 years. (6702, 569)
--------------------------------------------------
Subtree, depth = 5 (6702 data points).
Split on feature emp_length_4 years. (6234, 468)
--------------------------------------------------
Subtree, depth = 6 (6234 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (468 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (569 data points).
Split on feature grade_B. (569, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 1 (2335 data points).
Split on feature emp_length_7 years. (2197, 138)
--------------------------------------------------
Subtree, depth = 2 (2197 data points).
Split on feature term_ 36 months. (1719, 478)
--------------------------------------------------
Subtree, depth = 3 (1719 data points).
Split on feature home_ownership_OTHER. (1717, 2)
--------------------------------------------------
Subtree, depth = 4 (1717 data points).
Split on feature emp_length_3 years. (1577, 140)
--------------------------------------------------
Subtree, depth = 5 (1577 data points).
Split on feature home_ownership_RENT. (904, 673)
--------------------------------------------------
Subtree, depth = 6 (904 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (673 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (140 data points).
Split on feature home_ownership_RENT. (73, 67)
--------------------------------------------------
Subtree, depth = 6 (73 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (67 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (2 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 3 (478 data points).
Split on feature emp_length_8 years. (460, 18)
--------------------------------------------------
Subtree, depth = 4 (460 data points).
Pre-Pruning
--------------------------------------------------
Subtree, depth = 4 (18 data points).
Split on feature home_ownership_OWN. (17, 1)
--------------------------------------------------
Subtree, depth = 5 (17 data points).
Split on feature home_ownership_MORTGAGE. (11, 6)
--------------------------------------------------
Subtree, depth = 6 (11 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (6 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (1 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 2 (138 data points).
Pre-Pruning
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 def classify (tree, x, annotate = False) : ''' 递归的进行预测,一次只能预测一个样本 Parameters ---------- tree: dict x: pd.Series,样本 x: pd.DataFrame, 待预测的样本 annotate, boolean, 是否显示注释 Returns ---------- 返回预测的标记 ''' if tree['is_leaf' ]: if annotate: print ("At leaf, predicting %s" % tree['prediction' ]) return tree['prediction' ] else : split_feature_value = x[tree['splitting_feature' ]] if annotate: print ("Split on %s = %s" % (tree['splitting_feature' ], split_feature_value)) if split_feature_value == 0 : return classify(tree['left' ], x, annotate) else : return classify(tree['right' ], x, annotate)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 def predict (tree, data) : ''' 按行遍历data,对每个样本进行预测,将值存储起来,最后返回np.ndarray Parameter ---------- tree, dict, 模型 data, pd.DataFrame, 数据 Returns ---------- predictions, np.ndarray, 模型对这些样本的预测结果 ''' predictions = np.zeros(len(data)) for i in range(len(data)): predictions[i] = classify(tree, data.iloc[i]) return predictions
1 from sklearn.metrics import accuracy_score
不预剪枝的决策树在测试集上的精度
1 accuracy_score(predict(tree_without_pre_pruning, test_data), test_data[target])
0.80882472881494172
预剪枝的决策树在测试集上的精度
1 accuracy_score(predict(tree_with_pre_pruning, test_data), test_data[target])
0.80939564472718373
test 在下方计算出带有预剪枝和不带预剪枝的决策树的精度,查准率,查全率和F1值(最大深度为6,使用信息增益率),保留4位小数,四舍五入 1 2 3 4 5 6 7 8 9 10 from sklearn.metrics import precision_scorefrom sklearn.metrics import recall_scorefrom sklearn.metrics import f1_scoreprint(precision_score(predict(tree_without_pre_pruning, test_data), test_data[target])) print(precision_score(predict(tree_with_pre_pruning, test_data), test_data[target])) print(recall_score(predict(tree_without_pre_pruning, test_data), test_data[target])) print(recall_score(predict(tree_with_pre_pruning, test_data), test_data[target])) print(f1_score(predict(tree_without_pre_pruning, test_data), test_data[target])) print(f1_score(predict(tree_with_pre_pruning, test_data), test_data[target]))
0.998438365825
0.999848874112
0.809739755689
0.809494677597
0.894242916441
0.894658553076
双击此处编辑
模型
精度
查准率
查全率
F1
有预剪枝
0.8094
0.9998
0.8095
0.8947
无预剪枝
0.8088
0.9984
0.8097
0.8942
1 from pyecharts import Tree
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 def generate_echarts_data (tree) : value = dict() if tree['is_leaf' ] == True : value['value' ] = tree['prediction' ] value['name' ] = 'label: %s' %(tree['prediction' ]) return value value['name' ] = tree['splitting_feature' ] value['children' ] = [generate_echarts_data(tree['left' ]), generate_echarts_data(tree['right' ])] return value
1 data1 = generate_echarts_data(tree_without_pre_pruning)
1 2 3 4 5 6 7 8 9 10 11 tree = Tree(width=800 , height=800 ) tree.add("" , [data1], tree_collapse_interval=5 , tree_top="15%" , tree_right="20%" , tree_symbol = 'rect' , tree_symbol_size = 20 , ) tree.render() tree
1 data2 = generate_echarts_data(tree_with_pre_pruning)
1 2 3 4 5 6 7 8 9 10 11 tree = Tree(width=800 , height=800 ) tree.add("" , [data2], tree_collapse_interval=5 , tree_top="15%" , tree_right="20%" , tree_symbol = 'rect' , tree_symbol_size = 20 , ) tree.render() tree