2018-09-13an hour read (About 13367 words)

决策树实现

最近给本科生当助教，出了一道实现决策树的题，还有一个预剪枝的题，自己也顺便实现一下。

我实现的这个决策树主要是参照了C4.5算法。没加入剪枝。实现的其实很简单，只针对离散特征，做了一个二叉决策树。也就是将所有特征先做one-hot，这样所有的特征都变成0和1了，然后对其进行二分。

原理其实很简单，选择一种划分指标，遍历所有的特征，找到最优划分特征，然后分割训练集，从剩余特征中删除当前的最优特征，然后分左子树和右子树递归地继续创建结点即可。无非是递归的终止条件，递归的终止条件有三点：

如果当前结点内所有的样本同属一类，则直接做叶子结点
如果当前深度达到最大深度，直接做叶子结点
如果无剩余特征可供划分，直接做叶子节点

第三题：实现决策树

实验内容：
使用LendingClub Safe Loans数据集：

实现信息增益、信息增益率、基尼指数三种划分标准
使用给定的训练集完成三种决策树的训练过程
计算三种决策树在最大深度为10时在训练集和测试集上的精度，查准率，查全率，F1值

在这部分，我们会实现一个很简单的二叉决策树

1. 读取数据

# 导入类库
import pandas as pd
import numpy as np
import json

1 2	# 导入数据 loans = pd.read_csv('data/lendingclub/lending-club-data.csv', low_memory=False)

数据中有两列是我们想预测的指标，一项是safe_loans，一项是bad_loans，分别表示正例和负例，我们对其进行处理，将正例的safe_loans设为1，负例设为-1，删除bad_loans这列

1
2
3

# 对数据进行预处理，将safe_loans作为标记
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
del loans['bad_loans']

我们只使用grade, term, home_ownership, emp_length这四列作为特征，safe_loans作为标记，只保留loans中的这五列

features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home_ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]
target = 'safe_loans'
loans = loans[features + [target]]

2. 划分训练集和测试集

from sklearn.utils import shuffle
loans = shuffle(loans, random_state = 34)

split_line = int(len(loans) * 0.6)
train_data = loans.iloc[: split_line]
test_data = loans.iloc[split_line:]

3. 特征预处理

可以看到所有的特征都是离散类型的特征，需要对数据进行预处理，使用one-hot编码对其进行处理。

one-hot编码的思想就是将离散特征变成向量，假设特征$A$有三种取值$\lbrace a, b, c\rbrace$，这三种取值等价，如果我们使用1,2,3三个数字表示这三种取值，那么在计算时就会产生偏差，有一些涉及距离度量的算法会认为，2和1离得近，3和1离得远，但这三个值应该是等价的，这种表示方法会造成模型在判断上出现偏差。解决方案就是使用一个三维向量表示他们，用$[1, 0, 0]$表示a，$[0, 1, 0]$表示b，$[0, 0, 1]$表示c，这样三个向量之间的距离就都是相等的了，任意两个向量在欧式空间的距离都是$\sqrt{2}$。这就是one-hot编码是思想。

pandas中使用get_dummies生成one-hot向量

def one_hot_encoding(data, features_categorical):
    '''
    Parameter
    ----------
    data: pd.DataFrame
    
    features_categorical: list(str)
    '''
    
    # 对所有的离散特征遍历
    for cat in features_categorical:
        
        # 对这列进行one-hot编码，前缀为这个变量名
        one_encoding = pd.get_dummies(data[cat], prefix = cat)
        
        # 将生成的one-hot编码与之前的dataframe拼接起来
        data = pd.concat([data, one_encoding],axis=1)
        
        # 删除掉原始的这列离散特征
        del data[cat]
    
    return data

首先对训练集生成one-hot向量，然后对测试集生成one-hot向量，这里需要注意的是，如果训练集中，特征$A$的取值为$\lbrace a, b, c\rbrace$，这样我们生成的特征就有三列，分别为$A_a$, $A_b$, $A_c$，然后我们使用这个训练集训练模型，模型就只会考虑这三个特征，在测试集中如果有一个样本的特征$A$的值为$d$，那它的$A_a$，$A_b$，$A_c$就都为0，我们不去考虑$A_d$，因为这个特征在训练模型的时候是不存在的。

1	train_data = one_hot_encoding(train_data, features)

获取所有特征的名字

1
2
3

one_hot_features = train_data.columns.tolist()
one_hot_features.remove(target)
one_hot_features

['grade_A',
 'grade_B',
 'grade_C',
 'grade_D',
 'grade_E',
 'grade_F',
 'grade_G',
 'term_ 36 months',
 'term_ 60 months',
 'home_ownership_MORTGAGE',
 'home_ownership_OTHER',
 'home_ownership_OWN',
 'home_ownership_RENT',
 'emp_length_1 year',
 'emp_length_10+ years',
 'emp_length_2 years',
 'emp_length_3 years',
 'emp_length_4 years',
 'emp_length_5 years',
 'emp_length_6 years',
 'emp_length_7 years',
 'emp_length_8 years',
 'emp_length_9 years',
 'emp_length_< 1 year']

接下来是对测试集进行one_hot编码，但只要保留出现在one_hot_features中的特征即可·

1	test_data_tmp = one_hot_encoding(test_data, features)

# 创建一个空的DataFrame
test_data = pd.DataFrame(columns = train_data.columns)
for feature in train_data.columns:
    # 如果训练集中当前特征在test_data_tmp中出现了，将其复制到test_data中
    if feature in test_data_tmp.columns:
        test_data[feature] = test_data_tmp[feature].copy()
    else:
        # 否则就用全为0的列去替代
        test_data[feature] = np.zeros(test_data_tmp.shape[0], dtype = 'uint8')

1	train_data.shape

(73564, 25)

1	test_data.shape

(49043, 25)

训练集有37224个样本，测试集有9284个样本，处理完后，所有的特征都是0和1，标记是1和-1，以上就是数据预处理流程

4. 实现3种特征划分准则

决策树中有很多常用的特征划分方法，比如信息增益、信息增益率、基尼指数

我们需要实现一个函数，它的作用是，给定决策树的某个结点内的所有样本的标记，让它计算出对应划分指标的值是多少

接下来我们会实现上述三种划分指标

这里我们约定，将所有特征取值为0的样本，划分到左子树，特征取值为1的样本，划分到右子树

4.1 信息增益

信息熵：

$\mathrm{Ent}(D) = - \sum^{\vert \mathcal{Y} \vert}_{k = 1} p_k \mathrm{log}_2 p_k$

信息增益：

$\mathrm{Gain}(D, a) = \mathrm{Ent}(D) - \sum^{V}_{v=1} \frac{\vert D^v \vert}{\vert D \vert} \mathrm{Ent}(D^v)$

计算信息熵时约定：若$p = 0$，则$p \log_2p = 0$

def information_entropy(labels_in_node):
    '''
    求当前结点的信息熵
    
    Parameter
    ----------
    labels_in_node: np.ndarray, 如[-1, 1, -1, 1, 1]
    
    Returns
    ----------
    float: information entropy
    '''
    # 统计样本总个数
    num_of_samples = labels_in_node.shape[0]
    
    if num_of_samples == 0:
        return 0
    
    # 统计出标记为1的个数
    num_of_positive = len(labels_in_node[labels_in_node == 1])
    
    # 统计出标记为-1的个数
    # YOUR CODE HERE
    num_of_negative = len(labels_in_node[labels_in_node == -1])
    
    # 统计正例的概率
    prob_positive = num_of_positive / num_of_samples
    
    # 统计负例的概率
    # YOUR CODE HERE
    prob_negative = num_of_negative / num_of_samples
    
    if prob_positive == 0:
        positive_part = 0
    else:
        positive_part = prob_positive * np.log2(prob_positive)
    
    if prob_negative == 0:
        negative_part = 0
    else:
        negative_part = prob_negative * np.log2(prob_negative)
    
    return - ( positive_part + negative_part )

下面是6个测试样例

# 信息熵测试样例1
example_labels = np.array([-1, -1, 1, 1, 1])
print(information_entropy(example_labels)) # 0.97095

# 信息熵测试样例2
example_labels = np.array([-1, -1, 1, 1, 1, 1, 1])
print(information_entropy(example_labels)) # 0.86312
    
# 信息熵测试样例3
example_labels = np.array([-1, -1, -1, -1, -1, 1, 1])
print(information_entropy(example_labels)) # 0.86312

# 信息熵测试样例4
example_labels = np.array([-1] * 9 + [1] * 8)
print(information_entropy(example_labels)) # 0.99750

# 信息熵测试样例5
example_labels = np.array([1] * 8)
print(information_entropy(example_labels)) # 0

# 信息熵测试样例6
example_labels = np.array([])
print(information_entropy(example_labels)) # 0

0.970950594455
0.863120568567
0.863120568567
0.997502546369
-0.0
0

接下来完成计算所有特征的信息增益的函数

def compute_information_gains(data, features, target, annotate = False):
    '''
    计算所有特征的信息增益
    
    Parameter
    ----------
        data: pd.DataFrame，传入的样本，带有特征和标记的dataframe
        
        features: list(str)，特征名组成的list
        
        target: str, 标记(label)的名字
        
        annotate, boolean，是否打印所有特征的信息增益值，默认为False
        
    Returns
    ----------
        information_gains: dict, key: str, 特征名
                                 value: float，信息增益
    '''
    
    # 我们将使用每个特征划分的信息增益值存储在这个dict中
    # 键是特征名，值是信息增益值
    information_gains = dict()
    
    # 对所有的特征进行遍历，使用当前的划分方法对每个特征进行计算
    for feature in features:
        # 左子树保证所有的样本的这个特征取值为0
        left_split_target = data[data[feature] == 0][target]
        
        # 右子树保证所有的样本的这个特征取值为1
        right_split_target =  data[data[feature] == 1][target]
            
        # 计算左子树的信息熵
        left_entropy = information_entropy(left_split_target)
        
        # 计算左子树的权重
        left_weight = len(left_split_target) / (len(left_split_target) + len(right_split_target))

        # 计算右子树的信息熵
        # YOUR CODE HERE
        right_entropy = information_entropy(right_split_target)
        
        # 计算右子树的权重
        # YOUR CODE HERE
        right_weight = len(right_split_target) / (len(left_split_target) + len(right_split_target))
        
        # 计算当前结点的信息熵
        current_entropy = information_entropy(data[target])
            
        # 计算使用当前特征划分的信息增益
        # YOUR CODE HERE
        gain = current_entropy - (left_weight * left_entropy + right_weight * right_entropy)
        
        # 将特征名与增益值以键值对的形式存储在information_gains中
        information_gains[feature] = gain
        
        if annotate:
            print(" ", feature, gain)
            
    return information_gains

# 信息增益测试样例1
print(compute_information_gains(train_data, one_hot_features, target)['grade_A']) # 0.01759

# 信息增益测试样例2
print(compute_information_gains(train_data, one_hot_features, target)['term_ 60 months']) # 0.01429

# 信息增益测试样例3
print(compute_information_gains(train_data, one_hot_features, target)['grade_B']) # 0.00370

0.0175919801789
0.0142918503294
0.00370492003453

4.2 信息增益率

信息增益率：

$\mathrm{Gain\_ratio}(D, a) = \frac{\mathrm{Gain}(D, a)}{\mathrm{IV}(a)}$

其中

$\mathrm{IV}(a) = - \sum^V_{v=1} \frac{\vert D^v \vert}{\vert D \vert} \log_2 \frac{\vert D^v \vert}{\vert D \vert}$

完成计算所有特征信息增益率的函数

def compute_information_gain_ratios(data, features, target, annotate = False):
    '''
    计算所有特征的信息增益率并保存起来
    
    Parameter
    ----------
    data: pd.DataFrame, 带有特征和标记的数据
    
    features: list(str)，特征名组成的list
    
    target: str， 特征的名字
    
    annotate: boolean, default False，是否打印注释
    
    Returns
    ----------
    gain_ratios: dict, key: str, 特征名
                       value: float，信息增益率
    '''
    
    gain_ratios = dict()
    
    # 对所有的特征进行遍历，使用当前的划分方法对每个特征进行计算
    for feature in features:
        
        # 左子树保证所有的样本的这个特征取值为0
        left_split_target = data[data[feature] == 0][target]
        
        # 右子树保证所有的样本的这个特征取值为1
        right_split_target =  data[data[feature] == 1][target]
            
        # 计算左子树的信息熵
        left_entropy = information_entropy(left_split_target)
        
        # 计算左子树的权重
        left_weight = len(left_split_target) / (len(left_split_target) + len(right_split_target))

        # 计算右子树的信息熵
        right_entropy = information_entropy(right_split_target)
        
        # 计算右子树的权重
        right_weight = len(right_split_target) / (len(left_split_target) + len(right_split_target))
        
        # 计算当前结点的信息熵
        current_entropy = information_entropy(data[target])
        
        # 计算当前结点的信息增益
        # YOUR CODE HERE
        gain = current_entropy - (left_weight * left_entropy + right_weight * right_entropy)
        
        # 计算左子树的IV
        if left_weight == 0:
            left_IV = 0
        else:
            # YOUR CODE HERE
            left_IV = left_weight * np.log2(left_weight)
        
        # 计算右子树的IV
        if right_weight == 0:
            right_IV = 0
        else:
            # YOUR CODE HERE
            right_IV = right_weight * np.log2(right_weight)
        
        # IV 等于所有子树IV之和的相反数
        IV = - (left_IV + right_IV)
            
        # 计算使用当前特征划分的信息增益率
        # 这里为了防止IV是0，导致除法得到np.inf，在分母加了一个很小的小数
        gain_ratio = gain / (IV + np.finfo(np.longdouble).eps)
        
        # 信息增益率的存储
        gain_ratios[feature] = gain_ratio
        
        if annotate:
            print(" ", feature, gain_ratio)
            
    return gain_ratios

# 信息增益率测试样例1
print(compute_information_gain_ratios(train_data, one_hot_features, target)['grade_A']) # 0.02573

# 信息增益率测试样例2
print(compute_information_gain_ratios(train_data, one_hot_features, target)['grade_B']) # 0.00417

# 信息增益率测试样例3
print(compute_information_gain_ratios(train_data, one_hot_features, target)['term_ 60 months']) # 0.01970

0.025734780668
0.00417549506943
0.0197093627186

4.3 基尼指数

数据集$D$的基尼值：

$\begin{aligned} \mathrm{Gini}(D) & = \sum^{\vert \mathcal{Y} \vert}_{k=1} \sum_{k' \neq k} p_k p_{k'}\\ & = 1 - \sum^{\vert \mathcal{Y} \vert}_{k=1} p^2_k. \end{aligned}$

属性$a$的基尼指数：

$\mathrm{Gini\_index}(D, a) = \sum^V_{v = 1} \frac{\vert D^v \vert}{\vert D \vert} \mathrm{Gini}(D^v)$

完成数据集基尼值的计算

def gini(labels_in_node):
    '''
    计算一个结点内样本的基尼指数
    
    Paramters
    ----------
    label_in_data: np.ndarray, 样本的标记，如[-1, -1, 1, 1, 1]
    
    Returns
    ---------
    gini: float，基尼指数
    '''
    
    # 统计样本总个数
    num_of_samples = labels_in_node.shape[0]
    
    if num_of_samples == 0:
        return 0
    
    # 统计出1的个数
    num_of_positive = len(labels_in_node[labels_in_node == 1])
    
    # 统计出-1的个数
    # YOUR CODE HERE
    num_of_negative = len(labels_in_node[labels_in_node == -1])
    
    # 统计正例的概率
    prob_positive = num_of_positive / num_of_samples
    
    # 统计负例的概率
    # YOUR CODE HERE
    prob_negative = num_of_negative / num_of_samples
    
    # 计算基尼值
    # YOUR CODE HERE
    gini = 1 - (prob_positive ** 2 + prob_negative ** 2)
    
    return gini

# 基尼值测试样例1
example_labels = np.array([-1, -1, 1, 1, 1])
print(gini(example_labels)) # 0.48

# 基尼值测试样例2
example_labels = np.array([-1, -1, 1, 1, 1, 1, 1])
print(gini(example_labels)) # 0.40816
    
# 基尼值测试样例3
example_labels = np.array([-1, -1, -1, -1, -1, 1, 1])
print(gini(example_labels)) # 0.40816

# 基尼值测试样例4
example_labels = np.array([-1] * 9 + [1] * 8)
print(gini(example_labels)) # 0.49827

# 基尼值测试样例5
example_labels = np.array([1] * 8)
print(gini(example_labels)) # 0

# 基尼值测试样例6
example_labels = np.array([])
print(gini(example_labels)) # 0

0.48
0.40816326530612246
0.40816326530612246
0.4982698961937716
0.0
0

然后计算所有特征的基尼指数

def compute_gini_indices(data, features, target, annotate = False):
    '''
    计算使用各个特征进行划分时，各特征的基尼指数
    
    Parameter
    ----------
    data: pd.DataFrame, 带有特征和标记的数据
    
    features: list(str)，特征名组成的list
    
    target: str， 特征的名字
    
    annotate: boolean, default False，是否打印注释
    
    Returns
    ----------
    gini_indices: dict, key: str, 特征名
                       value: float，基尼指数
    '''
    
    gini_indices = dict()
    # 对所有的特征进行遍历，使用当前的划分方法对每个特征进行计算
    for feature in features:
        # 左子树保证所有的样本的这个特征取值为0
        left_split_target = data[data[feature] == 0][target]
        
        # 右子树保证所有的样本的这个特征取值为1
        right_split_target =  data[data[feature] == 1][target]
            
        # 计算左子树的基尼值
        left_gini = gini(left_split_target)
        
        # 计算左子树的权重
        left_weight = len(left_split_target) / (len(left_split_target) + len(right_split_target))

        # 计算右子树的基尼值
        # YOUR CODE HERE
        right_gini = gini(right_split_target)
        
        # 计算右子树的权重
        # YOUR CODE HERE
        right_weight = len(right_split_target) / (len(left_split_target) + len(right_split_target))
        
        # 计算当前结点的基尼指数
        # YOUR CODE HERE
        gini_index = left_weight * left_gini + right_weight * right_gini
        
        # 存储
        gini_indices[feature] = gini_index
        
        if annotate:
            print(" ", feature, gini_index)
            
    return gini_indices

# 基尼指数测试样例1
print(compute_gini_indices(train_data, one_hot_features, target)['grade_A']) # 0.30095

# 基尼指数测试样例2
print(compute_gini_indices(train_data, one_hot_features, target)['grade_B']) # 0.30568

# 基尼指数测试样例3
print(compute_gini_indices(train_data, one_hot_features, target)['term_ 36 months']) # 0.30055

0.3009520964964362
0.3056855375882364
0.30055418611740065

5. 完成最优特征的选择

到此，我们完成了三种划分策略的实现，接下来就是完成获取最优特征的函数

def best_splitting_feature(data, features, target, criterion = 'gini', annotate = False):
    '''
    给定划分方法和数据，找到最优的划分特征
    
    Parameters
    ----------
    data: pd.DataFrame, 带有特征和标记的数据
    
    features: list(str)，特征名组成的list
    
    target: str， 特征的名字
    
    criterion: str, 使用哪种指标，三种选项: 'information_gain', 'gain_ratio', 'gini'
    
    annotate: boolean, default False，是否打印注释
    
    Returns
    ----------
    best_feature: str, 最佳的划分特征的名字
    
    '''
    
    if criterion == 'information_gain':
        if annotate:
            print('using information gain')
        
        # 得到当前所有特征的信息增益
        information_gains = compute_information_gains(data, features, target, annotate)
    
        # information_gains是一个dict类型的对象，我们要找值最大的那个元素的键是谁
        # 可以通过这种方式直接获取这个键
        # 根据这些特征和他们的信息增益，找到最佳的划分特征
        # YOUR CODE HERE
        best_feature = max(information_gains.items(), key = lambda x: x[1])[0]
        
        return best_feature

    elif criterion == 'gain_ratio':
        if annotate:
            print('using information gain ratio')
        
        # 得到当前所有特征的信息增益率
        gain_ratios = compute_information_gain_ratios(data, features, target, annotate)
    
        # 根据这些特征和他们的信息增益率，找到最佳的划分特征
        # YOUR CODE HERE
        best_feature = max(gain_ratios.items(), key = lambda x: x[1])[0]

        return best_feature
    
    elif criterion == 'gini':
        if annotate:
            print('using gini')
        
        # 得到当前所有特征的基尼指数
        gini_indices = compute_gini_indices(data, features, target, annotate)
        
        # 根据这些特征和他们的基尼指数，找到最佳的划分特征
        # YOUR CODE HERE
        best_feature = min(gini_indices.items(), key = lambda x: x[1])[0]

        return best_feature
    else:
        raise Exception("传入的criterion不合规!", criterion)

6. 判断结点内样本的类别是否为同一类

def intermediate_node_num_mistakes(labels_in_node):
    '''
    求树的结点中，样本数少的那个类的样本有多少，比如输入是[1, 1, -1, -1, 1]，返回2
    
    Parameter
    ----------
    labels_in_node: np.ndarray, pd.Series
    
    Returns
    ----------
    int：个数
    
    '''
    # 如果传入的array为空，返回0
    if len(labels_in_node) == 0:
        return 0
    
    # 统计1的个数
    # YOUR CODE HERE
    num_of_one = len(labels_in_node[labels_in_node == 1])
    
    # 统计-1的个数
    # YOUR CODE HERE
    num_of_minus_one = len(labels_in_node[labels_in_node == -1])
    
    return num_of_one if num_of_minus_one > num_of_one else num_of_minus_one

# 测试样例1
print(intermediate_node_num_mistakes(np.array([1, 1, -1, -1, -1]))) # 2

# 测试样例2
print(intermediate_node_num_mistakes(np.array([]))) # 0

# 测试样例3
print(intermediate_node_num_mistakes(np.array([1]))) # 0

2
0
0

7. 创建叶子结点

def create_leaf(target_values):
    '''
    计算出当前叶子结点的标记是什么，并且将叶子结点信息保存在一个dict中
    
    Parameter:
    ----------
    target_values: pd.Series, 当前叶子结点内样本的标记

    Returns:
    ----------
    leaf: dict，表示一个叶结点，
            leaf['splitting_features'], None，叶结点不需要划分特征
            leaf['left'], None，叶结点没有左子树
            leaf['right'], None，叶结点没有右子树
            leaf['is_leaf'], True, 是否是叶子结点
            leaf['prediction'], int, 表示该叶子结点的预测值
    '''
    # 创建叶子结点
    leaf = {'splitting_feature' : None,
            'left' : None,
            'right' : None,
            'is_leaf': True}
   
    # 数结点内-1和+1的个数
    num_ones = len(target_values[target_values == +1])
    num_minus_ones = len(target_values[target_values == -1])    

    # 叶子结点的标记使用少数服从多数的原则，为样本数多的那类的标记，保存在 leaf['prediction']
    if num_ones > num_minus_ones:
        # YOUR CODE HERE
        leaf['prediction'] = 1
    else:
        # YOUR CODE HERE
        leaf['prediction'] = -1

    # 返回叶子结点
    return leaf

8. 递归地创建决策树

递归的创建决策树
递归算法终止的三个条件：

如果结点内所有的样本的标记都相同，该结点就不需要再继续划分，直接做叶子结点即可
如果结点所有的特征都已经在之前使用过了，在当前结点无剩余特征可供划分样本，该结点直接做叶子结点
如果当前结点的深度已经达到了我们限制的树的最大深度，直接做叶子结点

def decision_tree_create(data, features, target, criterion = 'gini', current_depth = 0, max_depth = 10, annotate = False):
    '''
    Parameter:
    ----------
    data: pd.DataFrame, 数据

    features: iterable, 特征组成的可迭代对象，比如一个list

    target: str, 标记的名字

    criterion: 'str', 特征划分方法，只支持三种：'information_gain', 'gain_ratio', 'gini'

    current_depth: int, 当前深度，递归的时候需要记录

    max_depth: int, 树的最大深度，我们设定的树的最大深度，达到最大深度需要终止递归

    Returns:
    ----------
    dict, dict['is_leaf']          : False, 当前顶点不是叶子结点
          dict['prediction']       : None, 不是叶子结点就没有预测值
          dict['splitting_feature']: splitting_feature, 当前结点是使用哪个特征进行划分的
          dict['left']             : dict
          dict['right']            : dict
    '''
    
    if criterion not in ['information_gain', 'gain_ratio', 'gini']:
        raise Exception("传入的criterion不合规!", criterion)
    
    # 复制一份特征，存储起来，每使用一个特征进行划分，我们就删除一个
    remaining_features = features[:]
    
    # 取出标记值
    target_values = data[target]
    print("-" * 50)
    print("Subtree, depth = %s (%s data points)." % (current_depth, len(target_values)))

    # 终止条件1
    # 如果当前结点内所有样本同属一类，即这个结点中，各类别样本数最小的那个等于0
    # 使用前面写的intermediate_node_num_mistakes来完成这个判断
    # YOUR CODE HERE
    if intermediate_node_num_mistakes(target_values) == 0:
        print("Stopping condition 1 reached.")
        return create_leaf(target_values)   # 创建叶子节点
    
    # 终止条件2
    # 如果已经没有剩余的特征可供分割，即remaining_features为空
    # YOUR CODE HERE
    if len(remaining_features) == 0:
        print("Stopping condition 2 reached.")
        return create_leaf(target_values)   # 创建叶子节点
    
    # 终止条件3
    # 如果已经到达了我们要求的最大深度，即当前深度达到了最大深度
    # YOUR CODE HERE
    if current_depth >= max_depth:
        print("Reached maximum depth. Stopping for now.")
        return create_leaf(target_values)   # 创建叶子节点

    # 找到最优划分特征
    # 使用best_splitting_feature这个函数
    ## YOUR CODE HERE
    splitting_feature = best_splitting_feature(data, features, target, criterion, annotate)
    
    # 使用我们找到的最优特征将数据划分成两份
    # 左子树的数据
    left_split = data[data[splitting_feature] == 0]
    
    # 右子树的数据
    # YOUR CODE HERE
    right_split = data[data[splitting_feature] == 1]
    
    # 现在已经完成划分，我们要从剩余特征中删除掉当前这个特征
    remaining_features.remove(splitting_feature)
    
    # 打印当前划分使用的特征，打印左子树样本个数，右子树样本个数
    print("Split on feature %s. (%s, %s)" % (\
                      splitting_feature, len(left_split), len(right_split)))
    
    # 如果使用当前的特征，将所有的样本都划分到一棵子树中，那么就直接将这棵子树变成叶子节点
    # 判断左子树是不是“完美”的
    if len(left_split) == len(data):
        print("Creating leaf node.")
        return create_leaf(left_split[target])
    
    # 判断右子树是不是“完美”的
    if len(right_split) == len(data):
        print("Creating right node.")
        # YOUR CODE HERE
        return create_leaf(right_split[target])

    # 递归地创建左子树
    left_tree = decision_tree_create(left_split, remaining_features, target, criterion, current_depth + 1, max_depth, annotate)
    
    # 递归地创建右子树
    ## YOUR CODE HERE
    right_tree = decision_tree_create(right_split, remaining_features, target, criterion, current_depth + 1, max_depth, annotate) 

    # 返回树的非叶子结点
    return {'is_leaf'          : False, 
            'prediction'       : None,
            'splitting_feature': splitting_feature,
            'left'             : left_tree, 
            'right'            : right_tree}

1	my_decision_tree = decision_tree_create(train_data, one_hot_features, target, 'gini', max_depth = 6, annotate = False)

--------------------------------------------------
Subtree, depth = 0 (73564 data points).
Split on feature term_ 36 months. (14831, 58733)
--------------------------------------------------
Subtree, depth = 1 (14831 data points).
Split on feature grade_F. (13003, 1828)
--------------------------------------------------
Subtree, depth = 2 (13003 data points).
Split on feature grade_E. (9818, 3185)
--------------------------------------------------
Subtree, depth = 3 (9818 data points).
Split on feature home_ownership_RENT. (6796, 3022)
--------------------------------------------------
Subtree, depth = 4 (6796 data points).
Split on feature grade_G. (6507, 289)
--------------------------------------------------
Subtree, depth = 5 (6507 data points).
Split on feature grade_D. (4368, 2139)
--------------------------------------------------
Subtree, depth = 6 (4368 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2139 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (289 data points).
Split on feature home_ownership_OWN. (249, 40)
--------------------------------------------------
Subtree, depth = 6 (249 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (40 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (3022 data points).
Split on feature grade_G. (2827, 195)
--------------------------------------------------
Subtree, depth = 5 (2827 data points).
Split on feature grade_D. (1651, 1176)
--------------------------------------------------
Subtree, depth = 6 (1651 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (1176 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (195 data points).
Split on feature emp_length_2 years. (176, 19)
--------------------------------------------------
Subtree, depth = 6 (176 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (19 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 3 (3185 data points).
Split on feature home_ownership_RENT. (1980, 1205)
--------------------------------------------------
Subtree, depth = 4 (1980 data points).
Split on feature emp_length_3 years. (1828, 152)
--------------------------------------------------
Subtree, depth = 5 (1828 data points).
Split on feature emp_length_10+ years. (1057, 771)
--------------------------------------------------
Subtree, depth = 6 (1057 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (771 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (152 data points).
Split on feature home_ownership_OTHER. (151, 1)
--------------------------------------------------
Subtree, depth = 6 (151 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (1 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 4 (1205 data points).
Split on feature emp_length_1 year. (1124, 81)
--------------------------------------------------
Subtree, depth = 5 (1124 data points).
Split on feature emp_length_8 years. (1073, 51)
--------------------------------------------------
Subtree, depth = 6 (1073 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (51 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (81 data points).
Split on feature grade_A. (81, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 2 (1828 data points).
Split on feature home_ownership_RENT. (1030, 798)
--------------------------------------------------
Subtree, depth = 3 (1030 data points).
Split on feature emp_length_3 years. (957, 73)
--------------------------------------------------
Subtree, depth = 4 (957 data points).
Split on feature emp_length_2 years. (886, 71)
--------------------------------------------------
Subtree, depth = 5 (886 data points).
Split on feature home_ownership_OTHER. (884, 2)
--------------------------------------------------
Subtree, depth = 6 (884 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 5 (71 data points).
Split on feature home_ownership_MORTGAGE. (12, 59)
--------------------------------------------------
Subtree, depth = 6 (12 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (59 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (73 data points).
Split on feature home_ownership_MORTGAGE. (13, 60)
--------------------------------------------------
Subtree, depth = 5 (13 data points).
Split on feature grade_A. (13, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 5 (60 data points).
Split on feature grade_A. (60, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 3 (798 data points).
Split on feature emp_length_7 years. (740, 58)
--------------------------------------------------
Subtree, depth = 4 (740 data points).
Split on feature emp_length_3 years. (673, 67)
--------------------------------------------------
Subtree, depth = 5 (673 data points).
Split on feature emp_length_9 years. (646, 27)
--------------------------------------------------
Subtree, depth = 6 (646 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (27 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (67 data points).
Split on feature grade_A. (67, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 4 (58 data points).
Split on feature grade_A. (58, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 1 (58733 data points).
Split on feature grade_A. (45632, 13101)
--------------------------------------------------
Subtree, depth = 2 (45632 data points).
Split on feature grade_B. (25130, 20502)
--------------------------------------------------
Subtree, depth = 3 (25130 data points).
Split on feature grade_C. (11066, 14064)
--------------------------------------------------
Subtree, depth = 4 (11066 data points).
Split on feature home_ownership_MORTGAGE. (6987, 4079)
--------------------------------------------------
Subtree, depth = 5 (6987 data points).
Split on feature grade_F. (6650, 337)
--------------------------------------------------
Subtree, depth = 6 (6650 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (337 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (4079 data points).
Split on feature grade_G. (4003, 76)
--------------------------------------------------
Subtree, depth = 6 (4003 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (76 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (14064 data points).
Split on feature home_ownership_MORTGAGE. (8209, 5855)
--------------------------------------------------
Subtree, depth = 5 (8209 data points).
Split on feature emp_length_2 years. (7209, 1000)
--------------------------------------------------
Subtree, depth = 6 (7209 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (1000 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (5855 data points).
Split on feature emp_length_10+ years. (3802, 2053)
--------------------------------------------------
Subtree, depth = 6 (3802 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2053 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 3 (20502 data points).
Split on feature home_ownership_MORTGAGE. (10775, 9727)
--------------------------------------------------
Subtree, depth = 4 (10775 data points).
Split on feature home_ownership_OTHER. (10741, 34)
--------------------------------------------------
Subtree, depth = 5 (10741 data points).
Split on feature emp_length_1 year. (9754, 987)
--------------------------------------------------
Subtree, depth = 6 (9754 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (987 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (34 data points).
Split on feature emp_length_10+ years. (27, 7)
--------------------------------------------------
Subtree, depth = 6 (27 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (7 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (9727 data points).
Split on feature emp_length_< 1 year. (9186, 541)
--------------------------------------------------
Subtree, depth = 5 (9186 data points).
Split on feature emp_length_9 years. (8807, 379)
--------------------------------------------------
Subtree, depth = 6 (8807 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (379 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (541 data points).
Split on feature grade_C. (541, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 2 (13101 data points).
Split on feature home_ownership_MORTGAGE. (5830, 7271)
--------------------------------------------------
Subtree, depth = 3 (5830 data points).
Split on feature emp_length_3 years. (5283, 547)
--------------------------------------------------
Subtree, depth = 4 (5283 data points).
Split on feature emp_length_1 year. (4705, 578)
--------------------------------------------------
Subtree, depth = 5 (4705 data points).
Split on feature emp_length_7 years. (4467, 238)
--------------------------------------------------
Subtree, depth = 6 (4467 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (238 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (578 data points).
Split on feature home_ownership_OTHER. (576, 2)
--------------------------------------------------
Subtree, depth = 6 (576 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 4 (547 data points).
Split on feature home_ownership_OTHER. (545, 2)
--------------------------------------------------
Subtree, depth = 5 (545 data points).
Split on feature home_ownership_OWN. (468, 77)
--------------------------------------------------
Subtree, depth = 6 (468 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (77 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (2 data points).
Split on feature grade_B. (2, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 3 (7271 data points).
Split on feature emp_length_2 years. (6702, 569)
--------------------------------------------------
Subtree, depth = 4 (6702 data points).
Split on feature emp_length_4 years. (6234, 468)
--------------------------------------------------
Subtree, depth = 5 (6234 data points).
Split on feature emp_length_3 years. (5689, 545)
--------------------------------------------------
Subtree, depth = 6 (5689 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (545 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (468 data points).
Split on feature grade_B. (468, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 4 (569 data points).
Split on feature grade_B. (569, 0)
Creating leaf node.

现在，模型就训练好了

9. 预测

接下来我们需要完成预测函数

def classify(tree, x, annotate = False):
    '''
    递归的进行预测，一次只能预测一个样本
    
    Parameters
    ----------
    tree: dict
    
    x: pd.Series，样本
    
    x: pd.DataFrame, 待预测的样本
    
    annotate, boolean, 是否显示注释
    
    Returns
    ----------
    返回预测的标记
    '''
    if tree['is_leaf']:
        if annotate:
            print ("At leaf, predicting %s" % tree['prediction'])
        return tree['prediction']
    else:
        split_feature_value = x[tree['splitting_feature']]
        if annotate:
             print ("Split on %s = %s" % (tree['splitting_feature'], split_feature_value))
        if split_feature_value == 0:
            return classify(tree['left'], x, annotate)
        else:
            return classify(tree['right'], x, annotate)

我们取测试集第一个样本来测试

1 2	test_sample = test_data.iloc[0] print(test_sample)

safe_loans                 1
grade_A                    0
grade_B                    0
grade_C                    0
grade_D                    0
grade_E                    1
grade_F                    0
grade_G                    0
term_ 36 months            1
term_ 60 months            0
home_ownership_MORTGAGE    1
home_ownership_OTHER       0
home_ownership_OWN         0
home_ownership_RENT        0
emp_length_1 year          0
emp_length_10+ years       0
emp_length_2 years         1
emp_length_3 years         0
emp_length_4 years         0
emp_length_5 years         0
emp_length_6 years         0
emp_length_7 years         0
emp_length_8 years         0
emp_length_9 years         0
emp_length_< 1 year        0
Name: 37225, dtype: int64

1 2	print('True class: %s ' % (test_sample['safe_loans'])) print('Predicted class: %s ' % classify(my_decision_tree, test_sample))

True class: 1 
Predicted class: 1

打印出使用决策树判断的过程

1	classify(my_decision_tree, test_sample, annotate=True)

Split on term_ 36 months = 1
Split on grade_A = 0
Split on grade_B = 0
Split on grade_C = 0
Split on home_ownership_MORTGAGE = 1
Split on grade_G = 0
At leaf, predicting 1





1

10. 在测试集上对我们的模型进行评估

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

先来编写一个批量预测的函数，传入的是整个测试集那样的pd.DataFrame，这个函数返回一个np.ndarray，存储模型的预测结果

def predict(tree, data):
    '''
    按行遍历data，对每个样本进行预测，将值存储起来，最后返回np.ndarray
    
    Parameter
    ----------
    tree, dict, 模型
    
    data, pd.DataFrame, 数据
    
    Returns
    ----------
    predictions, np.ndarray, 模型对这些样本的预测结果
    '''
    predictions = np.zeros(len(data))
    
    # YOUR CODE HERE
    for i in range(len(data)):
        predictions[i] = classify(tree, data.iloc[i])
    
    return predictions

11. 请你计算使用不同评价指标得到模型的四项指标的值，填写在下方表格内

树的最大深度为6

1	# YOUR CODE HERE

树的最大深度为6

双击此处编写

划分标准	精度	查准率	查全率	F1
信息增益	0.0	0.0	0.0	0.0
信息增益率	0.0	0.0	0.0	0.0
基尼指数	0.0	0.0	0.0	0.0

12. 扩展：使用Echarts绘制决策树

我们可以使用echarts绘制出我们训练的决策树，这时候可以利用pyecharts这个库
pyecharts
pyecharts可以与jupyter notebook无缝衔接，直接在notebook中绘制图表。
提醒：pyecharts还未支持jupyter lab

1 2	# 导入树形图 from pyecharts import Tree

echarts中的树形图要求我们提供一组这样的数据

[
    {
        value: 1212,    # 数值
        # 子节点
        children: [
            {
                # 子节点数值
                value: 2323,
                # 子节点名
                name: 'description of this node',
                children: [...],
            },
            {
                value: 4545,
                name: 'description of this node',
                children: [
                    {
                        value: 5656,
                        name: 'description of this node',
                        children: [...]
                    },
                    ...
                ]
            }
        ]
    },
    ...
]

关于pyecharts中的树形图的文档地址:pyecharts Tree

其实和我们训练得到的树结构类似，只不过每个结点有个”name”属性，表示这个结点的名字，”value”表示它的值，”children”是一个list，里面还有这样的dict，我们可以写一个递归的函数完成这种数据的生成

def generate_echarts_data(tree):
    
    # 当前顶点的dict
    value = dict()
    
    # 如果传入的tree已经是叶子结点了
    if tree['is_leaf'] == True:
        
        # 它的value就设置为预测的标记
        value['value'] = tree['prediction']
        
        # 它的名字就叫"label: 标记"
        value['name'] = 'label: %s'%(tree['prediction'])
        
        # 直接返回这个dict即可
        return value
    
    # 如果传入的tree不是叶子结点，名字就叫当前这个顶点的划分特征，子树是一个list
    # 分别增加左子树和右子树到children中
    value['name'] = tree['splitting_feature']
    value['children'] = [generate_echarts_data(tree['left']), generate_echarts_data(tree['right'])]
    return value

1	data = generate_echarts_data(my_decision_tree)

使用下面的代码进行绘制，绘制完成后，树的结点是可以点击的，点击后会展开它的子树

tree = Tree(width=800, height=400)
tree.add("",
         [data],
         tree_collapse_interval=5,
         tree_top="15%",
         tree_right="20%",
         tree_symbol = 'rect',
         tree_symbol_size = 20,
         )
tree.render()
tree

第四题：预剪枝

实验内容

实现使用信息增益率划分的预剪枝

我们会以信息增益率作为划分准则，构造带有预剪枝的二叉决策树
使用的数据和第三题一样，剪枝需要使用验证集，所以数据划分策略会和第三题不同

1. 读取数据

1
2
3

# 导入类库
import pandas as pd
import numpy as np

1 2	# 导入数据 loans = pd.read_csv('data/lendingclub/lending-club-data.csv', low_memory=False)

1
2
3

# 对数据进行预处理，将safe_loans作为标记
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
del loans['bad_loans']

features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home_ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]
target = 'safe_loans'
loans = loans[features + [target]]

2. 划分训练集和测试集

1	from sklearn.utils import shuffle

1	loans = shuffle(loans, random_state = 34)

我们使用数据的60%做训练集，20%做验证集，20%做测试集

split_line1 = int(len(loans) * 0.6)
split_line2 = int(len(loans) * 0.8)
train_data = loans.iloc[: split_line1]
validation_data = loans.iloc[split_line1: split_line2]
test_data = loans.iloc[split_line2:]

3. 特征预处理

def one_hot_encoding(data, features_categorical):
    '''
    Parameter
    ----------
    data: pd.DataFrame
    
    features_categorical: list(str)
    '''
    
    # 对所有的离散特征遍历
    for cat in features_categorical:
        
        # 对这列进行one-hot编码，前缀为这个变量名
        one_encoding = pd.get_dummies(data[cat], prefix = cat)
        
        # 将生成的one-hot编码与之前的dataframe拼接起来
        data = pd.concat([data, one_encoding],axis=1)
        
        # 删除掉原始的这列离散特征
        del data[cat]
    
    return data

1	train_data = one_hot_encoding(train_data, features)

1 2	one_hot_features = train_data.columns.tolist() one_hot_features.remove(target)

validation_tmp = one_hot_encoding(validation_data, features)
validation_data = pd.DataFrame(columns = train_data.columns)
for feature in train_data.columns:
    if feature in validation_tmp:
        validation_data[feature] = validation_tmp[feature].copy()
    else:
        validation_data[feature] = np.zeros(len(validation_tmp), dtype = 'uint8')

test_data_tmp = one_hot_encoding(test_data, features)
test_data = pd.DataFrame(columns = train_data.columns)
for feature in train_data.columns:
    if feature in test_data_tmp.columns:
        test_data[feature] = test_data_tmp[feature].copy()
    else:
        test_data[feature] = np.zeros(test_data_tmp.shape[0], dtype = 'uint8')

打印一下3个数据集的shape

1	print(train_data.shape, validation_data.shape, test_data.shape)

(73564, 25) (24521, 25) (24522, 25)

4. 实现信息增益率的计算

信息熵：

$\mathrm{Ent}(D) = - \sum^{\vert \mathcal{Y} \vert}_{k = 1} p_k \mathrm{log}_2 p_k$

信息增益：

$\mathrm{Gain}(D, a) = \mathrm{Ent}(D) - \sum^{V}_{v=1} \frac{\vert D^v \vert}{\vert D \vert} \mathrm{Ent}(D^v)$

信息增益率：

$\mathrm{Gain\_ratio}(D, a) = \frac{\mathrm{Gain}(D, a)}{\mathrm{IV}(a)}$

其中

$\mathrm{IV}(a) = - \sum^V_{v=1} \frac{\vert D^v \vert}{\vert D \vert} \log_2 \frac{\vert D^v \vert}{\vert D \vert}$

计算信息熵时约定：若$p = 0$，则$p \log_2p = 0$

def information_entropy(labels_in_node):
    '''
    求当前结点的信息熵
    
    Parameter
    ----------
    labels_in_node: np.ndarray, 如[-1, 1, -1, 1, 1]
    
    Returns
    ----------
    float: information entropy
    '''
    
    # 统计样本总个数
    num_of_samples = labels_in_node.shape[0]
    
    if num_of_samples == 0:
        return 0
    
    # 统计出标记为1的个数
    num_of_positive = len(labels_in_node[labels_in_node == 1])
    
    # 统计出标记为-1的个数
    # YOUR CODE HERE
    num_of_negative = len(labels_in_node[labels_in_node == -1])
    
    # 统计正例的概率
    prob_positive = num_of_positive / num_of_samples
    
    # 统计负例的概率
    # YOUR CODE HERE
    prob_negative = num_of_negative / num_of_samples
    
    if prob_positive == 0:
        positive_part = 0
    else:
        positive_part = prob_positive * np.log2(prob_positive)
    
    if prob_negative == 0:
        negative_part = 0
    else:
        negative_part = prob_negative * np.log2(prob_negative)
    
    return - ( positive_part + negative_part )

# 信息熵测试样例1
example_labels = np.array([-1, -1, 1, 1, 1])
print(information_entropy(example_labels)) # 0.97095

# 信息熵测试样例2
example_labels = np.array([-1, -1, 1, 1, 1, 1, 1])
print(information_entropy(example_labels)) # 0.86312
    
# 信息熵测试样例3
example_labels = np.array([-1, -1, -1, -1, -1, 1, 1])
print(information_entropy(example_labels)) # 0.86312

# 信息熵测试样例4
example_labels = np.array([-1] * 9 + [1] * 8)
print(information_entropy(example_labels)) # 0.99750

# 信息熵测试样例5
example_labels = np.array([1] * 8)
print(information_entropy(example_labels)) # 0

# 信息熵测试样例6
example_labels = np.array([])
print(information_entropy(example_labels)) # 0

0.970950594455
0.863120568567
0.863120568567
0.997502546369
-0.0
0

def compute_information_gain_ratios(data, features, target, annotate = False):
    '''
    计算所有特征的信息增益率并保存起来
    
    Parameter
    ----------
    data: pd.DataFrame, 带有特征和标记的数据
    
    features: list(str)，特征名组成的list
    
    target: str， 特征的名字
    
    annotate: boolean, default False，是否打印注释
    
    Returns
    ----------
    gain_ratios: dict, key: str, 特征名
                       value: float，信息增益率
    '''
    
    gain_ratios = dict()
    
    # 对所有的特征进行遍历，使用当前的划分方法对每个特征进行计算
    for feature in features:
        
        # 左子树保证所有的样本的这个特征取值为0
        left_split_target = data[data[feature] == 0][target]
        
        # 右子树保证所有的样本的这个特征取值为1
        right_split_target =  data[data[feature] == 1][target]
            
        # 计算左子树的信息熵
        left_entropy = information_entropy(left_split_target)
        
        # 计算左子树的权重
        left_weight = len(left_split_target) / (len(left_split_target) + len(right_split_target))

        # 计算右子树的信息熵
        right_entropy = information_entropy(right_split_target)
        
        # 计算右子树的权重
        right_weight = len(right_split_target) / (len(left_split_target) + len(right_split_target))
        
        # 计算当前结点的信息熵
        current_entropy = information_entropy(data[target])
        
        # 计算当前结点的信息增益
        # YOUR CODE HERE
        gain = current_entropy - (left_weight * left_entropy + right_weight * right_entropy)
        
        # 计算左子树的IV
        if left_weight == 0:
            left_IV = 0
        else:
            # YOUR CODE HERE
            left_IV = left_weight * np.log2(left_weight)
        
        # 计算右子树的IV
        if right_weight == 0:
            right_IV = 0
        else:
            # YOUR CODE HERE
            right_IV = right_weight * np.log2(right_weight)
        
        # IV 等于所有子树IV之和的相反数
        IV = - (left_IV + right_IV)
            
        # 计算使用当前特征划分的信息增益率
        # 这里为了防止IV是0，导致除法得到np.inf，在分母加了一个很小的小数
        gain_ratio = gain / (IV + np.finfo(np.longdouble).eps)
        
        # 信息增益率的存储
        gain_ratios[feature] = gain_ratio
        
        if annotate:
            print(" ", feature, gain_ratio)
            
    return gain_ratios

# 信息增益率测试样例1
print(compute_information_gain_ratios(train_data, one_hot_features, target)['grade_A']) # 0.02573

# 信息增益率测试样例2
print(compute_information_gain_ratios(train_data, one_hot_features, target)['grade_B']) # 0.00418

# 信息增益率测试样例3
print(compute_information_gain_ratios(train_data, one_hot_features, target)['term_ 60 months']) # 0.01971

0.025734780668
0.00417549506943
0.0197093627186

5. 完成最优特征的选择

这里我们没有实现信息增益和基尼指数的最优特征求解，感兴趣的同学可以按上一题实现

def best_splitting_feature(data, features, target, criterion = 'gain_ratios', annotate = False):
    '''
    给定划分方法和数据，找到最优的划分特征
    
    Parameters
    ----------
    data: pd.DataFrame, 带有特征和标记的数据
    
    features: list(str)，特征名组成的list
    
    target: str， 特征的名字
    
    criterion: str, 使用哪种指标，三种选项: 'information_gain', 'gain_ratio', 'gini'
    
    annotate: boolean, default False，是否打印注释
    
    Returns
    ----------
    best_feature: str, 最佳的划分特征的名字
    
    '''
    if criterion == 'information_gain':
        if annotate:
            print('using information gain')
        return None

    elif criterion == 'gain_ratio':
        if annotate:
            print('using information gain ratio')
        
        # 得到当前所有特征的信息增益率
        gain_ratios = compute_information_gain_ratios(data, features, target, annotate)
    
        # 根据这些特征和他们的信息增益率，找到最佳的划分特征
        best_feature = max(gain_ratios.items(), key = lambda x: x[1])[0]

        return best_feature
    
    elif criterion == 'gini':
        if annotate:
            print('using gini')
        return None
    else:
        raise Exception("传入的criterion不合规!", criterion)

6. 判断结点内样本的类别是否为同一类

def intermediate_node_num_mistakes(labels_in_node):
    '''
    求树的结点中，样本数少的那个类的样本有多少，比如输入是[1, 1, -1, -1, 1]，返回2
    
    Parameter
    ----------
    labels_in_node: np.ndarray, pd.Series
    
    Returns
    ----------
    int：个数
    
    '''
    # 如果传入的array为空，返回0
    if len(labels_in_node) == 0:
        return 0
    
    # 统计1的个数
    # YOUR CODE HERE
    num_of_one = len(labels_in_node[labels_in_node == 1])
    
    # 统计-1的个数
    # YOUR CODE HERE
    num_of_minus_one = len(labels_in_node[labels_in_node == -1])
    
    return num_of_one if num_of_minus_one > num_of_one else num_of_minus_one

7. 创建叶子结点

先编写一个辅助函数majority_class，求树的结点中，样本数多的那个类是什么

def majority_class(labels_in_node):
    '''
        求树的结点中，样本数多的那个类是什么
    '''
    # 如果传入的array为空，返回0
    if len(labels_in_node) == 0:
        return 0
    
    # 统计1的个数
    # YOUR CODE HERE
    num_of_one = len(labels_in_node[labels_in_node == 1])
    
    # 统计-1的个数
    # YOUR CODE HERE
    num_of_minus_one = len(labels_in_node[labels_in_node == -1])
    
    return 1 if num_of_minus_one < num_of_one else -1

def create_leaf(target_values):
    '''
    计算出当前叶子结点的标记是什么，并且将叶子结点信息保存在一个dict中
    
    Parameter:
    ----------
    target_values: pd.Series, 当前叶子结点内样本的标记

    Returns:
    ----------
    leaf: dict，表示一个叶结点，
            leaf['splitting_features'], None，叶结点不需要划分特征
            leaf['left'], None，叶结点没有左子树
            leaf['right'], None，叶结点没有右子树
            leaf['is_leaf'], True, 是否是叶子结点
            leaf['prediction'], int, 表示该叶子结点的预测值
    '''
    # 创建叶子结点
    leaf = {'splitting_feature' : None,
            'left' : None,
            'right' : None,
            'is_leaf': True}
   
    # 数结点内-1和+1的个数
    num_ones = len(target_values[target_values == +1])
    num_minus_ones = len(target_values[target_values == -1])    

    # 叶子结点的标记使用少数服从多数的原则，为样本数多的那类的标记，保存在 leaf['prediction']
    leaf['prediction'] = majority_class(target_values)

    # 返回叶子结点
    return leaf

8. 递归地创建决策树

递归的创建决策树
决策树终止的三个条件：

如果结点内所有的样本的标记都相同，该结点就不需要再继续划分，直接做叶子结点即可
如果结点所有的特征都已经在之前使用过了，在当前结点无剩余特征可供划分样本，该结点直接做叶子结点
如果当前结点的深度已经达到了我们限制的树的最大深度，直接做叶子结点

对于预剪枝来说，实质上是增加了第四个终止条件：

如果当前结点划分后，模型的泛化能力没有提升，则不进行划分

如何判断泛化能力有没有提升？我们需要使用验证集
就像使用训练集递归地划分数据一样，我们在递归地构造决策树时，也需要递归地将验证集进行划分，计算决策树在验证集上的精度

因为我们是递归地对决策树进行划分，所以每次计算验证集上精度是否提升时，也只是针对当前结点内的样本，因为是否对当前结点内的样本进行划分，不会影响它的兄弟结点及兄弟结点的子结点的精度

def decision_tree_create(training_data, validation_data, features, target, criterion = 'gain_ratios', pre_pruning = False, current_depth = 0, max_depth = 10, annotate = False):
    '''
    Parameter:
    ----------
    trianing_data: pd.DataFrame, 数据

    features: iterable, 特征组成的可迭代对象

    target: str, 标记的名字

    criterion: 'str', 特征划分方法

    current_depth: int, 当前深度

    max_depth: int, 树的最大深度

    Returns:
    ----------
    dict, dict['is_leaf']          : False, 当前顶点不是叶子结点
          dict['prediction']       : None, 不是叶子结点就没有预测值
          dict['splitting_feature']: splitting_feature, 当前结点是使用哪个特征进行划分的
          dict['left']             : dict
          dict['right']            : dict
    '''
    if criterion not in ['information_gain', 'gain_ratio', 'gini']:
        raise Exception("传入的criterion不合规!", criterion)
    
    # 复制一份特征，存储起来，每使用一个特征进行划分，我们就删除一个
    remaining_features = features[:]
    
    # 取出标记值
    target_values = training_data[target]
    validation_values = validation_data[target]
    print("-" * 50)
    print("Subtree, depth = %s (%s data points)." % (current_depth, len(target_values)))

    # 终止条件1
    # 如果当前结点内所有样本同属一类，即这个结点中，各类别样本数最小的那个等于0
    # 使用前面写的intermediate_node_num_mistakes来完成这个判断
    # YOUR CODE HERE
    if intermediate_node_num_mistakes(target_values) == 0:
        print("Stopping condition 1 reached.")
        return create_leaf(target_values)   # 创建叶子结点
    
    # 终止条件2
    # 如果已经没有剩余的特征可供分割
    # YOUR CODE HERE
    if len(remaining_features) == 0:
        print("Stopping condition 2 reached.")
        return create_leaf(target_values)   # 创建叶子结点
    
    # 终止条件3
    # 如果已经到达了我们要求的最大深度
    # YOUR CODE HERE
    if current_depth >= max_depth:
        print("Reached maximum depth. Stopping for now.")
        return create_leaf(target_values)   # 创建叶子结点
    
    # 找到最优划分特征
    # 使用best_splitting_feature这个函数
    # YOUR CODE HERE
    splitting_feature = best_splitting_feature(training_data, features, target, criterion, annotate)
    
    # 使用我们找到的最优特征将数据划分成两份
    # 左子树的数据
    left_split = training_data[training_data[splitting_feature] == 0]
    
    # 右子树的数据
    # YOUR CODE HERE
    right_split = training_data[training_data[splitting_feature] == 1]
    
    # 使用这个最优特征对验证集进行划分
    validation_left_split = validation_data[validation_data[splitting_feature] == 0]
    validation_right_split = validation_data[validation_data[splitting_feature] == 1]
    
    # 如果使用预剪枝，需要判断在验证集上的精度是否提升了
    if pre_pruning:
        # 首先计算不划分的时候的在验证集上的精度，也就是当前结点为叶子结点
        # 统计当前结点样本中，样本数多的那个类(majority class)
        true_class = majority_class(target_values)

        # 判断验证集在不划分时的精度，分母加eps是因为，有可能在划分的时候，验证集的样本数为0
        acc_without_splitting = len(validation_values[validation_values == true_class]) / (len(validation_values) + np.finfo(np.longdouble).eps)

        # 对当前结点进行划分，统计划分后，左子树的majority class
        left_true_class = majority_class(left_split[target])

        # 对当前结点进行划分，统计右子树的majority class
        right_true_class = majority_class(right_split[target])

        # 统计验证集左子树中有多少样本是左子树的majority class
        vali_left_num_of_majority = len(validation_left_split[validation_left_split[target] == left_true_class])

        # 统计验证集右子树中有多少样本是右子树的majority class
        vali_right_num_of_majority = len(validation_right_split[validation_right_split[target] == right_true_class])

        # 计算划分后的精度
        acc_with_splitting = (vali_left_num_of_majority + vali_right_num_of_majority) / (len(validation_data) + np.finfo(np.longdouble).eps)

        if annotate == True:
            print('acc before splitting: %.3f'%(acc_without_splitting))
            print('acc after splitting: %.3f'%(acc_with_splitting))

        # 如果划分后的精度小于等于划分前的精度，那就不划分，当前结点直接变成叶子结点
        # 否则继续划分，创建左右子树
        if acc_with_splitting < acc_without_splitting:
            print('Pre-Pruning')
            return create_leaf(target_values)   # 创建叶子结点
    
    # 从剩余特征中删除掉当前这个特征
    remaining_features.remove(splitting_feature)
    
    print("Split on feature %s. (%s, %s)" % (\
                      splitting_feature, len(left_split), len(right_split)))
    
    # 如果使用当前的特征，将所有的样本都划分到一棵子树中，那么就直接将这棵子树变成叶子结点
    # 判断左子树是不是“完美”的
    if len(left_split) == len(training_data):
        print("Creating leaf node.")
        return create_leaf(left_split[target])
    
    # 判断右子树是不是“完美”的
    if len(right_split) == len(training_data):
        print("Creating right node.")
        # YOUR CODE HERE
        return create_leaf(right_split[target])

    # 递归地创建左子树，需要传入验证集的左子树
    left_tree = decision_tree_create(left_split, validation_left_split, remaining_features, target, criterion, pre_pruning, current_depth + 1, max_depth, annotate)
    
    # 递归地创建右子树，需要传入验证集的右子树
    ## YOUR CODE HERE
    right_tree = decision_tree_create(right_split, validation_right_split, remaining_features, target, criterion, pre_pruning, current_depth + 1, max_depth, annotate) 

    return {'is_leaf'          : False, 
            'prediction'       : None,
            'splitting_feature': splitting_feature,
            'left'             : left_tree, 
            'right'            : right_tree}

tree_without_pre_pruning = decision_tree_create(train_data, validation_data, one_hot_features, target, criterion = 'gain_ratio', pre_pruning = False, current_depth = 0, max_depth = 6, annotate = False)

--------------------------------------------------
Subtree, depth = 0 (73564 data points).
Split on feature grade_F. (71229, 2335)
--------------------------------------------------
Subtree, depth = 1 (71229 data points).
Split on feature grade_A. (57869, 13360)
--------------------------------------------------
Subtree, depth = 2 (57869 data points).
Split on feature grade_G. (57232, 637)
--------------------------------------------------
Subtree, depth = 3 (57232 data points).
Split on feature grade_E. (51828, 5404)
--------------------------------------------------
Subtree, depth = 4 (51828 data points).
Split on feature grade_D. (40326, 11502)
--------------------------------------------------
Subtree, depth = 5 (40326 data points).
Split on feature term_ 36 months. (5760, 34566)
--------------------------------------------------
Subtree, depth = 6 (5760 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (34566 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (11502 data points).
Split on feature term_ 36 months. (3315, 8187)
--------------------------------------------------
Subtree, depth = 6 (3315 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (8187 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (5404 data points).
Split on feature term_ 36 months. (3185, 2219)
--------------------------------------------------
Subtree, depth = 5 (3185 data points).
Split on feature home_ownership_OTHER. (3184, 1)
--------------------------------------------------
Subtree, depth = 6 (3184 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (1 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 5 (2219 data points).
Split on feature emp_length_1 year. (2011, 208)
--------------------------------------------------
Subtree, depth = 6 (2011 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (208 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 3 (637 data points).
Split on feature emp_length_3 years. (590, 47)
--------------------------------------------------
Subtree, depth = 4 (590 data points).
Split on feature emp_length_2 years. (541, 49)
--------------------------------------------------
Subtree, depth = 5 (541 data points).
Split on feature home_ownership_OWN. (495, 46)
--------------------------------------------------
Subtree, depth = 6 (495 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (46 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (49 data points).
Split on feature term_ 36 months. (32, 17)
--------------------------------------------------
Subtree, depth = 6 (32 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (17 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (47 data points).
Split on feature home_ownership_OTHER. (46, 1)
--------------------------------------------------
Subtree, depth = 5 (46 data points).
Split on feature home_ownership_OWN. (44, 2)
--------------------------------------------------
Subtree, depth = 6 (44 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (1 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 2 (13360 data points).
Split on feature term_ 36 months. (259, 13101)
--------------------------------------------------
Subtree, depth = 3 (259 data points).
Split on feature emp_length_9 years. (252, 7)
--------------------------------------------------
Subtree, depth = 4 (252 data points).
Split on feature home_ownership_RENT. (202, 50)
--------------------------------------------------
Subtree, depth = 5 (202 data points).
Split on feature emp_length_8 years. (192, 10)
--------------------------------------------------
Subtree, depth = 6 (192 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (10 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 5 (50 data points).
Split on feature emp_length_4 years. (48, 2)
--------------------------------------------------
Subtree, depth = 6 (48 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (7 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 3 (13101 data points).
Split on feature home_ownership_MORTGAGE. (5830, 7271)
--------------------------------------------------
Subtree, depth = 4 (5830 data points).
Split on feature emp_length_7 years. (5592, 238)
--------------------------------------------------
Subtree, depth = 5 (5592 data points).
Split on feature emp_length_3 years. (5045, 547)
--------------------------------------------------
Subtree, depth = 6 (5045 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (547 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (238 data points).
Split on feature home_ownership_OWN. (184, 54)
--------------------------------------------------
Subtree, depth = 6 (184 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (54 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (7271 data points).
Split on feature emp_length_2 years. (6702, 569)
--------------------------------------------------
Subtree, depth = 5 (6702 data points).
Split on feature emp_length_4 years. (6234, 468)
--------------------------------------------------
Subtree, depth = 6 (6234 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (468 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (569 data points).
Split on feature grade_B. (569, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 1 (2335 data points).
Split on feature emp_length_7 years. (2197, 138)
--------------------------------------------------
Subtree, depth = 2 (2197 data points).
Split on feature term_ 36 months. (1719, 478)
--------------------------------------------------
Subtree, depth = 3 (1719 data points).
Split on feature home_ownership_OTHER. (1717, 2)
--------------------------------------------------
Subtree, depth = 4 (1717 data points).
Split on feature emp_length_3 years. (1577, 140)
--------------------------------------------------
Subtree, depth = 5 (1577 data points).
Split on feature home_ownership_RENT. (904, 673)
--------------------------------------------------
Subtree, depth = 6 (904 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (673 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (140 data points).
Split on feature home_ownership_RENT. (73, 67)
--------------------------------------------------
Subtree, depth = 6 (73 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (67 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (2 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 3 (478 data points).
Split on feature emp_length_8 years. (460, 18)
--------------------------------------------------
Subtree, depth = 4 (460 data points).
Split on feature emp_length_4 years. (433, 27)
--------------------------------------------------
Subtree, depth = 5 (433 data points).
Split on feature home_ownership_MORTGAGE. (287, 146)
--------------------------------------------------
Subtree, depth = 6 (287 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (146 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (27 data points).
Split on feature home_ownership_OWN. (25, 2)
--------------------------------------------------
Subtree, depth = 6 (25 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 4 (18 data points).
Split on feature home_ownership_OWN. (17, 1)
--------------------------------------------------
Subtree, depth = 5 (17 data points).
Split on feature home_ownership_MORTGAGE. (11, 6)
--------------------------------------------------
Subtree, depth = 6 (11 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (6 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (1 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 2 (138 data points).
Split on feature term_ 36 months. (109, 29)
--------------------------------------------------
Subtree, depth = 3 (109 data points).
Split on feature home_ownership_RENT. (51, 58)
--------------------------------------------------
Subtree, depth = 4 (51 data points).
Split on feature home_ownership_MORTGAGE. (8, 43)
--------------------------------------------------
Subtree, depth = 5 (8 data points).
Split on feature grade_A. (8, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 5 (43 data points).
Split on feature grade_A. (43, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 4 (58 data points).
Split on feature grade_A. (58, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 3 (29 data points).
Split on feature home_ownership_OWN. (25, 4)
--------------------------------------------------
Subtree, depth = 4 (25 data points).
Split on feature home_ownership_MORTGAGE. (13, 12)
--------------------------------------------------
Subtree, depth = 5 (13 data points).
Split on feature grade_A. (13, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 5 (12 data points).
Split on feature grade_A. (12, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 4 (4 data points).
Split on feature grade_A. (4, 0)
Creating leaf node.

1	tree_with_pre_pruning = decision_tree_create(train_data, validation_data, one_hot_features, target, criterion = "gain_ratio", pre_pruning = True, current_depth = 0, max_depth = 6, annotate = False)

--------------------------------------------------
Subtree, depth = 0 (73564 data points).
Split on feature grade_F. (71229, 2335)
--------------------------------------------------
Subtree, depth = 1 (71229 data points).
Split on feature grade_A. (57869, 13360)
--------------------------------------------------
Subtree, depth = 2 (57869 data points).
Split on feature grade_G. (57232, 637)
--------------------------------------------------
Subtree, depth = 3 (57232 data points).
Split on feature grade_E. (51828, 5404)
--------------------------------------------------
Subtree, depth = 4 (51828 data points).
Split on feature grade_D. (40326, 11502)
--------------------------------------------------
Subtree, depth = 5 (40326 data points).
Split on feature term_ 36 months. (5760, 34566)
--------------------------------------------------
Subtree, depth = 6 (5760 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (34566 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (11502 data points).
Split on feature term_ 36 months. (3315, 8187)
--------------------------------------------------
Subtree, depth = 6 (3315 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (8187 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (5404 data points).
Split on feature term_ 36 months. (3185, 2219)
--------------------------------------------------
Subtree, depth = 5 (3185 data points).
Split on feature home_ownership_OTHER. (3184, 1)
--------------------------------------------------
Subtree, depth = 6 (3184 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (1 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 5 (2219 data points).
Split on feature emp_length_1 year. (2011, 208)
--------------------------------------------------
Subtree, depth = 6 (2011 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (208 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 3 (637 data points).
Split on feature emp_length_3 years. (590, 47)
--------------------------------------------------
Subtree, depth = 4 (590 data points).
Split on feature emp_length_2 years. (541, 49)
--------------------------------------------------
Subtree, depth = 5 (541 data points).
Pre-Pruning
--------------------------------------------------
Subtree, depth = 5 (49 data points).
Split on feature term_ 36 months. (32, 17)
--------------------------------------------------
Subtree, depth = 6 (32 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (17 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (47 data points).
Split on feature home_ownership_OTHER. (46, 1)
--------------------------------------------------
Subtree, depth = 5 (46 data points).
Split on feature home_ownership_OWN. (44, 2)
--------------------------------------------------
Subtree, depth = 6 (44 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (1 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 2 (13360 data points).
Split on feature term_ 36 months. (259, 13101)
--------------------------------------------------
Subtree, depth = 3 (259 data points).
Split on feature emp_length_9 years. (252, 7)
--------------------------------------------------
Subtree, depth = 4 (252 data points).
Split on feature home_ownership_RENT. (202, 50)
--------------------------------------------------
Subtree, depth = 5 (202 data points).
Split on feature emp_length_8 years. (192, 10)
--------------------------------------------------
Subtree, depth = 6 (192 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (10 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 5 (50 data points).
Split on feature emp_length_4 years. (48, 2)
--------------------------------------------------
Subtree, depth = 6 (48 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (7 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 3 (13101 data points).
Split on feature home_ownership_MORTGAGE. (5830, 7271)
--------------------------------------------------
Subtree, depth = 4 (5830 data points).
Split on feature emp_length_7 years. (5592, 238)
--------------------------------------------------
Subtree, depth = 5 (5592 data points).
Split on feature emp_length_3 years. (5045, 547)
--------------------------------------------------
Subtree, depth = 6 (5045 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (547 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (238 data points).
Split on feature home_ownership_OWN. (184, 54)
--------------------------------------------------
Subtree, depth = 6 (184 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (54 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (7271 data points).
Split on feature emp_length_2 years. (6702, 569)
--------------------------------------------------
Subtree, depth = 5 (6702 data points).
Split on feature emp_length_4 years. (6234, 468)
--------------------------------------------------
Subtree, depth = 6 (6234 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (468 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (569 data points).
Split on feature grade_B. (569, 0)
Creating leaf node.
--------------------------------------------------
Subtree, depth = 1 (2335 data points).
Split on feature emp_length_7 years. (2197, 138)
--------------------------------------------------
Subtree, depth = 2 (2197 data points).
Split on feature term_ 36 months. (1719, 478)
--------------------------------------------------
Subtree, depth = 3 (1719 data points).
Split on feature home_ownership_OTHER. (1717, 2)
--------------------------------------------------
Subtree, depth = 4 (1717 data points).
Split on feature emp_length_3 years. (1577, 140)
--------------------------------------------------
Subtree, depth = 5 (1577 data points).
Split on feature home_ownership_RENT. (904, 673)
--------------------------------------------------
Subtree, depth = 6 (904 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (673 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (140 data points).
Split on feature home_ownership_RENT. (73, 67)
--------------------------------------------------
Subtree, depth = 6 (73 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (67 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (2 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 3 (478 data points).
Split on feature emp_length_8 years. (460, 18)
--------------------------------------------------
Subtree, depth = 4 (460 data points).
Pre-Pruning
--------------------------------------------------
Subtree, depth = 4 (18 data points).
Split on feature home_ownership_OWN. (17, 1)
--------------------------------------------------
Subtree, depth = 5 (17 data points).
Split on feature home_ownership_MORTGAGE. (11, 6)
--------------------------------------------------
Subtree, depth = 6 (11 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (6 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (1 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 2 (138 data points).
Pre-Pruning

def classify(tree, x, annotate = False):
    '''
    递归的进行预测，一次只能预测一个样本
    
    Parameters
    ----------
    tree: dict
    
    x: pd.Series，样本
    
    x: pd.DataFrame, 待预测的样本
    
    annotate, boolean, 是否显示注释
    
    Returns
    ----------
    返回预测的标记
    '''
    if tree['is_leaf']:
        if annotate:
            print ("At leaf, predicting %s" % tree['prediction'])
        return tree['prediction']
    else:
        split_feature_value = x[tree['splitting_feature']]
        if annotate:
             print ("Split on %s = %s" % (tree['splitting_feature'], split_feature_value))
        if split_feature_value == 0:
            return classify(tree['left'], x, annotate)
        else:
            return classify(tree['right'], x, annotate)

def predict(tree, data):
    '''
    按行遍历data，对每个样本进行预测，将值存储起来，最后返回np.ndarray
    
    Parameter
    ----------
    tree, dict, 模型
    
    data, pd.DataFrame, 数据
    
    Returns
    ----------
    predictions, np.ndarray, 模型对这些样本的预测结果
    '''
    predictions = np.zeros(len(data))
    
    # YOUR CODE HERE
    for i in range(len(data)):
        predictions[i] = classify(tree, data.iloc[i])
    
    return predictions

1	from sklearn.metrics import accuracy_score

不预剪枝的决策树在测试集上的精度

1	accuracy_score(predict(tree_without_pre_pruning, test_data), test_data[target])

0.80882472881494172

预剪枝的决策树在测试集上的精度

1	accuracy_score(predict(tree_with_pre_pruning, test_data), test_data[target])

0.80939564472718373

test 在下方计算出带有预剪枝和不带预剪枝的决策树的精度，查准率，查全率和F1值(最大深度为6，使用信息增益率)，保留4位小数，四舍五入

# YOUR CODE HERE
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
print(precision_score(predict(tree_without_pre_pruning, test_data), test_data[target]))
print(precision_score(predict(tree_with_pre_pruning, test_data), test_data[target]))
print(recall_score(predict(tree_without_pre_pruning, test_data), test_data[target]))
print(recall_score(predict(tree_with_pre_pruning, test_data), test_data[target]))
print(f1_score(predict(tree_without_pre_pruning, test_data), test_data[target]))
print(f1_score(predict(tree_with_pre_pruning, test_data), test_data[target]))

0.998438365825
0.999848874112
0.809739755689
0.809494677597
0.894242916441
0.894658553076

双击此处编辑

模型	精度	查准率	查全率	F1
有预剪枝	0.8094	0.9998	0.8095	0.8947
无预剪枝	0.8088	0.9984	0.8097	0.8942

1	from pyecharts import Tree

def generate_echarts_data(tree):
    
    # 当前顶点的dict
    value = dict()
    
    # 如果传入的tree已经是叶子结点了
    if tree['is_leaf'] == True:
        
        # 它的value就设置为预测的标记
        value['value'] = tree['prediction']
        
        # 它的名字就叫"label: 标记"
        value['name'] = 'label: %s'%(tree['prediction'])
        
        # 直接返回这个dict即可
        return value
    
    # 如果传入的tree不是叶子结点，名字就叫当前这个顶点的划分特征，子树是一个list
    # 分别增加左子树和右子树到children中
    value['name'] = tree['splitting_feature']
    value['children'] = [generate_echarts_data(tree['left']), generate_echarts_data(tree['right'])]
    return value

1	data1 = generate_echarts_data(tree_without_pre_pruning)

tree = Tree(width=800, height=800)
tree.add("",
         [data1],
         tree_collapse_interval=5,
         tree_top="15%",
         tree_right="20%",
         tree_symbol = 'rect',
         tree_symbol_size = 20,
         )
tree.render()
tree

1	data2 = generate_echarts_data(tree_with_pre_pruning)

tree = Tree(width=800, height=800)
tree.add("",
         [data2],
         tree_collapse_interval=5,
         tree_top="15%",
         tree_right="20%",
         tree_symbol = 'rect',
         tree_symbol_size = 20,
         )
tree.render()
tree

决策树实现

第三题：实现决策树

1. 读取数据

2. 划分训练集和测试集

3. 特征预处理

4. 实现3种特征划分准则

4.1 信息增益

4.2 信息增益率

4.3 基尼指数

5. 完成最优特征的选择

6. 判断结点内样本的类别是否为同一类

7. 创建叶子结点

8. 递归地创建决策树

9. 预测

10. 在测试集上对我们的模型进行评估

11. 请你计算使用不同评价指标得到模型的四项指标的值，填写在下方表格内

双击此处编写

12. 扩展：使用Echarts绘制决策树

第四题：预剪枝

实验内容

1. 读取数据

2. 划分训练集和测试集

3. 特征预处理

4. 实现信息增益率的计算

5. 完成最优特征的选择

6. 判断结点内样本的类别是否为同一类

7. 创建叶子结点

8. 递归地创建决策树

test 在下方计算出带有预剪枝和不带预剪枝的决策树的精度，查准率，查全率和F1值(最大深度为6，使用信息增益率)，保留4位小数，四舍五入

双击此处编辑

Like this article? Support the author with

Categories

Tags

Recent

Archives