Sklearn,xgboost机器学习多分类实验

xiaoxiao2021-03-25 140

一. 背景

多分类是一个机器学习的常见任务，本文将基于复旦大学中文文本分类语料，使用sklearn和xgboost来进行多分类实验。

预装软件包： 1. Jieba分词：

环境： linux fedora 23

源码安装https://github.com/fxsjy/jieba, 或者使用pipinstall jieba安装

2. Sklearn：

fedora 下参考：http://www.centoscn.com/image-text/install/2014/0403/2715.html

3.xgboost:

https://github.com/dmlc/xgboost.git

安装好后进入到python-package：pythonsetup.py install

二. LCCT代码和数据集

1.代码

LCCT (Learning to Classify Chinese Text)

git clone https://github.com/jaylenzhang/lcct.git

2.数据集

由复旦大学李荣陆提供。fudan_test.json为测试语料，共9833篇文档；fudan_train.json为训练语料，共9804篇文档，分为20个类别。训练语料和测试语料基本按照1:1的比例来划分。本文将训练和测试数据混合一起进行5折交叉验证。收集工作花费了不少人力和物力，所以请大家在使用时尽量注明来源（复旦大学计算机信息与技术系国际数据库中心自然语言处理小组）

百度云： http://pan.baidu.com/s/1qYjk0Ni密码：dhs7

下载数据集后，在当前目录下创建data目录，并将文件解压。

数据集类别统计

类别

数据量

Economy

3200

Computer

2714

Sports

2506

Enviornment

2434

Politics

2048

Agriculture

2042

Art

1480

Space

1280

History

932

Military

148

Education

118

Transport

114

Law

102

Medical

102

Philosophy

Literature

Mine

Energy

Electronics

Communication

总计

19608

三. 配置文件

[basic_conf] work_dir=/home/disk1/zhangzhiming/code/ml/lcct/ # 设置当前工作路径，需要修改 word_min_count=4 # 文档集中词频低于4的将会被去掉 label2clsName=label.clsName # <0, Art> id2docName=id.docName # word2id=word.id word2idf=word.idf tfidf_svm=tfid.svm train_test_dir=train_test_dir [pre_process] file_tag=classify json_data=classify.json wordseg_data=classify.seg vocab_data=classify.voc tfidf_data=classify.tfidf train_test_dir=train_test_dir train_test_rate=0.5 cross_validation_num=5 [feature_selection] is_chi2=0 chi2_best_k=10000

四. 预处理

cd ./ scripts

运行： sh ./run_pre_process.sh

1.分词。采用结巴分词。

2.生成字典。去掉停用词，过滤掉词频小于等于2的词，生成字典。

3.计算idf。

4.计算tf以及tfidf。

6.生成libsvm数据格式

1）重采样：以训练数据最多的类别为采样上线进行采样。

五. 分类实验

1.分类器

选择了朴素贝叶斯，SGCclassifier，logistic regression random_forest，decision_tree,xgboost 等。

2. 运行

cd./scripts

sh run_train_and_test.sh

3.实验结果

采用5折交叉验证

classifier

acc

precision

recall

f1-score

train_time(s)

XGBOOST

0.959

0.96

0.959

588.17

0.821

0.873

0.821

0.841

0.577

SGD

0.925

0.936

0.925

0.929

9.393

0.939

0.94

0.939

55.869

0.913

0.918

0.913

0.912

5.367

0.894

0.898

0.894

0.895

20.95

六. 特征工程

1.Sklearn常见的特征工程方法：

参考：使用sklearn做单机特征工程

这里实验了卡方检验，在配置文件中设置

[feature_selection] is_chi2=0 chi2_best_k=10000

2.代码：

卡方检验

def choose_best_chi2(self,train_x,train_y,test_x): ch2 = SelectKBest(chi2, k = self.chi2_best_k) train_x_ch2 = ch2.fit_transform(train_x, train_y) text_x_ch2 = ch2.transform(test_x) return train_x_ch2, text_x_ch2调用：

if self.is_chi2: train_x,test_x = self.choose_best_chi2(train_x, train_y,test_x)

3.实验结果

classifier

acc

precision

recall

f1-score

train_time（s）

XGBOOST

0.957

0.959

0.957

0.958

161.932

0.77

0.824

0.77

0.788

0.206

SGD

0.925

0.934

0.925

0.928

3.586

0.929

0.93

0.929

37.192

0.936

0.939

0.936

3.315

0.895

0.898

0.895

0.896

8.588

参考文献：

1. 李荣陆王建会陈晓云陶晓鹏胡运发. "使用最大熵模型进行中文文本分类." 计算机研究与发展 42.1 (2005): 94-101.

2. classifer.py参考了：http://blog.csdn.net/zouxy09/article/details/48903179

转载请注明原文地址: https://ju.6miu.com/read-6730.html

技术

最新回复(0)