基于Spark框架XGBoost的林業(yè)文本并行分類方法研究

doi:10.6041/j.issn.1000-1298.2019.06.032

首頁 > 過刊瀏覽>2019年第50卷第6期 >280-287. DOI:10.6041/j.issn.1000-1298.2019.06.032

基于Spark框架XGBoost的林業(yè)文本并行分類方法研究
DOI:
                        10.6041/j.issn.1000-1298.2019.06.032
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者單位:
作者簡介:
通訊作者:
中圖分類號:
基金項目:國家自然科學(xué)基金項目（61772078）和北京林業(yè)大學(xué)熱點追蹤項目（2018BLRD18）

Parallel Forestry Text Classification Technology Based on XGBoost in Spark Framework

Author:

Affiliation:

Fund Project:

摘要

圖/表

訪問統(tǒng)計

參考文獻(xiàn)

相似文獻(xiàn)

引證文獻(xiàn)

資源附件

文章評論

摘要:

針對當(dāng)前“互聯(lián)網(wǎng)+”技術(shù)與林業(yè)的交叉融合,，涌現(xiàn)出海量待挖掘的涉林文本,，而林業(yè)文本分類的相關(guān)研究尚不成熟的問題,，使用網(wǎng)絡(luò)爬蟲技術(shù)面向互聯(lián)網(wǎng)采集涉林文本,，基于豐富的語料重新構(gòu)建分類標(biāo)簽,，提出基于Spark計算框架的XGBoost并行化方法，對林業(yè)文本進(jìn)行分類,。經(jīng)由交叉驗證,，構(gòu)建的XGBoost并行分類算法準(zhǔn)確率為0.9234，在各類別中最低F1為0.8604,，最高為0.9984,；其在2.1萬條、4.2萬條,、8.4萬條數(shù)據(jù)集上的訓(xùn)練加速比分別為2.13,、3.47、3.82,。結(jié)果表明,，基于該標(biāo)簽設(shè)定的分類模型對現(xiàn)存互聯(lián)網(wǎng)中涉林文本的適應(yīng)性較好；Spark環(huán)境下實現(xiàn)的XGBoost并行化算法的準(zhǔn)確率顯著優(yōu)于其他4種機(jī)器學(xué)習(xí)（樸素貝葉斯,、GBDT決策樹,、BP神經(jīng)網(wǎng)絡(luò)和ELM神經(jīng)網(wǎng)絡(luò)算法）的并行化算法，算法執(zhí)行效率遠(yuǎn)高于單機(jī)版本,，且數(shù)據(jù)量越大,，其加速比越高，能有效應(yīng)對海量林業(yè)文本的實時,、準(zhǔn)確分類,。

Abstract:

At present, the cross-integration of computer technology and forestry field had emerged a large number of forestry texts to be explored, and the shortcomings of related research could be summarized in two aspects: the classification labels in the existing classification system were set unscientific, leading to the classification model lacking of ability to classify the texts on net;the classification algorithm was mostly trained in the single-machine environment without considering its parallelism, then the algorithm could not deal with the actual large-scale data classification problem. Therefore, it was pretty realistic and urgency to design more scientific classification labels and classify forestry texts based on Spark framework. A new crawler technology was used to collect forestry-related texts, and re-construct labels by referring to the existing information retrieval system of forestry to improve the adaptability of classification models. Then the XGBoost parallelization implementation method was realized based on Spark, which completed the computing of training and prediction by RDD program mode. Through cross-validation method, the accuracy of XGBoost parallel algorithm could reach 0.9234. The lowest F1-measure value was 0.8604 and the highest was 0.9984. By training on the 21 thousand, 42 thousand and 84 thousand data sets, the speedup ratios could reach 2.13, 3.47 and 3.82, respectively. The results showed that the new classification labels were set more scientific, and the system had better adaptability to the forestry-related texts on the existing internet. The precision and recall values of the XGBoost algorithm were significantly better than the four kinds of parallel algorithms based on Spark which included NB, gradient boosting decision tree, back propagation neural network, extreme learning machine and ran more effective than the stand-alone version. And with the increase of the data number, the acceleration ratio could be improved, which meant it was pretty useful to deal with the problem about the real-time and accurate classification of massive forestry texts.

參考文獻(xiàn)

相似文獻(xiàn)

引證文獻(xiàn)

引用本文

崔曉暉,師棟瑜,陳志泊,許福.基于Spark框架XGBoost的林業(yè)文本并行分類方法研究[J].農(nóng)業(yè)機(jī)械學(xué)報,2019,50(6):280-287. CUI Xiaohui, SHI Dongyu, CHEN Zhibo, XU Fu. Parallel Forestry Text Classification Technology Based on XGBoost in Spark Framework[J]. Transactions of the Chinese Society for Agricultural Machinery,2019,50(6):280-287.

復(fù)制

文章指標(biāo)

點擊次數(shù):
下載次數(shù):
HTML閱讀次數(shù):
引用次數(shù):

歷史

收稿日期:2019-03-02
最后修改日期:
錄用日期:
在線發(fā)布日期: 2019-06-10
出版日期:

期刊瀏覽

EI收錄結(jié)果

引用本文

分享

文章指標(biāo)

歷史

文章二維碼