Apache Hiveを用いたスケーラブルな機械学習機構の構築

油井, 誠, 小島, 功

我々はApache Hive上で動作する機械学習ライブラリHivemallをオープンソースソフトウェアとして公開している．Hivemallはオープンソースの機械学習フレームワークとしてデータ量に対するスケーラビリティが最も高いものの1つであり，Hadoop Distributed Filesystem（HDFS）に格納されたデータを入力とした機械学習処理を効率的に扱えるという特徴からHadoop/Hiveに精通する開発者やデータ分析の専門家から注目を集めている．本稿では，Hivemallによるスケーラブルな機械学習を実現するうえで得られた実践的な知見，およびその実現手法を述べる．KDD Cup 2012, Track 2の広告クリックスルー率の予測タスクを用いた評価実験により，学習速度に定評のあるState-of-the-artの機械学習フレームワークに対してHivemallがより短い学習時間で同等以上の予測精度を出せることを示し，さらに計算ノードの追加によって学習時間を短縮できることを示す．

We have released a machine learning library for Apache Hive, named Hivemall, as an open source software. Hivemall is one of the most scalable machine learning frameworks avaiable as an open source software and is getting attention from data scientists and developers who are familiar with Hive/Hadoop because Hivemall is well suited for analyzing data on the Hadoop distributed file system (HDFS). In this paper, we present practical findings in achieving scalable machine learning with Hivemall and explain the implementation details. We conducted a series of experimental evaluations using a commercial advertisement dataset provided in the KDD Cup 2012, Track 2. The experimental results show that our scheme has competitive classification performance and superior training speed compared with state-of-the-art scalable machine learning frameworks for the regression task. We also show the scalability of Hivemall to the computing nodes through an experiment.

Apache Hiveを用いたスケーラブルな機械学習機構の構築

書誌事項

この論文をさがす

抄録

収録刊行物

関連プロジェクト

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

Apache Hiveを用いたスケーラブルな機械学習機構の構築

書誌事項

この論文をさがす

抄録

収録刊行物

関連プロジェクト

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

参加プロジェクトリスト

詳細情報詳細情報について