Evaluation of Checkpointing Mechanism on SCore Cluster System

この論文をさがす

著者

抄録

Cluster systems are getting widely used because of good performance / cost ratio. However, their reliability has not been well discussed in practical environment so far. As the number of commodity components in a cluster system gets in- creased, it is indispensable to support reliability by system software. Score cluster system software is a parallel programming environment for High Performance Computing (HPC). Score provides checkpointing and rollback-recovery mechanism for high availability. In this paper, we analyze and evaluate the checkpointing and rollback-recovery mechanisms of Score quantitively. The experimental results reveal that the required time for checkpointing scales very well in respect to the number of computing nodes. However, the required time is quite long due to the low effective network bandwidth. Based on the results, we modify Score and successfully make checkpointing and recovery 1.8 〜 2.8 times and 3.7 〜 5.0 times faster respectively. This is very helpful for cluster systems to achieve high performance and high availability.

収録刊行物

  • IEICE transactions on information and systems

    IEICE transactions on information and systems 86(12), 2553-2562, 2003-12-01

    一般社団法人電子情報通信学会

参考文献:  15件中 1-15件 を表示

被引用文献:  1件中 1-1件 を表示

各種コード

  • NII論文ID(NAID)
    110003213678
  • NII書誌ID(NCID)
    AA10826272
  • 本文言語コード
    ENG
  • 資料種別
    ART
  • ISSN
    09168532
  • データ提供元
    CJP書誌  CJP引用  NII-ELS 
ページトップへ