Spark架構上之分散式循序樣式探勘演算法研究__國立東華大學博碩士論文全文影像系統

帳號：guest(3.19.27.178) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者:	胡哲嘉
作者(英文):	Che-Chia Hu
論文名稱:	Spark架構上之分散式循序樣式探勘演算法研究
論文名稱(英文):	Distributed Algorithms for Sequential Pattern Mining on Spark
指導教授:	吳秀陽
指導教授(英文):	Shiow-yang Wu
口試委員:	孫宗瀛張耀中
口試委員(英文):	Tsung-Ying Sun Yao-Chung Chang
學位類別:	碩士
校院名稱:	國立東華大學
系所名稱:	資訊工程學系
學號:	610321243
出版年(民國):	108
畢業學年度:	108
語文別:	中文
論文頁數:	40
關鍵詞:	循序樣式探勘、MapReduce、Hadoop、Spark
關鍵詞(英文):	Sequential pattern mining、MapReduce、Hadoop、Spark
相關次數:	推薦:0 點閱:20 評分: 下載:12 收藏:0

循序樣式探勘在資料探勘的領域已發展多年，目的在於挖掘原始資料之間的先後順序中是否具有循序性。近年來資料量的增長速度越來越快，而分散式運算可以花費較少的成本完成龐大工作量的特性，使得其重要性與日俱增。MapReduce的提出進一步的簡化了分散式運算的工作，Hadoop叢集便是運用MapReduce架構對於大規模資料集進行運算，亦是現在運用最為廣泛的架構。本實驗室以前便曾以大規模資料集作為運算對象，並以Hadoop作為運算架構進行循序樣式探勘研究。現今電腦記憶體的價格比起過去更加便宜。有鑑於此，本文改良從前使用的架構，以更高效的平台運算循序樣式探勘。此外Hadoop執行分散式運算時仰賴硬碟I/O的架構，速度已逐漸無法應付頻繁更新的資料庫，因此許多相關改進方法被提出，Spark便是其中之一。Spark是與MapReduce相似的分散式運算架構，與Hadoop不同的是，Spark使用記憶體內計算的方式，省去了大量的I/O工作，官方宣稱速度至少能夠達到Hadoop的10倍。本文的另一個目的是利用Spark運算架構特性，實作循序樣式探勘演算法，並於Hadoop叢集上實作驗證其效能。
本文所提出的Spark架構上之分散式循序樣式探勘演算法研究，會使用scala程式語言，重新編寫過去幾種循序樣式探勘演算法，接著使用Spark叢集來比對效能。
為了驗證演算法在Spark系統上的運作效能，本文亦實際架構一組Spark叢集作為實驗，並使用IBM所提供的資料產生器（IBM Quest Synthetic Data Generator）作為實驗用的人工大規模資料集。實驗結果顯示，使用Spark叢集的循序樣式探勘會比過去我們使用的MapReduce探勘具有更好的效能。

Sequential pattern mining has been studied in data mining research for years. The purpose is to discover sequential patterns from large datasets. Since the datasets increase faster and faster in recent years, distributed computing architecture for handling large datasets is becoming more and more important. MapReduce is a computing structure that greatly simplify distributed computing tasks. Hadoop is a widely used distributed computing architecture based on the MapReduce framework. Our laboratory used to develop algorithms for sequential pattern mining with Hadoop MapReduce. The price of computer memory today is much cheaper than in the past. In view of this, this paper improves the architecture used in the past, and sequential pattern mining with more efficient platform operations. On the other hand, Hadoop distributed computing depends heavily on disk I/O. Programs running on top of it can’t handle frequent database update gracefully. Therefore many research was conducted targeting such a deficiency. Spark is one of them. It is similar to MapReduce but employs in-memory computing to reduce disk I/O. It is reported that Spark runs at least 10 times faster than Hadoop MapReduce. Therefore we implement our algorithm on Spark.
We develop a Distributed Algorithms for Sequential Pattern Mining on Spark , which uses the scala programming language to rewrite several Sequential pattern mining algorithms and then use Spark clusters to compare performance.
To verify the performance of our sequential pattern mining algorithm on Spark, we implement our algorithm on a Spark cluster and conduct extensive experiments. We use IBM Quest Synthetic Data Generator to generate large datasets for our experiments. Experiment results show that sequential pattern mining using Spark clusters will be more efficient than the MapReduce exploration we used in the past.

第一章序論 1
1.1研究動機與目的 1
1.2研究方法與成果 1
1.3論文架構 2
第二章相關技術與研究 3
2.1 MapReduce 3
2.2 Hadoop 3
2.3 Spark 6
2.4 Scala 10
2.5循序樣式探勘（Sequential pattern mining） 12
2.5.1 PrefixSpan演算法 13
2.6本實驗室之循序樣式探勘研究 14
2.6.1 One-Phase演算法 15
2.6.2 Sequnce Growth演算法 15
第三章研究方法與演算法 16
3.1研究方法 16
3.2 One-Phase 18
3.3 Sequnce Growth 19
3.4 Two-Phase 21
3.5 Spark實作策略 22
3.6運用Spark提升探勘效能 23
第四章實驗與效能評估 24
4.1實驗環境與測試資料 24
4.2實驗結果 25
4.3資料量可擴充性實驗 26
4.4資料長度可變性實驗 30
4.5不同序列探勘演算法效能差異 32
4.6資料量大小與記憶體大小設置實驗 35
4.7實驗總結 36
第五章結論與未來展望 37
5.1結論 37
5.2未來展望 37
參考文獻 38

[1] R. Agrawal, R. Srikant, “Mining Sequential Patterns”, Data Engineering, 11th International Conference on, pp. 3-14, 1995.
[2] R. Agrawal, R. Srikant., “Fast algorithms for mining association rules”, Very Large Data Base, 20th International Conference on, Vol. 1215, pp. 487-499, 1994.
[3] Ming Syan Chen, Jiawei Han, Philip S. Yu, “Data Mining：an overview from a database perspective”, IEEE Transactions on Knowledge and Data Engineering, Vol. 8(6), pp. 866-883, 1996.
[4] Jeffrey Dean, Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, Communications of the ACM, Vol. 51(1), pp. 107-113, January 2008.
[5] Xin Yue Yang, Zhen Liu, Yan Fu, “MapReduce as a programming model for association rules algorithm on Hadoop”, Information Sciences and Interaction Sciences, 3rd International Conference on, pp. 99-102, 2010.
[6] Ling juan Li, Min Zhang, “The strategy of Mining Association Rule Based on Cloud Computing”, Business Computing and Global Information, 2011 IEEE International Conference on, pp. 475-478, 2011.
[7] Chun-Chieh Chen, Chi-Yao Tseng, Ming-Syan Chen, “Highly Scalable Sequential Pattern Mining Based on MapReduce on the Cloud”, Big Data Congress, 2013 IEEE International Congress on, 2013.
[8] Zahra Farzanyar, Nick Cercone, “Efficient mining of frequent itemsets in social network data based on MapReduce framework”, Advances in Social Networks Analysis and Mining, 2013 IEEE/ACM International Conference on, ACM, 2013.
[9] The Apache Software Foundation, Apache Hadoop, https://hadoop.apache.org, October 2017.
[10] Yen-hui Liang, Shiow-yang Wu, “Sequence-Growth：A Scalable and Effective Frequent Itemset Mining Algorithm for Big Data Based on MapReduce Framework”, Big Data, 2015 IEEE International Congress on, pp. 393-400, 2015.
[11] The Apache Software Foundation, Apache Spark, https://spark.apache.org, October 2017.
[12] Scala, https://www.scala-lang.org/, October 2017.
[13] J.Dean,S.Ghemawat,"MapReduce: Simplified Data Processing on Large Clusters". Proc. of Operating Systems Design and Implementation, San Francisco,CA, pp. 137-150, 2004
[14] Apache Spark – RDD, https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm
[15] Bjarne Stroustrup, AT&T Bell Laboratories, Murray Hill, New Jersey, Why C++ is not just an object-oriented programming language, Vol. 6(4), ACM, 1995.
[16] Helen Ponto, Jiawei Han, Jian Pei, Ke Wang, “Multi-dimensional Sequential Pattern Mining”, Information and Knowledge Management, 10th International Conference on, ACM, pp. 81-88, 2001.
[17] Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, Mei-Chun Hsu, “PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth”, Data Engineering, Proceedings 17th International Conference on, pp.215-224, 2002.
[18] David C. Anastasiu, Jeremy Iverson, Shaden Smith, George Karypis, “Big Data Frequent Pattern Mining”, Frequent Pattern Mining, Springer International Publishing, Aggarwal C., Han J. (eds), pp. 225-259, 2014.

01.pdf

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文