串流資料窗口運算動態演算法之研究__國立東華大學博碩士論文全文影像系統

帳號：guest(3.142.197.198) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者:	李岳勳
作者(英文):	Yueh-Hsun Lee
論文名稱:	串流資料窗口運算動態演算法之研究
論文名稱(英文):	Dynamic Algorithms for Window Operations on Streaming Data
指導教授:	吳秀陽
指導教授(英文):	Shiow-Yang Wu
口試委員:	張耀中孫宗瀛
口試委員(英文):	Yao-Chung Chang Tsung-Ying Sun
學位類別:	碩士
校院名稱:	國立東華大學
系所名稱:	資訊工程學系
學號:	610521216
出版年(民國):	109
畢業學年度:	108
語文別:	中文
論文頁數:	63
關鍵詞:	串流資料處理、窗口運算、工作量預估、動態處理、循序樣式探勘、MapReduce、Hadoop、Spark Streaming
關鍵詞(英文):	streaming data processing、window operations、workload estimation、dynamic processing、sequential pattern mining、MapReduce、Hadoop、Spark Streaming
相關次數:	推薦:0 點閱:30 評分: 下載:19 收藏:0

串流資料處理在近年來越來越受到重視，尤其在物聯網應用中的重要程度越來越高。在串流資料處理中，窗口運算最能反應即時狀況，所以更受重視。在現有串流資料服務系統中，窗口運算實際處理方式主要有兩種，一種為每個窗口皆全部重新計算，另一種為從前一窗口運算結果中除去過時資料並增補新進資料的補差方式。在廣受歡迎的Spark Streaming系統中，就有提供此兩種方法的實作。然而在資料量與前後差異度等變因不同的情況下，不同方法的效能不同，適用的時機也不同，導致採用單一策略的處理方法效能低落。
本論文改進現有技術，提出了一個動態窗口運算架構與演算法，透過預估不同策略之窗口運算工作負擔量的差異，再動態選擇最佳執行策略，以即時反應資料串流型態的變化，增進整體執行效能。本論文實際在Spark叢集上進行實驗，並使用IBM所提供的資料產生器(IBM Quest Synthetic Data Generator)來產生各項實驗用資料，並透過結果歸納不同情境下適合使用的窗口運算執行策略與演算法。

Streaming data processing has received more and more attention in recent years, especially in IoT applications. In streaming data processing, window operations can best reflect the real-time situation. In the existing streaming data service system, there are mainly two actual processing methods for window operations. One is to recalculate each window, and the other is to remove obsolete data from the previous window operation result and add new data. In the popular Spark Streaming system, there are implementations of both methods. However, in the case of data variations in the amount of data and the degree of difference between consecutive windows, the performance of different methods can vary significantly, resulting in a low performance of window processing using a single strategy.
This paper improves existing technologies and proposes a dynamic window operation architecture and algorithm. By estimating the difference in the computing workload of different window processing strategies, the dynamic execution strategy can reflect the changes in the data stream pattern immediately and therefore improve the overall execution efficiency. We conduct extensive experiments on Spark cluster with IBM Quest Synthetic Data Generator to generate various experimental data. Experimental results demonstrate that appropriate window operation execution strategies can significantly improve overall performance. The application scenario of different strategies are also summarized for real-world applications of our techniques.

第1章緒論 1
1.1研究動機與目的 1
1.2研究方法與成果 1
1.3論文架構 2
第2章相關研究與技術 3
2.1 MapReduce 3
2.2 Hadoop 3
2.3 Spark 5
2.4 Spark Streaming 9
2.5 串流資料(Streaming Data) 11
2.6 窗口運算(Window Operation) 11
2.6.1 reduceByKeyAndWindow（一般模式） 12
2.6.2 reduceByKeyAndWindow（Inverse Reduce模式） 13
2.7循序樣式探勘（Sequential Pattern Mining） 13
2.8 One-Phase演算法 15
2.9漸進式循序樣式探勘(Incremental Sequential Pattern Mining) 15
2.10 Cache與Checkpoint設置 18
第3章研究方法與演算法 21
3.1研究方法 21
3.2串流窗口運算動態演算法 24
3.2.1 Unit更新與儲存策略 24
3.2.2直接運算演算法(Direct-Calculation Algorithm) 26
3.2.3差異增減演算法(Increase-Decrease Algorithm) 26
3.2.4動態演算法DWE Algorithm (Dynamic Workload Estimation Algorithm) 27
3.3窗口運算中的One-Phase演算法實踐 32
第4章實驗與效能評估 33
4.1實驗環境與測試資料 33
4.2實驗方法 35
4.3資料相異性與可擴充性實驗 36
4.4 DWE演算法效能實驗 43
4.5窗口長度實驗 44
4.6滑動長度實驗 48
4.7 One-Phase演算法實驗 55
4.8實驗總結 56
第5章結論與未來展望 57
5.1結論 57
5.1.1 Spark Streaming的window運算策略 57
5.1.2 DWE演算法效能 57
5.1.3窗口長度及滑動長度的影響 58
5.1.4搭配循序樣式探勘時的效能 58
5.1.5適用狀況 58
5.2未來展望 59
參考文獻 61

[1] R. Agrawal and R. Srikant, "Mining sequential patterns," Proceedings of the Eleventh International Conference on Data Engineering, Taipei, Taiwan, 1995, pp. 3-14, doi: 10.1109/ICDE.1995.380415.
[2] R. Agrawal, R. Srikant., "Fast algorithms for mining association rules", Very Large Data Base, 20th International Conference on, Vol. 1215, pp. 487-499, 1994.
[3] Ming-Syan Chen, Jiawei Han and P. S. Yu, "Data mining: an overview from a database perspective," in IEEE Transactions on Knowledge and Data Engineering, vol. 8, no. 6, pp. 866-883, Dec. 1996, doi: 10.1109/69.553155.
[4] Jeffrey Dean, Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", Communications of the ACM, Vol. 51(1), pp. 107-113, January 2008.
[5] X. Y. Yang, Z. Liu and Y. Fu, "MapReduce as a programming model for association rules algorithm on Hadoop," The 3rd International Conference on Information Sciences and Interaction Sciences, Chengdu, 2010, pp. 99-102, doi: 10.1109/ICICIS.2010.5534718.
[6] L. Li and M. Zhang, "The Strategy of Mining Association Rule Based on Cloud Computing," 2011 International Conference on Business Computing and Global Informatization, Shanghai, 2011, pp. 475-478, doi: 10.1109/BCGIn.2011.125.
[7] C. Chen, C. Tseng and M. Chen, "Highly Scalable Sequential Pattern Mining Based on MapReduce Model on the Cloud," 2013 IEEE International Congress on Big Data, Santa Clara, CA, 2013, pp. 310-317, doi: 10.1109/BigData.Congress.2013.48.
[8] Z. Farzanyar and N. Cercone, "Efficient mining of frequent itemsets in social network data based on MapReduce framework," 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013), Niagara Falls, ON, 2013, pp. 1183-1188, doi: 10.1145/2492517.2500301.
[9] The Apache Software Foundation, Apache Hadoop, https://hadoop.apache.org.
[10] The Apache Software Foundation, Apache Spark, https://spark.apache.org.
[11] The Apache Software Foundation, Apache Spark, https://spark.apache.org/docs/latest/streaming-programming-guide.
[12] The Apache Software Foundation, Apache Spark, https://spark.apache.org/docs/latest/streaming-programming-guide#window-operations.
[13] Python.org, https://www.python.org/.
[14] J.Dean,S.Ghemawat,"MapReduce: Simplified Data Processing on Large Clusters". Proc. of Operating Systems Design and Implementation, San Francisco,CA, pp. 137-150, 2004.
[15] Apache Spark - RDD, https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.
[16] Helen Ponto, Jiawei Han, Jian Pei, Ke Wang, "Multi-dimensional Sequential Pattern Mining", Information and Knowledge Management, 10th International Conference on, ACM, pp. 81-88, 2001.
[17] 許國宏，"基於Spark架構上之漸進式循序樣式探勘".碩士論文，國立東華大學，花蓮，台灣，2017.
[18] Mao Yinmin, Yang Lumin, Li Hong, Chen Zhigang and Liu Lixin, "Mining closed frequent itemsets in the sliding window over data stream," 2009 IEEE Youth Conference on Information, Computing and Telecommunication, Beijing, 2009, pp. 146-149, doi: 10.1109/YCICT.2009.5382407.
[19] A. Khodabakhsh, I. Ari, M. Bakir and S. M. Alagoz, "Stream Analytics and Adaptive Windows for Operational Mode Identification of Time-Varying Industrial Systems," 2018 IEEE International Congress on Big Data (BigData Congress), San Francisco, CA, 2018, pp. 242-246, doi: 10.1109/BigDataCongress.2018.00042.
[20] P. Murena, M. Al-Ghossein, T. Abdessalem and A. Cornuéjols, "Adaptive Window Strategy for Topic Modeling in Document Streams," 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, 2018, pp. 1-7, doi: 10.1109/IJCNN.2018.8489771.
[21] H. Peiro Sajjad, Y. Liu and V. Vlassov, "Optimizing Windowed Aggregation over Geo-Distributed Data Streams," 2018 IEEE International Conference on Edge Computing (EDGE), San Francisco, CA, 2018, pp. 33-41, doi: 10.1109/EDGE.2018.00012.
[22] A. Xiong, Y. Huang, Y. Wu, J. Zhang and L. Long, "An Adaptive Sliding Window Algorithm for Mining Frequent Itemsets in Computer Forensics," 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), New York, NY, 2018, pp. 1660-1663, doi: 10.1109/TrustCom/BigDataSE.2018.00246.

01.pdf

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文