2005 Volume 20 Issue 1 Pages 106_2
By rapid progress of network and storage technologies, a huge amount of weakly structured data such as Web pages and XML data, called semistructured data, have been available on Internet and intranets Therefore, there have been increasing demands for efficient methods that extract rules and patterns from semi-structured data, namely semi-structured data mining However, semi-structured data are a huge amount of complex and hetero-geneous data modeled by trees or graphs Thus, we can not directly apply to semi-structured data traditional data mining methods for relational databases Hence, it is an important challenge to develop efficient and scalable methods for semi-structured data mining In this thesis, we model semi-structured data as labeled trees, and study efficient semi-structured data mining algorithms for various classes of tree patterns In Chap 3, we consider the problem of discovering frequent ordered tree patterns from semi-structured data and developed an efficient algorithm FREQT for the problem The key technique is an efficlent enumeration technique called rightmost expansion, which enumerates all the ordered trees in 0 (1) time per pattern Consequently, FREQT computes all the frequent patterns m 0 (kb^2m) time per pattern without duplication, where k is the size of the pattern, b is the maximum branching of an input data tree, and m is the number of occurrences of the pattern In Chap 4, we then extend the algorithm FREQT to the optimized pattern discovery problem and give an efficient algorithin OPTT for mining optimized ordered tree patterns In Chap 5, we consider the frequent unordered tree pattern discovery problem for semi-structured data and developed an efficient algorithm UNOT for the problem In Chap 6, we study a variant of the frequent tree discovery problem, called frequent pattern mining from semi-structured data streams, where the input to the mining algorithm is not a static dataset, but a rapidly changing and unbounded data stream We developed an online algorithm StreamT, which discovers all the frequent ordered trees appearing in a given data stream by scanning it once Finally, we conclude this thesis and discuss future research directions in Chap 7