Fuzzy clustering has been discussed in order to reflect the pervasiveness of imprecision and uncertainty which exists in the real world. The amount of data is growing and we are faced with a challenge to analyze, process and extract useful information from the complicated vast amount data. Many fuzzy clustering methods have been developed in order to deal with such data. This paper provides outlines of some fuzzy clustering methods in which the target data is asymmetric similarity data or interval-valued data. Moreover, a method for the interpretation of the fuzzy clustering result is stated.
A Logistic Regression Model using Random Subspace Method is investigated through experiments. Ensemble learning is known as one of better prediction methods. The framework makes it possible to improve precision of prediction, however interpretation of the model often gets difficult. Using random subspace method with logistic regression model, this paper tried to provide a solution to the problem. Precision improvement of the model is verified from a preliminary experiment. Furthermore the meaning of the combined model is easily understandable by means of coefficients of the prediction model.
Research was carried out in an infant infectious disease for the purpose of commanding of the occurrence trend and the early detection of the infectious disease epidemic. It was examined about present state grasping of the patient occurrence in the patient information data by the area of the infection disease occurrence trend investigation, the neighboring zone and the epidemic spread to others prefecture. Moreover, the algorithm of the prediction of epidemics was examined.
Dimensionality reduction is one of the important preprocessing steps in high-dimensional data analysis. In this paper, we consider the supervised dimensionality reduction problem, i.e., samples are accompanied with class labels. Fisher discriminant analysis (FDA) is a traditional but powerful technique for linear supervised dimensionality reduction. However, FDA tends to give undesired results if samples in a class are multimodal. Locality-preserving projection (LPP) allows us to reduce the dimensionality of multimodal data without losing the local structure. However, LPP is an unsupervised method and is not necessarily effective in supervised learning scenarios. In this paper, we propose a new linear supervised dimensionality reduction method called local Fisher discriminant analysis (LFDA). LFDA effectively combines the ideas of FDA and LPP and works well for dimensionality reduction of multimodal labeled data. LFDA has an analytic form of the embedding transformation and the solution can be easily computed just by solving a generalized eigenvalue problem. This is an advantage over recently proposed supervised dimensionality reduction methods. We demonstrate the practical usefulness and high scalability of the LFDA method in data visualization and classification tasks through extensive simulation studies. We also show that LFDA can be extended to non-linear dimensionality reduction scenarios by applying the kernel trick.
We consider principal component analysis for multi-dimensional sparse functional data. The mixed effect model and reduced rank model have been used for analyzing the sparse functional data. In this paper, we introduce a principal component method for the multi-dimensional sparse functional data based on the reduced rank model, and model selections will be performed by using Akaike information criterion (AIC) and Bayesian information criterion (BIC). Further more,the use of the proposed method is illustrated through the analysis of human gait data and handwriting data.
This article presents flexible methods for modeling censored survival data using penalized smoothing splines when the covariates values change for the duration of the study. The Cox's proportional hazards model has been widely used for the analysis of treatment and prognostic effects with censored survival data. However, a number of theoretical problems that are to be solved with respect to the baseline survival function and the baseline cumulative hazard function, are involved. The basic idea in this article is to use logistic regression model and generalized additive models with B-splines, and then estimate the survival function. The proposed methods are illustrated using data from a long-term study of patients with PBC (primary biliary cirrhosis) for the purpose of facilitating the decision as towhen to undertake liver transplantation. As illustration of graphical evaluation of covariates, the Stanford Heart Transplant data are also used which has been collected to model survival in patients. We model survival time as a function of patient covariates and transplant status, and compare the results obtained using smoothing spline, partial logistic, Cox's proportional hazards, and piecewise exponential models.
This paper proposes an approach to investigation on how students behave during the course of practice in medical informatics. Similarities among students' learning behaviors were measured by a tutorial evaluation form, which consists of 19 questionnaires and were analyzed by multidimensional scaling. The results show that items in an evaluation sheet can be divided into two classes on a two dimensional plane. One class includes arrangement of knowledge, important theme, fundamental items, common learning items, a goal for learning, and understanding of other people, which are located in the neighborhood of the origin with high similarities. The other class includes time distribution of discussion announcement, time distribution of the learning plan, with the learning item order, logical explains, which are located in the regions surrounding the former class. In each year from 2002 to 2004, while self-presentation was not observed, learning behavior with extroversion and other-directedness in the learning behavior were observed.
This paper proposes three algorithms for relational frequent pattern mining based on logical structure of examples. These are methods using bottom-up property extraction from examples. The extracted properties construct patterns by a level-wise way like Apriori. Proprtties are defined in terms of mode of predicates. The algorithms are evaluated with comparison to WARMR.
Studies using collocation become more complex and expand applications. However it was not e±cient to caliculate collocation scores. In this paper, we propose a method to caliculate collocation score e±ently by using frequencies of multi-collocation, and report experimental results.
In this paper, we present a method to estimate each minimum number of training instances for constructing a valid learning model in a rule evaluation support method. In post-processing of data mining process, rule evaluation procedure is one of important and costly procedures, because it requires experties from human experts. To support such rule evaluation, we have developed a support method based on learning model called rule evaluation model. These models should be valid and constructed from smaller training instances to curb learning costs. Therefore, we have evaluated learning costs with accuracies for an entire training dataset and achive rates of sub-sampled training instances. Then, we show case studies on artificial evaluation results and an actual data mining result.
Total Frequency proposed by Takano et al. is a useful anti-mnotonic measure, which was developed for finding out all frequent patterns in a single large-scale data sequence. But, in some application areas extracted sequences only by using a frequency measure are offen meaningless noisy sequences. In this paper, we propose a method that is based on an information gain measure obtained by combining frequency and self-information. This method can check and exclude noisy sequences. Notice that using only self-information as an extracting measure can not find out an important pattern, because data subsequences of high self-information hardly appears in the data base sequence. It is important to extract a pattern sequence which occurs not too many times and also not too few times in a single large-scale data sequence.
Web usage mining plays an important role in adaptation of Web sites and the improvement of Web server perfomance. It applies data mining techniques to discover Web access patterns from Web usage data. It is interesting to use information of Web Hyperlink structure and Web contents as well as Web Access log to discover Web access patterns. In this paper, we propose a unified form to represent all the three Web information in sequences. The form is convenient to apply to frequent sequential pattern mining algorithms. In addition, in consideration of Web, we give two proposals for frequent sequential pattern mining algorithms.
A graph mining technique called Chunkingless Graph-Based Induction (Cl-GBI) can extract discriminative subgraphs from graph structured data by the operation called chunkingless pairwise expansion which constructs pseudo-nodes from selected pairs of nodes in the data. Because of the time and space complexities, it happens that Cl-GBI cannot extract good enough to describe characteristics of data. Thus, to improve its efficiency, we propose a pruning method based on the upper-bound of information gain. Information gain is used as a criterion of discriminativity in Cl-GBI and the upper-bound of information gain of a subgraph is the maximal one that its super graph can achieve. The proposed method allows Cl-GBI to exclude from its search space unfruitful subgraphs that cannot yield the most discriminative one, by comparing the upper-bound of information gain of each subgraph at hand with the best information gain at the moment. Furthermore, in this paper, we experimentally show the usefulness of the proposed pruning method by applying CI-GBI adopting it to both a real world dataset and an artificial datasets.
The data mining process consists of sub-processes such as pre-process, mining process, and post-process, in each of which many algorithms are available. Usually, a data miner has to iterate such a process varying algorithms and their parameters in each sub-process in order to obtain the satisfiable result. Such iterative processes impose a heavy burden on users. In addition, for novices in data mining, it is necessary to learn many things about algorithms such as meanings of their parameters. Thus, the selection of appropriate algorithms to use on a new dataset is an important issue. In this paper, we propose a user support system based on Case-based reasoning, which recommends algorithms that seem to be appropriate for a new dataset at hand to the user based on the most similar cases in its case-base. The similarity in our system takes into account not only superficial features of a dataset such as the number of instances in a class, but also its contents. In addition, the system utilizes rules about effects of algorithms learned form the case-base in order to refine the candidate algorithms it selected. We also experimentally show the usefulness of the proposed system by comparing it with an existing method based on Case-based reasoning.
The World Wide Web(WWW) has been regarded as a one of the important information databases. We are able to obtain much information from WWW using search engines and understand some knowledge. However, it is difficult to understand a huge amount of knowledge in a short time only using the search engines. We propose an animation interface to support for knowledge reconfirmation. We defined the knowledge as differences between keyword relationships. The interface has three functions: (1) displaying a map of keyword relationships, (2) switching maps depending on viewpoints and (3) displaying an animation between the two maps. Experimental results showed that the interface is able to support for knowledge reconfirmation.
Correlation analysis among variables are frequently used in various approaches of statistics and data mining. However, its application to the data obtained from the recent ubiquitous sensing system consisting of massive sensors is often intractable, since its computational complexity is proportional to the square number of variables. On the other hand, the strong correlations among the variables are usually sparse in various data such as small world data. In this report, we propose a novel method to efficiently estimate the correlations among massive variables under this sparseness. Its experimental evaluations show excellent performance in efficiency comparing with the direct computation of all correlation coefficients.
Several studies have investigated efficient algorithms to detect highly correlated itemset pairs. However, we regard itemset pairs with even medium degree of correlations in a target database, provided the correlations are drastically higher than the corresponding ones in another databases to be contrasted. We consider that the greater change of correlation can be evidence that something to be remarked occurs implicitly in the target database. In a problem of finding such itemset pairs, we consider the problem in a case where one component is given by users. For the given component, we try to find the other component. Because of the nonmonotonicy of degrees of correlation chage, the problem of finding the other component is difficult. However, we prove some monotonicity if we consider some itemsets in the process of mining the other component.
Covariances from categorical variables are defined using a regular simplex expression for categories. The method follows the variance definition by Gini, and it gives the covariance as a solution of simultaneous equations using the Newton method. The calculated results give reasonable values for test data. A method of principal component analysis (RS-PCA) is also proposed using regular simplex expressions, which allows easy interpretation of the principal components.