Hybrid Partitioning-Density Algorithm for K-Means Clustering of Distributed Data Utilizing OPTICS

The authors present the first clustering algorithm for use with distributed data that is fast, reliable, and does not make any presumptions in terms of data distribution. The authors' algorithm constructs a global clustering model using small local models received from local clustering statistics. This approach outperforms the classical non-distributed approaches since it does not require downloading all of the data to the central processing unit. The authors' solution is a hybrid algorithm that uses the best partitioning and density-based approach. The proposed algorithm handles uneven data dispersion without a transfer overload of additional data. Experiments were carried out with large datasets and these showed that the proposed solution introduces no loss of quality compared to non-distributed approaches and can achieve even better results, approaching reference clustering. This is an excellent outcome, considering that the algorithm can only build a model from fragmented data where the communication cost between nodes is negligible.

Efficient Algorithm for Mining High Utility Pattern Considering Length Constraints

High utility itemset (HUI) mining is one of the popular and important data mining tasks. Several studies have been carried out on this topic, which often discovers a very large number of itemsets and rules, which reduces not only the efficiency but also the effectiveness of HUI mining. In order to increase the efficiency and discover more interesting HUIs, constraint-based mining plays an important role. To address this issue, the authors propose an algorithm to discover HUIs with length constraints named EHIL (Efficient High utility Itemsets with Length constraints) to decrease the number of HUIs by removing tiny itemsets. EHIL adopts two new upper bound named sub-tree and local utility for pruning and modify them by incorporating length constraints. To reduce the dataset scans, the proposed algorithm uses transaction merging and dataset projection techniques. The execution time improvements ranged from a modest five percent to two orders of magnitude across benchmark datasets. The memory usage is up to twenty-eight times less than state-of-the-art algorithm FHM+.

A Method of Sanitizing Privacy-Sensitive Sequence Pattern Networks Mined From Trajectories Released

Mobility patterns mined from released trajectories can help to allocate resources and provide personalized services, although these also pose a threat to personal location privacy. As the existing sanitization methods cannot deal with the problems of location privacy inference attacks based on privacy-sensitive sequence pattern networks, the authors proposed a method of sanitizing the privacy-sensitive sequence pattern networks mined from trajectories released by identifying and removing influential nodes from the networks. The authors conducted extensive experiments and the results were shown that by adjusting the parameter of the proportional factors, the proposed method can thoroughly sanitize privacy-sensitive sequence pattern networks and achieve the optimal values for security degree and connectivity degree measurements. In addition, the performance of the proposed method was shown to be stable for multiple networks with basically the same privacy-sensitive node ratio and be scalable for batches of networks with different sensitive nodes ratios.

Adaptation of Error Adjusted Bagging Method for Prediction

In this study, the error-adjusted bagging technique is adapted to support vector regression (SVR) and regression tree (RT) methods to obtain more accurate predictions, and then the method performances are evaluated with real data sets and a simulation study. For this purpose, the prediction performances of single models, classical bagging models, and error-adjusted bagging models obtained via complementary versions of the above-mentioned methods are constructed. The comparison is mainly based on a real dataset of 295 patients with Hodgkin's lymphoma (HL). The effect of several parameters such as training set ratio, the number of influential predictors on model performances, is examined with 500 repetitions of simulation data. The results reveal that error-adjusted bagging method provides the best performance compared to both single and classical bagging performances of the methods. Furthermore, the bias variance analysis confirms the success of this technique in reducing both bias and variance.

Maintaining Dimension’s History in Data Warehouses Effectively

A data warehouse is considered a key aspect of success for any decision support system. Research on temporal databases have produced important results in this field, and data warehouses, which store historical data, can clearly benefit from such studies. A slowly changing dimension is a dimension in which any of its attributes in a data warehouse can change infrequently over time. Although different solutions have been proposed, each has its own particular disadvantages. The authors propose the Object-Relational Temporal Data Warehouse (O-RTDW) model for the slowly changing dimensions in this research work. Using this approach, it is possible to keep track of the whole history of an object in a data warehouse efficiently. The proposed model has been implemented on a real data set and tested successfully. Several limitations implied in other solutions, such as redundancy, surrogate keys, incomplete historical data, and creation of additional tables are not present in our solution.

Efficient Algorithm for Mining High Utility Pattern Considering Length Constraints

High utility itemset (HUI) mining is one of the popular and important data mining tasks. Several studies have been carried out on this topic, which often discovers a very large number of itemsets and rules, which reduces not only the efficiency but also the effectiveness of HUI mining. In order to increase the efficiency and discover more interesting HUIs, constraint-based mining plays an important role. To address this issue, the authors propose an algorithm to discover HUIs with length constraints named EHIL (Efficient High utility Itemsets with Length constraints) to decrease the number of HUIs by removing tiny itemsets. EHIL adopts two new upper bound named sub-tree and local utility for pruning and modify them by incorporating length constraints. To reduce the dataset scans, the proposed algorithm uses transaction merging and dataset projection techniques. The execution time improvements ranged from a modest five percent to two orders of magnitude across benchmark datasets. The memory usage is up to twenty-eight times less than state-of-the-art algorithm FHM+.

A Method of Sanitizing Privacy-Sensitive Sequence Pattern Networks Mined From Trajectories Released

Mobility patterns mined from released trajectories can help to allocate resources and provide personalized services, although these also pose a threat to personal location privacy. As the existing sanitization methods cannot deal with the problems of location privacy inference attacks based on privacy-sensitive sequence pattern networks, the authors proposed a method of sanitizing the privacy-sensitive sequence pattern networks mined from trajectories released by identifying and removing influential nodes from the networks. The authors conducted extensive experiments and the results were shown that by adjusting the parameter of the proportional factors, the proposed method can thoroughly sanitize privacy-sensitive sequence pattern networks and achieve the optimal values for security degree and connectivity degree measurements. In addition, the performance of the proposed method was shown to be stable for multiple networks with basically the same privacy-sensitive node ratio and be scalable for batches of networks with different sensitive nodes ratios.

Adaptation of Error Adjusted Bagging Method for Prediction

In this study, the error-adjusted bagging technique is adapted to support vector regression (SVR) and regression tree (RT) methods to obtain more accurate predictions, and then the method performances are evaluated with real data sets and a simulation study. For this purpose, the prediction performances of single models, classical bagging models, and error-adjusted bagging models obtained via complementary versions of the above-mentioned methods are constructed. The comparison is mainly based on a real dataset of 295 patients with Hodgkin's lymphoma (HL). The effect of several parameters such as training set ratio, the number of influential predictors on model performances, is examined with 500 repetitions of simulation data. The results reveal that error-adjusted bagging method provides the best performance compared to both single and classical bagging performances of the methods. Furthermore, the bias variance analysis confirms the success of this technique in reducing both bias and variance.

Maintaining Dimension’s History in Data Warehouses Effectively

A data warehouse is considered a key aspect of success for any decision support system. Research on temporal databases have produced important results in this field, and data warehouses, which store historical data, can clearly benefit from such studies. A slowly changing dimension is a dimension in which any of its attributes in a data warehouse can change infrequently over time. Although different solutions have been proposed, each has its own particular disadvantages. The authors propose the Object-Relational Temporal Data Warehouse (O-RTDW) model for the slowly changing dimensions in this research work. Using this approach, it is possible to keep track of the whole history of an object in a data warehouse efficiently. The proposed model has been implemented on a real data set and tested successfully. Several limitations implied in other solutions, such as redundancy, surrogate keys, incomplete historical data, and creation of additional tables are not present in our solution.