Sentiment analysis of tweets using text and graph multi-views learning

Abstract

With the surge of deep learning framework, various studies have attempted to address the challenges of sentiment analysis of tweets (data sparsity, under-specificity, noise, and multilingual content) through text and network-based representation learning approaches. However, limited studies on combining the benefits of textual and structural (graph) representations for sentiment analysis of tweets have been carried out. This study proposes a multi-view learning framework (end-to-end and ensemble-based) that leverages both text-based and graph-based representation learning approaches to enrich the tweet representation for sentiment classification. The efficacy of the proposed framework is evaluated over three datasets using suitable baseline counterparts. From various experimental studies, it is observed that combining both textual and structural views can achieve better performance of sentiment classification tasks than its counterparts.

Session-based recommendation with fusion of hypergraph item global and context features

Abstract

Session-based recommendation (SBR) is to predict the items that users are likely to click afterward by using their recent click history. Learning item features from existing session data to capture users’ current preferences is the main problem to be solved in session-based recommendation domain, and fusing global and local information to learn users’ preferences is an effective way to obtain this information more accurately. In this paper, we propose a session-based recommendation with fusion of hypergraph item global and context features (FHGIGC), which learns users’ current preferences by fusing item global and contextual features. Specifically, the model first constructs a global hypergraph and a local hypergraph and uses the hypergraph neural network to learn item global features and local features by relevant session information and item contextual information, respectively. Then, the learned features are fused by the attention mechanism to obtain the final item features and session features. Finally, personalized recommendations are generated for users based on the fused features. Experiments were conducted on three datasets of session-based recommendation, and the results demonstrate that the FHGIGC model can improve the accuracy of recommendations.

MedTSS: transforming abstractive summarization of scientific articles with linguistic analysis and concept reinforcement

Abstract

This research addresses the limitations of pretrained models (PTMs) in generating accurate and comprehensive abstractive summaries for scientific articles, with a specific focus on the challenges posed by medical research. The proposed solution named medical text simplification and summarization (MedTSS) introduces a dedicated module designed to enrich source text for PTMs. MedTSS addresses issues related to token limits, reinforces multiple concepts, and mitigates entity hallucination problems without necessitating additional training. Furthermore, the module conducts linguistic analysis to simplify generated summaries, particularly tailored for the complex nature of medical research articles. The results demonstrate a significant enhancement, with MedTSS improving the Rouge-1 score from 16.46 to 35.17 without requiring additional training. By emphasizing knowledge-driven components, this framework offers a distinct perspective, challenging the common narrative of ’more data’ or ’more parameters.’ This alternative approach, especially applicable in health-related domains, signifies a broader contribution to the field of NLP. MedTSS serves as an innovative model that not only addresses the intricacies of medical research summarization but also presents a paradigm shift with implications for diverse domains beyond its initial scope.

GA-based QOS-aware workflow scheduling of deadline tasks in grid computing

Abstract

Grid computing is the aggregation of the power of heterogeneous, geographically distributed computing resources to provide high-performance computing. To benefit from the grid computing capabilities, effectual scheduling algorithms are primarily essential. This paper presents a GA-based approach, called Grid Workflow Tasks Scheduling Algorithm (GWTSA), for scheduling workflow tasks on grid services based on users’ QoS (quality of service) constraints in terms of cost and time. For a given set of inter-dependent workflow tasks, it generates an optimal schedule, which minimizes the execution time and cost, such that the optimized time be within the time constraints (deadline) imposed by the user. In GWTSA, the workflow tasks are modeled as a DAG, which is divided, then the optimal sub-schedules of all task divisions are computed and used to obtain the execution schedule of the entire workflow. A GA-based technique is employed in GWTSA to compute the optimal execution sub-schedule for each branch division that consists of a set of sequential tasks. In this technique, the chromosome represents a branch division, where each gene holds the id of the service provider chosen to execute the corresponding task in the branch; and the fitness function is formulated as a multi-objective function of time and cost, this gives users the ability to determine their requirements if speed against cost or vice versa, by changing the weighting coefficients in the fitness function. The paper also exhibits the experimental results of assessing the performance of GWTSA with workflow samples of different sizes.

Hybrid Henry gas solubility optimization and the equilibrium optimizer for feature selection: real cases with Twitter spam detection

Abstract

The rapid spread and daily usage of social networks have made them vulnerable to spammers. Therefore, detecting and eliminating spam and spammers has become more than necessary to reduce the risks that it poses to users’ security. In order to achieve this goal, it is crucial to determine the exact features that help identify and classify whether a user is spam or not. This paper proposes a wrapper-based method for selecting the most important features. It is based on combining two recent metaheuristic algorithms, the Henry Gas Solubility Optimization Algorithm (HGSO) and the Equilibrium Optimizer Algorithm (EO), with the goal of choosing a small and most influential subset of features that give good performance and help in the spammer profile detection process. For the purpose of showing the ability of the proposed method to achieve the desired goals, several comparisons are conducted on a modified Social Honeypot dataset. The first comparison is made between HGSOEO and the two algorithms (HGSO and EO) that were used to develop the proposed algorithm to prove the power of hybridization. The two next comparisons are made against some classical filter- and wrapper-based feature selection methods. The last comparison is carried out against some well-known metaheuristic algorithms for feature selection. Experiments and analysis of the results show that the proposed model is more accurate than the algorithms and methods that we compared it to.

Enhancing sentiment analysis via fusion of multiple embeddings using attention encoder with LSTM

Abstract

Different embeddings capture various linguistic aspects, such as syntactic, semantic, and contextual information. Taking into account the diverse linguistic facets, we propose a novel hybrid model. This model hinges on the amalgamation of multiple embeddings through an attention encoder, subsequently channeled into an LSTM framework for sentiment classification. Our approach entails the fusion of Paragraph2vec, ELMo, and BERT embeddings to extract contextual information, while FastText is adeptly employed to capture syntactic characteristics. Subsequently, these embeddings were fused with the embeddings obtained from the attention encoder which forms the final embeddings. LSTM model is used for predicting the final classification. We conducted experiments utilizing both the Twitter Sentiment140 and Twitter US Airline Sentiment datasets. Our fusion model’s performance was evaluated and compared against established models such as LSTM, Bi-directional LSTM, BERT and Att-Coder. The test results clearly demonstrate that our approach surpasses the baseline models in terms of performance.

Towards more sustainable and trustworthy reporting in machine learning

Abstract

With machine learning (ML) becoming a popular tool across all domains, practitioners are in dire need of comprehensive reporting on the state-of-the-art. Benchmarks and open databases provide helpful insights for many tasks, however suffer from several phenomena: Firstly, they overly focus on prediction quality, which is problematic considering the demand for more sustainability in ML. Depending on the use case at hand, interested users might also face tight resource constraints and thus should be allowed to interact with reporting frameworks, in order to prioritize certain reported characteristics. Furthermore, as some practitioners might not yet be well-skilled in ML, it is important to convey information on a more abstract, comprehensible level. Usability and extendability are key for moving with the state-of-the-art and in order to be trustworthy, frameworks should explicitly address reproducibility. In this work, we analyze established reporting systems under consideration of the aforementioned issues. Afterwards, we propose STREP, our novel framework that aims at overcoming these shortcomings and paves the way towards more sustainable and trustworthy reporting. We use STREP’s (publicly available) implementation to investigate various existing report databases. Our experimental results unveil the need for making reporting more resource-aware and demonstrate our framework’s capabilities of overcoming current reporting limitations. With our work, we want to initiate a paradigm shift in reporting and help with making ML advances more considerate of sustainability and trustworthiness.

Robustness verification of k-nearest neighbors by abstract interpretation

Abstract

We study the certification of stability properties, such as robustness and individual fairness, of the k-nearest neighbor algorithm (kNN). Our approach leverages abstract interpretation, a well-established program analysis technique that has been proven successful in verifying several machine learning algorithms, notably, neural networks, decision trees, and support vector machines. In this work, we put forward an abstract interpretation-based framework for designing a sound approximate version of the kNN algorithm, which is instantiated to the interval and zonotope abstractions for approximating the range of numerical features. We show how this abstraction-based method can be used for stability, robustness, and individual fairness certification of kNN. Our certification technique has been implemented and experimentally evaluated on several benchmark datasets. These experimental results show that our tool can formally prove the stability of kNN classifiers in a precise and efficient way, thus expanding the range of machine learning models amenable to robustness certification.

BotCL: a social bot detection model based on graph contrastive learning

Abstract

The proliferation of social bots on social networks presents significant challenges to network security due to their malicious activities. While graph neural network models have shown promise in detecting social bots, acquiring a large number of high-quality labeled accounts remains challenging, impacting bot detection performance. To address this issue, we introduce BotCL, a social bot detection model that employs contrastive learning through data augmentation. Initially, we build a directed graph based on following/follower relationships, utilizing semantic, attribute, and structural features of accounts as initial node features. We then simulate account behaviors within the social network and apply two data augmentation techniques to generate multiple views of the directed graph. Subsequently, we encode the generated views using relational graph convolutional networks, achieving maximum homogeneity in node representations by minimizing the contrastive loss. Finally, node labels are predicted using Softmax. The proposed method augments data based on its distribution, showcasing robustness to noise. Extensive experimental results on Cresci-2015, Twibot-20, and Twibot-22 datasets demonstrate that our approach surpasses the state-of-the-art methods in terms of performance.

GK index: bridging Gf and K indices for comprehensive author evaluation

Abstract

The task of accurately predicting scientific impact and ranking the researcher based on impact has emerged as a crucial research challenge, captivating the interest of scholars across diverse domains. This task holds immense importance in enhancing research efficiency, aiding decision-making processes, and facilitating scientific evaluations. For this, the scientific community has put forth a wide array of parameters to identify the most influential researchers. These include citation count, total publication count, hybrid methodologies, the h-index, and also its extended or modified versions. But still, there is a lack of consensus on a single optimal parameter for identifying the most influential author. In this study, we introduce a novel index derived from learning hidden patterns within the mathematics field dataset, comprising data from 1050 researchers evenly split between awardees and non-awardees. Initially, we ranked selected parameters by assessing their values for individual researchers, identifying the top five parameters that most frequently placed awardees within the top 100 records. Additionally, we employed deep learning techniques to identify the top five influential parameters from the initially selected set. Subsequently, we evaluated the disjointness between the results produced by these parameters. To further refine our analysis, we assessed seven different statistical models for combining the top disjoint pair to retain the maximum properties of both parameters. The study’s findings revealed that the gf and k indices exhibited a 0.96 percent disjointness ratio, establishing them as the highest disjoint pair. Moreover, the geometric mean demonstrated a 0.87 percent average impact in retaining the properties of the top disjoint pair, surpassing the other seven models. As a result of this study, we propose a new index obtained by taking the geometric mean of the top disjoint pair which increase the result by 12% as compared to existing best performing individual index performance.