BDCC, Vol. 8, Pages 36: From Traditional Recommender Systems to GPT-Based Chatbots: A Survey of Recent Developments and Future Directions

BDCC, Vol. 8, Pages 36: From Traditional Recommender Systems to GPT-Based Chatbots: A Survey of Recent Developments and Future Directions

Big Data and Cognitive Computing doi: 10.3390/bdcc8040036

Authors: Tamim Mahmud Al-Hasan Aya Nabil Sayed Faycal Bensaali Yassine Himeur Iraklis Varlamis George Dimitrakopoulos

Recommender systems are a key technology for many applications, such as e-commerce, streaming media, and social media. Traditional recommender systems rely on collaborative filtering or content-based filtering to make recommendations. However, these approaches have limitations, such as the cold start and the data sparsity problem. This survey paper presents an in-depth analysis of the paradigm shift from conventional recommender systems to generative pre-trained-transformers-(GPT)-based chatbots. We highlight recent developments that leverage the power of GPT to create interactive and personalized conversational agents. By exploring natural language processing (NLP) and deep learning techniques, we investigate how GPT models can better understand user preferences and provide context-aware recommendations. The paper further evaluates the advantages and limitations of GPT-based recommender systems, comparing their performance with traditional methods. Additionally, we discuss potential future directions, including the role of reinforcement learning in refining the personalization aspect of these systems.

BDCC, Vol. 8, Pages 35: Two-Stage Method for Clothing Feature Detection

BDCC, Vol. 8, Pages 35: Two-Stage Method for Clothing Feature Detection

Big Data and Cognitive Computing doi: 10.3390/bdcc8040035

Authors: Xinwei Lyu Xinjia Li Yuexin Zhang Wenlian Lu

The rapid expansion of e-commerce, particularly in the clothing sector, has led to a significant demand for an effective clothing industry. This study presents a novel two-stage image recognition method. Our approach distinctively combines human keypoint detection, object detection, and classification methods into a two-stage structure. Initially, we utilize open-source libraries, namely OpenPose and Dlib, for accurate human keypoint detection, followed by a custom cropping logic for extracting body part boxes. In the second stage, we employ a blend of Harris Corner, Canny Edge, and skin pixel detection integrated with VGG16 and support vector machine (SVM) models. This configuration allows the bounding boxes to identify ten unique attributes, encompassing facial features and detailed aspects of clothing. Conclusively, the experiment yielded an overall recognition accuracy of 81.4% for tops and 85.72% for bottoms, highlighting the efficacy of the applied methodologies in garment categorization.

BDCC, Vol. 8, Pages 34: A Comparative Study for Stock Market Forecast Based on a New Machine Learning Model

BDCC, Vol. 8, Pages 34: A Comparative Study for Stock Market Forecast Based on a New Machine Learning Model

Big Data and Cognitive Computing doi: 10.3390/bdcc8040034

Authors: Enrique González-Núñez Luis A. Trejo Michael Kampouridis

This research aims at applying the Artificial Organic Network (AON), a nature-inspired, supervised, metaheuristic machine learning framework, to develop a new algorithm based on this machine learning class. The focus of the new algorithm is to model and predict stock markets based on the Index Tracking Problem (ITP). In this work, we present a new algorithm, based on the AON framework, that we call Artificial Halocarbon Compounds, or the AHC algorithm for short. In this study, we compare the AHC algorithm against genetic algorithms (GAs), by forecasting eight stock market indices. Additionally, we performed a cross-reference comparison against results regarding the forecast of other stock market indices based on state-of-the-art machine learning methods. The efficacy of the AHC model is evaluated by modeling each index, producing highly promising results. For instance, in the case of the IPC Mexico index, the R-square is 0.9806, with a mean relative error of 7×10−4. Several new features characterize our new model, mainly adaptability, dynamism and topology reconfiguration. This model can be applied to systems requiring simulation analysis using time series data, providing a versatile solution to complex problems like financial forecasting.

BDCC, Vol. 8, Pages 33: Cancer Detection Using a New Hybrid Method Based on Pattern Recognition in MicroRNAs Combining Particle Swarm Optimization Algorithm and Artificial Neural Network

BDCC, Vol. 8, Pages 33: Cancer Detection Using a New Hybrid Method Based on Pattern Recognition in MicroRNAs Combining Particle Swarm Optimization Algorithm and Artificial Neural Network

Big Data and Cognitive Computing doi: 10.3390/bdcc8030033

Authors: Sepideh Molaei Stefano Cirillo Giandomenico Solimando

MicroRNAs (miRNAs) play a crucial role in cancer development, but not all miRNAs are equally significant in cancer detection. Traditional methods face challenges in effectively identifying cancer-associated miRNAs due to data complexity and volume. This study introduces a novel, feature-based technique for detecting attributes related to cancer-affecting microRNAs. It aims to enhance cancer diagnosis accuracy by identifying the most relevant miRNAs for various cancer types using a hybrid approach. In particular, we used a combination of particle swarm optimization (PSO) and artificial neural networks (ANNs) for this purpose. PSO was employed for feature selection, focusing on identifying the most informative miRNAs, while ANNs were used for recognizing patterns within the miRNA data. This hybrid method aims to overcome limitations in traditional miRNA analysis by reducing data redundancy and focusing on key genetic markers. The application of this method showed a significant improvement in the detection accuracy for various cancers, including breast and lung cancer and melanoma. Our approach demonstrated a higher precision in identifying relevant miRNAs compared to existing methods, as evidenced by the analysis of different datasets. The study concludes that the integration of PSO and ANNs provides a more efficient, cost-effective, and accurate method for cancer detection via miRNA analysis. This method can serve as a supplementary tool for cancer diagnosis and potentially aid in developing personalized cancer treatments.

BDCC, Vol. 8, Pages 32: AI-Generated Text Detector for Arabic Language Using Encoder-Based Transformer Architecture

BDCC, Vol. 8, Pages 32: AI-Generated Text Detector for Arabic Language Using Encoder-Based Transformer Architecture

Big Data and Cognitive Computing doi: 10.3390/bdcc8030032

Authors: Hamed Alshammari Ahmed El-Sayed Khaled Elleithy

The effectiveness of existing AI detectors is notably hampered when processing Arabic texts. This study introduces a novel AI text classifier designed specifically for Arabic, tackling the distinct challenges inherent in processing this language. A particular focus is placed on accurately recognizing human-written texts (HWTs), an area where existing AI detectors have demonstrated significant limitations. To achieve this goal, this paper utilized and fine-tuned two Transformer-based models, AraELECTRA and XLM-R, by training them on two distinct datasets: a large dataset comprising 43,958 examples and a custom dataset with 3078 examples that contain HWT and AI-generated texts (AIGTs) from various sources, including ChatGPT 3.5, ChatGPT-4, and BARD. The proposed architecture is adaptable to any language, but this work evaluates these models’ efficiency in recognizing HWTs versus AIGTs in Arabic as an example of Semitic languages. The performance of the proposed models has been compared against the two prominent existing AI detectors, GPTZero and OpenAI Text Classifier, particularly on the AIRABIC benchmark dataset. The results reveal that the proposed classifiers outperform both GPTZero and OpenAI Text Classifier with 81% accuracy compared to 63% and 50% for GPTZero and OpenAI Text Classifier, respectively. Furthermore, integrating a Dediacritization Layer prior to the classification model demonstrated a significant enhancement in the detection accuracy of both HWTs and AIGTs. This Dediacritization step markedly improved the classification accuracy, elevating it from 81% to as high as 99% and, in some instances, even achieving 100%.

BDCC, Vol. 8, Pages 31: Machine Learning Approaches for Predicting Risk of Cardiometabolic Disease among University Students

BDCC, Vol. 8, Pages 31: Machine Learning Approaches for Predicting Risk of Cardiometabolic Disease among University Students

Big Data and Cognitive Computing doi: 10.3390/bdcc8030031

Authors: Dhiaa Musleh Ali Alkhwaja Ibrahim Alkhwaja Mohammed Alghamdi Hussam Abahussain Mohammed Albugami Faisal Alfawaz Said El-Ashker Mohammed Al-Hariri

Obesity is increasingly becoming a prevalent health concern among adolescents, leading to significant risks like cardiometabolic diseases (CMDs). The early discovery and diagnosis of CMD is essential for better outcomes. This study aims to build a reliable artificial intelligence model that can predict CMD using various machine learning techniques. Support vector machines (SVMs), K-Nearest neighbor (KNN), Logistic Regression (LR), Random Forest (RF), and Gradient Boosting are five robust classifiers that are compared in this study. A novel “risk level” feature, derived through fuzzy logic applied to the Conicity Index, as a novel feature, which was previously unused, is introduced to enhance the interpretability and discriminatory properties of the proposed models. As the Conicity Index scores indicate CMD risk, two separate models are developed to address each gender individually. The performance of the proposed models is assessed using two datasets obtained from 295 records of undergraduate students in Saudi Arabia. The dataset comprises 121 male and 174 female students with diverse risk levels. Notably, Logistic Regression emerges as the top performer among males, achieving an accuracy score of 91%, while Gradient Boosting lags with a score of 72%. Among females, both Support Vector Machine and Logistic Regression lead with an accuracy score of 87%, while Random Forest performs least optimally with a score of 80%.

BDCC, Vol. 8, Pages 30: Proposal of a Service Model for Blockchain-Based Security Tokens

BDCC, Vol. 8, Pages 30: Proposal of a Service Model for Blockchain-Based Security Tokens

Big Data and Cognitive Computing doi: 10.3390/bdcc8030030

Authors: Keundug Park Heung-Youl Youm

The volume of the asset investment and trading market can be expanded through the issuance and management of blockchain-based security tokens that logically divide the value of assets and guarantee ownership. This paper proposes a service model to solve a problem with the existing investment service model, identifies security threats to the service model, and specifies security requirements countering the identified security threats for privacy protection and anti-money laundering (AML) involving security tokens. The identified security threats and specified security requirements should be taken into consideration when implementing the proposed service model. The proposed service model allows users to invest in tokenized tangible and intangible assets and trade in blockchain-based security tokens. This paper discusses considerations to prevent excessive regulation and market monopoly in the issuance of and trading in security tokens when implementing the proposed service model and concludes with future works.

BDCC, Vol. 8, Pages 29: The Distribution and Accessibility of Elements of Tourism in Historic and Cultural Cities

BDCC, Vol. 8, Pages 29: The Distribution and Accessibility of Elements of Tourism in Historic and Cultural Cities

Big Data and Cognitive Computing doi: 10.3390/bdcc8030029

Authors: Wei-Ling Hsu Yi-Jheng Chang Lin Mou Juan-Wen Huang Hsin-Lung Liu

Historic urban areas are the foundations of urban development. Due to rapid urbanization, the sustainable development of historic urban areas has become challenging for many cities. Elements of tourism and tourism service facilities play an important role in the sustainable development of historic areas. This study analyzed policies related to tourism in Panguifang and Meixian districts in Meizhou, Guangdong, China. Kernel density estimation was used to study the clustering characteristics of tourism elements through point of interest (POI) data, while space syntax was used to study the accessibility of roads. In addition, the Pearson correlation coefficient and regression were used to analyze the correlation between the elements and accessibility. The results show the following: (1) the overall number of tourism elements was high on the western side of the districts and low on the eastern one, and the elements were predominantly distributed along the main transportation arteries; (2) according to the integration degree and depth value, the western side was easier to access than the eastern one; and (3) the depth value of the area negatively correlated with kernel density, while the degree of integration positively correlated with it. Based on the results, the study put forward measures for optimizing the elements of tourism in Meizhou’s historic urban area to improve cultural tourism and emphasize the importance of the elements.

BDCC, Vol. 8, Pages 28: Enhancing Supervised Model Performance in Credit Risk Classification Using Sampling Strategies and Feature Ranking

BDCC, Vol. 8, Pages 28: Enhancing Supervised Model Performance in Credit Risk Classification Using Sampling Strategies and Feature Ranking

Big Data and Cognitive Computing doi: 10.3390/bdcc8030028

Authors: Niwan Wattanakitrungroj Pimchanok Wijitkajee Saichon Jaiyen Sunisa Sathapornvajana Sasiporn Tongman

For the financial health of lenders and institutions, one important risk assessment called credit risk is about correctly deciding whether or not a borrower will fail to repay a loan. It not only helps in the approval or denial of loan applications but also aids in managing the non-performing loan (NPL) trend. In this study, a dataset provided by the LendingClub company based in San Francisco, CA, USA, from 2007 to 2020 consisting of 2,925,492 records and 141 attributes was experimented with. The loan status was categorized as “Good” or “Risk”. To yield highly effective results of credit risk prediction, experiments on credit risk prediction were performed using three widely adopted supervised machine learning techniques: logistic regression, random forest, and gradient boosting. In addition, to solve the imbalanced data problem, three sampling algorithms, including under-sampling, over-sampling, and combined sampling, were employed. The results show that the gradient boosting technique achieves nearly perfect Accuracy, Precision, Recall, and F1score values, which are better than 99.92%, but its MCC values are greater than 99.77%. Three imbalanced data handling approaches can enhance the model performance of models trained by three algorithms. Moreover, the experiment of reducing the number of features based on mutual information calculation revealed slightly decreasing performance for 50 data features with Accuracy values greater than 99.86%. For 25 data features, which is the smallest size, the random forest supervised model yielded 99.15% Accuracy. Both sampling strategies and feature selection help to improve the supervised model for accurately predicting credit risk, which may be beneficial in the lending business.

BDCC, Vol. 8, Pages 27: Temporal Dynamics of Citizen-Reported Urban Challenges: A Comprehensive Time Series Analysis

BDCC, Vol. 8, Pages 27: Temporal Dynamics of Citizen-Reported Urban Challenges: A Comprehensive Time Series Analysis

Big Data and Cognitive Computing doi: 10.3390/bdcc8030027

Authors: Andreas F. Gkontzis Sotiris Kotsiantis Georgios Feretzakis Vassilios S. Verykios

In an epoch characterized by the swift pace of digitalization and urbanization, the essence of community well-being hinges on the efficacy of urban management. As cities burgeon and transform, the need for astute strategies to navigate the complexities of urban life becomes increasingly paramount. This study employs time series analysis to scrutinize citizen interactions with the coordinate-based problem mapping platform in the Municipality of Patras in Greece. The research explores the temporal dynamics of reported urban issues, with a specific focus on identifying recurring patterns through the lens of seasonality. The analysis, employing the seasonal decomposition technique, dissects time series data to expose trends in reported issues and areas of the city that might be obscured in raw big data. It accentuates a distinct seasonal pattern, with concentrations peaking during the summer months. The study extends its approach to forecasting, providing insights into the anticipated evolution of urban issues over time. Projections for the coming years show a consistent upward trend in both overall city issues and those reported in specific areas, with distinct seasonal variations. This comprehensive exploration of time series analysis and seasonality provides valuable insights for city stakeholders, enabling informed decision-making and predictions regarding future urban challenges.