ESTGV - DEMGI - Artigo em revista científica, indexada ao WoS/Scopus
Permanent URI for this collection
Browse
Recent Submissions
- Performance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World DatasetsPublication . Martins, Pedro; Cardoso, Filipe; Vaz, Paulo; Silva, José; Abbasi, MaryamData cleaning remains one of the most time-consuming and critical steps in modern data science, directly influencing the reliability and accuracy of downstream analytics. In this paper, we present a comprehensive evaluation of five widely used data cleaning tools—OpenRefine, Dedupe, Great Expectations, TidyData (PyJanitor), and a baseline Pandas pipeline—applied to large-scale, messy datasets spanning three domains (healthcare, finance, and industrial telemetry). We benchmark each tool on dataset sizes ranging from 1 million to 100 million records, measuring execution time, memory usage, error detection accuracy, and scalability under increasing data volumes. Additionally, we assess qualitative aspects such as usability and ease of integration, reflecting realworld adoption concerns. We incorporate recent findings on parallelized data cleaning and highlight how domain-specific anomalies (e.g., negative amounts in finance, sensor corruption in industrial telemetry) can significantly impact tool choice. Our findings reveal that no single solution excels across all metrics; while Dedupe provides robust duplicate detection and Great Expectations offers in-depth rule-based validation, tools like TidyData and baseline Pandas pipelines demonstrate strong scalability and flexibility under chunkbased ingestion. The choice of tool ultimately depends on domain-specific requirements (e.g., approximate matching in finance and strict auditing in healthcare) and the magnitude of available computational resources. By highlighting each framework’s strengths and limitations, this study offers data practitioners clear, evidence-driven guidance for selecting and combining tools to tackle large-scale data cleaning challenges
- Adaptive and Scalable Database Management with Machine Learning Integration: A PostgreSQL Case StudyPublication . Abbasi, Maryam; Bernardo, Marco V.; Vaz, Paulo; Silva, José; Martins, Pedro; ANTUNES VAZ, PAULO JOAQUIM; Silva, JoséThe increasing complexity of managing modern database systems, particularly in terms of optimizing query performance for large datasets, presents significant challenges that traditional methods often fail to address. This paper proposes a comprehensive framework for integrating advanced machine learning (ML) models within the architecture of a database management system (DBMS), with a specific focus on PostgreSQL. Our approach leverages a combination of supervised and unsupervised learning techniques to predict query execution times, optimize performance, and dynamically manage workloads. Unlike existing solutions that address specific optimization tasks in isolation, our framework provides a unified platform that supports real-time model inference and automatic database configuration adjustments based on workload patterns. A key contribution of our work is the integration of ML capabilities directly into the DBMS engine, enabling seamless interaction between the ML models and the query optimization process. This integration allows for the automatic retraining of models and dynamic workload management, resulting in substantial improvements in both query response times and overall system throughput. Our evaluations using the Transaction Processing Performance Council Decision Support (TPC-DS) benchmark dataset at scale factors of 100 GB, 1 TB, and 10 TB demonstrate a reduction of up to 42% in query execution times and a 74% improvement in throughput compared with traditional approaches. Additionally, we address challenges such as potential conflicts in tuning recommendations and the performance overhead associated with ML integration, providing insights for future research directions. This study is motivated by the need for autonomous tuning mechanisms to manage large-scale, hetero geneous workloads while answering key research questions, such as the following: (1) How can machine learning models be integrated into a DBMS to improve query optimization and workload management? (2) What performance improvements can be achieved through dynamic configuration tuning based on real-time workload patterns? Our results suggest that the proposed framework significantly reduces the need for manual database administration while effectively adapting to evolving workloads, offering a robust solution for modern large-scale data environments.
- Machine Learning Approaches for Predicting Maize Biomass Yield: Leveraging Feature Engineering and Comprehensive Data IntegrationPublication . Abbasi, Maryam; Vaz, Paulo; Silva, José; Martins, Pedro; Silva, José; ANTUNES VAZ, PAULO JOAQUIMThe efficient prediction of corn biomass yield is critical for optimizing crop production and addressing global challenges in sustainable agriculture and renewable energy. This study employs advanced machine learning techniques, including Gradient Boosting Machines (GBMs), Random Forests (RFs), Support Vector Machines (SVMs), and Artificial Neural Networks (ANNs), integrated with comprehensive environmental, soil, and crop management data from key agricultural regions in the United States. A novel framework combines feature engineering, such as the creation of a Soil Fertility Index (SFI) and Growing Degree Days (GDDs), and the incorporation of interaction terms to address complex non-linear relationships between input variables and biomass yield. We conduct extensive sensitivity analysis and employ SHAP (SHapley Additive exPlanations) values to enhance model interpretability, identifying SFI, GDDs, and cumulative rainfall as the most influential features driving yield outcomes. Our findings highlight significant synergies among these variables, emphasizing their critical role in rural environmental governance and precision agriculture. Furthermore, an ensemble approach combining GBMs, RFs, and ANNs outperformed individual models, achieving an RMSE of 0.80 t/ha and R2 of 0.89. These results underscore the potential of hybrid modeling for real-world applications in sustainable farming practices. Addressing the concerns of passive farmer participation, we propose targeted incentives, education, and institutional support mechanisms to enhance stakeholder collaboration in rural environmental governance. While the models assume rational decision-making, the inclusion of cultural and political factors warrants further investigation to improve the robustness of the framework. Additionally, a map of the study region and improved visualizations of feature importance enhance the clarity and relevance of our findings. This research contributes to the growing body of knowledge on predictive modeling in agriculture, combining theoretical rigor with practical insights to support policymakers and stakeholders in optimizing resource use and addressing environ mental challenges. By improving the interpretability and applicability of machine learning models, this study provides actionable strategies for enhancing crop yield predictions and advancing rural environmental governance.
- Data Privacy and Ethical Considerations in Database ManagementPublication . Pina, Eduardo; Ramos, José; Jorge, Henrique; ANTUNES VAZ, PAULO JOAQUIM; Vaz, Paulo; Silva, José; Wanzeller, Cristina; Abbasi, Maryam; Martins, Pedro; Silva, José; Wanzeller Guedes de Lacerda, Ana CristinaData privacy and ethical considerations ensure the security of databases by respecting individual rights while upholding ethical considerations when collecting, managing, and using information. Nowadays, despite having regulations that help to protect citizens and organizations, we have been presented with thousands of instances of data breaches, unauthorized access, and misuse of data related to such individuals and organizations. In this paper, we propose ethical considerations and best practices associated with critical data and the role of the database administrator who helps protect data. First, we suggest best practices for database administrators regarding data minimization, anonymization, pseudonymization and encryption, access controls, data retention guidelines, and stakeholder communication. Then, we present a case study that illustrates the application of these ethical implementations and best practices in a real-world scenario, showing the approach in action and the benefits of privacy. Finally, the study highlights the importance of a comprehensive approach to deal with data protection challenges and provides valuable insights for future research and developments in this field
- Enhancing Visual Perception in Immersive VR and AR Environments: AI-Driven Color and Clarity Adjustments Under Dynamic Lighting ConditionsPublication . Abbasi, Maryam; Silva, José; Martins, Pedro; ANTUNES VAZ, PAULO JOAQUIM; Silva, JoséThe visual fidelity of virtual reality (VR) and augmented reality (AR) environments is essential for user immersion and comfort. Dynamic lighting often leads to chromatic distortions and reduced clarity, causing discomfort and disrupting user experience. This paper introduces an AI-driven chromatic adjustment system based on a modified U-Net architecture, optimized for real-time applications in VR/AR. This system adapts to dynamic lighting conditions, addressing the shortcomings of traditional methods like histogram equalization and gamma correction, which struggle with rapid lighting changes and real-time user interactions. We compared our approach with state-of-the-art color constancy algorithms, including Barron’s Convolutional Color Constancy and STAR, demonstrating superior performance. Experimental results from 60 participants show significant improvements, with up to 41% better color accuracy and 39% enhanced clarity under dynamic lighting conditions. The study also included eye-tracking data, which confirmed increased user engagement with AI-enhanced images. Our system provides a practical solution for developers aiming to improve image quality, reduce visual discomfort, and enhance overall user satisfaction in immersive environments. Future work will focus on extending the model’s capability to handle more complex lighting scenarios.
- Real-Time Gesture-Based Hand Landmark Detection for Optimized Mobile Photo Capture and SynchronizationPublication . Marques, Pedro; ANTUNES VAZ, PAULO JOAQUIM; Silva, José; Martins, Pedro; Abbasi, MaryamGesture recognition technology has emerged as a transformative solution for natural and intuitive human–computer interaction (HCI), offering touch-free operation across diverse fields such as healthcare, gaming, and smart home systems. In mobile contexts, where hygiene, convenience, and the ability to operate under resource constraints are critical, hand gesture recognition provides a compelling alternative to traditional touch based interfaces. However, implementing effective gesture recognition in real-world mobile settings involves challenges such as limited computational power, varying environmen tal conditions, and the requirement for robust offline–online data management. In this study, we introduce ThumbsUp, which is a gesture-driven system, and employ a partially systematic literature review approach (inspired by core PRISMA guidelines) to identify the key research gaps in mobile gesture recognition. By incorporating insights from deep learning–based methods (e.g., CNNs and Transformers) while focusing on low resource consumption, we leverage Google’s MediaPipe in our framework for real-time detection of 21 hand landmarks and adaptive lighting pre-processing, enabling accurate recogni tion of a “thumbs-up” gesture. The system features a secure queue-based offline–cloud synchronization model, which ensures that the captured images and metadata (encrypted with AES-GCM) remain consistent and accessible even with intermittent connectivity. Ex perimental results under dynamic lighting, distance variations, and partially cluttered environments confirm the system’s superior low-light performance and decreased resource consumption compared to baseline camera applications. Additionally, we highlight the feasibility of extending ThumbsUp to incorporate AI-driven enhancements for abrupt lighting changes and, in the future, electromyographic (EMG) signals for users with mo tor impairments. Our comprehensive evaluation demonstrates that ThumbsUp maintains robust performance on typical mobile hardware, showing resilience to unstable network conditions and minimal reliance on high-end GPUs. These findings offer new perspectives for deploying gesture-based interfaces in the broader IoT ecosystem, thus paving the way toward secure, efficient, and inclusive mobile HCI solutions.
- Head-to-Head Evaluation of FDM and SLA in Additive Manufacturing: Performance, Cost, and Environmental PerspectivesPublication . Abbasi, Maryam; ANTUNES VAZ, PAULO JOAQUIM; Martins, Pedro; Silva, JoséThis paper conducts a comprehensive experimental comparison of two widely used additive manufacturing (AM) processes, Fused Deposition Modeling (FDM) and Stereolithography (SLA), under standardized conditions using the same test geometries and protocols. FDM parts were printed with both Polylactic Acid (PLA) and Acryloni trile Butadiene Styrene (ABS) filaments, while SLA used a general-purpose photopolymer resin. Quantitative evaluations included surface roughness, dimensional accuracy, ten sile properties, production cost, and energy consumption. Additionally, environmental considerations and process reliability were assessed by examining waste streams, recy clability, and failure rates. The results indicate that SLA achieves superior surface quality (Ra ≈ 2 µm vs. 12–13 µm) and dimensional tolerances (±0.05 mm vs. ±0.15–0.20 mm), along with higher tensile strength (up to 70 MPa). However, FDM provides notable ad vantages in cost (approximately 60% lower on a per-part basis), production speed, and energy efficiency. Moreover, from an environmental perspective, FDM is more favorable when using biodegradable PLA or recyclable ABS, whereas SLA resin waste is hazardous. Overall, the study highlights that no single process is universally superior. FDM offers a rapid, cost-effective solution for prototyping, while SLA excels in precision and surface finish. By presenting a detailed, data-driven comparison, this work guides engineers, product designers, and researchers in choosing the most suitable AM technology for their specific needs.
- Comprehensive Evaluation of Deepfake Detection Models: Accuracy, Generalization, and Resilience to Adversarial AttacksPublication . Abbasi, Maryam; ANTUNES VAZ, PAULO JOAQUIM; Silva, José; Martins, PedroThe rise of deepfakes—synthetic media generated using artificial intelli gence—threatens digital content authenticity, facilitating misinformation and manipu lation. However, deepfakes can also depict real or entirely fictitious individuals, leveraging state-of-the-art techniques such as generative adversarial networks (GANs) and emerging diffusion-based models. Existing detection methods face challenges with generalization across datasets and vulnerability to adversarial attacks. This study focuses on subsets of frames extracted from the DeepFake Detection Challenge (DFDC) and FaceForensics++ videos to evaluate three convolutional neural network architectures—XCeption, ResNet, and VGG16—for deepfake detection. Performance metrics include accuracy, precision, F1-score, AUC-ROC, and Matthews Correlation Coefficient (MCC), combined with an assessment of resilience to adversarial perturbations via the Fast Gradient Sign Method (FGSM). Among the tested models, XCeption achieves the highest accuracy (89.2% on DFDC), strong generalization, and real-time suitability, while VGG16 excels in precision and ResNet provides faster inference. However, all models exhibit reduced performance under adversarial conditions, underscoring the need for enhanced resilience. These find ings indicate that robust detection systems must consider advanced generative approaches, adversarial defenses, and cross-dataset adaptation to effectively counter evolving deep fake threats
- A Simulation of Data Censored Rigth Type I with Weibull DistributionPublication . Gaspar, Daniel; Andrande Ferreira, LuisIn the maintenance and reliability field, there are frequent analyses with data being censored. In reliability research, many articles do simulation, but few explain how they do it. the loss of information resulting from the unavailable exact failure times will impact negatively the efficiency of reliability analysis. This paper presents four different algorithms to generate random data with a different number of censored values. The four algorithms are compared, and tree parameters are used to select the best one. The Weibull distribution is used to generate the random numbers because it is one of the most used in reliability studies. The results of the algorithm chosen are very relevant; with a sample of n = 50 and a number of cycles of simulations m = 1000, the standard deviation is higher when the shape factor of Weibull distribution is beta = 0.5 and slowly decreases until the shape factor equals 5. The percentage error (PE), one of the indicators selected, is much higher when the percentage of censored data is c = 5%, then goes down when the shape factor increases. There is a different behaviour when censored data is C = 20% and the percentage error (PE) is higher when shape factor is beta = 1.5. This article presents an algorithm that it considers the best for simulating right-censored type-I data. The algorithm has excellent accuracy, random data i.i.d and excellent computational performance.
- Reliability Estimation Using EM Algorithm with Censored Data: A Case Study on Centrifugal Pumps in an Oil RefineryPublication . Silva, José; VAZ, PAULO; Martins, Pedro; Ferreira, LuísCentrifugal pumps are widely employed in the oil refinery industry due to their efficiency and effectiveness in fluid transfer applications. The reliability of pumps plays a pivotal role in ensuring uninterrupted plant productivity and safe operations. Analysis of failure history data shows that bearings have been identified as critical components in oil refinery pump groups. Analyzing historical failure data for such systems is a complex task due to censored data and missing information. This paper addresses the complexity of estimating the Weibull distribution parameters using the maximum likelihood method under these conditions. The likelihood equation lacks an explicit analytical solution, necessitating numerical methods for resolution. The proposed approach presented in this article leverages the expectation maximization (EM) algorithm for estimating the Weibull distribution parameters in a real-world case study of a complex engineering system. The results demonstrate the superior performance of the EM algorithm with censored data, showcasing its ability to overcome the limitations of traditional methods and provide more accurate estimates for reliability metrics. This highlights the importance of obtaining results through these methodologies in the analysis of reliability and in facilitating more informed decision making in complex systems