Repository logo
 
Publication

Performance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World Datasets

dc.contributor.authorMartins, Pedro
dc.contributor.authorCardoso, Filipe
dc.contributor.authorVaz, Paulo
dc.contributor.authorSilva, José
dc.contributor.authorAbbasi, Maryam
dc.date.accessioned2025-05-21T15:17:07Z
dc.date.available2025-05-21T15:17:07Z
dc.date.issued2025-05-18
dc.description.abstractData cleaning remains one of the most time-consuming and critical steps in modern data science, directly influencing the reliability and accuracy of downstream analytics. In this paper, we present a comprehensive evaluation of five widely used data cleaning tools—OpenRefine, Dedupe, Great Expectations, TidyData (PyJanitor), and a baseline Pandas pipeline—applied to large-scale, messy datasets spanning three domains (healthcare, finance, and industrial telemetry). We benchmark each tool on dataset sizes ranging from 1 million to 100 million records, measuring execution time, memory usage, error detection accuracy, and scalability under increasing data volumes. Additionally, we assess qualitative aspects such as usability and ease of integration, reflecting realworld adoption concerns. We incorporate recent findings on parallelized data cleaning and highlight how domain-specific anomalies (e.g., negative amounts in finance, sensor corruption in industrial telemetry) can significantly impact tool choice. Our findings reveal that no single solution excels across all metrics; while Dedupe provides robust duplicate detection and Great Expectations offers in-depth rule-based validation, tools like TidyData and baseline Pandas pipelines demonstrate strong scalability and flexibility under chunkbased ingestion. The choice of tool ultimately depends on domain-specific requirements (e.g., approximate matching in finance and strict auditing in healthcare) and the magnitude of available computational resources. By highlighting each framework’s strengths and limitations, this study offers data practitioners clear, evidence-driven guidance for selecting and combining tools to tackle large-scale data cleaning challengeseng
dc.identifier.citationMartins, P., Cardoso, F., Váz, P., Silva, J., & Abbasi, M. (2025). Performance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World Datasets. Data, 10(5), 68. https://doi.org/10.3390/data10050068
dc.identifier.doihttps://doi.org/10.3390/data10050068
dc.identifier.eissn2306-5729
dc.identifier.urihttp://hdl.handle.net/10400.19/9342
dc.language.isoeng
dc.peerreviewedyes
dc.publisherMDPI
dc.relation.hasversionhttps://www.mdpi.com/2306-5729/10/5/68
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectdata cleaning
dc.subjectlarge-scale benchmarking
dc.subjectduplicate detection
dc.subjectdata validation
dc.subjecthealthcare
dc.subjectfinance
dc.titlePerformance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World Datasetseng
dc.typetext
dspace.entity.typePublication
oaire.citation.issue5
oaire.citation.startPage68
oaire.citation.titleData Technologies and Applications
oaire.citation.volume10
oaire.versionhttp://purl.org/coar/version/c_970fb48d4fbd8a85

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Performance and Scalability of Data.pdf
Size:
682.77 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.79 KB
Format:
Item-specific license agreed upon to submission
Description: