Performance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World Datasets

Martins, Pedro; Cardoso, Filipe; Vaz, Paulo; Silva, José; Abbasi, Maryam

Publication

Performance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World Datasets

2025-05-18Text

dc.contributor.author	Martins, Pedro
dc.contributor.author	Cardoso, Filipe
dc.contributor.author	Vaz, Paulo
dc.contributor.author	Silva, José
dc.contributor.author	Abbasi, Maryam
dc.date.accessioned	2025-05-21T15:17:07Z
dc.date.available	2025-05-21T15:17:07Z
dc.date.issued	2025-05-18
dc.description.abstract	Data cleaning remains one of the most time-consuming and critical steps in modern data science, directly influencing the reliability and accuracy of downstream analytics. In this paper, we present a comprehensive evaluation of five widely used data cleaning tools—OpenRefine, Dedupe, Great Expectations, TidyData (PyJanitor), and a baseline Pandas pipeline—applied to large-scale, messy datasets spanning three domains (healthcare, finance, and industrial telemetry). We benchmark each tool on dataset sizes ranging from 1 million to 100 million records, measuring execution time, memory usage, error detection accuracy, and scalability under increasing data volumes. Additionally, we assess qualitative aspects such as usability and ease of integration, reflecting realworld adoption concerns. We incorporate recent findings on parallelized data cleaning and highlight how domain-specific anomalies (e.g., negative amounts in finance, sensor corruption in industrial telemetry) can significantly impact tool choice. Our findings reveal that no single solution excels across all metrics; while Dedupe provides robust duplicate detection and Great Expectations offers in-depth rule-based validation, tools like TidyData and baseline Pandas pipelines demonstrate strong scalability and flexibility under chunkbased ingestion. The choice of tool ultimately depends on domain-specific requirements (e.g., approximate matching in finance and strict auditing in healthcare) and the magnitude of available computational resources. By highlighting each framework’s strengths and limitations, this study offers data practitioners clear, evidence-driven guidance for selecting and combining tools to tackle large-scale data cleaning challenges	eng
dc.identifier.citation	Martins, P., Cardoso, F., Váz, P., Silva, J., & Abbasi, M. (2025). Performance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World Datasets. Data, 10(5), 68. https://doi.org/10.3390/data10050068
dc.identifier.doi	https://doi.org/10.3390/data10050068
dc.identifier.eissn	2306-5729
dc.identifier.uri	http://hdl.handle.net/10400.19/9342
dc.language.iso	eng
dc.peerreviewed	yes
dc.publisher	MDPI
dc.relation.hasversion	https://www.mdpi.com/2306-5729/10/5/68
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject	data cleaning
dc.subject	large-scale benchmarking
dc.subject	duplicate detection
dc.subject	data validation
dc.subject	healthcare
dc.subject	finance
dc.title	Performance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World Datasets	eng
dc.type	text
dspace.entity.type	Publication
oaire.citation.issue	5
oaire.citation.startPage	68
oaire.citation.title	Data Technologies and Applications
oaire.citation.volume	10
oaire.version	http://purl.org/coar/version/c_970fb48d4fbd8a85