Name: | Description: | Size: | Format: | |
---|---|---|---|---|
682.77 KB | Adobe PDF |
Advisor(s)
Abstract(s)
Data cleaning remains one of the most time-consuming and critical steps in
modern data science, directly influencing the reliability and accuracy of downstream
analytics. In this paper, we present a comprehensive evaluation of five widely used data
cleaning tools—OpenRefine, Dedupe, Great Expectations, TidyData (PyJanitor), and a
baseline Pandas pipeline—applied to large-scale, messy datasets spanning three domains
(healthcare, finance, and industrial telemetry). We benchmark each tool on dataset sizes
ranging from 1 million to 100 million records, measuring execution time, memory usage,
error detection accuracy, and scalability under increasing data volumes. Additionally,
we assess qualitative aspects such as usability and ease of integration, reflecting realworld adoption concerns. We incorporate recent findings on parallelized data cleaning
and highlight how domain-specific anomalies (e.g., negative amounts in finance, sensor
corruption in industrial telemetry) can significantly impact tool choice. Our findings reveal
that no single solution excels across all metrics; while Dedupe provides robust duplicate
detection and Great Expectations offers in-depth rule-based validation, tools like TidyData
and baseline Pandas pipelines demonstrate strong scalability and flexibility under chunkbased ingestion. The choice of tool ultimately depends on domain-specific requirements
(e.g., approximate matching in finance and strict auditing in healthcare) and the magnitude
of available computational resources. By highlighting each framework’s strengths and
limitations, this study offers data practitioners clear, evidence-driven guidance for selecting
and combining tools to tackle large-scale data cleaning challenges
Description
Keywords
data cleaning large-scale benchmarking duplicate detection data validation healthcare finance
Citation
Martins, P., Cardoso, F., Váz, P., Silva, J., & Abbasi, M. (2025). Performance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World Datasets. Data, 10(5), 68. https://doi.org/10.3390/data10050068
Publisher
MDPI