Streamlit façade
Provides upload or bundled messy demos, sliders for thresholds, deterministic cleaning toggles, previews, downloadable cleaned files, and the HTML artifact.
I ship a practical upload → detect → clean → export loop for messy operational tables: issue detection across missing values, duplicates, parses, categorical drift, and IQR outliers (flagged outlier stats in reporting), plus notebooks that replay the offline path. Exported artifacts plus pytest cover the repeatable core.
Static recruiter page only. Hands-on uploads and downloads use the hosted Streamlit instance linked above.
Motivation. Tabular extracts from ERPs, finance exports, and support CSVs routinely mix missing fields, bogus dates, drifting category strings, duplicated keys, and half-parsed numeric text. Analysts rework the same transformations by hand—the goal here is deterministic, repeatable cleaning with receipts.
This repository pairs a Streamlit reviewer with reusable Python helpers so the same transformations can ship in demos, notebooks, or future automation without copying ad hoc notebook cells everywhere.
Streamlit stitches the modules together; notebooks duplicate the offline path.
Provides upload or bundled messy demos, sliders for thresholds, deterministic cleaning toggles, previews, downloadable cleaned files, and the HTML artifact.
Merges uploads, summarizes missing ratios, parses invalid formats, spots inconsistent categorical tokens, evaluates duplicate subsets (optional key columns), and surfaces IQR outlier counts in reporting without silent auto-removal.
Configurable duplication removal, categorical normalization (case / whitespace trims), numeric coercion, date parsing, and explicit missing-row strategies (leave, drop-with-warning, deterministic fill).
HTML summaries mirror Streamlit-visible tables—useful attachments for reviews or ticketing without forcing stakeholders into the IDE.
Upload CSV, Excel, Parquet, or shallow JSON structures (single-level flatten); optionally merge multiples with provenance tagging.
Column insight matrix plus aggregate issue summaries; “detect-only” mode previews without mutating downstream assets.
Explicit checklist of transformations with honest guardrails surfaced in README (row caps, demo upload MB limit).
Before / after previews, trimmed change samples, deterministic summary bullets, zipped exports (CSV · Parquet · JSON) aligned with QA expectations.