Tabular data quality & preprocessing

I ship a practical upload → detect → clean → export loop for messy operational tables: issue detection across missing values, duplicates, parses, categorical drift, and IQR outliers (flagged outlier stats in reporting), plus notebooks that replay the offline path. Exported artifacts plus pytest cover the repeatable core.

GitHub Live Streamlit demo Run locally

Static recruiter page only. Hands-on uploads and downloads use the hosted Streamlit instance linked above.

Python / pandas

Streamlit

CSV · Excel · Parquet · JSON

pytest

Limitations & scope

Demo safeguards cap uploads (~25 MB) — not unmanaged enterprise ingest.
Guidance warns above ~100k rows; brute-force gigantic tables will hit memory/time limits.
No hosted REST layer in this repo: Streamlit UX + downloadable outputs + offline notebooks.
Heuristic DQ helpers — not lineage, MDM catalogs, contracts-as-a-service platforms, or SLA-driven ETL control planes.

Motivation. Tabular extracts from ERPs, finance exports, and support CSVs routinely mix missing fields, bogus dates, drifting category strings, duplicated keys, and half-parsed numeric text. Analysts rework the same transformations by hand—the goal here is deterministic, repeatable cleaning with receipts.

This repository pairs a Streamlit reviewer with reusable Python helpers so the same transformations can ship in demos, notebooks, or future automation without copying ad hoc notebook cells everywhere.

End-to-end flow

01 Load & merge
02 Detect issues
03 Configure fixes
04 Clean & audit
05 Export data + HTML report

What the codebase contains

src/load.py → src/detect.py → src/clean.py → src/report.py

Streamlit stitches the modules together; notebooks duplicate the offline path.

Streamlit façade

Provides upload or bundled messy demos, sliders for thresholds, deterministic cleaning toggles, previews, downloadable cleaned files, and the HTML artifact.

Detection

Merges uploads, summarizes missing ratios, parses invalid formats, spots inconsistent categorical tokens, evaluates duplicate subsets (optional key columns), and surfaces IQR outlier counts in reporting without silent auto-removal.

Cleaning primitives

Configurable duplication removal, categorical normalization (case / whitespace trims), numeric coercion, date parsing, and explicit missing-row strategies (leave, drop-with-warning, deterministic fill).

Report bytes

HTML summaries mirror Streamlit-visible tables—useful attachments for reviews or ticketing without forcing stakeholders into the IDE.

Operational walkthrough

Ingress

Upload CSV, Excel, Parquet, or shallow JSON structures (single-level flatten); optionally merge multiples with provenance tagging.
Detection pass

Column insight matrix plus aggregate issue summaries; “detect-only” mode previews without mutating downstream assets.
Configurable cleaning run

Explicit checklist of transformations with honest guardrails surfaced in README (row caps, demo upload MB limit).
Verification artifacts

Before / after previews, trimmed change samples, deterministic summary bullets, zipped exports (CSV · Parquet · JSON) aligned with QA expectations.

Reproducibility & assurance

pytest exercises loader edge cases plus detect/clean invariants referenced in README.
Notebooks regenerate demo CSV noise and rerun the analytical path without Streamlit—for pair reviews or onboarding.
Transparency: every cleaning action maps to textual descriptions surfaced in-session for audit trails.

Transparent limits

This is practical engineering hygiene—not a silver bullet “AI cleaner” SaaS abstraction.
Browser-oriented Streamlit safeguards reject multi-hundred-megabyte blobs; ingest-hardened services need separate scope.
No workflow scheduler, SLA manager, semantic master-data graph, or auto-discovered ontology engine lives here.

Vahdettin Karatas

Location:

Technical focus

Review this artifact

ML systems

Data tools