Vahdettin Karatas
Data / ML engineer — preprocessing, validation, repeatable pipelines
  • Location:
    Czech Republic (EU)
Technical focus
  • Tabular data quality & preprocessing
  • Streamlit tooling for DQ reviews
  • Python / pandas workloads
  • Testing & reproducible notebooks
  • Exportable artifacts & honest limits
Portfolio artifact — Streamlit demo + reproducible preprocessing

Tabular data quality & preprocessing

I ship a practical upload → detect → clean → export loop for messy operational tables: issue detection across missing values, duplicates, parses, categorical drift, and IQR outliers (flagged outlier stats in reporting), plus notebooks that replay the offline path. Exported artifacts plus pytest cover the repeatable core.

Static recruiter page only. Hands-on uploads and downloads use the hosted Streamlit instance linked above.

Python / pandas
Streamlit
CSV · Excel · Parquet · JSON
pytest

Limitations & scope

  • Demo safeguards cap uploads (~25 MB) — not unmanaged enterprise ingest.
  • Guidance warns above ~100k rows; brute-force gigantic tables will hit memory/time limits.
  • No hosted REST layer in this repo: Streamlit UX + downloadable outputs + offline notebooks.
  • Heuristic DQ helpers — not lineage, MDM catalogs, contracts-as-a-service platforms, or SLA-driven ETL control planes.

Motivation. Tabular extracts from ERPs, finance exports, and support CSVs routinely mix missing fields, bogus dates, drifting category strings, duplicated keys, and half-parsed numeric text. Analysts rework the same transformations by hand—the goal here is deterministic, repeatable cleaning with receipts.

This repository pairs a Streamlit reviewer with reusable Python helpers so the same transformations can ship in demos, notebooks, or future automation without copying ad hoc notebook cells everywhere.

End-to-end flow

  1. 01 Load & merge
  2. 02 Detect issues
  3. 03 Configure fixes
  4. 04 Clean & audit
  5. 05 Export data + HTML report

What the codebase contains

src/load.py src/detect.py src/clean.py src/report.py

Streamlit stitches the modules together; notebooks duplicate the offline path.

Streamlit façade

Provides upload or bundled messy demos, sliders for thresholds, deterministic cleaning toggles, previews, downloadable cleaned files, and the HTML artifact.

Detection

Merges uploads, summarizes missing ratios, parses invalid formats, spots inconsistent categorical tokens, evaluates duplicate subsets (optional key columns), and surfaces IQR outlier counts in reporting without silent auto-removal.

Cleaning primitives

Configurable duplication removal, categorical normalization (case / whitespace trims), numeric coercion, date parsing, and explicit missing-row strategies (leave, drop-with-warning, deterministic fill).

Report bytes

HTML summaries mirror Streamlit-visible tables—useful attachments for reviews or ticketing without forcing stakeholders into the IDE.

Operational walkthrough

  1. Ingress

    Upload CSV, Excel, Parquet, or shallow JSON structures (single-level flatten); optionally merge multiples with provenance tagging.

  2. Detection pass

    Column insight matrix plus aggregate issue summaries; “detect-only” mode previews without mutating downstream assets.

  3. Configurable cleaning run

    Explicit checklist of transformations with honest guardrails surfaced in README (row caps, demo upload MB limit).

  4. Verification artifacts

    Before / after previews, trimmed change samples, deterministic summary bullets, zipped exports (CSV · Parquet · JSON) aligned with QA expectations.

Reproducibility & assurance

  • pytest exercises loader edge cases plus detect/clean invariants referenced in README.
  • Notebooks regenerate demo CSV noise and rerun the analytical path without Streamlit—for pair reviews or onboarding.
  • Transparency: every cleaning action maps to textual descriptions surfaced in-session for audit trails.

Transparent limits

  • This is practical engineering hygiene—not a silver bullet “AI cleaner” SaaS abstraction.
  • Browser-oriented Streamlit safeguards reject multi-hundred-megabyte blobs; ingest-hardened services need separate scope.
  • No workflow scheduler, SLA manager, semantic master-data graph, or auto-discovered ontology engine lives here.

Data Cleaning Toolkit

Proof-of-work · Streamlit demo + modular Python

© Vahdettin Karatas. All rights reserved.