Experience

Click a role to expand. Summaries first; full responsibilities inside each card.

Data Science Manager
ARID Lab, University of Arizona — Tucson, AZ
Feb 2024 – Oct 2025

Highlights

  • Built and deployed ML models (XGBoost, survival analysis, ClinicalBERT NLP) on 500K+ healthcare records achieving 83% AUROC and 25% accuracy improvement in clinical data extraction
  • Designed end-to-end ETL pipelines integrating multi-source databases into Snowflake, reducing data prep time by 60% through automated validation and metadata standardization
  • Led A/B testing and experimentation frameworks for model optimization, delivering 20% lift in precision through threshold tuning and statistical analysis
  • Orchestrated production ML workflows on AWS (SageMaker/S3) with automated monitoring, feature engineering, and experiment tracking
  • Mentored 3 analysts on data integration, model validation, and research methodology best practices
Python/R/SQLXGBoost/ClinicalBERT AWS SageMaker/S3Snowflake A/B TestingETL Pipelines Survival AnalysisNLP Tableau/Power BI

Scope & Responsibilities

  • Machine Learning Development: Designed and deployed supervised learning models (XGBoost, logistic regression, Cox survival) for real-time and batch prediction systems; conducted hyperparameter optimization and cross-validation to achieve 83% AUROC
  • NLP Pipeline Engineering: Built scalable ClinicalBERT pipeline processing unstructured clinical notes with distributed text processing, automated validation, and rule-based audits; improved extraction accuracy 25% and reduced manual review by 80%
  • Experimentation & Optimization: Designed and executed A/B tests for model threshold optimization; performed statistical significance testing and causal inference analysis to measure business impact
  • Data Engineering: Architected ETL workflows integrating 500K+ records from multiple source systems using Python, SQL, and Airflow-style orchestration; implemented data quality checks, deduplication, schema validation, and complete lineage tracking
  • MLOps & Production Systems: Built end-to-end ML serving framework on AWS with automated ingestion, feature engineering, model monitoring, and experiment tracking; reduced data preparation latency by 60%
  • Analytics & Reporting: Developed interactive dashboards (Tableau/Power BI) for stakeholder decision-making; delivered technical reports translating complex model outputs into actionable business insights
  • Data Governance & Compliance: Established data quality standards, validation protocols, and HIPAA-compliant audit controls ensuring reliable, reproducible analytics
  • Program Evaluation: Led multi-site evaluation studies using causal inference, survival analysis, and clustering to quantify treatment effects and outcome disparities across patient cohorts
  • Collaboration & Leadership: Partnered with clinical teams and policy stakeholders to translate research questions into analytical frameworks; mentored junior analysts on statistical methods and data engineering best practices
Graduate Research Assistant
University of Arizona — Department of Pediatrics
Nov 2022 – Dec 2023

Highlights

  • Built anomaly detection and time-series forecasting pipelines for population health surveillance with backtesting and statistical validation
  • Designed HIPAA-compliant PostgreSQL databases with optimized indexing and partitioning, reducing query latency by 80%
  • Automated REDCap and survey platform ingestion with validation checks, improving data completeness by 25% and shortening reporting cycles by 70%
  • Developed standardized feature tables enabling reproducible modeling and experimentation workflows
Python/R/SQLPostgreSQL REDCapAirflow Time-Series ForecastingAnomaly Detection Data Validation

Scope & Responsibilities

  • Predictive Modeling: Built anomaly detection and time-series forecasting models for epidemiological surveillance; implemented backtesting, cross-validation, and statistical significance tests to ensure model robustness
  • Database Engineering: Designed and optimized multi-site PostgreSQL databases on Linux infrastructure; implemented indexing strategies, range partitioning, and query optimization reducing latency by 80%
  • Data Pipeline Automation: Created automated ingestion workflows for survey platforms (REDCap, Amazon MTurk) with validation, deduplication, schema checks, and audit logging; improved completeness 25% and reduced reporting time 70%
  • Feature Engineering: Standardized feature table creation for downstream modeling; established reproducible data preparation workflows supporting experimentation and analysis
  • Data Quality & Validation: Implemented comprehensive validation protocols including completeness checks, range validation, and logical consistency rules ensuring high-quality analytical datasets
  • Visualization & Reporting: Published KPI dashboards and analytical results to Tableau/Power BI enabling operational decision-making by public health teams
  • Research Support: Conducted statistical analyses (logistic regression, trend analysis) on population-level datasets; delivered baseline summaries and technical documentation for research teams
  • Compliance & Security: Maintained HIPAA-compliant data handling with access controls, encryption, and complete audit trails for sensitive health information
Software Development Engineer
Tata Consultancy Services — Mumbai, India
Mar 2018 – Jul 2022

Highlights

  • Re-engineered PCI-compliant ETL from Informatica to distributed PySpark processing 3M+ daily transactions; led 12-member team ensuring security compliance and audit trails
  • Built and deployed fraud detection ML pipelines on 10M+ transactions reducing false positives by 20% through experimentation and threshold optimization
  • Automated CI/CD for ETL and ML deployments with data quality validation and rollback capabilities, improving release cadence by 25%
  • Established enterprise data governance standards and automated quality monitoring improving data reliability by 25%
Python/SQL/PySparkInformatica PowerCenter Teradata/AWSML Deployment CI/CDData Governance Unix/Linux

Scope & Responsibilities

  • Large-Scale Data Engineering: Re-architected ETL pipelines from Informatica PowerCenter to distributed PySpark on Unix/Linux infrastructure processing 3M+ daily banking transactions; ensured PCI compliance with Protegrity tokenization for PII protection
  • Machine Learning in Production: Designed and deployed fraud detection models using Python and SQL on 10M+ transactions; reduced false positives by 20% through iterative experimentation, feature engineering, and threshold optimization
  • Team Leadership: Led cross-functional team of 12 engineers managing end-to-end ETL development lifecycle; established code review standards, testing protocols, and quality gates
  • DevOps & Automation: Built automated CI/CD pipelines for ETL and ML model deployments across mainframe and enterprise data warehouse systems; implemented data quality validation, automated testing, and rollback procedures improving release reliability by 25%
  • Data Governance: Established enterprise-wide data quality standards including validation rules, audit controls, and quality metrics; implemented automated monitoring ensuring consistent data reliability across production systems
  • Performance Optimization: Tuned complex SQL queries and ETL workflows reducing processing time by 50%; optimized resource utilization and infrastructure costs through query refactoring and job parallelization
  • Analytics & Forecasting: Built statistical models for revenue forecasting and trend analysis; delivered executive reports on multi-year fraud patterns supporting strategic decision-making
  • Compliance & Security: Maintained complete audit trails for financial data processing; ensured regulatory compliance with automated validation checks and PII tokenization protocols
  • Migration Management: Orchestrated complex data migrations across Dev/Test/UAT/Production environments; managed 2.5M customer record integration achieving 30% performance improvement