Experience
Click a role to expand. Summaries first; full responsibilities inside each card.
Data Science Manager
ARID Lab, University of Arizona — Tucson, AZ
Feb 2024 – Oct 2025
Data Science Manager
ARID Lab, University of Arizona — Tucson, AZ
Highlights
- Built and deployed ML models (XGBoost, survival analysis, ClinicalBERT NLP) on 500K+ healthcare records achieving 83% AUROC and 25% accuracy improvement in clinical data extraction
- Designed end-to-end ETL pipelines integrating multi-source databases into Snowflake, reducing data prep time by 60% through automated validation and metadata standardization
- Led A/B testing and experimentation frameworks for model optimization, delivering 20% lift in precision through threshold tuning and statistical analysis
- Orchestrated production ML workflows on AWS (SageMaker/S3) with automated monitoring, feature engineering, and experiment tracking
- Mentored 3 analysts on data integration, model validation, and research methodology best practices
Python/R/SQLXGBoost/ClinicalBERT
AWS SageMaker/S3Snowflake
A/B TestingETL Pipelines
Survival AnalysisNLP
Tableau/Power BI
Scope & Responsibilities
- Machine Learning Development: Designed and deployed supervised learning models (XGBoost, logistic regression, Cox survival) for real-time and batch prediction systems; conducted hyperparameter optimization and cross-validation to achieve 83% AUROC
- NLP Pipeline Engineering: Built scalable ClinicalBERT pipeline processing unstructured clinical notes with distributed text processing, automated validation, and rule-based audits; improved extraction accuracy 25% and reduced manual review by 80%
- Experimentation & Optimization: Designed and executed A/B tests for model threshold optimization; performed statistical significance testing and causal inference analysis to measure business impact
- Data Engineering: Architected ETL workflows integrating 500K+ records from multiple source systems using Python, SQL, and Airflow-style orchestration; implemented data quality checks, deduplication, schema validation, and complete lineage tracking
- MLOps & Production Systems: Built end-to-end ML serving framework on AWS with automated ingestion, feature engineering, model monitoring, and experiment tracking; reduced data preparation latency by 60%
- Analytics & Reporting: Developed interactive dashboards (Tableau/Power BI) for stakeholder decision-making; delivered technical reports translating complex model outputs into actionable business insights
- Data Governance & Compliance: Established data quality standards, validation protocols, and HIPAA-compliant audit controls ensuring reliable, reproducible analytics
- Program Evaluation: Led multi-site evaluation studies using causal inference, survival analysis, and clustering to quantify treatment effects and outcome disparities across patient cohorts
- Collaboration & Leadership: Partnered with clinical teams and policy stakeholders to translate research questions into analytical frameworks; mentored junior analysts on statistical methods and data engineering best practices
Graduate Research Assistant
University of Arizona — Department of Pediatrics
Nov 2022 – Dec 2023
Graduate Research Assistant
University of Arizona — Department of Pediatrics
Highlights
- Built anomaly detection and time-series forecasting pipelines for population health surveillance with backtesting and statistical validation
- Designed HIPAA-compliant PostgreSQL databases with optimized indexing and partitioning, reducing query latency by 80%
- Automated REDCap and survey platform ingestion with validation checks, improving data completeness by 25% and shortening reporting cycles by 70%
- Developed standardized feature tables enabling reproducible modeling and experimentation workflows
Python/R/SQLPostgreSQL
REDCapAirflow
Time-Series ForecastingAnomaly Detection
Data Validation
Scope & Responsibilities
- Predictive Modeling: Built anomaly detection and time-series forecasting models for epidemiological surveillance; implemented backtesting, cross-validation, and statistical significance tests to ensure model robustness
- Database Engineering: Designed and optimized multi-site PostgreSQL databases on Linux infrastructure; implemented indexing strategies, range partitioning, and query optimization reducing latency by 80%
- Data Pipeline Automation: Created automated ingestion workflows for survey platforms (REDCap, Amazon MTurk) with validation, deduplication, schema checks, and audit logging; improved completeness 25% and reduced reporting time 70%
- Feature Engineering: Standardized feature table creation for downstream modeling; established reproducible data preparation workflows supporting experimentation and analysis
- Data Quality & Validation: Implemented comprehensive validation protocols including completeness checks, range validation, and logical consistency rules ensuring high-quality analytical datasets
- Visualization & Reporting: Published KPI dashboards and analytical results to Tableau/Power BI enabling operational decision-making by public health teams
- Research Support: Conducted statistical analyses (logistic regression, trend analysis) on population-level datasets; delivered baseline summaries and technical documentation for research teams
- Compliance & Security: Maintained HIPAA-compliant data handling with access controls, encryption, and complete audit trails for sensitive health information
Software Development Engineer
Tata Consultancy Services — Mumbai, India
Mar 2018 – Jul 2022
Software Development Engineer
Tata Consultancy Services — Mumbai, India
Highlights
- Re-engineered PCI-compliant ETL from Informatica to distributed PySpark processing 3M+ daily transactions; led 12-member team ensuring security compliance and audit trails
- Built and deployed fraud detection ML pipelines on 10M+ transactions reducing false positives by 20% through experimentation and threshold optimization
- Automated CI/CD for ETL and ML deployments with data quality validation and rollback capabilities, improving release cadence by 25%
- Established enterprise data governance standards and automated quality monitoring improving data reliability by 25%
Python/SQL/PySparkInformatica PowerCenter
Teradata/AWSML Deployment
CI/CDData Governance
Unix/Linux
Scope & Responsibilities
- Large-Scale Data Engineering: Re-architected ETL pipelines from Informatica PowerCenter to distributed PySpark on Unix/Linux infrastructure processing 3M+ daily banking transactions; ensured PCI compliance with Protegrity tokenization for PII protection
- Machine Learning in Production: Designed and deployed fraud detection models using Python and SQL on 10M+ transactions; reduced false positives by 20% through iterative experimentation, feature engineering, and threshold optimization
- Team Leadership: Led cross-functional team of 12 engineers managing end-to-end ETL development lifecycle; established code review standards, testing protocols, and quality gates
- DevOps & Automation: Built automated CI/CD pipelines for ETL and ML model deployments across mainframe and enterprise data warehouse systems; implemented data quality validation, automated testing, and rollback procedures improving release reliability by 25%
- Data Governance: Established enterprise-wide data quality standards including validation rules, audit controls, and quality metrics; implemented automated monitoring ensuring consistent data reliability across production systems
- Performance Optimization: Tuned complex SQL queries and ETL workflows reducing processing time by 50%; optimized resource utilization and infrastructure costs through query refactoring and job parallelization
- Analytics & Forecasting: Built statistical models for revenue forecasting and trend analysis; delivered executive reports on multi-year fraud patterns supporting strategic decision-making
- Compliance & Security: Maintained complete audit trails for financial data processing; ensured regulatory compliance with automated validation checks and PII tokenization protocols
- Migration Management: Orchestrated complex data migrations across Dev/Test/UAT/Production environments; managed 2.5M customer record integration achieving 30% performance improvement