Andrea – SQL, Apache Spark, Apache Kafka
Andrea is a Senior Data Engineer with extensive experience in distributed data pipelines, Spark/Databricks, and fraud detection systems, primarily in large-scale fintech environments. He demonstrates strong skills in Apache Spark, Kafka, AWS, and medallion architecture, with hands-on ML lifecycle tooling. The feedback from vetting at Lemon.io highlights his calm, collaborative communication and real-world enterprise expertise!
15 years of commercial experience in
Main technologies
Additional skills
Direct hire
PossibleReady to get matched with vetted developers fast?
Let’s get started today!Experience Highlights
Lead AI Engineer
An intelligent AI agent platform for insurance automation, streamlining b2b workflows and reducing operational costs of insurtech by over 50% through AI-driven process optimization (e.g., quoting, product selection).
- Built FastAPI microservices with Pydantic validation to receive and action automated quoting requests;
- Integrated Gemini API for AI reasoning and autonomous decisions across insurance tasks;
- Deployed on AWS with containerized services for elastic scaling and fault tolerance;
- Added Nylas integration for email/calendar automation and customer workflow streamlining;
- Implemented automated quote retrieval using Browser-Use and custom Selenium workflows;
- Architected Celery processing for asynchronous quoting requests, enabling scalable task queuing;
- Persist workflows data and metadata in PostgreSQL.
Tech lead
An ML-as-a-service platform on Databricks for internal teams to accelerate model development and deployment. The platform enabled large-scale data engineering to ingest raw data from S3, Snowflake, and Kafka, with templated data pipelines for cleaning, feature engineering, encoding, and sampling following the medallion model. It leveraged Ray APIs for distributed model training and tuning, and included integrated model serving and monitoring for seamless production readiness.
- Owned all technical decisions, defining architecture, tools, and frameworks to ensure scalability and performance;
- Led a global team to deliver high-quality, collaborative code aligned with business goals;
- Delivered scalable data engineering pipelines handling large volumes from S3, Snowflake, and Kafka, using the medallion model architecture;
- Developed custom feature encoding libraries for distributed processing on Databricks and optimized feature engineering in PySpark for the calculation of critical time-based features;
- Implemented and productionized Ray on Databricks for distributed model training and tuning, leveraging its richer APIs for superior training efficiency and performance;
- Applied MLOps best practices, including multi-environment setups and champion-challenger strategies for robust production workflows.
Senior Data Engineer
Customized CDAP (open source version of Google Data Fusion) platform for designing no-to-low-code ETL pipelines. It leveraged AWS services like EMR, Kinesis, S3, and included custom serverless solutions for encryption and tokenization. The platform featured a visual pipeline builder, reusable plugins, and lifecycle management, enabling external teams to efficiently design and manage complex data workflows.
- Successfully helped customers productionise a handful of data pipelines in a few months, including batch and real-time streaming use cases;
- Developed a custom Spark-based plugin to add metadata, validate partitioning, and ensure transformation consistency;
- Customised AWS Kinesis plugin to optimise shards to maximize throughput and minimize ingestion latency to Snowflake;
- Created a custom plugin to integrate serverless AWS Lambda functions for data tokenization and encryption of sensitive PCI and PII data.
Senior Data Engineer
The project migrated an on-premises Hadoop/Hive platform to a cloud architecture using AWS, Snowflake, and Databricks. Data from Kafka is streamed into S3, ingested, and processed by Databricks for heavy transformations and ML workloads. Snowflake hosted BI-ready data with dbt transformations, serving key stakeholders. All the data pipelines that ingested data into the on-premises platform were updated and migrated to Snowflake and Databricks environments. This hybrid platform balances advanced data engineering in Databricks with scalable analytics in Snowflake, catering to diverse user needs.
- Actively led end-to-end migration, redesigning data ingestion pipelines from Kafka and databases via AWS S3;
- Implemented a custom Kafka Connect component to ingest data into S3 efficiently;
- Migrated Spark pipelines from Hadoop to Databricks for scalable processing and advanced transformations;
- Ported hive-based data pipelines into dbt ELT workflows running on Snowflake, optimizing for BI stakeholder needs;
- Orchestrated automated ELT pipelines with Airflow for reliable and monitored data flows;
- Coordinated data quality and performance tuning across both dbt and Spark pipelines;
- Integrated Snowflake’s app/gold data layers with Tableau and delivered custom dashboards and reporting solutions.
Senior Data Engineer
Containerized monitoring platform orchestrated with Docker on AWS Elastic Kubernetes Service (EKS), designed to oversee fraud model performance in real time. It processed streaming data via Kafka Streams while computing batch KPIs with Spark, ensuring comprehensive health tracking.
- Implemented custom Spark processing libraries for hourly/daily batch KPIs, including model drift via KL divergence, and produced results to Kafka for downstream consumption;
- Designed and developed a Kafka Streams solution for real-time KPI generation, processing streaming metrics with low-latency aggregations and anomaly thresholds;
- Built Kafka consumers to push metrics into Prometheus and Grafana, plus multiple dashboards visualizing drift, accuracy, latency, and uptime;
- Integrated Prometheus and Grafana with corporate observability platforms, configuring alerts via email and Slack for rapid incident response;
- Coordinated optimization of container orchestration on EKS, ensuring high availability and fault-tolerant metric ingestion;
- Designed end-to-end integration tests validating real-time Kafka Streams, batch Spark KPIs, Prometheus metrics flow, and Grafana dashboards in staging before prod promotion;
- Introduced PostgreSQL to maintain state for static and non-temporal KPIs.
Senior Data Engineer
Machine learning feedback system that closed the loop between on-prem Hadoop data lake fraud analysis and company's AWS-hosted FraudSight API, enabling continuous model retraining with confirmed fraud cases (CHB/RFIs). It involved processing of massive transaction volumes to identify confirmed fraud/chargebacks/refunds, then fed negative feedback.
- Optimized fraud matching pipeline from source tables through Hive partitioning and Spark resource tuning;
- Implemented Spark producer with Confluent Schema Registry enforcement using Avro for FraudSight API contract;
- Developed custom Kafka Connect HTTP sink consuming schema-validated messages with batching and serialization;
- Implemented API throttling & retry logic with rate limiting, backpressure, and dead letter queue;
- Added Hadoop pipeline monitoring, tracking job SLAs, data freshness, and failure alerts;
- Implemented Kafka Connect monitoring via Prometheus/Grafana dashboards.
Senior Data Engineer
A custom data processing solution that parsed and loaded company's Point-of-Sale (POS) transaction log files into the enterprise Hadoop data lake, providing analytics visibility to financial department.
- Designed end-to-end ingestion pipeline from SFTP landing through tokenization, encryption, and Hadoop data lake loading.;
- Developed a custom Scala parser with concurrent multi-threaded processing for high-throughput PTLF file handling;
- Implemented multi-stage recovery logic enabling partial parsing of malformed files while skipping corrupt records;
- Implemented SFTP server integration with automated file discovery, secure credential management, and incremental fetching.;
- Integrated PCI-DSS compliant encryption for data at rest/transit with field-level tokenization of cardholder data;
- Optimized Hive storage with ORC format for columnar compression, predicate pushdown, and 10x query performance gains.