Databricks case study: How Oportun simplified data engineering and machine learning

June 15, 2026

7 min read

At Oportun, data is not something we “report on” after the fact. It is mission-critical and part of how we run the business. It drives how we understand customers, measure performance, manage risk, detect fraud, and operationalize machine learning in day-to-day workflows. As our data footprint grew and more teams relied on data for decisions, one reality became clear: the long-term challenge was not whether we could build pipelines and models. We could. The challenge was whether we could do it reliably, quickly, and securely as adoption expanded, without multiplying tools, handoffs, and operational overhead.

That is the story behind our move to Databricks. We have transitioned to a Databricks-first platform that consolidates ingestion, transformation, analytics, and machine learning into a consistent operating model. Equally important, we strengthened our governance posture by ensuring that all of our data is governed by Unity Catalog. In practice, that means permissions, auditing, lineage, and policy-based protections are applied centrally and consistently, regardless of whether the data is used for analytics, reporting, or ML.

This journey is a Databricks success story because it made our platform simpler and our delivery faster: data ingestion timelines reduced dramatically, ML development accelerated, and production deployments became more predictable, all while governance became the default rather than the afterthought.

Here’s what we’re going to cover

Where we started: Spark ETL that touched many systems
The shift we made: one platform and one operating model
Turning data into products: Bronze, Silver, and Gold as a shared contract
dbt as the modeling discipline: making transformation scalable and maintainable
Consumption aligned to curated datasets and shared definitions
Governance at scale: Unity Catalog, ABAC, and PII masking
What changed materially: cycle time improvements across data and ML
Conclusion: what this unlocks for Oportun

Key takeaways

Moving to a Databricks-first platform helped Oportun simplify ingestion, transformation, analytics, and machine learning in one operating model.
Standardized data products and dbt-based modeling helped reduce logic drift and make trusted datasets easier to scale.
Unity Catalog, ABAC, and PII masking helped make governance more consistent as data access expanded across the organization.

Where we started: Spark ETL that touched many systems

Our earlier environment was capable and grew organically over time. Data engineering workflows expanded into Spark code that interacted with multiple AWS services and interfaces, including EMR on EC2, S3, DynamoDB, Glue, Athena, Redshift (including Redshift Spectrum), along with discovery and catalog tooling.

That ecosystem gave us flexibility, but it also increased what we often call the “integration tax.” Each product introduced its own operational considerations: how jobs are scheduled, how failures are handled, how schema changes are managed, how data is discovered, and how access is enforced. As pipelines multiplied and more teams contributed, the overall system naturally accumulated variability. Two pipelines solving similar problems could look very different because they were implemented through different tools, conventions, or patterns.

The impact of that fragmentation wasn’t abstract. It showed up in longer onboarding cycles, higher troubleshooting burden, and slower iteration when changes were needed. It also increased the risk of logic drift—where the same concept or metric is computed differently depending on where it is used—creating reconciliation work and reducing trust.

In parallel, our notebook-first ML approach supported rapid experimentation, but the path from “notebook success” to “reliable production model” was not always consistent. Feature logic often needed to be standardized, dependencies hardened, deployment workflows formalized, and access controls repeatedly validated. This is a common inflection point: experimentation scales with notebooks, but production ML scales with repeatable platform primitives and consistent governance.

The integration tax shows up when capable systems become harder to operate consistently as adoption grows.

The shift we made: one platform and one operating model

We moved to a Databricks-first approach to simplify how we build and operate data and ML workloads. Databricks now provides a single environment where ingestion, transformations, analytics, and machine learning can be developed and run with consistent patterns. This consolidation reduced operational variance and clarified ownership, because work is no longer spread across multiple platforms with different operational models.

Just as importantly, we built governance into the foundation. Unity Catalog governs all of our data, providing a centralized layer for permissions and access controls, auditing, lineage, and consistent policy enforcement. This matters at scale because governance can’t be dependent on where a dataset happens to live or which tool created it. Whether the consumer is an analyst, a dashboard, or a model pipeline, governance is applied consistently at the platform layer.

This is the core difference between a collection of tools and an operating model: we are no longer stitching together platform behavior across systems. We are standardizing it.

Turning data into products: Bronze, Silver, and Gold as a shared contract

A major unlock in our Databricks-first approach has been standardizing how raw data becomes trusted, reusable data products. We organize data through an explicit progression from raw ingestion to validated datasets to curated, consumption-ready assets.

Raw data is retained to preserve traceability and enable replay when needed. Validated and conformed layers apply quality checks and business rules, producing datasets that are standardized and reusable. Gold datasets are then curated around domains and KPIs so teams can consume them confidently for analytics and downstream use cases.

This structure benefits both technical and non-technical stakeholders. For technical teams, it clarifies where transformations belong and how quality is enforced. For business users, it establishes a clear contract: Gold datasets represent curated, decision-ready data, not raw extracts with hidden assumptions. It also helps keep business logic centralized in the platform rather than scattered across ad hoc pipelines or downstream calculation layers.

dbt as the modeling discipline: making transformation scalable and maintainable

To keep modeling consistent as more teams contribute, we use dbt as our data modeling layer. dbt provides a repeatable workflow for authoring transformations as modular models with standardized tests and documentation. This improves maintainability and reduces duplication because teams can build on shared models and agreed patterns rather than re-implementing logic in isolated ways.

From an operating standpoint, dbt helps us scale collaboration. Transformation logic becomes version-controlled, reviewable, and easier to evolve safely. The outcome is not merely cleaner code—it is a more predictable and scalable system for producing trusted datasets.

Consumption aligned to curated datasets and shared definitions

We also took a deliberate approach to consumption. Rather than spreading core metric logic across many endpoints, we emphasize consumption from curated datasets and shared definitions. The goal is to reduce metric drift and increase trust: when different teams look at performance, risk metrics, or customer outcomes, the organization benefits when those insights are anchored to the same curated sources and definitions.

This is one of the most important changes for non-technical readers: the value isn’t only faster pipelines. The value is confidence that when teams talk about a metric, they are talking about the same thing, because the definition is centralized and governed.

Governance at scale: Unity Catalog, ABAC, and PII masking

As data usage expands, protecting sensitive data must be systematic. That is why we standardized on Unity Catalog as our governance foundation for the entire data estate. Governance is not a separate process; it is how the platform operates.

Unity Catalog provides centralized access management and auditing so we can apply consistent controls and maintain visibility into data usage. On top of that, we use attribute-based access control (ABAC) and dynamic masking to protect PII in a policy-driven way. Sensitive values are masked by default and exposed only to authorized roles or approved use cases, without requiring redundant “masked copies” of datasets or manual enforcement in multiple tools.

This approach allows us to expand access safely. It supports the outcome we care about most: enabling self-service and broad adoption while maintaining strong controls and auditability as the organization scales.

When governance sits at the platform layer, self-service access can expand without relying on manual controls in multiple tools.

What changed materially: cycle time improvements across data and ML

The most tangible outcomes of this platform shift show up in delivery speed and predictability.

In the previous environment, end-to-end data ingestion from key source databases could take up to two weeks, largely because ingestion workflows spanned many steps, systems, and dependencies. With our Databricks-first platform and standardized ingestion patterns, we reduced that timeline to two days, enabling teams to move from “data request” to “usable data” much faster.

We saw similar acceleration across the ML lifecycle. Model development timelines reduced from roughly seven weeks to two weeks, supported by easier access to curated datasets, more consistent feature work, and faster iteration. Productionalizing models also became more predictable: the time to deploy models into production has been reduced to approximately four to seven weeks, supported by standardized workflows and governance applied consistently through Unity Catalog.

These improvements are not driven by a single tool feature. They come from simplifying the system: fewer integration points, fewer handoffs, consistent patterns from ingestion through consumption, and governance that is centralized and enforced by default.

Conclusion: what this unlocks for Oportun

Oportun’s Databricks journey is a success story because it delivered simplification with measurable operational impact. We moved from an ecosystem where Spark ETL touched many systems and where ML productionization required bespoke hardening to a Databricks-first platform that standardizes how data products and models are built, governed, and delivered.

With Bronze/Silver/Gold curation, dbt-driven modeling discipline, and Unity Catalog governing all data with ABAC and PII masking, we now have a foundation that scales as adoption grows. The outcomes are concrete: faster ingestion, faster model development, and a more predictable route to production, while governance remains consistent across the entire data estate.

This is what modern data at scale looks like for Oportun: one platform, trusted data products, and governance that is built in by design, enabling teams to move faster with confidence.

Frequently asked questions about Oportun’s Databricks success story

What was the main goal of Oportun’s move to Databricks?

The main goal was to simplify how data engineering, analytics, and ML workloads are built and governed so teams can move faster with more consistency.

How did Databricks improve data delivery at Oportun?

Oportun reduced ingestion timelines from up to two weeks to about two days by using more standardized patterns in a single platform.

Why does Unity Catalog matter in this Databricks success story?

Unity Catalog helps apply permissions, lineage, auditing, and policy-based protections consistently across analytics, reporting, and ML use cases.