Holds & Releases Workflow Engine

Most developers haven't had to think about insurance carrier integrations. If you have, you know they involve a particular kind of complexity: one that doesn't feel complicated in any single piece but turns into something genuinely subtle when you're coordinating across four different carriers, each with their own submission API, status model, and timing expectations.

During my time at Porch Software, I worked on the Edward Jones e-application platform (a system that lets financial advisers submit life insurance applications on behalf of clients directly to carriers like Protective, Lincoln Financial, Prudential, Pacific Life, and Nationwide). One of the more involved things I built there was the Holds & Releases workflow engine. This post walks through what that system did, why it was more nuanced than it sounds, and what I'd approach differently.

What is a "Hold" in this context?

In life insurance underwriting, a hold is a pause in the application process triggered by the carrier (usually because they need something before they can proceed). Common causes include a missing medical record, an incomplete owner section, a beneficiary designation that needs clarification, or a question the proposed insured needs to answer via a phone interview.

From the platform's perspective, a hold means: the application was submitted successfully, the carrier received it, and now we're waiting, but not passively. We need to know when the hold is cleared, why it was placed, and in some cases prompt the adviser or client to take action.

A release is the resolution: the carrier has what they need, the hold is lifted, and the application can move forward in the underwriting queue.

The architecture: SFTP-based status feeds

Each carrier delivers status updates differently. Some expose a REST API you can poll. Others (particularly the older, larger ones) drop flat files on an SFTP server on a schedule. Protective, for example, delivers a batch status feed via SFTP rather than real-time webhooks.

The Holds & Releases engine needed to handle both patterns without the calling code caring which it was dealing with. Here's the rough flow:

A scheduled job picks up incoming SFTP file drops from each carrier directory.
Files are parsed into a normalized status model (hold type, case ID, timestamp, notes).
The normalized status is matched to the corresponding application record in our database.
If a hold is detected, the workflow engine triggers the appropriate downstream actions: an email notification to the adviser, a status update in the UI, and (if required) a prompt for client action such as completing an online interview.
When the hold clears, the same pipeline processes the release and updates the application state.

One thing that made this harder than it sounds: the status files from different carriers use different schemas, different field names for conceptually identical things, and different conventions for representing holds versus releases. Normalizing across them reliably required building per-carrier parsers that fed into a shared status model, rather than trying to write generic parsing logic that handled everyone.

Email notifications and triggering logic

When a hold lands on an application, the adviser needs to know, but not every hold warrants the same urgency or the same message. An administrative hold (missing a cover page, for example) is different from a medical hold that requires the proposed insured to schedule a call.

The notification logic was keyed off hold type and carrier. Each carrier surfaces hold types differently, so the engine maintained a mapping from carrier-specific hold codes to internal hold categories, and the notification templates were driven by those categories rather than raw carrier codes.

One specific flow worth describing is Protective's online interview hold. When an underwriter places this type of hold, Protective needs the proposed insured to complete a phone interview before the case can proceed. The notification we sent wasn't triggered by the hold being placed (that goes to the adviser) but by the client actually starting the interview, which Protective surfaces as a separate status event in the feed. That distinction matters: an adviser seeing "interview hold placed" is different from "your client has started the interview," and mixing them up produces noise that trains people to ignore your notifications.

The trickier problem was deduplication. Because SFTP files are batch drops processed on a schedule, the same status file could be reprocessed if a job failed and retried. Without explicit deduplication, a single hold event could fire multiple email notifications to the adviser. The fix was straightforward once identified: before triggering any notification, the engine checks whether that specific combination of case ID, hold type, and status timestamp has already been acted on. If it has, the event is logged and skipped. If it hasn't, it's processed and the combination is recorded. The cost of that check is negligible; the cost of not having it is an adviser getting the same email four times and losing trust in the system.

Concurrent multi-carrier deployments

The trickiest operational challenge wasn't the logic. It was the fact that the platform was running integrations with multiple carriers simultaneously, each at a different stage of rollout. Protective's Holds & Releases flow went live while Lincoln's was still in integration testing, while Prudential's was being built.

This meant the engine had to be safe to deploy in a partially-complete state: carriers that weren't yet integrated needed to pass through without errors, carriers in testing needed to be isolated from production data, and carriers in production needed to keep working when we shipped changes for a different carrier.

We handled this with per-carrier feature switches. Each carrier's integration path was gated behind a flag, so we could deploy new carrier code to production without activating it, run internal testing against a mock server that simulated the carrier's SFTP feed and submission API, certify the integration, and then flip the switch. The mock servers were particularly useful here: rather than needing a live connection to Protective's or Lincoln's sandbox environment for every test run, the mock let us reproduce specific hold scenarios (including edge cases we'd seen in production) on demand.

The cross-carrier bugs were the most time-consuming to diagnose. The most instructive one involved how the platform serialized the "Relationship to Proposed Insured" field, which appears in owner and beneficiary sections across all carriers. Each carrier has its own set of accepted values for this field, mapped from a shared internal enum. When we corrected a mapping for one carrier's expected values, it changed how the shared model serialized the field, which caused a different carrier's validation to start rejecting submissions that had previously passed. The fix was in the mapping layer, but finding it meant tracing through two carriers' submission logs to identify that the same field was behaving differently between them after a deploy that neither team thought was related.

What I'd do differently

A few things stand out in retrospect:

Build the per-carrier parser abstraction earlier. The first carrier's parser was written directly against the shared status model because there was only one carrier at the time. By the time the second carrier arrived, there was already logic that assumed the first carrier's shape in places it shouldn't have been. Retrofitting the abstraction cleanly took longer than it would have to design it upfront, and there were a few spots where the seams were never quite clean. When you know multi-carrier support is coming, the extra hour to define the interface first is almost always worth it.

Add observability before you need it, not after. The deduplication logging and the per-event status tracking were added reactively, after the first time a batch reprocessed and fired duplicate emails. The same was true of the monitoring that told us how many SFTP files were being picked up per run, whether any carrier directories were unexpectedly empty, or how long the normalization step was taking per carrier. None of this was hard to add, but adding it under pressure while also debugging a production issue is worse than having it from day one.

Treat hold detection and notification triggering as separate concerns from the start. In the initial implementation they were close together in the same code path, which made sense when there was one carrier and one notification type. As more carriers came on with different hold types and different notification rules, that coupling made the logic harder to follow and harder to test in isolation. Separating "this application is now on hold" from "given this hold, what should happen" into distinct layers would have made both easier to reason about independently.

Takeaways

Insurance platform work is rarely the kind of engineering that gets written up in blog posts. It's not distributed systems at Google scale or novel ML infrastructure. It's a dense domain with decades of accumulated convention, standards like ACORD XML that predate most of the frameworks we use, and carriers that move slowly for good reasons.

What I found interesting about it was that the complexity was mostly in the coordination: between carriers, between teams, between stages of rollout. Getting that coordination right in software is the same problem you encounter in every sufficiently complex system. The insurance domain just makes the stakes visible in an unusually concrete way.