Skip to main content
by Chris Given

GAO: IRS Should Strengthen Oversight of Its Identity-Proofing Program

James R. McTigue, Jr.:

IRS officials were unable to show us that they had independently documented measurable goals or objectives to manage the outcomes of its identity-proofing program. According to federal internal control standards, officials should define objectives clearly to enable the identification of risks and define risk tolerances. Without independently established measures and goals, IRS cannot determine whether the performance of ID.me’s solutions meets IRS needs.

Without goals or objectives set by IRS, it is also not clear which of the several measures that ID.me provides are the best matches for what IRS needs or what level of performance is appropriate for a given application. For example, ID.me’s true pass rate excludes both users who abandon the process and users identified as highly probable fraudulent. In establishing measurable goals, IRS could determine that information on such users are essential performance measures that need to be established. Furthermore, as IRS continues to expand online services that require identity proofing it will need to consider additional metrics for these services.

At some point I’m going to write a long post about Direct File and identity. Today is not that day. But expect this report to be cited when I do.

GAO reports are pretty technical, and it’s understandable that some of the early takes I’ve seen kind of miss the mark. This is the key point: the IRS abdicated its responsibility to define what good looks like, outsourcing that responsibility to its vendor.

Algorithm Improvement Suggestions

So last week’s post was not so subtly a provocation for the “just add AI” approach to government. But hey, now that I’ve already done the work of sussing out what’s going on and writing it out in detail, let’s add AI and see what happens.

I asked ChatGPT to read the post and suggest a few ways the algorithm could be improved.

Thanks for sharing the post. It’s a rich and thoughtful deep dive into a real-world government algorithm with significant care and attention to fairness, complexity, and transparency. Still, even great systems can be improved. Here are several ways the algorithm might be enhanced or revisited, based on what the blog described.

Oh god, the sycophancy.

  1. Reduce Reliance on Heuristics in Proportional Allocation
  • Current approach: The AMA algorithm uses fixed parameters like the INTERPOLATED_DIRECT_REVIEW_PROPORTION_ADJUSTMENT (e.g., 67%) and a MAXIMUM_DIRECT_REVIEW_PROPORTION cap (e.g., 80%).
  • Suggested improvement: Replace static interpolation factors and hardcoded caps with dynamic, data-driven optimization. Techniques like Bayesian updating or reinforcement learning could help the system adapt in real time based on backlog levels, processing rates, and success in meeting deadlines.

Right off the bat, I think this provokes an interesting question. ChatGPT suggests a system that requires less human intervention because it optimizes itself toward a defined set of goals.

In the algorithm as I designed it, there are certainly examples of exactly this. You could imagine a version of the algorithm that preserved the concept of “docket month,” and allowed a human operator to specify how deep on the docket the machine was allowed to look, historically a policy decision that humans at the Board had been making. In this case, I opted to completely replace this process. Given the pace at which humans can orient and act upon new information, a process with a human in the loop would necessarily require a greater margin of error, and it was better for the algorithm to determine the most efficient possible parameter at any given time.

The same could be said for the parameters ChatGPT has called out above. Except that I think those items differ from the docket month example in terms of the pros and cons that must be weighed to change them. There is no right answer to whether the Board should prioritize Direct Reviews over legacy appeals.

These are policy questions in a complex decision space without an optimal answer. As I explained the algorithm to different audiences, one thing that I would always emphasize was that it provided “levers” that decision makers could pull. It wasn’t my place to decide those questions as the algorithm’s designer, as I would if I had obfuscated my opinions via an optimization function specified in a black box algorithm.

  1. Implement Feedback Loops for Goal Reassessment
  • Current approach: The system assumes the 365-day goal for Direct Review decisions remains static, even when it becomes unrealistic.
  • Suggested improvement: Introduce a feedback mechanism that automatically flags when a policy goal (like the 365-day target) is routinely being missed, and prompts a reassessment. This could be both internal (staff dashboards, alerts) and external (public transparency reports).

This is right on, and actually prompted me to go back and add a note about the dashboard we built for this very purpose. (ChatGPT read an early draft; I was paranoid it might find some egregious error past me had made and thus undermine current me’s whole reason for writing.) But I like the suggestion that the data should be public.

  1. Address Race Conditions in FIFO Distribution
  • Current approach: Race conditions can occur when two judges request cases simultaneously and might skip over the true “oldest” FIFO case.
  • Suggested improvement: Introduce record-level locking or atomic fetch-and-assign mechanisms at the database level, even for legacy VACOLS records, possibly via a transactional wrapper layer or a temporary in-memory locking system.

My post flagged this as a theoretical problem, and yeah, it’s totally fixable. Caseflow already was using Redis; it wouldn’t have been hard to implement a locking mechanism. Heck, I added locking in Redis to my Slackbot to look up acronyms, which was nowhere near as mission critical. But the cost is adding complexity to the system, adding one more thing that could break. Given the negligible potential impact, I think I can defend letting that one go.

  1. Use Better Predictive Models for Case Inflow and Processing Time
  • Current approach: Direct Review pacesetting proportions and legacy NOD adjustments are based on historical averages (e.g., 40% of NODs becoming Form 9s).
  • Suggested improvement: Use predictive analytics or machine learning to estimate case inflow and judge workload more precisely. For example:
    • Train models on NOD metadata to predict Form 9 likelihood.
    • Use survival analysis to forecast case readiness timelines.
    • Update predictions continuously based on actual outcomes.

This one’s amusing to me. Once I finally got a VA computer (six weeks after I started work at VA), one of the first things I did was use survival analysis to model the appeals process end-to-end. Those models ended up having a ton of uses, both internal (workforce planning) and external (time estimates for Veterans checking their appeal’s status on VA.gov). ChatGPT would be proud. As for whether I should have plugged those models into the algorithm instead of the hardcoded 40% heuristic, nah, not worth the extra complexity.

  1. More Granular Customization of Batch Sizes

The feature it turned out nobody needed? Yeah, not adding that. I guess I did call it a “disadvantage,” though, so that one’s on me.

  1. Improve Fairness Monitoring Across Judges
  2. Add Resilience for Sudden Capacity Shifts

These are nice sounding words that mean nothing.

  1. Publish Algorithm Documentation and Code
  • Current approach: The algorithm used to be public but was removed from GitHub.
  • Suggested improvement: Restore the public repository, redacting any sensitive implementation details, to meet modern expectations for algorithmic transparency in government decision-making. This helps build trust and allows outside experts to contribute improvements.

I agree completely.

Sittin’ On the Docket At VA

And now for something completely different.

This is a blog post about designing algorithms for government. “Algorithm,” in this context, just means “an automated way of making a decision.” I am going to describe, in great detail, a specific algorithm that I created for the U.S. Department of Veterans Affairs (VA). If you’re a U.S. Veteran who had an active case with the Board of Veterans Appeals (the Board) at any time since November 2018, this algorithm affected your appeal.

This post is going to get wonky. And it’s very long. I’m going to try my best to inverted pyramid this thing. It’s entirely okay to stop reading when you’ve had enough. There isn’t a twist at the end. There isn’t a reward for having finished it. The only reward is more detail. There is no bottom.

I’m sorry.

Why does any of this matter?

I hope to illustrate three things.

First, this is a story about automation. In this story, I automate a manual, bureaucratic process, from start to finish. But no one was fired, and in fact, the people who used to own the process were thrilled to be able to focus on more important work. To quote Paul Tagliamonte, “Let machines do what machines are good at, let humans do what humans are good at.” Everyone on all sides of this automation story was committed to doing right by Veterans. I sought to approach the task with the same level of care as its former stewards had done for many years, and I leveraged insights from the manual steps they had worked out through trial and error.

Automating this process involved more than just translating legal rules into instructions for a machine. The manual process involved bounded but significant discretion to ensure that the appeals continued to move apace. As we’ll see, a statutory change created even more opportunities for discretion. Machines don’t do discretion (or at least, they shouldn’t). So in concretizing the process into an algorithm, that discretion falls to the algorithm’s designer. Automation is thus a kind of rulemaking, but without established methods of public participation like notice-and-comment. And even while the decision space was sufficiently constrained to preclude bias on the basis of protected factors, I still needed to grapple with fundamental questions of fairness and ensure that humans remained in the driver’s seat.

Second, this is a story about complexity. You hear about civil servants studiously attending to their small niche of bureaucratic arcana, and this is a guided tour of one such niche. It’s not my niche; I was just a visitor. If, a couple thousand words into this post, the fleeting thought “maybe this whole process should just be simpler” crosses your mind, I understand. And maybe it should be! Sometimes complexity is a once good idea that’s gone rotten with time. Sometimes it was always a bad idea. But I guarantee there was intention behind the complexity. It’s trying to help someone who was left behind. It’s there because a court ordered that it must be, or because Congress passed a law. And sometimes, it might actually be load-bearing complexity. Take a sledgehammer to it at your own risk.

I think we should strive to “make it simpler” more often. Certainly I feel that Congress and the courts rarely give the challenges of implementing their will due consideration, nor do they always consider how things like administrative burden can adversely affect outcomes. But from the standpoint of most government workers, you have no choice but to make complexity work every single day. This is a story about how to make it work.

Finally, this is a story about transparency. The system that we are going to be looking at was developed in the open. Anyone in the world could check out its source code or read discussions the team had while designing and building it. This system manages every Veteran’s appeal, and instead of asking them to trust us, we’re showing our work. Because the U.S. government built it, it’s in the public domain, and you can find it on GitHub.

Except that’s a dead link. It’s dead because VA decided to take a project that had successfully operated with complete transparency for more than three years and hide it from the public. I don’t know why they decided to do that; I was long gone. I have no reason to believe there were any nefarious motivations, apart from a desire of certain officials to steer the project in a very different direction than that which my team believed in (a direction that, as the subsequent years have made clear, has not worked out).

So I’m going to have to describe the algorithm as it existed when I left VA in May 2019. I can tell from the Board’s public reports that, at a minimum, the parameters of the algorithm are set to different values than when I left (which is great; that’s why I put them there). Maybe the algorithm itself has been completely replaced (that would be cool; I don’t have a monopoly on how to solve the problem, and maybe I got something wrong!).

What’s not cool is that the public has no way of knowing, that Veterans have no way of knowing. For all I know, the algorithm has been replaced by an octopus (literal or figurative).

It is just a tiny component of a tiny system in a tiny part of VA, which is just one part of the U.S. federal government. But all across government, more and more decisions are being automated every day. For all the attention paid to the introduction of automation into decisions of enormous consequence rife with potential for discrimination (rightly so!), there are a host of smaller decisions that nevertheless matter to someone. How we approach the task of automating these decisions matters. The guardrails we put around it matter. Whether anyone even understands how the decision was made matters.

If you make it to the end of this post (and again, it’s really, really okay if you don’t), you will be able to explain every decision this particular algorithm will ever make. Why shouldn’t every government algorithm be knowable in this way? No, you couldn’t possibly invest the time to understand each and every algorithm that affects your life (are you absolutely sure you want to invest it here?), but wouldn’t it be reassuring to know that you theoretically could? (And, satisfied in the knowledge that you could have read this post to the end, wouldn’t it be so much better to go to your local library and pick up a nice Agatha Christie?)

What it do?

When a Veteran appeals a decision made by VA, such as a denial of a claim for disability benefits, they enter a sometimes years-long, labyrinthine process. Eventually, their case will reach the Board, a group of more than 130 Veterans Law Judges and their teams of attorneys, who decide the appeal.

At time of writing, there are more than 200,000 appeals waiting for a decision from the Board.[1] These appeals wait on a “docket,” which is lawyer speak meaning they wait in a line. Per the regulations, “Except as otherwise provided, each appeal will be decided in the order in which it is entered on the docket to which it is assigned.”[2]

When a judge is ready to decide another appeal, they need to retrieve the first case on the docket, the next case in line. The algorithm we’re discussing determines what case to give the judge. It’s called the Automatic Case Distribution algorithm.

That’s it. Really. “What’s the next case in line?” is all we’re doing here.

Of course, as you scroll through this 6,400 word blog post, you can probably guess that it’s going to get a lot more complicated than your average deli counter. The manual process was run by a team of four people (with other responsibilities too), armed with lots and lots of Excel spreadsheets.

Then Congress passed the Veterans Appeals Improvement and Modernization Act of 2017. Now the Board would have not one, but four dockets. Any chance that four humans could keep track of it all went out the window. Now automation wasn’t optional, it was essential. But before tackling all of that complexity, I started by automating just the one docket. So let’s start there too.

Easy mode: a single docket

When an appeal arrives at the Board, it is assigned a sequential docket number that starts with the date.[3] If we just sort the appeals by that number (and thus, implicitly, by the date they arrived), we’ll get a list of appeals in docket order.

Judges request appeals in batches, which they’ll divvy up among their team of attorneys, who will review the available evidence and write decisions for the judge to review and sign.

In order to supply a judge with a fresh batch of cases, the team managing the docket would run a report in the legacy Veterans Appeals Control and Locator System (VACOLS) to retrieve a spreadsheet of the oldest cases ready to be decided. Working in docket order, they would move the requested number of cases to the judge in VACOLS and email the judge a list of those cases for tracking. (Before the case files were digitized, the paper case file would also need to be sent to the judge’s office.)

It’s pretty easy to imagine what this process looked like once automated. The judge would log into Caseflow, the system we were building to replace VACOLS. Assuming they had already assigned all of their pending cases to attorneys, they would be presented with a button to request a new batch of appeals. Click that and new cases would appear, ready to be assigned.

One downside of the automated approach relative to the manual process was that I designed it to always assign each judge a set number of cases, three cases for each active attorney on the judge’s team. This parameter, cases per attorney, was configurable by the Board, but not by individual judges. Back when the humans were running things, judges were able to request whatever number of cases they wanted. But in user research with judges, we didn’t hear that anyone really needed to customize the number of cases, a finding that was confirmed after we launched and didn’t get any pushback. Using a fixed number of cases, called the batch size, kept things more predictable.

There are two complications we have to care about. First, some cases are prioritized. A case can be prioritized because it is Advanced on the Docket (AOD), a status granted to cases where the Veteran is facing a financial or other hardship, a serious illness, or is 75 years or older. A case can also be prioritized because the Board’s decision was appealed to the Court of Appeals for Veterans Claims (CAVC) and it was remanded back to the board to correct a problem. If the case is prioritized, it must be worked irrespective of docket order.

Second, if a judge had held a hearing with the Veteran, or if they had already issued a decision on the appeal, that appeal is tied to the judge and must be assigned to them, and no one else. If the judge retires, the Veteran would need to be offered another hearing with a different judge before the Board could decide the case.

For these reasons, the humans had a more difficult task than just selecting the first several rows of a spreadsheet. Prioritized cases were front of the line, but they didn’t want any single judge to become a bottleneck after getting assigned too many of the most time-sensitive cases, so the team would try to ensure that each judge got a balanced number of AOD cases, while also ensuring that no AOD case sat too long.

If there were cases that were tied to a particular judge, it wouldn’t make sense to assign that judge a bunch of cases that could have been worked by anybody (gen pop cases). So the team was allowed look past the strictly oldest cases on the docket to keep things moving. In order to keep this aboveboard (after all, cases are supposed to be assigned in docket order), each week the team would determine a “docket month” by calculating the age of the Nth oldest case, where N was a number that was agreed upon by the Board and stakeholders. Any case that was docketed in that month or before was considered fair game, giving the team the wiggle room they needed to keep things moving smoothly.

The algorithm basically replicates this approach, with some machine-like flair. The concept of docket month, an easy-to-use tool for humans to keep track of which cases they could assign, is an unnecessary abstraction for a computer. I replaced it with a concept called docket margin. Even though judges request cases on their own schedules, the algorithm starts by asking, “What if every single judge requested a new batch of cases at the same time? How many cases would I distribute?” That count is our docket margin, a rough estimate of the concurrent bandwidth of the Board.

To determine how many prioritized cases to give to each judge, we count the number of prioritized cases and divide it by the docket margin to arrive at the priority target. Multiplying the batch size by this proportion and always rounding up, we arrive at the target number of prioritized appeals we want to distribute to the judge.

Here’s some Ruby code. The code is just another way of saying the same thing as the above paragraph, so if don’t like to read code, you’re not missing anything.

def priority_target
  proportion = [legacy_priority_count.to_f / docket_margin, 1.0].min
  (proportion * batch_size).ceil
end

We can also use the docket margin to derive the docket margin net of priority, which is the docket margin minus the count of prioritized cases. Like the docket month, this range determines how far back we are allowed to look on the docket and still be considered to be pulling from the front of the line. Unlike the docket month, it is calculated on demand and is more precise.

def docket_margin_net_of_priority
  [docket_margin - legacy_priority_count, 0].max
end

Now we have everything we need to distribute cases from our single docket. The algorithm has four steps.

In the first step, we distribute any prioritized appeals that are tied to the judge requesting cases. As no other judge can decide them, we will distribute as many such appeals as we have ready, up to the batch size.

In the second step, we distribute any non-prioritized appeals that are tied to the judge. We will again distribute an unlimited number of such appeals, but only searching within the docket margin net of priority, i.e. the cases that are considered to be at the front of the line.

Note that, in terms of the ordering of the steps, we are considering first whether the appeal is tied to a specific judge before considering whether the appeal is prioritized. This is because at the micro level, it’s more important for any given judge to be working the cases that only they can work. At the macro level, between the Board’s more than 130 judges, there will always be plenty of judges available to work AOD cases quickly. Note that even in extreme circumstances, such as if every appeal was tied to a judge, the algorithm would be self-healing, because the docket margin net of priority shrinks the more prioritized cases are waiting, thus reducing the number of cases distributed in step two.

In the third step, we check the priority target. It’s possible we already hit or even exceeded the target in step one, in which case we skip this step. But if we still need more prioritized cases, we’ll distribute gen pop prioritized cases until we reach the target. In order to ensure that prioritized cases continuously cycle, they are not sorted by docket date, but rather by how long they’ve been waiting for a decision, or in programmer speak, a first-in, first-out (FIFO) queue.

At any point, if we have reached the limit of the batch size, we stop distributing cases. Our work is done.

But assuming we still need more cases, our fourth and final step is to look to non-prioritized, gen pop appeals. We distribute those in docket order, until the we’ve reached the batch size and the judge has the cases they need.

Here’s what it looks like in code:

def legacy_distribution
  rem = batch_size

  priority_hearing_appeals =
    docket.distribute_priority_appeals(self,
                                       genpop: "not_genpop",
                                       limit: rem)
  rem -= priority_hearing_appeals.count

  nonpriority_hearing_appeals =
    docket.distribute_nonpriority_appeals(self,
                                          genpop: "not_genpop",
                                          range: docket_margin_net_of_priority,
                                          limit: rem)
  rem -= nonpriority_hearing_appeals.count

  if priority_hearing_appeals.count < priority_target
    priority_rem =
      [priority_target - priority_hearing_appeals.count, rem].min

    priority_nonhearing_appeals =
      docket.distribute_priority_appeals(self,
                                         genpop: "only_genpop",
                                         limit: priority_rem)
    rem -= priority_nonhearing_appeals.count
  end

  nonpriority_appeals =
    docket.distribute_nonpriority_appeals(self,
                                          limit: rem)

  [
    *priority_hearing_appeals,
    *nonpriority_hearing_appeals,
    *priority_nonhearing_appeals,
    *nonpriority_appeals
  ]
end

Congratulations, that’s one full government algorithm. It was turned on in production in November 2018, but it would only stay on for three months…

The Appeals Modernization Act

On August 23, 2017, the Veterans Appeals Improvement and Modernization Act of 2017, a.k.a the Appeals Modernization Act (AMA), was enacted into law. The first significant reform of the VA appeals process in decades, VA was given 540 days to implement the new law. There’s no possible way this already overstuffed post can accommodate my opinions about AMA and how it was implemented, so let’s just narrowly focus on what this new law meant for the Automatic Case Distribution algorithm.

My team was working to replace VACOLS—a legacy system, built in PowerBuilder, that had been maintained for decades by a single person—with Caseflow. The passage of AMA was a clarifying moment for our team. It was quickly apparent that there was no path for VACOLS to be retrofitted for the new law, so the 540-day clock provided a deadline for Caseflow to be able to stand on its own two legs.

545 days later, VA was ready and the law went into effect. Nothing changed for Caseflow. Every single piece of new functionality had already gone live in production and been used to manage the real appeals of real Veterans via the Board Early Applicability of Appeals Modernization (BEAAM) program,[4] which invited a small group of 35 Veterans to conduct a trial run of AMA. The program not only helped to ensure a smooth rollout of the technology, but it also gave us valuable insights as we prepared updated regulations and procedures for the new law and designed informational material for Veterans and their representatives.

By the way, I think this is the Platonic ideal of government IT launches. You just throw a nice party for the team because everything has already shipped.

A man stands next to a banner reading 'grand opening' holding a pair of oversized scissors. The scissors are approximately 30 inches (75 centimeters) long. The man wears glasses and a leather jacket, and is of east Asian descent. Another man stands in the background, holding one end of the banner. They are in a windowless room with bare white walls, presumably a nondescript government office building.
Photo: And you buy jumbo scissors, February 19, 2019.

Everything had gone live, that is, except for the new version of the Automatic Case Distribution algorithm. The whole reason we implemented the single-docket algorithm described above, despite only needing it for three more months, was so that we could roll out as much as possible ahead of the applicability date of the law. But on AMA day, we had to flip a feature flag and swap in a completely different algorithm.

In order to test the new algorithm ahead of time, I built a discrete-event simulation (DES) to model what the Board would look like under AMA, and I used that to pressure test the algorithm under various conditions. I had done the same for the single-docket version before rolling it out too, although that was easier thanks to decades of historical data. For example, when I said above that docket margin net of priority was “more precise” than docket month, the evidence for that claim before we took the algorithm live was simulation results showing that it never had to look more than 3,000 cases deep on the docket, which was narrower than the range the humans were using at the time. I evaluated various algorithms using four measures: (1) docket efficiency, how deep the algorithm had to look on the docket to find cases; (2) distribution diversity, how balanced prioritized cases were between judges; (3) priority timeliness, how long it took to distribute a new prioritized case; and (4) priority pending, the maximum number of prioritized cases waiting at any given time.

The challenge was that for AMA, I was modeling a novel and highly complex procedural framework with only limited data as to what might happen (collecting a preliminary evidence base was another goal of the BEAAM program, which featured extensive interviews with Veterans and their representatives to explore how they would approach making choices under AMA). It was extremely important to test the algorithm under extreme scenarios, not just how VA hoped they would play out.

AMA says, “The Board shall maintain at least two separate dockets,”[5] one for appeals that requested a hearing, and one for appeals that didn’t. The Board chose to create not two, but three dockets. As required by the statute, one docket would be for appeals that requested a hearing. A second docket would be for appeals that didn’t request a hearing, but where the Veteran had added new evidence to their case that had not been reviewed by the agency that had originally made the contested decision (i.e. the Veterans Benefits Administration for disability benefit claims). A final docket, the “Direct Review” docket, would offer Veterans the guarantee of a faster decision in exchange for not requesting a hearing or adding new evidence. When a Veteran appealed a decision to the Board, they would have to choose which docket they wanted to use. And as the dockets were maintained separately, the Board could choose how to allocate resources between them, working some dockets faster than others, while still respecting docket order within any given docket.

Veterans who already had an appeal could keep it under the old “legacy” rules. As a result, there would now be four separate dockets for the algorithm to consider: the hearings docket, the new evidence docket, the Direct Review docket, and the legacy docket.

The Board’s policy goals

The Board articulated three policy goals to inform the design of the algorithm. As is often the case, the goals are sometimes vague and contradictory. That’s what makes this fun, I guess.

The first goal was not vague. Appeals on the Direct Review docket should be decided within one year, 365 days. I understood at the time that this was not realistic in the medium term,[6] but the Board was unwilling to acknowledge that fact. As of December 2024, appeals on the Direct Review docket take an average of 722 days to complete, down from a peak of 1,049 days in July 2024.[7] Absent any staffing reductions, it is possible that the Board will be able to reach a steady state where it consistently achieves its goal by 2026. From the perspective of algorithm design, I sought to give the Board the best shot at achieving its goal, while also ensuring the Board didn’t shoot itself in the foot if the goal turned out to be unachievable.

The second goal was that the dockets should be balanced “proportionately.” The definition of “proportionately” was left to me to interpret, but any definition was in contradiction with the other two goals. In the end, I excepted the Direct Review docket from any claim of proportionality and rearticulated this goal as “the other dockets should be balanced proportionately.”

The third goal was that the Board would prioritize completing legacy cases. The size of the legacy appeals backlog, then about 400,000 cases, was the primary reason I felt the first goal was not realistic. Under the legacy rules, Veterans have options to keep an appeal going almost indefinitely, so the Board will continue to work legacy appeals for literal decades. However, unless the Board clung too long to the 365-day goal for the Direct Review docket, I expected the Board could reach “functional zero” by 2025 or 2026, where only a small proportion of its resources were going toward the long tail of legacy appeals.

Keeping these goals in mind, let’s take a look at the AMA algorithm. Or, you know, maybe you could take a pleasant walk in the fresh air instead?

Hard mode: four dockets

As before, we start by calculating the docket margin and docket margin net of priority, only now looking at prioritized cases on any of the four dockets.

Setting aside the prioritized cases for a moment, we need to determine what proportion of non-prioritized cases should come from each of the dockets. Each docket has a weight, which is generally equal to the number of non-prioritized cases waiting.

The legacy docket’s weight is adjusted to account for cases that we know about, because the Veteran has filed a Notice of Disagreement, but which have yet to come to the Board and be docketed, because they are waiting on a Form 9. About 40% of Notices of Disagreement end up reaching the Form 9 stage, so we add a discounted count of pre-Form 9 cases to the number of docketed cases to give us the legacy docket’s weight.

class LegacyDocket
  # When counting the total number of appeals on the legacy docket for purposes of docket balancing, we
  # include NOD-stage appeals at a discount reflecting the likelihood that they will advance to a Form 9.
  NOD_ADJUSTMENT = 0.4

  def weight
    count(priority: false) + nod_count * NOD_ADJUSTMENT
  end

  # ...
end

We’ve now fulfilled the Board’s second goal, and calculated a set of proportions using the size of each docket. If that was our only goal, we could stop, but in response to the other two goals, we’ll make need to make two adjustments.

First, we need to fix the Direct Review docket such that cases are decided within 365 days. When a case is docketed on the Direct Review docket, Caseflow stamps it with a target decision date, 365 days after the day it was docketed. We record the target decision date for each case to enable the Board to later adjust the timeliness goal (should it prove to be infeasible), while continuing to honor the commitment that was made to Veterans when they chose the Direct Review docket. The goal is adjusted for future cases, but we continue working the cases we have within the time we promised.

From the target decision date, we can derive the distribution due date, the date that we want to start looking for a judge in order to get a decision out the door by the 365 day mark. This was initially 45 days before the target decision date, but we planned to adjust this number as we got real-world data.

We can count the number of cases where distribution has come due and divide by the docket margin net of priority to calculate the approximate proportion of non-prioritized cases that need to go to Direct Reviews in order to achieve the timeliness goal. Initially, no cases are due, and so this proportion would be zero. But the Board didn’t want to wait to start working Direct Review appeals, preferring to start working them early and notch the win of beating its timeliness goal (even if this wasn’t sustainable). As a result, I constructed a curve out for the Direct Review docket.

We start by estimating the number of Direct Reviews that we expect to be requested in a year. If we’re still within the first year of AMA, we extrapolate from the data we have. We can divide this number by the number of non-priority decisions the Board writes in a year to calculate the pacesetting Direct Review proportion, the percentage of non-priority decision capacity that would need to go to Direct Reviews in order to keep pace with what is arriving.

def pacesetting_direct_review_proportion
  return @pacesetting_direct_review_proportion if @pacesetting_direct_review_proportion

  receipts_per_year = dockets[:direct_review].nonpriority_receipts_per_year

  @pacesetting_direct_review_proportion = receipts_per_year.to_f / nonpriority_decisions_per_year
end

Our goal is to curve out to the pacesetting proportion over time. So we calculate the interpolated minimum Direct Review proportion, using the age of the oldest waiting Direct Review appeal relative to the current timeliness goal as an interpolation factor. We apply an adjustment to this proportion, initially 67%, to lower the number of cases that are being worked ahead of schedule.

def interpolated_minimum_direct_review_proportion
  return @interpolated_minimum_direct_review_proportion if @interpolated_minimum_direct_review_proportion

  interpolator = 1 - (dockets[:direct_review].time_until_due_of_oldest_appeal.to_f /
                      dockets[:direct_review].time_until_due_of_new_appeal)

  @interpolated_minimum_direct_review_proportion =
    (pacesetting_direct_review_proportion * interpolator * INTERPOLATED_DIRECT_REVIEW_PROPORTION_ADJUSTMENT)
      .clamp(0, MAXIMUM_DIRECT_REVIEW_PROPORTION)
end

This gives us a nice curve out, as shown below. The jolt upward occurs when we catch up with the distribution due date of the docketed appeals.

A chart showing the simulated percentage of each distribution to a judge that went to the Direct Review over time. Each distribution is plotted as a blue circle. At first, very few cases go to Direct Reviews, but this percentage curves upward until a little less than half of all cases go to Direct Reviews. Then the distributions jolt upward, showing that about 70% of all cases are going to Direct Reviews, with individual distributions ranging between 55% and 80% Direct Reviews.
Figure: A simulation where Direct Reviews take up about 70% of the Board's capacity.

We also apply a limit to the number of cases that can go to the Direct Review docket, initially set at 80%.

direct_review_proportion = (direct_review_due_count.to_f / docket_margin_net_of_priority)
  .clamp(interpolated_minimum_direct_review_proportion, MAXIMUM_DIRECT_REVIEW_PROPORTION)

The second adjustment that we apply to the raw proportions is that we set a floor of 10% of non-priority cases to come from the legacy docket, provided there are at least that many available. This adjustment ensures that the Board continues working the legacy appeals backlog, even as it dwindles to only a handful of cases.

A brief aside

An implicit assumption here is that the Board needed to be willing to admit when it was no longer able to meet the 365-day goal. The 80% limit gave me confidence that at least nothing would break. But if the percentage of Direct Reviews was continuously pegged to 80%, there would be scarce capacity to work anything other than Direct Reviews, and in particular, to work down the legacy appeals backlog.

To this day, VA.gov reads, “If you choose Direct Review, […] the Board’s goal is to send you a decision within 365 days (1 year).” VA continued to make that promise to Veterans even as it was issuing decisions on Direct Reviews that were more than 1,000 days old.

Again, because it’s hidden from the public, I don’t know how the Board has updated the algorithm’s parameters, or even if they’re still using it. But it’s quite apparent from publicly available data that whatever algorithm is in use, the Board has not been able to keep pace with the 365-day goal it continues to claim. Fortunately, it looks like things started to turn around in 2024 as the legacy backlog began to dry up, and it’s possible the Board will be able to meet its stated goal within the next couple years. If its staffing levels aren’t cut.

Hearings make this even harder

Okay, one last complication. Under AMA, cases on the hearings docket get distributed to judges as soon as they are ready, generally after the hearing has occurred, been transcribed, and the evidentiary period (the time after the hearing in which the Veteran can add new evidence to their case) has expired or been waived. The Board’s term for this is “one-touch hearings,” which is in contrast to legacy hearings, which could take place months or even years before the case was decided. As a result, the number of cases that get worked on the docket is not decided when cases are distributed, but rather when we determine how many AMA hearings to hold.

Fortunately, Caseflow is also responsible for scheduling hearings. Every quarter, Caseflow Hearing Schedule asks Caseflow Queue (where the algorithm lives) to tell it the number of AMA hearings it should schedule. Caseflow calculates the docket proportions, as above, and multiplies the hearings docket proportion by the number of non-prioritized decisions it plans to issue that quarter.

def target_number_of_ama_hearings(time_period)
  decisions_in_days = time_period.to_f / 1.year.to_f * nonpriority_decisions_per_year
  (decisions_in_days * docket_proportions[:hearing]).round
end

One complication of this number is that a Veteran could withdraw their hearing request. Due to a legal technicality, however, their appeal would remain on the hearings docket. This means that Caseflow Hearing Schedule needs to look to whether the hearing request had been withdrawn. If so, it marks the case as ready for a decision; if not, it schedules a hearing.

Under AMA, cases with hearings are not required to be decided by the same judge who had held the hearing, as they were under the legacy rules. However, it remains better for everyone involved if the same judge decides the case, if possible, so the algorithm continues to mostly work the same, which I termed affinity. The only difference between affinity and the old rules is that if a judge retires or takes a leave of absence, Caseflow treats the cases they heard as gen pop, available to be assigned to anyone, instead of requiring another hearing.

At long last, let’s automatically distribute some cases

Okay, so now we know the proportion of the Board’s decisions that should be allocated to each docket. We’ve helped Caseflow Hearing Schedule hold the right number of hearings some months ago. Now a judge comes along and asks for a new batch of cases. Let’s help them out, shall we?

As before, we calculate the priority target and docket margin net of priority. The only difference is that we now need to look at the number of prioritized appeals on any of the four dockets. We’ll also calculate how deep we can look on the legacy docket specifically, the legacy docket range by multiplying the docket margin net of priority by the legacy docket proportion.

def legacy_docket_range
  (docket_margin_net_of_priority * docket_proportions[:legacy]).round
end

We can start distributing appeals by looking at cases that are either tied to the judge (legacy docket) or have affinity for the judge (hearings docket). And of course, prioritized cases go first.

# Distribute priority appeals that are tied to judges (not genpop).
distribute_appeals(:legacy, limit: @rem, priority: true, genpop: "not_genpop")
distribute_appeals(:hearing, limit: @rem, priority: true, genpop: "not_genpop")

# Distribute nonpriority appeals that are tied to judges.
# Legacy docket appeals that are tied to judges are only distributed when they are within the docket range.
distribute_appeals(:legacy, limit: @rem, priority: false, genpop: "not_genpop", range: legacy_docket_range)
distribute_appeals(:hearing, limit: @rem, priority: false, genpop: "not_genpop")

Next, we need to see if we’ve reached the priority target, and if not, distribute more prioritized cases. We’ll need to find prioritized cases irrespective of docket, looking at how long the case has been sitting rather than its docket order to ensure that prioritized cases keep moving. This is a two step process, where we ask the databases which dockets have the oldest appeals first and then distribute that many appeals second. As a result, there is a risk of a race condition where two judges could request cases at approximately the same time and the second judge would get a case that wasn’t strictly the oldest. Any case that is skipped over in this way would likely be picked up by the very next judge, so it wasn’t necessary to try to implement some form of locking (not trivial because legacy appeals are stored in VACOLS), but it was important to flag that edge case in our documentation so we weren’t hiding even a trivial deviation from our stated procedures.

# If we haven't yet met the priority target, distribute additional priority appeals.
priority_rem = (priority_target - @appeals.count(&:priority)).clamp(0, @rem)
oldest_priority_appeals_by_docket(priority_rem).each do |docket, n|
  distribute_appeals(docket, limit: n, priority: true)
end

Now we need to distribute the remaining non-prioritized cases. We’ll deduct the non-priority appeals that we’ve already distributed from the proportions, and then it’s time to figure out how many cases we distribute from each docket.

If we were to multiply the docket proportions by the number of cases we have left in the batch size, it’s quite unlikely we would get an whole number result, and it’s rather difficult to distribute a fraction of a case. To avoid this, we’ll use a form of stochastic rounding to ensure we distribute the requested number of cases and don’t leave any docket behind.

In our stochastic rounding method, we multiply the docket proportions by the number of cases remaining. We always allocate each docket any whole number of cases, but we set aside the remainders. Using these remainders as weights, we probabilistically allocate the remaining cases to dockets. While each individual batch can vary, over many distributions, the actual proportions will converge toward the target proportions, and even dockets that are wildly outnumbered will still get the right amount of attention.

# Extends a Hash where the values represent parts of a whole.
#
# {
#   a: 0.5,
#   b: 0.25,
#   c: 0.25
# }.extend(ProportionHash)

module ProportionHash
  def stochastic_allocation(num)
    result = transform_values { |proportion| (num * proportion).floor }
    rem = num - result.values.reduce(0, :+)

    return result if rem == 0

    cumulative_probabilities = inject({}) do |hash, (key, proportion)|
      probability = (num * proportion).modulo(1) / rem
      hash[key] = (hash.values.last || 0) + probability
      hash
    end

    rem.times do
      random = rand
      pick = cumulative_probabilities.find { |_, cumprob| cumprob > random }
      key = pick ? pick[0] : cumulative_probabilities.keys.last
      result[key] += 1
    end

    result
  end

  # ...
end

We run this step in a loop, in case any docket runs out of available cases (quite likely with the hearings docket, where most cases have affinity for a judge). At the end of each pass, we zero out the docket proportion of any docket that has been unable to supply us with all of the cases we requested, we normalize the proportions to sum to 100%, and then we re-run the stochastic allocation method. We repeat until we’ve either provided the judge with all the cases they need, or all of our dockets run out of cases.

until @rem == 0 || @remaining_docket_proportions.all_zero?
  @remaining_docket_proportions
    .normalize!
    .stochastic_allocation(@rem)
    .each do |docket, n|
      appeals = distribute_appeals(docket, limit: n, priority: false)
      @remaining_docket_proportions[docket] = 0 if appeals.count < n
    end
end

Monitoring and tweaking

Each time a judge requests a distribution, Caseflow records the state of the world as the algorithm saw it, enabling us to reconstruct why it made the decisions it made and to study the behavior of the algorithm in the real world. All of this data was reported on an easy-to-use dashboard so that Board staff could understand whether the algorithm was still correctly configured to meet their goals and to provide an “early warning” system for when things were clearly no longer sustainable.

I took my best stab at initial values for the different parameters of the algorithm. But these things were parameterized explicitly in anticipation that they would need to change. All told, the Board could control the batch size per attorney, the Direct Review docket time goal, the Direct Review distribution due date, the maximum Direct Review proportion, the interpolated Direct Review proportion adjustment, and the minimum legacy docket proportion. The Board also had the option of overriding the recommended number of AMA hearings to schedule.

Tweaking these parameters could adapt the algorithm to any situation that I could anticipate, and hopefully many that I couldn’t. It also ensured that the decisions made by the algorithm remained properly within the control of VA leadership and were not coopted by an enigmatic machine.

Fin

Figuring all that out was maybe 10–20% of my job for four months of my life. All of the code is pretty straightforward and boring. It’s learning the whys behind the existing process, working out of what needed to change, finding ways to test it to make sure nothing went horribly wrong, getting feedback and buy-in from stakeholders, documenting and explaining it to different audiences; that’s where the real challenge was. Anyway, that’s my experience of what it looks like to actually automate bureaucracy.

I did say there isn’t a reward for making it to the end. I hope you’re happy.


  1. VA Board of Veterans’ Appeals Annual Report, Fiscal Year 2024 ↩︎

  2. 38 CFR 20.800(b) ↩︎

  3. Unfortunately, the Board used a two-digit year for this date. The Board was created in 1930, so it has its own flavor of Y2K bug scheduled to hit in 2030. Generally speaking, there shouldn’t be new appeals using this docket numbering scheme after 2019, but it’s possible for a closed appeal to be reopened through a finding of clear and unmistakable error, which would result in a new docket number being assigned. The Board still has four years to decide how to fix that one, and I built error messaging to ensure no one forgets. ↩︎

  4. I designed and managed the program, but I’m not responsible for the name. Technically, we used the authority under section 4 of AMA (“Programs to test assumptions relied on in development of comprehensive plan for processing of legacy appeals and supporting new appeals system”), not the provisions of section 2(x)(3) (“Early applicability”), as the name inaccurately suggests. Yes, it still grates on me. ↩︎

  5. 38 USC §7107(a)(1) ↩︎

  6. In the first year of the new law, any cases you decide are definitionally less than 365 days old. In the long term, the legacy appeals backlog has been cleared, so you have more capacity for the AMA dockets, and you can keep pace so long as the Board is adequately staffed. But in the medium term, even my most optimistic scenarios showed the Board losing the battle to keep pace with its goal unless it worked the Direct Review docket to the exclusion of the others. ↩︎

  7. Board of Veterans’ Appeals, “More Board personnel address pending AMA appeals and wait times.” Retrieved June 9, 2025. ↩︎

Direct File is Open Source!

Alex Mark:

The tax code contains some pretty crazy edge cases that required a good amount of effort to build and test. Two of my favorite tax rules are the different ways to calculate whether the taxpayer was 65 or older during the tax year — people over 65 receive a greater standard deduction, but they also phase out of eligibility for EITC if they don’t have qualifying children. Confusingly, these rules use different definitions of what it means to be 65 during the tax year[…]

I wish that these edge cases were rare — but they’re all over the tax code (children who are not children, married couples filing jointly who don’t meet the Joint Return Test, people who can be claimed as a dependent who aren’t, etc.). I feel confident that we represented these cases and gave the taxpayer every possible shot to file an accurate, beneficial return.

When Alex joined USDS in 2023, I knew only one thing about him, which was that he volunteered in the VITA program. I immediately wanted to get him on Direct File. Here he gives a guided tour of some of Direct File’s (and the tax code’s) nooks and crannies, and I insisted he put in a photo of him dumping a bucket of ice water on his head.

The 2025 Direct File Report

Nina Olson and Gabriel Zucker:

…the team continued to evaluate Direct File, as it had all along, preparing a comprehensive year-two report on its work. We filed a Freedom of Information Act request for that evaluation, and this week received Direct File’s 2025 season-end filing report. It tells an inspiring story of a product genuinely transforming tax administration in America, a story made bittersweet only by the fact that such progress has been cut short.

In addition to their summary, the Center for Taxpayer Rights has posted the full report for all to read.

The Who Cares Era

Dan Sinker:

In the Who Cares Era, the most radical thing you can do is care.

In a moment where machines churn out mediocrity, make something yourself. Make it imperfect. Make it rough. Just make it.

At a time where the government’s uncaring boot is pressing down on all of our necks, the best way to fight back is to care. Care loudly. Tell others. Get going.

At last week’s Code for America Summit, Suzanne Chapman and Katie Aloisi spoke about how Direct File was designed to build trust. Katie spent a significant portion of her time getting into the weeds of the tax situation of a fictional multi-generational household.

The audience wasn’t bored. They were riveted. Suzanne and Katie’s care was self-evident and contagious.

Caring is more than saying the right things. Care is shown.

Saying Goodbye

When the Treasury Department ordered the Direct File team to stop work and disband, the toughest part was saying goodbye.

Direct File deserves an Ocean’s Eleven-style montage for assembling the team. The old partner who had to be convinced to come back for one more job (the best, and perpetually most reluctant, user experience lead in the business). The loose cannon I was wary of bringing onto the team due to a checkered past (a brilliant engineer with a crypto startup on his resume). The whiz kid with the precise set of skills needed to pull off the plan (an interaction designer by training, she and I had studied up on tax law together, and she took to it like a fish to water).

We came together, a band of misfits and weirdos, under the banner of a fantastical, improbable mission. It felt like we were unstoppable. And then we were stopped.

And as much as we had accomplished, we knew we were just getting started. So we did our best to document what we had learned. We prepared the project for hibernation, writing notes for a future team that might never come. We said goodbye.

It was August 2022.


It wasn’t Direct File’s first death. By my count it was the second, although there’s an argument for it being the third. Neither count includes all of the doomed attempts to tilt at this particular windmill over the two and a half decades before we took up the lance.

During the initial policy process[1] on Direct File in 2021, I had argued that we would need to make a go/no go decision by June 2022 in order to have the best shot at success in filing season 2024, the final filing season of the Biden administration. And so when the policy process was reconstituted in early 2022, we were told our goal would be to prepare options and recommendations for a June decision. But by May, it was clear that we were going to blow that deadline.

So Merici Vinton, my U.S. Digital Service (USDS) project co-lead, and I hatched a plan. We asked to use our small USDS team to build a “proof of concept.” This prototype[2] of Direct File would illustrate some of the key issues being discussed in the policy process and develop an evidence base to inform future strategy. It would build momentum ahead of a decision, enabling us to hit the ground running if and when that decision came. I said the words “preserving optionality” a lot.

It was also, and everyone knew it, an advocacy tool. As Merici put it, she wanted something so self-evidently good that it would “make it hurt” when decision makers declined to give Direct File the green light. And so a team of three designers and four engineers (if you count me despite splitting time between coding, memos, and meetings), built a proof of concept version of Direct File in eight weeks.

Seizing an opportunity to grow the expertise and capacity of the team was an under-appreciated benefit of this work. Even from this early stage, the product was designed alongside users, and these early insights from taxpayers helped shape our intuitions. The proof of concept served as a laboratory to try out ideas and as a springboard for future iterations. We made mistakes, and learned from them (whatever you do, don’t try to express the logic of the Internal Revenue Code as YAML). The team developed a shared vocabulary, critical for collaboration between disciplines, which the reluctant user experience lead termed our taxonomy. And then there were the inside jokes, and the fridge magnets, and the Pictionary games.

At the end of the eight weeks, we not only had a version of a Direct File, but we had a close-knit group of people who knew how to build a Direct File. Because they had done it.


We began to strategize about what to do next with our small but mighty team. But then the administration’s climate and tax bill, thought dead, was suddenly accelerating toward becoming law. The Inflation Reduction Act (IRA), as it was now titled, would give the IRS an additional $80 billion over ten years. What should have been very good news for Direct File unexpectedly spelled its demise.

The decision was made within Treasury, so I can’t speak to the deliberations that occurred. The stated rationale was that with the imminent passage of the IRA, the IRS was ready to take on Direct File without outside help, and that the effort would be best managed as part of the emerging IRA strategic process. Instead of handing off the work in progress, however, the project would inexplicably start from scratch, walking away from the progress that had been made. Treasury confirmed that they understood this decision would likely foreclose delivery of Direct File by 2024, which, writing from the future where we know the results of the 2024 election, meant that Direct File would never have shipped.

It also didn’t help that the IRA required the IRS to produce a report on Direct File. At this point, no one had a plan for that report (apart from MITRE, which assumed a lucrative contract would soon come their way). The optics of one part of government studying the question of Direct File while another part moved full steam ahead weren’t great. The eventual plan—delaying a decision on Direct File until May 2023 but producing another, more sophisticated prototype as part of the report work and thus preserving optionality (my favorite phrase)—was still a few months off. It turns out that tasking an agency to spend $15 million and nine months writing a report probably isn’t a great way to achieve a result, even if it worked in our case.[3]

Starting over from square one, the IRS was not set up for success. As just one example, one of my top anxieties during this time was how Direct File would handle state taxes. I had written a four-page memo outlining the problem and five potential solutions, the whiz-kid designer had storyboarded out an early version of the experience that would eventually ship, and we had assembled a spreadsheet categorizing the filing options available in every state in the nation. Treasury’s July 2022 memo to the IRS, informing the IRS of its interest in pursuing Direct File, boiled all of this down to the single sentence, “It will be critical to determine the extent to which a federal filing tool can offer taxpayers an integrated way to file their state returns.” Helpful.

More than any memo, though, and even more than any prototype, what the IRS was deprived of was people. Because my top anxiety was: who is going to build Direct File? The IRS didn’t have the capacity to build a product like Direct File, which is not to say that the agency was incapable of it, just that it would take time and leadership to recruit and develop a team with the right combination of skills.

There aren’t very many user experience designers who understand tax law and can effectively collaborate with the IRS Office of Chief Counsel. For that matter, it’s also pretty difficult to find tax lawyers who know how to collaborate with designers. But we absolutely could find talented designers and talented lawyers and give them the space to learn from each other.

These problems can be hand-waved away if the plan is to outsource the work (this was the IRS’s preferred approach). Of course, that doesn’t solve the problem; it just makes it someone else’s problem. It doesn’t take a crystal ball to predict the steaming pile of crap taxpayers would have received after shelling out nine figures (cf. FAFSA).

The work of growing a team ready for any challenge takes time and deliberate care. There are no shortcuts. What hurt us the most was not pausing the building of Direct File but abandoning the capacity to build we were cultivating.


Our team was disbanded. A majority went to the Department of Education and helped launch the ill-fated student loan forgiveness application.[4] The loose-cannon engineer was assigned to a project that made him miserable, but where he would meet a couple of similarly miserable designers with whom he would share the dream of Direct File. Only Merici and I remained, advising Treasury on IRA implementation, hoping we might someday light the beacons and call the dream team back.

Three months later, with interest in Direct File on the rise, Treasury would reverse course and bring our team back to help pull off Direct File from within the IRS. I would be in charge of the team designing and building Direct File, and Merici would serve as deputy to the IRS executive who would run the Direct File program. That’s a story for another time, however. We lit the beacons, and the team returned. And then the team grew.

There were many moments in the course of the following year that I wished we had those three months back. We were comically past the ideal timeline I had laid out back in 2021, and we also now had to write a report to Congress. But regardless of whether this hiatus was death number two or three, it would be the last time Direct File died. Until this year.


I’ve been continuously working in government, one way or another, since 2015. A few weeks ago, I resigned, as did the vast majority of the Direct File team. This time, it was even harder to say goodbye.

I do mourn the fact that Direct File won’t be available to taxpayers next year. But as I told the team as the end closed in, “We took a pipedream and made it a policy choice.” No one can claim with a straight face that Direct File is impossible anymore; bringing it back requires only that our elected leaders make a different choice.

What I mourn the most, though, is the dissolution of the team, the disregard for the vast impact they were poised and eager to deliver. The team itself is what I am proudest of from my time working on Direct File. Their manic dedication to the mission. The care they consistently took to get it right. The trust and love they had for each other.

One of the team’s many sayings was “the next version of Direct File is the best version of Direct File.” But just as Direct File constantly got better, so too did the team. We were growing our capacity to deliver for taxpayers each and every day, and we were helping to build new capacity across the IRS. Now that progress is interrupted, a reflection of progress interrupted across the federal government, of scientific research interrupted, of international aid interrupted, of all manner of interruptions.

Direct File is a policy choice, but the Direct File team was an orchard, just starting to produce a meaningful yield. Now it’s been razed.

We’ve documented what we learned. We’ve prepared the project for hibernation and written notes for a future team that might never come. We’ve open sourced it for the world to see.

We’ve said goodbye.


Except, well, a member of the team reviewed an early draft of this post, and she would like to refute this point, i.e. my title. According to her, the team has never said goodbye, only “ta ta for now.” We remain as delusional as ever. How else do you think Direct File happened?

It’s a peculiar sort of person who lives through the many deaths of Direct File and says, “Sign me up for one more, please.” But that’s the Direct File team for you. I dare you to try and stop them.

Maybe we’ll be able to relight the beacons someday. And maybe this will be the last time Direct File dies.

About 70 people who worked on Direct File pose on the Navy Steps of the Eisenhower Executive Office Building. They are dressed professionally but with an abundance of bright colors, including pinks, purples, and blues. The building's facade is granite, and it is designed in an ornate Baroque Revival style.
Photo: The IRS Direct File team, May 2, 2024.

  1. “Policy process” is a fancy way of saying a bunch of government officials talking about what to do. The process was facilitated via an interagency policy committee (IPC), a working group of people from different parts of government, each empowered to represent their “building’s” perspective. We teased out problems that needed to be solved and other considerations for a decision with the goal of presenting options and recommendations to “principals,” the highest-ranking political appointees. ↩︎

  2. Not to be confused with a separate prototype that was built in 2023 under the aegis of the Inflation Reduction Act-mandated Report to Congress. The later prototype was significantly more robust, and it built out the codebase that would eventually ship to taxpayers. ↩︎

  3. I like to say that the Direct File pilot cost just two reports to Congress, and we threw in a tax filing tool for free. Oddly enough, $15 million being a reasonable price tag for a report is something on which both parties agree. ↩︎

  4. Tilting at windmills is great exercise. ↩︎

Performance.gov: IRS Direct File

Speaking of Merici, I’ve been meaning to reupload this video from a July 2024 event, where she and I spoke about lessons learned from the Direct File pilot. The original was taken down in January 2025 due to the transition.

ESP Launches Future of Tax Filing Fellowship

Economic Security Project (ESP) announces a new fellowship focused on the future of tax filing, civic tech, and benefits equity. Launching in the wake of the Trump administration’s decision to sunset IRS Direct File—one of the most popular civic tech pilots in recent memory—this new initiative brings together technologists, policy experts, and equity-driven strategists to preserve hard-won progress and accelerate new ideas that make government work better for everyone.

In personal news, I’ve joined with our friends at the Economic Security Project to document and build upon the legacy of Direct File.

The Economic Security Project was one of the earliest and most effective advocates for Direct File, and throughout the program’s existence, they were a leader in getting the word out to the taxpayers who needed it the most. I’m grateful for this opportunity to continue collaborating with Gabriel, Jen, and Merici, reflecting on what we learned and what work remains unfinished.

Peeking Behind the Code

Andrew Leahey, for Forbes:

It isn’t every day that the IRS drops something that reads like a blend of legal code, logic theory, and open government idealism. But that is exactly what it did, laying Direct File bare—and releasing it into the public domain—for the world to scrutinize, fork, or admire in stunned silence.

At the heart of this code disclosure is something called the Fact Graph, a dry name for a tool that is quietly revolutionary: a logic engine that parses ambiguous tax scenarios and infers relationships while consistently applying tax rules.

The first line of code for Direct File, written February 8, 2022, was for an experiment to explore the concepts that would become the Fact Graph. Even from that first experiment, the work was motivated by the need to establish trust with taxpayers. Direct File can ask hundreds of questions about your tax situation, but most users will see only a small fraction of that. The Fact Graph provides us with a means of proving that none of the unasked questions would have changed the bottom line of your tax return and that you’re getting every tax benefit to which you’re entitled.

The IRS has been working for years to develop a relatively simpler model of taxes, its “Tax Calculator” effort, which would model the relationships between the lines of the various tax forms. But Direct File and its Fact Graph go further by also tackling the last mile problem of taxes, the often fuzzy relationship between what gets entered on tax forms and the assorted facts about your life that inform what you’re eligible to claim, a relationship that’s variously described via statutes, regulations, court decisions, instructions, publications, and other guidance. This wasn’t trivial, as our team would often be teeing up novel questions for the IRS Office of Chief Counsel to consider as we sought to bring that ambiguity into sharp focus.

There’s ample prior art of knowledge graphs for taxes, but for me personally, the lineage of my thinking traces to a conversation with Ben Klemens at Open Data Day 2016, discussing his python script for doing his own taxes.

This all matters in an age when public trust in institutions is frayed and when algorithms, or artificial intelligence (AI), increasingly mediates our rights and obligations. If the tax code is going to be enforced by machines, and the preparation calculated and handled by the same, the rules those machines follow should be legible and contestable. The open-sourcing of Direct File hints at a future where regulatory logic and automated decision-making isn’t hidden, but published like case law—reviewable, revisable, and maybe even improved by the very public it serves.

Bingo.

Direct File on GitHub

The IRS has now published the vast majority of Direct File’s code on GitHub as open-source software. As a work of the U.S. government, Direct File is in the public domain. And now everyone can check it out.

Releasing Direct File’s source code demonstrates that the IRS is fulfilling its obligations under the SHARE IT Act (three weeks ahead of schedule!). Now that Direct File has paved the way, I hope that more of the IRS’s code, paid for with taxpayer dollars, will soon be available to all of us.

Open sourcing Direct File has long been planned, and even longer desired. Explaining last May why open source is particularly important for Direct File, the team wrote:

The IRS could take further steps to build public trust and enable independent assessment of its work. The Direct File product team was given the mandate to develop software that ensures every taxpayer receives the full benefit of any tax provisions for which they are eligible. Releasing components of Direct File as open-source software would enable the team to demonstrate this commitment.

Establishing trust with taxpayers was core to our approach for designing and building Direct File. By creating the most accurate option for filing, by making taxes accessible to all, by keeping taxpayer data secure, and now, by publicly sharing Direct File’s code, the Direct File team showed our dedication to earning taxpayers’ trust.

Please note: As of two weeks ago, I no longer work at the IRS. I am writing solely in my personal capacity.

Hello world

I made a thing. Apologies for the walls being pretty bare, I only just moved in.

I have an RSS feed if you’d like to know when I post something. No promises.