The Plumbing
Or: a technical appendix that accidentally became an argument about what 'public' means.
I. Seven Cabinets
Go to fec.gov. Pull up any member of Congress. You can see every contribution they received above 3,300, 03/15/2024. Now go to congress.gov. Pull up the same member's voting record. You can see every roll-call vote they've cast: bill number, date, yea or nay. It's all there. Public. Free. Now try to answer a simple question: did the member who took $3,300 from a Pfizer VP vote on any bills that affected the pharmaceutical industry?
You will find that you cannot answer this question. Not because the information is classified, or behind a paywall, or legally restricted. You cannot answer it because the contribution lives in one database and the vote lives in another and the two databases do not know about each other. They don't share an ID system. They don't share a vocabulary. They don't link together.
You can, if you're patient, parse this out manually: cross-referencing names, dates, bill subjects, committee jurisdictions, lobbying filings from a third database that uses yet another ID system. Assume maybe ten minutes per contribution, if you know what you're doing. The member you're looking at received 1,200 contributions last cycle. That's 200 hours. And that's one member. There are 534 more.
Or you can build a machine.
Because the information is scattered across seven federal databases maintained by seven different organizations using seven incompatible identification systems, and the act of connecting a dollar to a vote, which is the thing the disclosure laws were presumably designed to let citizens do, turns out to require roughly 91 gigabytes of data processing, an entity resolution pipeline that maps four different ID systems onto a single human being, a donor deduplication engine running string similarity algorithms at thresholds chosen (I'll be honest here) partly by feel, and about 36 million joins.1
The companion post describes what the machine found: 6.64%, which is the share of the model's vote predictions attributable to financial features. This post describes the machine itself. I thought, when I started writing it, that I was writing a technical appendix. I am no longer sure that's what this is. Because what the technical details keep revealing, once you line them up and look at them truthfully, is something about the architecture of American campaign finance disclosure that I think is more interesting than the 6.64% itself:
The system discloses the information. It buries the legibility. And the distance between those two things is where influence lives.
II. The Mirror and the Map
I want to start with the Federal Election Commission, because the FEC is where the architectural principle becomes most visible, and because the FEC's design philosophy, once you understand it, explains almost everything else.
The FEC does not assign stable identifiers to individual donors. When you give money to a congressional candidate, the FEC records your name, city, state, employer, and occupation, exactly as reported on the form. This means the same person can appear as JOHNSON, ROBERT, BOEING CO, ENGINEER in one filing, BOB JOHNSON, BOEING COMPANY, AEROSPACE ENGR in another, and JOHNSON R, BOEING, ENGR in a third. The FEC does not consider these to be the same person. Its matching logic is string identity: are the characters the same? No? Different people.
The instinct here is annoyance. This seems broken. But if you sit with it (which I have, at hours suggesting poor judgment or genuine obsession, and possibly both), what emerges is a system that has made a very specific philosophical choice about its own purpose. The FEC is a filing agency. Its mandate, per the Federal Election Campaign Act, is to receive and make available reports and statements filed with the Commission. Its job is to receive filings and publish them. To be a cabinet. The form says JOHNSON, ROBERT. The database says JOHNSON, ROBERT. The database has done its job.2
This distinction (between recording what was filed and asserting what is true about the world) is, I've come to believe, the keystone of the entire disclosure system, and I want to name it clearly because everything downstream depends on it. Every database is one of two things. It is either a mirror (it reflects what someone reported, no more, no less) or a map (it models what someone believes about reality). The FEC's contribution database is a mirror. It is a very good mirror. It is also, if what you want to know is "which industries are funding which legislators and how does that correlate with their votes," a mirror pointed at the wrong wall.
Congress.gov is also a mirror. It records votes. It does not record why anyone voted that way. VoteView (UCLA) is closer to a map: it computes ideology scores from voting patterns, which is an interpretive act. The Senate Lobbying Disclosure database is a mirror. It records what lobbyists filed. It does not record whether the lobbying worked. OpenSecrets is a map: it classifies donors into industry categories, which requires judgment. The congress-legislators YAML file on GitHub is a Rosetta Stone maintained by volunteers, without which the other six databases are, in a meaningful sense, written in mutually unintelligible languages.3
Seven databases. Each one does its job. Each one's job is narrower than what you need it to do if you want to answer the question "does money predict votes." And the collective result is a disclosure system that is simultaneously transparent (every record is public, as the law requires) and opaque (no record connects to any other record, because the law does not require that).
I do not think this was designed. I think it congealed. The FEC was established in 1975. Congress.gov's API launched decades later. The Senate lobbying database was built to a different spec by different people in a different era. Nobody sat in a room and said "let's make the data public but unconstructable." It happened the way a city without sidewalks happens: each road was built for a purpose, and the purposes didn't include pedestrians, and by the time anyone noticed the pedestrians, the concrete had settled.
But the effect is the same whether it was designed or accreted. A citizen who wants to know whether their representative's votes correlate with their representative's donors faces a choice:
A. Spending several hundred hours manually reviewing filings across multiple federal websites.
B. Building a data pipeline.
Option A is theoretically available to every American. Option B requires programming skills that maybe some subset of the population possess, and even less want to do. The information is public. The comprehension is gated. And the gate is plumbing, which is harder to object to than a paywall or a classification level or a legal restriction, because plumbing sounds like a solvable engineering problem rather than a structural feature of the political economy it describes.
I did the plumbing. Here's what's in it.4
III. The Identity Problem
Zachary Nunn, a congressman from Iowa is H4NY15024 to the FEC, ICPSR 21500 to VoteView, Bioguide T000474 to Congress.gov, and CID N00036154 to OpenSecrets. Four strings of characters that mean "Zachary Nunn" in four different institutional languages. The chain that connects a dollar donated to Nunn's campaign to a vote Nunn cast on the House floor runs: contribution → FEC committee ID → FEC candidate ID → Bioguide (via YAML crosswalk) → vote record. Four joins. Each one can fail. The committee master file has committees with no candidate. The candidate file has candidates with no Bioguide mapping. The YAML crosswalk has legislators from 1789 who, for reasons that will be obvious upon reflection, lack FEC identifiers.
The legislator mapping is, comparatively, clean. There are 12,000 of them, each a public figure with a paper trail. The donor mapping is where the disclosure system's architectural choice becomes most consequential.
Because the FEC doesn't assign donor IDs, my pipeline has to decide whether ROBERT JOHNSON and BOB JOHNSON are the same person. I use Jaro-Winkler similarity at a 0.92 threshold with 70/30 name-to-employer weighting. At 0.92, they merge. At 0.93, they don't. The difference between one person and three people can be a 0.01 change in a similarity threshold. I set it to 0.92 and Bob Johnson is one guy. I set it to 0.93 and he's triplets. Every number downstream, including the 6.64%, lives in that reality.
Here is the thing I keep coming back to: the FEC has 36 million contribution records. If those records had stable donor IDs (the way, say, the Social Security Administration assigns stable IDs to earners),5 the entire donor-industry analysis would be trivial. You could look up a donor, see their employer, classify their industry, and connect them to their contributions in a single query. The reason it requires a deduplication engine running fuzzy string matching at thresholds chosen by a person staring at pairs of records at his desk is that the FEC designed a system for filing, and filing requires fidelity to the form, and fidelity to the form means three records for one person, and three records for one person means the difference between understanding and noise is a similarity threshold set by feel.
Three contributions, incidentally, totaling $6,200, to the same legislator, in the same cycle, from the same human being who happens to have a nickname and a job change. The mirror reflects three donations from three strangers. The map, if you build one, shows one person with a pattern.
Multiply this by 36 million and you have the donor identity problem. Multiply the donor identity problem by the legislator identity problem by the bill identity problem6 by the lobbying identity problem, and you have the reason that connecting money to votes requires a pipeline instead of a query, which is itself the reason that the connection between money and votes is, for the vast majority of Americans, functionally invisible despite being technically public.
IV. What the Machine Sees
So you build the pipeline. You resolve the identities. You connect the databases. And now you have to decide what to show the model, which is a decision that is often presented as engineering but is actually theory.
A feature, in machine learning, is a thing (number, category, other) you believe contains information relevant to the prediction you're trying to make. "Believe" is doing real work there. By including a measurement, you assert it matters. By excluding one, you assert (or at least imply) it doesn't. The model will tell you whether you were right, which is nice, but the initial decision about what to measure is a hypothesis wearing an engineering hat.
The Congressional Yield Index uses 276 features per vote. Each vote is a (legislator, bill) pair. The features decompose into several categories:
The legislator (~60 features): party, DW-NOMINATE ideology scores,7 district demographics, committee assignments, seniority, caucus memberships, interest group ratings, predecessor voting patterns, attendance rate, defection trend, and features from structured political profiles generated by submitting all 535 current legislators to an LLM.8
The bill (~64 features): subject tags, sponsor party, vote type (passage, amendment, procedural, nomination), bipartisan history scores, and PCA-reduced dimensions of bill text embeddings from a local model that converts 200 pages of legislative language into numbers that capture semantic similarity. What those numbers lose in the conversion is, roughly, everything that makes legislation legislation rather than a vector.
The financial relationship (107 features): For each of the top 50 CRP industry categories, what percentage of this legislator's total funding came from that industry. The concentration index of that vector.9 PAC-to-individual ratio. In-state vs. out-of-state ratio. Independent expenditure totals. Dark money exposure. Donor cluster exposure (from a Gaussian Mixture Model on the co-funding graph, which finds emergent clusters of donors who fund the same legislators; the clusters are what the algorithm discovered, real in the sense that they're predictive and unreal in the sense that nobody in Washington would recognize them).10 Financial conflict scores (the dot product of the legislator's donor-industry vector and the bill's stakeholder-industry vector, weighted by direction and magnitude). Lobbying pressure. Employer concentration. And temporal financial features: contribution velocity in the 90 days before a vote, new donor counts from affected industries, quarter-over-quarter lobbying spend changes, donor tenure pressure, and employer-bill exposure signals.
These 107 features are the ones that produce the 6.64%. They are also the ones that were hardest to build, because they required connecting the databases that were designed to exist in isolation. The industry concentration vector requires linking contributions (FEC) to industry codes (OpenSecrets) via donor names (fuzzy-matched) and committee IDs (crosswalked). The lobbying pressure feature requires a triple join: contributions (FEC) → organizations (built from scratch, because no shared org ID exists across the databases) → lobbying filings (Senate LDA) → issue codes → bill subjects (Congress.gov). Every one of these joins crosses an institutional boundary. Every boundary crossing is a potential failure point, a place where identities don't match, formats don't align, or the mirror on one side is reflecting something that the mirror on the other side doesn't recognize.
The 107 financial features are, in other words, the features that the disclosure system's architecture made maximally difficult to compute. The non-financial features (party, ideology, district, committee) were comparatively easy, because they come from databases that were designed to be legible. Make of this what you will.
V. The Machine That Plays Twenty Questions
The model is CatBoost, which plays 20 Questions.
I mean this almost literally. In 1948, Claude Shannon published "A Mathematical Theory of Communication," in which he proved that information is the resolution of uncertainty and that the optimal way to identify an unknown is to ask the sequence of binary questions that eliminates the most possibilities per question. A decision tree does exactly this. It takes 2.4 million votes, each described by 276 features and a binary target (did this legislator defect from their party?), and learns a sequence of yes/no splits. Is this a Democrat? Is their DW-NOMINATE above -0.3? Does more than 10% of their funding come from financial services? Each split is chosen to maximize Shannon's information gain: the biggest reduction in uncertainty about the outcome. Eight layers of questions, and you've carved the space into regions where the answer is mostly yes or mostly no.11
The model uses 3,000 such trees, each one trained to correct the errors of the previous ensemble (gradient boosting), with hyperparameters selected by Bayesian optimization over 100 trials.12
The thing I want you to notice, though, is what the tree doesn't know. It doesn't know what a Republican is. It doesn't know what the pharmaceutical industry is. It doesn't know what it means for a legislator to defy their party on a vote about banking regulation while their top five donors are banks. It knows that certain conjunctions of feature values, in the training data, correlated with defection at certain rates. It has answered every question. It has understood nothing.
Shannon himself recognized this. His theory deliberately excluded meaning. "Frequently the messages have meaning," he wrote. "These semantic aspects of communication are irrelevant to the engineering problem." The 6.64% lives in the same exclusion zone. It is a statistical statement. Whether it means what you think it means when you read it is, as Shannon would say, a different department.
VI. The Counterfactual
The model has two stages. The architecture is standard.
Stage 1 sees 169 features: everything except money. Party. Ideology. Seniority. Committees. Bill subjects. Bill embeddings. Interest group ratings. District demographics. Caucus memberships. Vote type. Predecessor patterns. Stage 1 is a what-if machine. It asks: if you could not see this legislator's donors, how would you predict their vote?
Stage 2 sees all 276 features, including the 107 financial ones. Same algorithm. Same data.
For each vote, the pipeline computes P₁ (Stage 1's defection probability) and P₂ (Stage 2's). The financial delta is P₂ − P₁.
And here is where I need to slow down, because the counterfactual is the foundation of the entire analysis, and the foundation has a crack in it that I think is important enough to show you rather than plaster over.
Stage 1 simulates a world in which the model cannot see campaign money. That is a different thing from a world without campaign money. In a world without contributions, the legislator might behave differently because the incentive structure has changed. Industries that currently shape agendas through financial support might still shape them through expertise, information asymmetry, revolving-door employment, or the simple fact that the pharmaceutical industry employs 30,000 people in a particular district and those people vote. The legislator who votes with pharma might vote with pharma even in a moneyless world, because they genuinely believe the industry's position is correct, or because their committee staff drafted the language and the staff came from industry, or because their district would lose jobs if the bill passed.
In Stage 1's world, all of those mechanisms still exist. The model just can't see the financial features. The delta between P₁ and P₂ is the additional predictive power that financial features provide after everything else the model can observe has been accounted for. That claim is statistical. The causal claim ("money changed this vote") is a different and much stronger assertion. The gap between them cannot be closed with observational data. To close it, you would need to randomly assign campaign contributions to legislators and observe the result, which would require an experimental design that violates several federal laws and all known principles of political science.13
So the 6.64% lives in the gap. It is a measurement of something, but the something is ambiguous: it could be money's influence on votes, or money's correlation with the part of voting behavior that ideology and party don't already explain, or some combination that varies by member and by vote in proportions we cannot determine.
I want to be transparent about this ambiguity because I think being honest about it is a prerequisite for saying what I'm about to say, which is that despite the ambiguity, I believe the number is pointing at something real.
Here is why. The confounders that Stage 1 controls for are strong. DW-NOMINATE is the gold standard of ideology measurement. Party membership is the single strongest predictor of votes. District demographics capture constituent pressure. Committee assignments capture institutional position. Interest group ratings capture revealed policy preferences. Bill embeddings capture semantic content. If the financial signal were merely proxying for ideology (if the money followed preferences rather than did anything to them), Stage 1 should absorb it, and Stage 2 should show minimal improvement. Instead, Stage 2 adds predictive power on top of all of that. Consistently. And in patterns that cohere with how you'd expect financial influence to operate if it were real: the tenure curve that peaks for freshmen and declines as ideology features absorb the signal, the concentration of financial signal on amendment votes (the low-scrutiny, high-substance votes where party discipline is weakest), and the diffuse distribution of money's signal across 107 features with no single financial variable dominating (which is what structural influence looks like, as opposed to transactional quid pro quo).
A random confounder would produce a signal that is uniform across vote types, uniform across tenure levels, uniform across feature importance rankings. The financial signal varies in patterned ways that are consistent with a specific hypothesis (money as structural alignment between donors and legislators) and inconsistent with the leading alternative (money as noise correlated with ideology). This is evidence, and I think reasonable people should treat it as such, while acknowledging that evidence and proof occupy different floors of the same building and I am standing on the lower one.
VII. Decomposition
The 6.64% is computed using Shapley values, a concept from cooperative game theory that Lloyd Shapley published in 1953 and that won him a Nobel Prize sixty years later. The Shapley value decomposes a prediction into additive contributions from each feature. For 276 features, the brute-force computation requires 2²⁷⁶ subsets (a number with 83 digits, roughly twenty orders of magnitude shy of the number of atoms in the observable universe), but TreeSHAP computes exact values in polynomial time by exploiting the structure of decision trees.
For each vote, the decomposition produces 276 numbers, one per feature, each answering the question: how much did this feature push the prediction toward or away from defection? Feature party_seat_margin (how many seats the member's party holds beyond the majority threshold, which is a rough measure of how much the party can afford to lose them on any given vote) pushed -0.31 toward party loyalty. Feature leg_dw_nominate_dim1 (the member's position on the liberal-conservative spectrum, computed from their lifetime voting record) pushed +0.31 toward defection. Every feature gets a number. The numbers sum to the prediction.
The 6.64% is what you get when you add up the absolute contributions of all 107 financial features, divide by the absolute contributions of all 276, and average that fraction across every vote in the test set. It is, in other words, money's share of the model's total explanatory effort, which is a thing you can compute precisely and interpret only approximately.
I want to flag one thing about this number and then move on. The SHAP values decompose a prediction; they do not explain a phenomenon. The difference is the difference between "60% of the repair cost was the transmission" and "the transmission caused the breakdown." One is a partition. The other is a story. The 6.64% is a partition. The companion post sometimes uses language that implies it's a story. I chose that language because the accurate version ("financial features account for 6.64% of the mean absolute SHAP attribution across a gradient-boosted tree ensemble trained on 2.4 million votes with a defection target, evaluated on a time-separated test set") would lose every reader I'd like to keep. Whether the imprecision is forgivable depends on how much you trust the rest of the machinery to justify it.
VIII. The Target and the Test
The model predicts defection from party, which is the interesting question: when does a legislator break? Predicting yea or nay is trivially accurate (guess the party's usual position and you'll be right almost every time), and trivial predictions tell you nothing about influence. Defection is rare (roughly 6% of votes), which makes it hard to predict and therefore informative when you can. The trivial baseline (always predict no defection) achieves roughly 94%. The model achieves 97.42% with a 0.9551 AUC-ROC.14
The evaluation has three layers. Standard metrics on a time-separated test set. Baselines (NeverDefect, party-only, logistic regression) that the model must beat to justify its existence; it beats all three, though the margin over logistic regression is modest enough that I want to flag it rather than crow about it. And the CYI Evaluation Set: a curated subset of votes where money plausibly matters most (high financial conflict, close margins, high defection rates, bills with multiple identified stakeholder industries; a vote must meet at least 2 of 4 criteria). If the financial signal is real, the model should perform better on this subset. It does. If the signal were an artifact of bill contentiousness, you'd expect the performance to be uniform.
IX. What the Plumbing Reveals
This was supposed to be a technical appendix. I want to acknowledge that, because I think the gap between what I thought I was writing and what I ended up writing is itself part of the finding.
I sat down to describe a data pipeline: sources, joins, features, model, evaluation. A document for the technically curious. And what kept emerging, section after section, was the implications of the pipeline's existence rather than its mechanics. Because every hard part of building this system was hard for the same reason: the data was designed to exist in isolation. The identities don't match. The formats don't align. The FEC records dollars, Congress.gov records votes, the Senate records lobbying, and none of them record the relationship between any of these things, because none of them were asked to.
Congress passed laws requiring that every campaign contribution above $200 be disclosed. Each disclosure mandate was fulfilled. Each mandate, fulfilled in isolation, produces a filing cabinet of extraordinary completeness that is, in isolation, of almost no use for answering the question the disclosure was presumably designed to let you ask. The result is a transparency regime that is, in the strictest legal sense, transparent.
And in every practical sense these disclosures might as well be written on the dark side of the moon. The information is disclosed. The legibility is absent.
The effect, whether it was intended or coalesced, is that connecting money to votes requires building a machine. The machine I built uses gradient-boosted trees, SHAP decomposition, donor clustering, and bill embeddings. It ingests 91GB of data. It makes 36 million joins. And after all of that, it produces a number (6.64%) that is, at best, a noisy estimate of the financial system's statistical footprint on legislative behavior, hedged by every caveat I've described in this post: the deduplication threshold that creates its own reality, the counterfactual that imagines a world that doesn't exist, the SHAP values that partition without explaining, the emergent clusters that are real but uninterpretable.
All of this was necessary to see a thing that was, legally speaking, never hidden.
The Federal Election Campaign Act of 1971 begins with a statement of purpose. The purpose, per the legislative history, is to promote "public disclosure of the sources and uses of funds in Federal election campaigns." The word "disclosure" appears 70 times15 in the original act. The word "comprehension" appears zero times. And I keep coming back to this, because I think the gap between those two words is where the entire story lives.
Disclosure means making information available. Comprehension means making information usable. The distance between those two definitions is the distance between putting all the ingredients on the counter and cooking dinner. The FECA put every ingredient on the counter. It did not, at any point, in any section, require that the ingredients be combinable into a meal. The result, fifty years later, is a countertop so covered with disconnected ingredients that the only people who can cook with them are the ones who build their own kitchens from scratch, which is a group that includes, as far as I can determine: some political scientists, some investigative journalists, some well-funded advocacy organizations, and a dude writing a blog post sitting next to a cat litterbox.16
I want to be careful with what I'm saying here, because there is a version of this argument that sounds like a conspiracy theory, and that is not what I mean. Nobody designed this. The opacity is emergent. But emergent opacity and engineered opacity produce the same result for the citizen standing outside the system trying to look in, which is: nothing.
And here's where I arrive at a question I'm not qualified to answer and that I don't think the unanswerability excuses me from asking: is a disclosure system that discloses everything and connects nothing actually fulfilling the purpose for which it was created? When James Madison wrote in Federalist No. 52 that the House of Representatives should have "an immediate dependence on, and an intimate sympathy with, the people," he was articulating a principle that assumes citizens can see what their representatives are doing and why. The "why" is the part that requires connection. And the connection is the part that requires plumbing. And the plumbing is the part that the law neither provides nor requires, which means the "intimate sympathy with the people" part depends, structurally, on whether someone felt like building a data pipeline that week.17
The 6.64% is interesting. I believe it points at something real. But I think the deeper finding is the one the pipeline's existence reveals: that the system is working exactly as designed, that the design achieves exactly what it specifies, and that what it specifies falls so far short of what most people assume it specifies that the distance between the specification and the assumption is wide enough to hold $14 billion and 2.4 million votes and the entire relationship between the two, in plain sight, in public, invisible.
Footnotes
-
The seven sources: FEC bulk data (fec.gov), Congress.gov API, VoteView (voteview.com), congress-legislators GitHub repo (unitedstates/congress-legislators), Senate LDA API (lda.senate.gov), OpenSecrets/CRP bulk data, and ProPublica Nonprofit Explorer. Total raw data: 36 million contributions, 2.4 million member-vote pairs, 460,000 lobbying filings, 81,000 bills, 12,000 legislators. If you want to reproduce this, bring coffee and a high tolerance for government data formats. ↩
-
The Federal Election Campaign Act of 1971, as amended, requires the Commission to "receive and make available" reports and statements. The word "analyze" does not appear in the mandate. This distinction matters: a filing cabinet does its job by receiving and publishing; a regulator does its job by interpreting and acting. The FEC is, by design, the former, which is one of several reasons why the FEC has been, for roughly its entire existence, the subject of intense debate about whether it is structurally capable of fulfilling the purpose most people assume it exists to fulfill. The Commission has an even number of members (six), no more than three from any party, and a history of partisan deadlocks so consistent they have become, in the academic literature, an object of study in their own right. ↩
-
The
congress-legislatorsYAML is maintained by volunteers at theunitedstatesGitHub organization. It maps Bioguide to FEC to OpenSecrets to VoteView ICPSR to GovTrack IDs. Its memory footprint, loaded into a bidirectional lookup index, is about 15MB. The infrastructure connecting 40% of American political data projects weighs less than a single high-resolution photo of a legislator. ↩ -
Ew. ↩
-
In 1938, a wallet manufacturer in Lockport, New York, included a sample Social Security card in its wallets bearing the real SSN of a company secretary named Hilda Schrader Whitcher. 40,000 Americans adopted Hilda's number as their own. At the peak in 1943, 5,755 people were simultaneously using it. The SSN was designed for payroll tracking and printed NOT FOR IDENTIFICATION, which turned out to be one of the least enforceable directives in the history of American bureaucracy. The IRS adopted it in 1962. The military in 1969. The credit agencies eventually. Each adoption was pragmatic. The cumulative effect was a de facto national ID that nobody voted for. The FEC, notably, did not adopt it as a donor identifier, which is one of many choices that is easy to criticize and hard to fault. ↩
-
The bill identity problem is its own special nightmare. VoteView identifies bills by chamber-specific codes (HR21, S567). Congress.gov uses a different format. The pipeline reconciles them, but VoteView frequently assigns placeholder IDs (prefixed
vv-) for roll-call votes whose associated bill can't be matched to Congress.gov metadata, which means a nontrivial number of votes float in the database attached to bills that the pipeline knows exist but can't fully describe. These phantom bills are excluded from analysis but their existence is a reminder that even the vote database, which is the simplest part of the pipeline, contains lacunae. ↩ -
DW-NOMINATE stands for Dynamic Weighted NOMINAl Three-step Estimation, an acronym that could only have been produced by political scientists. The first dimension captures the liberal-conservative spectrum. The second dimension captures something that political scientists have been arguing about for decades. ↩
-
Structured political profiles for 535 members of Congress, generated via the Anthropic Batch API: $9. The profiles include interest group ratings, district demographics, electoral vulnerability, caucus memberships, biographical details, and donor tension analysis. This is opposition research at 99.997% below market rate, which is either a marvel of technology or a sign that the opposition research industry should update its pricing model. ↩
-
The Herfindahl-Hirschman Index was developed for antitrust economics, where it measures market concentration. Applied to campaign donors, it measures how close a legislator's funding base is to a monoculture. A legislator whose funding is a monopoly of one industry is, at minimum, in a different structural position than one whose funding is a competitive market. Whether "different structural position" translates to "different voting behavior" is what the model is trying to determine. ↩
-
The donor clusters are computed via soft Gaussian Mixture Model clustering on the co-funding graph: donors who give to the same legislators end up in the same clusters, regardless of what industry they work in or where they live. The clusters are emergent, which means they were defined by the data rather than by any prior theory about how donor networks work. Whether they correspond to actual coordination among donors or merely to shared preferences is, like most interesting questions in this project, one the model raises and cannot answer. ↩
-
Shannon's information gain, which measures the reduction in entropy from a binary split, is literally the criterion that decision tree algorithms use to choose each split. The 20 Questions analogy is genuinely literal. Decision trees are Shannon's information theory applied to classification. The optimal question is the one that resolves the most bits of uncertainty about the target variable. The tree asks: what single yes/no question, right now, would most reduce my confusion about whether this legislator is about to defect? ↩
-
3,000 trees, max depth 8, learning rate 0.05, 80% row and column subsampling, L2 leaf regularization 5.0, SqrtBalanced class weights. Hyperparameters via Optuna with time-series cross-validation (train on earlier congresses, validate on later ones, so the model never sees the future during training). ↩
-
The technical term for this in the political science literature is the "selection vs. influence" problem, and it has been the subject of approximately ten thousand published papers, none of which have resolved it. The best summary I've encountered is from a researcher who said the field's consensus is "probably both, in proportions that vary by member and by vote, in ways we cannot measure." This is a very expensive way of saying "we don't know." ↩
-
Weather forecasting achieves 80-90% accuracy at the 24-hour horizon, on a problem with much more regular underlying physics. Whether human political behavior should be more or less predictable than a thunderstorm is a question I find myself thinking about at odd moments. ↩
-
I dare you to count to see if I actually did it or not. ↩
-
I recognize that this list is both incomplete and self-aggrandizing, and I want to acknowledge that the self-aggrandizement bothers me more than the incompleteness, which probably tells you something about my priorities that I'd rather it didn't. And the general rule of thumb is one litter box per cat, plus one extra. ↩
-
Madison, of course, was writing in a context where the total federal budget was approximately 14 billion in campaign contributions across 539 members of Congress is a question that I suspect he would find fascinating and I find genuinely unanswerable. ↩