What AI Productivity Studies Do Not Measure

The Cleanest Result, Read Closely

In 2023, Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond ran a randomized rollout of a generative-AI assistant to 5,172 customer-support agents at a Fortune-500 US software company. The assistant surfaced suggested replies during live customer chats. The phased deployment supplied a clean difference-in-differences identification. The paper is now in Quarterly Journal of Economics (Vol. 140, no. 2, 2025).¹

The result is one paragraph, and it is the cleanest piece of public evidence the field has for the strong AI-productivity claim in customer-facing white-collar work:

+14 percent issues resolved per hour, on average, across treated agents.
Almost all of the gain concentrated in the lowest-skill quartile. The top-skill quartile showed essentially no productivity change.
Improved customer-transcript sentiment on average.
Lower attrition in the treated arm.

If the question were "do generative-AI assistants raise productivity in customer-facing knowledge work," that paragraph is the strongest yes the field has. It is also where the interesting questions begin. The headline number does not say where the surplus went. It does not say what behavioural log the assistant produced or who can contest it. It does not say what repair labour the agents performed when the assistant was wrong. And it does not say whose situation lets a productivity gain convert into mobility versus whose situation makes it tightening of shifts and intensified review.

This essay names those four silences, gives them testable form, and sketches the data a serious follow-up would collect. The four philosophical lenses (Rawls, Foucault, Marx, Beauvoir) are tools used in passing; the spine is the gap between what one randomized field experiment can identify and what an institutional account of AI productivity actually requires.

Four Silences

Silence 1: where the surplus went. The published paper measures issues resolved per hour and customer-transcript sentiment. It does not measure how the firm divided the productivity gain across the parties that have a claim on it. The full distribution table has at least six columns: agent wages, agent retained jobs, customer prices and quality, firm operating margin, AI-vendor licensing fees, and public-revenue effects through earnings. None of those columns are reported. This is not a fault of the paper; the paper was not the institutional study. It is a limitation of treating the productivity number as a complete account.

A productivity gain that lands on the lowest-skill quartile looks Rawlsian on its face. The difference principle prefers arrangements where inequalities are arranged to the greatest benefit of the least advantaged, so a gain concentrated on the lowest quartile points in that direction. But "novice agents do better at the task" and "novice agents end up better off in the basic structure" are different claims. The first is supported by the data. The second is silent.

Silence 2: contestability of the assistant's logs. Call-center work was already heavily measured before generative AI arrived: handle time, schedule adherence, post-call survey scores, recording. The assistant adds two new categories of measurement. The logs of which suggestions were surfaced, accepted, overridden, or ignored are themselves a worker-monitoring record. And the suggestion-acceptance rate becomes a tractable scalar that, once tractable, tends to become managerial.

Neither of those moves is wrong by itself. But there is a cliff between a system the agent can inspect, contest, and refuse versus a system whose logs flow only into management dashboards. The paper does not say which side of the cliff the deployment landed on. That is the Foucauldian audit and it is not legible from the productivity number.

Silence 3: repair labour absorbed as AI output. The 14 percent gain was not produced by the model alone. Some fraction of the productivity uplift was the agent overriding a wrong suggestion, rephrasing a stilted reply, catching a hallucinated commitment before the customer saw it, or escalating a case the assistant misclassified. That repair labour shows up in the productivity number as if the model produced it. From an accounting standpoint, the firm treats one combined output. From a Marxian standpoint, the question is whether the worker who absorbed the model's failures sees any share of the headline number, or whether the gain accrues to vendor seat-licenses and firm margin while the worker takes on more cognitive load per hour for the same wage.

Silence 4: situated freedom of the agent. Two agents at the same call center can face the same tool from different situations. A part-time agent with caregiving responsibilities, no degree, and a non-portable visa has more to lose if the firm uses the productivity gain to consolidate shifts. A college-student agent has more to gain if the new productivity opens room for transfer or internal promotion. The aggregate gain is real and average attrition fell; neither fact tells us whether which workers stayed and whether their staying reflects better work or fewer outside options. Beauvoir's contribution is the reminder that an average over situations obscures the distribution of options that the situations actually allow.

The four silences are not separate issues. They are four different ways the same productivity number under-describes the institution that produced it.

Five Hypotheses

Each silence pins to a falsifiable claim. The next generation of workplace-AI studies should be designed to test these directly.

Hypothesis	What it predicts	What would falsify it
H1	AI assistance compresses the novice learning curve. Time to a given productivity level shortens.	Treated novices show no faster ramp than untreated novices on tasks the assistant did not help with.
H2	Compression weakens the wage premium for experienced agents. The hourly-pay advantage of seniority converges toward zero.	Senior wages keep their premium over a 3 to 5 year window despite no productivity gap.
H3	Suggestion-acceptance logs become performance-management data unless contract or regulation explicitly prohibits it.	Two years post-deployment, no employer in the relevant sector cites suggestion-acceptance rates in promotion or termination decisions.
H4	Firms and AI vendors capture most of the productivity surplus unless compensation, bargaining, or regulation intervene.	Wage data over a 3 to 5 year window shows the surplus split with workers in proportion to the productivity gain.
H5	"Lower attrition" is ambiguous in worker welfare without satisfaction, outside-option, wage, and scheduling data.	Survey and labour-market data show treated agents report higher satisfaction, comparable outside options, and more predictable schedules.

H1 is a learning-curve claim. H2 and H4 are claims about the bargaining and compensation tails of the productivity number. H3 is a claim about how surveillance accumulates in the absence of explicit limits. H5 is a claim about the welfare interpretation of an aggregate worker-flow statistic.

What Data You Would Actually Collect

The hypotheses matter only if they can be tested. Here is the data-collection protocol.

Variable	Measurement	Tests	Source
Wage trajectories	Hourly or annual wage by tenure, pre and post deployment, broken out by skill quartile	H1, H2	Firm payroll data; matched non-treated firm as counterfactual
Suggestion-acceptance rate in HR records	Whether the rate appears in performance reviews, raise decisions, or terminations	H3	Firm policy documents; HR audits; grievance filings
Headcount and substitution	Net employment in the role over 12-36 months; new-hire profile shift	H1, H4	Public filings; LinkedIn workforce data; trade-press reporting
Vendor licensing terms	Per-seat fee; data-rights clauses; agent-log ownership clauses	H4	Procurement contracts; vendor SEC filings; SOC reports
Worker survey	Pre- and post-deployment self-reported satisfaction, outside-option breadth, schedule predictability	H5	Pre-registered instrument; longitudinal panel
Attrition reasons	Why workers leave; whether they leave to better, comparable, or worse outside options	H5	Exit interviews; matched LinkedIn trajectories; tax-filing earnings
Log access rights	Who can read the logs; whether agents can review or contest; whether worker representatives have audit rights	H3	Firm IT policy; HR rights memos; collective-bargaining language

That table is the diligence list. It is also the design memo for a follow-up randomized or quasi-experimental study. The published productivity number anchors none of these variables. A reader who comes away with the productivity number alone has read the abstract. A reader who comes away with H1-H5 has the research agenda. A reader who comes away with this table has the protocol.

Where the Lenses Land

This is where Rawls, Foucault, Marx, and Beauvoir earn their keep. They are not the headline; they are tools that make the silences legible.

Rawls pushes the analyst from "the model is fairer than the old process" to "is the institution this model sits inside justifiable to the least advantaged worker." Silence 1 is Rawlsian.
Foucault insists that measurement is a technique of power, not a neutral tool, and that the question is who can inspect, contest, and refuse the record. Silence 2 is Foucauldian.
Marx asks who owns the productive system, who supplies the labour that makes the system work, and who captures the surplus. Silence 3 is Marxian.
Beauvoir insists that freedom is lived from within concrete material conditions and that an average across workers obscures the distribution of options. Silence 4 is Beauvoir's.

Read this way, the lenses are diagnostic, not decorative. Each one names a class of variable the productivity literature still owes an answer on.

The Synthesis

The line that ties this essay to the rest of the Philosophy x AI series is short:

A technical system can be corrected by feedback. A worker, applicant, or citizen needs recourse. Do not confuse correction in the model with correction in the institution.

That sentence is the spine. The earlier essays in the series build it up from different directions: Empiricism gave the epistemic version (a model corrected by validation loss is not necessarily a model corrected by deployment); Plato's Cave to the Era of Experience gave the source-and-action version (an interactive system corrected only by a benchmark verifier is corrected within a smaller world than its deployment); the AI Economy and Distribution Question essays gave the institutional version. The Brynjolfsson-Li-Raymond reading in this essay is the empirical version: the gap between what the model loop measured and what the worker experienced is exactly the institutional gap the lenses warn about.

A productivity number is not enough. A research design that names what was not measured, names hypotheses that bind to specific variables, and names the data those variables require is the closest a single essay can come to making the productivity literature corrigible. That is what this essay is for.

References

Brynjolfsson, Erik, Danielle Li, and Lindsey R. Raymond. "Generative AI at Work." QJE 140, no. 2 (2025): 889-942. The cleanest randomized field study of generative-AI productivity in customer-facing knowledge work.
Selbst, Andrew D., danah boyd, Sorelle A. Friedler, Suresh Venkatasubramanian, and Janet Vertesi. "Fairness and Abstraction in Sociotechnical Systems." FAT* 2019. The canonical statement of why technical interventions abstracted from the surrounding sociotechnical system can fail as fairness instruments.
Kellogg, Katherine C., Melissa A. Valentine, and Angèle Christin. "Algorithms at Work: The New Contested Terrain of Control." Academy of Management Annals, 2020. The algorithmic-control synthesis.
Ajunwa, Ifeoma. The Quantified Worker: Law and Technology in the Modern Workplace. Cambridge University Press, 2023. Book-length treatment of workplace data collection and the legal lag.
Acemoglu, Daron. "The Simple Macroeconomics of AI." NBER Working Paper 32487, 2024. The macro counterweight to the productivity-optimism literature.
IMF. "Gen-AI: Artificial Intelligence and the Future of Work." Staff Discussion Note, 2024. The widely-cited 40 percent global / 60 percent advanced-economy exposure estimate.
ILO. "Generative AI and Jobs: A Refined Global Index of Occupational Exposure." 2025. The 2025 update to the ILO global exposure index.

Brynjolfsson, Erik, Danielle Li, and Lindsey R. Raymond. "Generative AI at Work." Quarterly Journal of Economics 140, no. 2 (2025): 889-942. Originally NBER Working Paper 31161, 2023. https://academic.oup.com/qje/article/140/2/889/7990658 ↩