Call-Center AI: Productivity, Surveillance, Surplus

The Result

In 2023, Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond ran a randomized rollout of a generative-AI assistant to 5,172 customer-support agents at a Fortune-500 US software company. The assistant surfaced suggested replies during live customer chats. The phased deployment produced a clean difference-in-differences identification.

The result, in one paragraph: average productivity (issues resolved per hour) rose roughly 14 percent across treated agents. Almost all of the gain was concentrated among the lowest-skill quartile. The highest-skill agents showed no measurable productivity change. Average customer-transcript sentiment improved. Attrition fell. The paper is now in Quarterly Journal of Economics (Vol. 140, no. 2, 2025).¹

That is the cleanest piece of public evidence for the strong AI-productivity claim in customer-facing white-collar work. It is also the worst case for treating the headline number as a complete answer. The next four sections run four audits on the same finding. Each audit names a question the headline number does not address, and each one points to data the published study does not collect.

What This Case Is For

This essay is the corrective to the framework essays earlier in the series. The four lenses (Rawls, Foucault, Marx, Beauvoir) earn their keep only when they bite on a real deployment. The Brynjolfsson-Li-Raymond study is the right anchor for four reasons:

The productivity number is real and randomized.
It is the friendliest published case for the optimistic AI-productivity claim.
It is the worst case for treating that claim as a complete answer.
Two of the lenses (Foucault and Marx) bind hard regardless of the productivity number's sign.

A case study should expose, not flatter, the framework. This one does.

The Setup, Tersely

Before the lenses, the facts.

What	Detail
Firm	A US software company; full identity not disclosed in the published paper
Workforce studied	5,172 customer-support agents
Treatment	Phased rollout of an LLM-based "AI assistant" surfacing recommended replies during customer chats
Identification	Staggered rollout supplies a difference-in-differences design with phased treatment
Headline outcome	$\approx +14\%$ issues resolved per hour, averaged across all agents
Heterogeneity	Gain concentrated in the lowest two skill quartiles; top-quartile agents essentially unchanged
Customer sentiment	Net improvement in chat-transcript sentiment scores
Attrition	Lower in the treated arm
Open question 1	Whether the productivity gain persists; published follow-up data are limited
Open question 2	Where the productivity surplus went: agent wages, retained jobs, customer pricing, firm margin, AI-vendor licensing
Open question 3	What logs the assistant produced, who could read them, and whether agents could contest them

That is the case. Now the four audits.

The Rawlsian Audit

The headline finding looks Rawlsian on its face. The difference principle prefers arrangements where inequalities are arranged to the greatest benefit of the least advantaged. A productivity gain that lands on the lowest-skill quartile points in that direction.

That reading is too quick. Rawls's actual question is structural, not local. Where does the productivity surplus go?

The published paper measures issues resolved per hour and customer ratings. It does not measure how the firm split the gains. The full distribution table has at least five columns:

Recipient	What "gain" looks like	Was it measured?
Agents (raises)	Higher wages tied to higher productivity	Not reported
Agents (retained jobs)	Headcount that would otherwise have been cut is preserved	Indirectly via lower attrition; mechanism unclear
Customers	Lower prices or improved service	Not reported
Firm (shareholders)	Operating margin improvement	Not reported
AI vendor	Per-seat licensing revenue	Not reported
Public revenue	Tax on increased earnings	Not reported

None of those zeros are the firm's fault for not reporting. They were not the study's research question. They are also the structural variables a Rawlsian institutional analysis cares about. "Novice agents do better at the task" and "novice agents end up better off in the basic structure" are different claims and the published evidence supports the first one and is silent on the second.

Selbst, Boyd, Friedler, Venkatasubramanian, and Vertesi argue this point sharply for fairness specifically: technical interventions that abstract away the surrounding sociotechnical system can become ineffective or misguided as fairness instruments because the variables that determine real-world outcomes lie at the institutional layer the abstraction excluded.² The same trap applies to productivity. A randomized field experiment measures the decision-rule-and-tool layer cleanly and is silent on the institutional layer above it.

The Rawlsian recommendation, therefore, is not "do not deploy AI assistants in call centers." It is: the productivity number is informative about the model and uninformative about the institution. The institutional question is a separate empirical project, not a corollary of the model evaluation.

The Foucauldian Audit

Call-center work was already heavily measured before generative-AI assistants arrived. Average handle time. Schedule adherence. Post-call survey scores. Call recording. The supervisor's view of the floor was already a dashboard.

The AI assistant adds two new categories of measurement. First, it sees every keystroke, every recommendation surfaced, every recommendation accepted, every recommendation overridden. The assistant's logs are themselves a worker-monitoring record. Second, "agent uses suggestion X percent of the time" becomes a tractable metric. Once tractable, it tends to become managerial.

Kellogg, Valentine, and Christin synthesize the management-research literature on what they call "algorithmic control": the shift from human supervisors directing work to algorithmic systems directing, monitoring, evaluating, and disciplining workers.³ The call-center assistant is a textbook instance: it directs (recommended replies), monitors (every interaction), evaluates (suggestion-acceptance rate), and ultimately can discipline (the rate becomes a performance metric).

Ifeoma Ajunwa's The Quantified Worker is the closest book-length treatment of what happens when this kind of measurement scales: every aspect of the worker becomes a continuous data stream, the legal and policy framework lags the technology by years, and the asymmetry between what the firm sees and what the worker can see widens.⁴

The Foucauldian audit does not say "surveillance bad." Some measurement protects workers. Recording is necessary for safety and for unpaid-overtime claims. The audit asks five harder questions.

Question	What the published study says
Who can read the assistant's logs?	Not specified; presumably management.
Can the agent inspect the record before a performance review?	Not specified.
Can the agent contest a measurement they believe is wrong?	Not specified.
Are the logs admissible against the agent in a termination dispute?	Not specified.
What measurement boundaries does the firm refuse to cross even with technical capability?	Not specified.

Five "not specifieds." That is exactly Foucault's point. The productivity result is published; the governance regime around the productivity result is not. The dashboard precedes its rules.

The recommendation that follows is concrete: the assistant's worker-facing artifact is a different system from the assistant's manager-facing artifact, and they should be specified separately. If there is no contestability, do not call it accountability. Call it control. That phrasing is from the framework essay; it lands harder here, on a real product, than it did as an abstract heuristic.

The Marxian Audit

The 14 percent productivity gain was not produced by the model alone. It was produced by:

the contact-center agents using the assistant (the workers in the study)
the engineers who built the assistant (employees of the AI vendor or of the firm)
the data labellers who annotated training examples (often outsourced to lower-wage labour markets)
the platform vendor selling the assistant as a per-seat license
the cloud provider charging per inference call
the prior generation of customer-support workers whose conversation logs trained the assistant in the first place

Mary Gray and Siddharth Suri's Ghost Work is the canonical reference for the second-to-last and last items: the human labour that is structurally invisible inside an "AI" product, scoped out of the labour-cost line, and counted as part of the model's apparent capability.⁵ The productivity gain reads cleanly in the headline because the data-labelling layer is already absorbed into the vendor's cost basis and the historical-conversation layer is absorbed into the training data.

A subtler form of absorbed labour shows up at the agents' own desks. The assistant recommends; the agent accepts, edits, or overrides. When the recommendation is wrong and the agent corrects it, the conversation succeeds and the productivity number moves up. The repair work is real labour and it is what made the metric look clean. In the Marxian frame, that repair labour was absorbed into "AI productivity." The agent's intelligence is part of the product specification of the AI assistant whether or not the firm names it as such.

The political question (whose surplus is this?) has a determinate answer only if the institutional context is specified. Three illustrative scenarios on the same productivity number:

Scenario	Where the 14% lands	Distribution outcome
Tight labour market, strong worker bargaining	Wages rise; firm absorbs vendor licensing as ordinary cost	Agents capture a meaningful share
Loose labour market, vendor lock-in	Headcount cut by ~14%; remaining wages flat; vendor margin grows	Firm and vendor split the surplus; agents lose
Public-sector contracted call centers, multi-year procurement	Firm bills public buyer flat; productivity gain becomes vendor margin	Public buyer overpays for unchanged service; agents and citizens lose

The same model and the same productivity number produce three different surplus distributions because the institution sitting around the model is different. That is not a Marxist slogan; it is the clean reading of the published evidence.

The Beauvoir Audit

Two agents at the same call-center see the same tool from different situations.

Agent A is a part-time worker, a single parent, in a country whose work visa is tied to the current employer, with no four-year degree, two years tenure, and on a non-portable schedule because of childcare. Agent B is a college student working evenings, lives at home, has a part-time wage as supplemental income, expects to leave the job within a year. Both receive the assistant. Both improve productivity by the average amount.

The productivity number is the same. The freedom that productivity number creates is not.

Agent A's gain looks like a margin of safety: the metric improves, attrition risk falls, the visa renewal is more likely, the schedule stays. Agent A's loss, if the firm uses the productivity gain to consolidate shifts and reduce headcount, is a category beyond compensation. It includes the visa, the school district the kids are in, the proximity to extended family. None of those are reflected in any productivity dashboard.

Agent B's gain looks like upward optionality: the metric improves, the dossier is stronger, the leverage to ask for a transfer or to leave is greater. Agent B's loss is bounded; another job is available, the rent is low, the timeline is short.

Beauvoir's argument is that the average is not wrong; the average is incomplete. Whose situation lets them turn this gain into mobility, and whose situation makes it a tightening of shifts? That is a different question from "did productivity rise?" and it is the question that decides whether the deployment counts as humane. The published study does not have demographic data adequate to answer it; this is a research opportunity, not a published result.

A second-level Beauvoir question: what happens to the agents who would have been hired but no longer are because the AI assistant compressed the headcount needed to handle the same call volume? They are the most invisible category in any AI-productivity study because they were never on the firm's payroll. The productivity number is silent on them by construction.

What the Four Audits Tell You, Together

Two of the lenses (Rawls, Beauvoir) push toward "wait, look at distribution and situation." Two of them (Foucault, Marx) push toward "the productivity number under-describes what changed." That is the right way to use the lenses: as four different audits run on the same firm-level claim, not as one consensus diagnosis.

A summary of where each lens lands:

Lens	What the headline number does not tell you
Rawls	Where the surplus went: agent wages, firm margin, customer prices, vendor fees, public revenue
Foucault	Whether agents can inspect, contest, or refuse the assistant's logs as performance evidence
Marx	Whether the productivity gain absorbs worker repair labour and prior-worker conversation logs as if they were the model's output
Beauvoir	Whose situation lets them turn the productivity gain into mobility, and whose situation makes it a tightening of shifts

The right next deployment question is therefore not "should we ship the assistant." The right deployment question is which of these four columns the firm has specified, and which it is silent on. The silences are policy. Not having a specified contestability path is itself a Foucauldian decision; not specifying where the surplus goes is itself a Marxian decision; not measuring situated impact is itself a Beauvoirian decision.

Three Plausible Futures, Same Productivity Result

The published study is point-in-time. Two years later there are three plausible futures, all consistent with the same underlying technology.

Future 1: integrated augmentation. The firm formalizes the assistant's logs as worker-visible. The contestability path is a real grievance procedure with a human reviewer. The productivity gain is partly absorbed into wage growth and partly into reducing the worst-quintile of customer-experience problems. Headcount holds. The vendor licensing is one of several costs the firm renegotiates. The agents who were lowest-skill move up the skill ladder partly because the assistant made the ladder cheaper to climb.

Future 2: managerial intensification. The assistant's logs become the basis of weekly performance reviews. Agents who override the recommendation too often or accept it too often are flagged. The productivity gain accrues mostly to the firm and the vendor. Headcount falls 12 percent. Attrition continues to fall, but partly because the labour-market alternative for the displaced agents is worse. The bottom quartile is helped relative to not using the tool and is hurt relative to a counterfactual deployment where the same gain landed in wages.

Future 3: platform-mediated invisibility. The firm contracts the call-center function to a vendor whose offering bundles the assistant. The vendor charges per resolved ticket. Agents become contractors of the vendor rather than employees of the firm. Almost all of the productivity gain becomes vendor margin. Worker-facing logs are now governed by the vendor's terms of service. The agents are technically more productive, technically less directly supervised, and technically with no way to contest a single record.

All three futures are consistent with the published 14 percent. The framework essays in this series explain why. The case study makes the choice between them concrete enough to argue about.

Where the Case Breaks

Limits on the use of this case as a stand-in for "AI in the workplace":

The published evidence is one industry, one firm, one type of work, in one labour market. The productivity gain in customer support is unusually clean to measure; in software engineering, the gain is contested and depends heavily on task type; in nursing, the assistant's product-shape would be entirely different.
The lowest-skill-gains-most pattern may not reproduce. Subsequent deployments may show productivity gains accruing to higher-skill workers, no aggregate gain, or productivity gains paired with quality declines that compensating measurement did not catch.
The customer-sentiment improvement is from chat-transcript sentiment scores, which are themselves model outputs and inherit the same alignment limitations as the assistant.
The firm's institutional choices in this case (logs, contestability, surplus split) are private; the analysis above describes plausible options, not specific firm policies.

The case study earns its conclusions only when the limits are stated. The four lenses still produce four audits.

The Generalization

A reader can re-run this exercise on three other domains. The same four columns apply; the binding lens changes.

Domain	Which lens binds hardest
Résumé screening at scale	Rawls: the system rations opportunity, and the binding question is whether the institutional pipeline that uses the model gives the least-advantaged candidate a route in
Warehouse worker monitoring	Foucault: measurement is the product, the contestability path is the design problem, and worker bargaining about which behaviours count is more important than the model's accuracy
AI coding assistants	Marx: the question of who captures the productivity gain over five years (junior workers? senior workers? employers? cloud vendors? open-source maintainers?) is genuinely open and depends on bargaining outside the model
Care work / nursing assistants	Beauvoir: situated freedom is the binding axis, with the question of whether the assistant gives the worker more discretion or restructures the workflow into more compliance

That is the practical version of "make the lenses argue." Different domains foreground different lenses. Which domain you choose decides which lens is doing the most work.

Five Hypotheses That the Headline Number Cannot Answer

The case study is most useful as a research design rather than as a settled verdict. Each lens turned a silence into a hypothesis the published evidence cannot adjudicate. Naming the hypotheses explicitly is what makes the framework portable.

Hypothesis	What it predicts	What would falsify it	Lens
H1	AI assistance compresses the novice learning curve: time to a given productivity level shortens.	Treated novices show no faster ramp than untreated novices on tasks the assistant did not help with.	Rawls / Marx
H2	Compression weakens the wage premium for experienced agents: the productivity advantage of experience converges toward zero.	Experienced wages keep their premium; senior agents retain higher per-hour pay despite no productivity gap.	Marx
H3	Suggestion-acceptance logs become performance-management data unless explicitly prohibited by contract or regulation.	Two years post-deployment, no employer in the relevant sector cites suggestion-acceptance rates in promotion or termination decisions.	Foucault
H4	Firms and AI vendors capture most of the productivity surplus unless compensation, bargaining, or regulation intervene.	Wage data over a 3-5 year window shows the surplus split with workers in proportion to the productivity gain.	Marx / Rawls
H5	"Lower attrition" is ambiguous in worker welfare without satisfaction, outside-option, wage, and scheduling data.	Survey and labour-market data show treated agents report higher satisfaction, comparable outside options, and more predictable schedules.	Beauvoir

The Brynjolfsson, Li, and Raymond paper does not test any of these directly. It was not its job. The hypotheses are testable, falsifiable, and require longitudinal wage data, contract analysis, regulatory filings, worker surveys, and audit access to suggestion-log usage. They are the kind of follow-up work the productivity literature still owes.

What Data You Would Actually Collect

The reason H1–H5 matter is that each of them pins to a specific, collectable piece of evidence. A serious follow-up study to the Brynjolfsson, Li, and Raymond paper would gather the following:

Variable	Measurement	Test for	Source
Wage trajectories	Hourly or annual wage by tenure, before and after deployment, broken out by skill quartile	H1 (novice ramp), H2 (premium erosion)	Firm payroll data; counterfactual from a matched non-treated firm
Suggestion-acceptance rate as performance evidence	Whether the rate appears in performance reviews, raise decisions, or termination decisions	H3 (logs become management data)	Firm policy documents; HR file audits; legal filings if grievances surface
Headcount and substitution	Net employment in the role over 12-36 months; new-hire profile	H1, H4 (surplus capture)	Public filings; LinkedIn-scrape for firm headcount; trade-press reporting
Vendor licensing terms	Per-seat license fee; data-rights clauses; clauses on agent log ownership	H4 (surplus to vendor)	Procurement contracts; vendor SEC filings; SOC reports
Worker survey	Self-reported satisfaction, outside-option breadth, schedule predictability	H5 (attrition welfare)	Pre- and post-deployment instruments; ideally pre-registered
Attrition reasons	Why workers leave; whether they leave to comparable, better, or worse outside options	H5	Exit-interview data; matched LinkedIn-trajectory data; tax-filing-based earnings before and after
Suggestion-log access	Who can read the logs; whether agents can request review; whether unions or works councils have audit rights	H3 (Foucauldian audit)	Firm IT policy; HR rights memos; collective-bargaining language

That table is the data-collection plan a researcher would actually need to write before running the follow-up study. It is also the diligence list a worker representative or labour regulator would need to review the deployment honestly. The published productivity number anchors none of these variables. A reader who comes away with the productivity number alone has read the abstract. A reader who comes away with H1–H5 has the research agenda. A reader who comes away with this table has the protocol.

Why This Matters for the Series

This essay is the case study the series needed. The lenses arrive at different recommendations. The published evidence is real and is also incomplete in specific, traceable ways. The framework does not collapse under the weight of a real example; it gets sharper.

The synthesis the series has been building toward fits in one line:

A technical system can be corrected by feedback. A worker needs recourse. Do not confuse correction in the model with correction in the institution.

That distinction is the through-line. Locke / Hume / Kant gave the epistemic correction story (essay 1). Plato gave the source-and-experience correction story (essay 2). Rawls / Foucault / Marx / Beauvoir give the institutional correction story (essays 4 and 5). The case study (this essay) shows the gap between the two senses of "correction" in one concrete domain. The next essay in the series, The Right to Correct the Record (forthcoming), is the explicit synthesis: a model can be corrected by feedback; a worker, applicant, or citizen needs recourse, contestability, and power.

This case study is the bridge.

Sources

The anchor study:

Brynjolfsson, Erik, Danielle Li, and Lindsey R. Raymond. "Generative AI at Work." NBER Working Paper 31161, 2023; later published in Quarterly Journal of Economics, 2025. https://academic.oup.com/qje/article/140/2/889/7990658

Workplace and algorithmic-control sources (new in this essay relative to the rest of the series):

Selbst, Andrew D., danah boyd, Sorelle A. Friedler, Suresh Venkatasubramanian, and Janet Vertesi. "Fairness and Abstraction in Sociotechnical Systems." FAT* 2019. The canonical statement of why fairness interventions abstracted from the surrounding sociotechnical system can fail as fairness instruments.
Kellogg, Katherine C., Melissa A. Valentine, and Angèle Christin. "Algorithms at Work: The New Contested Terrain of Control." Academy of Management Annals, 2020. Synthesizes the algorithmic-control literature: directing, evaluating, disciplining, and tracking workers.
Ajunwa, Ifeoma. The Quantified Worker: Law and Technology in the Modern Workplace. Cambridge University Press, 2023. Book-length treatment of workplace data collection and the legal lag.
Gray, Mary L., and Siddharth Suri. Ghost Work. Houghton Mifflin Harcourt, 2019. Hidden human labour inside apparently autonomous AI products.

Frame essays in this series:

Internal Links

PhilosophyPath:
- empiricism-induction-and-llm-limits. Essay 1.
- platos-cave-and-the-era-of-experience. Essay 2.
- platos-cave-and-the-era-of-experience. Essay 3.
- four-lenses-on-the-ai-economy. Essay 4.
- ai-productivity-and-the-distribution-question. Essay 5.
TheoremPath:
- reinforcement-learning-from-human-feedback-deep-dive
- calibration-and-uncertainty
PedagogyPath (forthcoming): workplace-learning, AI-literacy.

Brynjolfsson, Erik, Danielle Li, and Lindsey R. Raymond. "Generative AI at Work." Quarterly Journal of Economics 140, no. 2 (2025): 889–942. Originally NBER Working Paper 31161, 2023. https://academic.oup.com/qje/article/140/2/889/7990658 ↩
Selbst, Andrew D., danah boyd, Sorelle A. Friedler, Suresh Venkatasubramanian, and Janet Vertesi. "Fairness and Abstraction in Sociotechnical Systems." FAT* 2019, January 2019. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3265913 ↩
Kellogg, Katherine C., Melissa A. Valentine, and Angèle Christin. "Algorithms at Work: The New Contested Terrain of Control." Academy of Management Annals 14, no. 1 (2020): 366–410. https://journals.aom.org/doi/10.5465/annals.2018.0174 ↩
Ajunwa, Ifeoma. The Quantified Worker: Law and Technology in the Modern Workplace. Cambridge University Press, 2023. ↩
Gray, Mary L., and Siddharth Suri. Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass. Houghton Mifflin Harcourt, 2019. ↩