What 50+ data team leaders told us about scaling agents to 1,000s of tables — and the architecture pattern that actually works in production.
13 sections 4 min read Field research from 50+ data teams
12 months of conversations with 50+ data team leaders.
They shared that more tables lead to significantly higher error rates.
Ontology graphs encode rich semantics.
Text-to-SQL agents can only consult them as reference.
Most of the graph's structural power is unused.
A cybersecurity firm built a system across 1,500 tables using PuppyGraph.
Iceberg-native. Zero ETL. Customer-facing.
Data team leaders we've spoken with. Spanning financial services, security, networking, retail, and semiconductors.
Natural language question
Context engineering
Generated query
Execution
Back to user
At small schema size: this works well.
At enterprise scale: context exceeds prompt limits + joins compound + semantic ambiguity multiplies.
SQL assumes the analyst already knows which joins make sense. Replace the analyst with an agent, and that knowledge vanishes — every patch is an attempt to put it back.
Lives OUTSIDE the query engine.
Lives INSIDE the query engine.
A map is reference knowledge: useful, but ignorable. A railway is structural: the agent can only travel where the tracks go.
3-hop joins in SQL: ~15 lines.
Same in Cypher: 1 line.
Not a new language for the LLM.
It's a well-represented one.
Agents write Cypher.
Humans ask in English.
For agent-generated queries on graph-shaped data, Cypher is the lower-friction path. Humans don't need to learn it — they don't write it.
MATCH (r:Role)-[:ALLOWS_ACCESS_TO]->(res:Resource) WITH r, count(res) AS permissionCount WHERE permissionCount > 4 MATCH path = (vm:VMInstance)-[ar:ASSIGNED_ROLE]->(r) -[at:ALLOWS_ACCESS_TO]->(res:Resource) RETURN vm, ar, r, at, res
WITH role_permissions AS ( SELECT r.role_id FROM Roles r JOIN RoleResourceAccess rra ON r.role_id = rra.role_id GROUP BY r.role_id HAVING COUNT(DISTINCT rra.resource_id) > 4 ) SELECT vm.*, r.*, res.* FROM role_permissions rp JOIN Roles r ON r.role_id = rp.role_id JOIN VMInstances vm ON vm.role_id = r.role_id JOIN RoleResourceAccess rra ON rra.role_id = r.role_id JOIN Resources res ON res.resource_id = rra.resource_id;
Agent attempts to retrieve student grades alongside teacher salary data — a join that violates business meaning. Production deployment becomes real when agents recover autonomously, without silent wrong answers.
SELECT s.name, s.grade, sal.salary FROM students s JOIN salaries sal ON s.id = sal.person_id;
No error message returned. Student suddenly has salary when they should not have.
Nothing went wrong.
MATCH (s:Student)-[:HAS_SALARY]->(sal:salary) RETURN s.name, s.grade, sal.salary
No edge 'HAS_SALARY' exists between 'Student' and 'Salary'.
Salary is semantically out of scope from Student.
Wrong joins. Silent wrong answers.
Business rules built into the query structure. Wrong queries are structurally impossible to express, not just discouraged.
Enterprise data = massive & spread out.
Scale and reach in one storage:
Too slow for real-time agents.
MPP architecture and vectorized execution. Subsecond response for multi-hop traversals — agents don't wait.
Iceberg-native by design. Federate across whatever else your enterprise runs on.
Query Iceberg natively. Data never leaves your storage.
Your lakehouse can be the analytical core without forcing a full migration of operational data.
Built a "Glean++" agent to cut IT support costs by reducing human-in-loop ticket resolution time. Using it as the blueprint for company-wide AI overhaul.
"This work is a strong example of how we're operationalizing AI and data across the enterprise — building the foundation for more autonomous capabilities ahead."
Hasmukh Ranjan · CIO @ AMD
It gave AMD's agent the context to reason, transforming a static chatbot into a self-learning knowledge engine.
Only PuppyGraph delivered the scale, speed, and live reasoning AMD needed to make enterprise AI in production.
Built a graph-powered digital cyber twin. Agentic workflow to dynamically compute attack vectors, blast radius, and threat paths.
"We process billions of edges on a daily basis using PuppyGraph. It's one of the products in hindsight to ask, how is this just now?"
Leon Goldberg · CTO @ Sola Security
This enables AI agents to access context and answer questions like:
Which users have admin access to my cloud and SaaS apps?
How does a change in security settings affect my environment?
Do attack paths exist to system-critical resources?