ICLR 2026 Orals

Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference

Ben Finkelshtein, Silviu Cucerzan, Sujay Kumar Jauhar, Ryen W White

LLMs & Reasoning Thu, Apr 23 · 10:54 AM–11:04 AM · 203 A/B Avg rating: 5.50 (2–8)
Author-provided TL;DR

A comprehensive study of LLMs for node classification, providing a principled understanding of their capabilities in processing graph information that practitioners can apply in real-world tasks

Abstract

Large language models (LLMs) are increasingly leveraged for text-rich graph machine learning tasks, with node classification standing out due to its high-impact application domains such as fraud detection and recommendation systems. Yet, despite a surge of interest, the field lacks a principled understanding of the capabilities of LLMs in processing graph data. In this work, we conduct a large-scale, controlled evaluation across the key axes of variability: the LLM-graph interaction mode, comparing prompting, tool-use, and code generation; dataset domains, spanning citation, web-link, e-commerce, and social networks; homophilic vs. heterophilic regimes; short- vs. long-text features; LLM sizes and reasoning capabilities. We further analyze dependencies by independently truncating features, deleting edges, and removing labels to quantify reliance on input types. Our findings provide actionable guidance for both research and practice. (1) Code generation mode achieves the strongest overall performance, with especially large gains on long-text or high-degree graphs where prompting quickly exceeds the token budget. (2) All interaction strategies remain effective on heterophilic graphs, challenging the assumption that LLM-based methods collapse under low homophily. (3) Code generation mode is able to flexibly shift its reliance to the most informative input type, whether that be structure, features, or labels. Together, these results establish a clear picture of the strengths and limitations of current LLM–graph interaction modes and point to design principles for future methods.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Large-scale study comparing LLM-graph interaction modes for node classification, finding code generation outperforms prompting on long-text and high-degree graphs.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Comprehensive controlled evaluation across prompting, tool-use, and code generation modes for LLM-graph interaction
  • Analysis showing code generation achieves strongest performance especially on long-text or high-degree graphs
  • Dependency analysis revealing LLMs can flexibly shift reliance to most informative input type
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Prompting
  • ReAct-style tool use
  • Graph-as-Code generation
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • citation networks
  • web-link networks
  • e-commerce networks
  • social networks
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

  • Large Language Models
  • Prompting
  • In-Context Learning
  • Tool-augmented Reasoning
  • Text-rich Graphs

Related orals

Something off? Let us know →