Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference

Ben Finkelshtein, Silviu Cucerzan, Sujay Kumar Jauhar, Ryen W White

LLMs & Reasoning Thu, Apr 23 · 10:54 AM–11:04 AM · 203 A/B Avg rating: 5.50 (2–8)

Author-provided TL;DR

A comprehensive study of LLMs for node classification, providing a principled understanding of their capabilities in processing graph information that practitioners can apply in real-world tasks

Abstract

Large language models (LLMs) are increasingly leveraged for text-rich graph machine learning tasks, with node classification standing out due to its high-impact application domains such as fraud detection and recommendation systems. Yet, despite a surge of interest, the field lacks a principled understanding of the capabilities of LLMs in processing graph data. In this work, we conduct a large-scale, controlled evaluation across the key axes of variability: the LLM-graph interaction mode, comparing prompting, tool-use, and code generation; dataset domains, spanning citation, web-link, e-commerce, and social networks; homophilic vs. heterophilic regimes; short- vs. long-text features; LLM sizes and reasoning capabilities. We further analyze dependencies by independently truncating features, deleting edges, and removing labels to quantify reliance on input types. Our findings provide actionable guidance for both research and practice. (1) Code generation mode achieves the strongest overall performance, with especially large gains on long-text or high-degree graphs where prompting quickly exceeds the token budget. (2) All interaction strategies remain effective on heterophilic graphs, challenging the assumption that LLM-based methods collapse under low homophily. (3) Code generation mode is able to flexibly shift its reliance to the most informative input type, whether that be structure, features, or labels. Together, these results establish a clear picture of the strengths and limitations of current LLM–graph interaction modes and point to design principles for future methods.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Large-scale study comparing LLM-graph interaction modes for node classification, finding code generation outperforms prompting on long-text and high-degree graphs.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Comprehensive controlled evaluation across prompting, tool-use, and code generation modes for LLM-graph interaction
Analysis showing code generation achieves strongest performance especially on long-text or high-degree graphs
Dependency analysis revealing LLMs can flexibly shift reliance to most informative input type

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Prompting
ReAct-style tool use
Graph-as-Code generation

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

citation networks
web-link networks
e-commerce networks
social networks

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

Large Language Models
Prompting
In-Context Learning
Tool-augmented Reasoning
Text-rich Graphs

Something off? Let us know →

Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis