Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference
Ben Finkelshtein, Silviu Cucerzan, Sujay Kumar Jauhar, Ryen W White
A comprehensive study of LLMs for node classification, providing a principled understanding of their capabilities in processing graph information that practitioners can apply in real-world tasks
Abstract
Large language models (LLMs) are increasingly leveraged for text-rich graph machine learning tasks, with node classification standing out due to its high-impact application domains such as fraud detection and recommendation systems. Yet, despite a surge of interest, the field lacks a principled understanding of the capabilities of LLMs in processing graph data. In this work, we conduct a large-scale, controlled evaluation across the key axes of variability: the LLM-graph interaction mode, comparing prompting, tool-use, and code generation; dataset domains, spanning citation, web-link, e-commerce, and social networks; homophilic vs. heterophilic regimes; short- vs. long-text features; LLM sizes and reasoning capabilities. We further analyze dependencies by independently truncating features, deleting edges, and removing labels to quantify reliance on input types. Our findings provide actionable guidance for both research and practice. (1) Code generation mode achieves the strongest overall performance, with especially large gains on long-text or high-degree graphs where prompting quickly exceeds the token budget. (2) All interaction strategies remain effective on heterophilic graphs, challenging the assumption that LLM-based methods collapse under low homophily. (3) Code generation mode is able to flexibly shift its reliance to the most informative input type, whether that be structure, features, or labels. Together, these results establish a clear picture of the strengths and limitations of current LLM–graph interaction modes and point to design principles for future methods.
Large-scale study comparing LLM-graph interaction modes for node classification, finding code generation outperforms prompting on long-text and high-degree graphs.
- Comprehensive controlled evaluation across prompting, tool-use, and code generation modes for LLM-graph interaction
- Analysis showing code generation achieves strongest performance especially on long-text or high-degree graphs
- Dependency analysis revealing LLMs can flexibly shift reliance to most informative input type
- Prompting
- ReAct-style tool use
- Graph-as-Code generation
- citation networks
- web-link networks
- e-commerce networks
- social networks
Authors did not state explicit limitations.
Authors did not state explicit future directions.
Author keywords
- Large Language Models
- Prompting
- In-Context Learning
- Tool-augmented Reasoning
- Text-rich Graphs
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.