Test-Time Scaling Makes Overtraining Compute-Optimal
arXiv:2604.01411v1 Announce Type: new Abstract: Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of …
Quality follows upgrading
Tag: stat.ML
arXiv:2604.01411v1 Announce Type: new Abstract: Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of …
arXiv:2603.23568v1 Announce Type: new Abstract: Sentiment signals derived from sparse news are commonly used in financial analysis and technology monitoring, yet transforming raw article-level observations …
arXiv:2603.23783v1 Announce Type: new Abstract: Adapting large-scale foundation models to new domains with limited supervision remains a fundamental challenge due to latent distribution mismatch, unstable …
arXiv:2603.23792v1 Announce Type: new Abstract: Diffusion models often generate novel samples even when the learned score is only \emph{coarse} -- a phenomenon not accounted for …
arXiv:2603.23805v1 Announce Type: new Abstract: Neural Collapse is a phenomenon that helps identify sparse and low rank structures in deep classifiers. Recent work has extended …
arXiv:2603.23831v1 Announce Type: new Abstract: Deep neural networks (DNNs), particularly those using Rectified Linear Unit (ReLU) activation functions, have achieved remarkable success across diverse machine …
arXiv:2603.23926v1 Announce Type: new Abstract: Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with …
arXiv:2603.22320v1 Announce Type: new Abstract: While climate models provide insights for climate decision-making, their use is constrained by significant computational and technical demands. Although machine …
arXiv:2603.22328v1 Announce Type: new Abstract: Despite the strong predictive performance achieved by machine learning models across many application domains, assessing their trustworthiness through reliable estimates …
arXiv:2603.22339v1 Announce Type: new Abstract: Chinchilla Approach 2 is among the most widely used methods for fitting neural scaling laws. Its parabolic approximation introduces systematic …
arXiv:2603.22465v1 Announce Type: new Abstract: Federated Learning (FL) is constrained by the communication and energy limitations of decentralized edge devices. While gradient sparsification via Top-K …
arXiv:2603.20939v1 Announce Type: new Abstract: Large language models are increasingly used as personal assistants, yet most lack a persistent user model, forcing users to repeatedly …