Oldest Papers

Date: Oldest Clear all

Advanced filtersTopic: All

Experimental

Topic

Topics are auto-detected from title, abstract, and metadata and may be imperfect.

Publication Date

Newest Oldest Clear

Average Rating

Highest Lowest

Automated clinical coding using off-the-shelf large language models

Joseph S. Boyle, Antanas Kascenas, Pat Lok, Maria Liakata, Alison Q. O'Neil

arXiv·2023

The task of assigning diagnostic ICD codes to patient hospital admissions is typically performed by expert human coders. Efforts towards automated ICD coding are dominated by supervised deep learning models. However, difficulties in learning to predict the large number of rare codes remain a barrier to adoption in clinical practice. In this work, we leverage off-the-shelf pre-trained generative large language models (LLMs) to develop a practical solution that is suitable for zero-shot and few-shot code assignment, with no need for further task-specific training. Unsupervised pre-training alone does not guarantee precise knowledge of the ICD ontology and specialist clinical coding task, therefore we frame the task as information extraction, providing a description of each coded concept and asking the model to retrieve related mentions. For efficiency, rather than iterating over all codes, we leverage the hierarchical nature of the ICD ontology to sparsely search for relevant codes.

★ 5.0 (1)

View paper →

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks, Max Tegmark

arXiv·2023

Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we use high-quality datasets of simple true/false statements to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements. We also show that simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs.

No ratings yet

View paper →

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, Joseph E. Gonzalez

arXiv·2023

Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM's limited context window, and utilizes interrupts to manage control flow between itself and the user. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM's context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users. We release MemGPT code and data for our experiments at https://memgpt.ai.

No ratings yet

View paper →

PreviousPage 243 of 554Next

Detecting Multidimensional DIF in Polytomous Items with IRT Methods and Estimation Approaches

Güler Yavuz Temel

Journal of Educational Measurement·2023·1 citations

AbstractThe purpose of this study was to investigate multidimensional DIF with a simple and nonsimple structure in the context of multidimensional Graded Response Model (MGRM). This study examined and compared the performance of the IRT‐LR and Wald test using MML‐EM and MHRM estimation approaches with different test factors and test structures in simulation studies and applying real data sets. When the test structure included two dimensions, the IRT‐LR (MML‐EM) generally performed better than the Wald test and provided higher power rates. If the test included three dimensions, the methods provided similar performance in DIF detection. In contrast to these results, when the number of dimensions in the test was four, MML‐EM estimation completely lost precision in estimating the nonuniform DIF, even with large sample sizes. The Wald with MHRM estimation approaches outperformed the Wald test (MML‐EM) and IRT‐LR (MML‐EM). The Wald test had higher power rate and acceptable type I error rates for nonuniform DIF with the MHRM estimation approach.The small and/or unbalanced sample sizes, small DIF magnitudes, unequal ability distributions between groups, number of dimensions, estimation methods and test structure were evaluated as important test factors for detecting multidimensional DIF.

No ratings yet

View paper →

Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations

Shiyuan Huang, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, Leilani H. Gilpin

arXiv·2023

Large language models (LLMs) such as ChatGPT have demonstrated superior performance on a variety of natural language processing (NLP) tasks including sentiment analysis, mathematical reasoning and summarization. Furthermore, since these models are instruction-tuned on human conversations to produce "helpful" responses, they can and often will produce explanations along with the response, which we call self-explanations. For example, when analyzing the sentiment of a movie review, the model may output not only the positivity of the sentiment, but also an explanation (e.g., by listing the sentiment-laden words such as "fantastic" and "memorable" in the review). How good are these automatically generated self-explanations? In this paper, we investigate this question on the task of sentiment analysis and for feature attribution explanation, one of the most commonly studied settings in the interpretability literature (for pre-ChatGPT models). Specifically, we study different ways to elicit the self-explanations, evaluate their faithfulness on a set of evaluation metrics, and compare them to traditional explanation methods such as occlusion or LIME saliency maps. Through an extensive set of experiments, we find that ChatGPT's self-explanations perform on par with traditional ones, but are quite different from them according to various agreement metrics, meanwhile being much cheaper to produce (as they are generated along with the prediction). In addition, we identified several interesting characteristics of them, which prompt us to rethink many current model interpretability practices in the era of ChatGPT(-like) LLMs.

No ratings yet

View paper →

Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament

Philipp Schoenegger, Peter S. Park

arXiv·2023

Accurately predicting the future would be an important milestone in the capabilities of artificial intelligence. However, research on the ability of large language models to provide probabilistic predictions about future events remains nascent. To empirically test this ability, we enrolled OpenAI's state-of-the-art large language model, GPT-4, in a three-month forecasting tournament hosted on the Metaculus platform. The tournament, running from July to October 2023, attracted 843 participants and covered diverse topics including Big Tech, U.S. politics, viral outbreaks, and the Ukraine conflict. Focusing on binary forecasts, we show that GPT-4's probabilistic forecasts are significantly less accurate than the median human-crowd forecasts. We find that GPT-4's forecasts did not significantly differ from the no-information forecasting strategy of assigning a 50% probability to every question. We explore a potential explanation, that GPT-4 might be predisposed to predict probabilities close to the midpoint of the scale, but our data do not support this hypothesis. Overall, we find that GPT-4 significantly underperforms in real-world predictive tasks compared to median human-crowd forecasts. A potential explanation for this underperformance is that in real-world forecasting tournaments, the true answers are genuinely unknown at the time of prediction; unlike in other benchmark tasks like professional exams or time series forecasting, where strong performance may at least partly be due to the answers being memorized from the training data. This makes real-world forecasting tournaments an ideal environment for testing the generalized reasoning and prediction capabilities of artificial intelligence going forward.

★ 3.0 (1)

View paper →

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi

arXiv·2023

Despite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. Retrieval-Augmented Generation (RAG), an ad hoc approach that augments LMs with retrieval of relevant knowledge, decreases such issues. However, indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation. We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) that enhances an LM's quality and factuality through retrieval and self-reflection. Our framework trains a single arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its own generations using special tokens, called reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements. Experiments show that Self-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, Self-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.

No ratings yet

View paper →

HANDS UP: WHAT’S UP WITH HASTAS REVIEWING THE DISCOURSE OF INTRODUCING HAND GESTURES TO CODIFY WAACKING

Sangram Mukhopadhyay

ShodhKosh: Journal of Visual and Performing Arts·2023

Waacking/Whacking as a 70s Los Angeles disco gay dance form is consistently being discovered and localized in different dance ecosystems of the world. The Indian chapter of its trajectory is uniquely drawing a certain kind of display which is in direct cognizance of its own codified forms of hand movements. The present study focuses on the seepages and abstinence of hand gestures through understanding the impetus to do so. Based on the analysis of the occurrences associated with its usage, through detailed conversations, literature review and reading of performance texts, several inferences can be drawn, and a better understanding of the genre can be further achieved.

No ratings yet

View paper →

GraphMaker: Can Diffusion Models Generate Large Attributed Graphs?

Mufei Li, Eleonora Kreačić, Vamsi K. Potluru, Pan Li

arXiv·2023

Large-scale graphs with node attributes are increasingly common in various real-world applications. Creating synthetic, attribute-rich graphs that mirror real-world examples is crucial, especially for sharing graph data for analysis and developing learning models when original data is restricted to be shared. Traditional graph generation methods are limited in their capacity to handle these complex structures. Recent advances in diffusion models have shown potential in generating graph structures without attributes and smaller molecular graphs. However, these models face challenges in generating large attributed graphs due to the complex attribute-structure correlations and the large size of these graphs. This paper introduces a novel diffusion model, GraphMaker, specifically designed for generating large attributed graphs. We explore various combinations of node attribute and graph structure generation processes, finding that an asynchronous approach more effectively captures the intricate attribute-structure correlations. We also address scalability issues through edge mini-batching generation. To demonstrate the practicality of our approach in graph data dissemination, we introduce a new evaluation pipeline. The evaluation demonstrates that synthetic graphs generated by GraphMaker can be used to develop competitive graph machine learning models for the tasks defined over the original graphs without actually accessing these graphs, while many leading graph generation methods fall short in this evaluation.

No ratings yet

View paper →

Testing exchangeability by pairwise betting

Aytijhya Saha, Aaditya Ramdas

arXiv·2023

In this paper, we address the problem of testing exchangeability of a sequence of random variables, $X_1, X_2,\cdots$. This problem has been studied under the recently popular framework of testing by betting. But the mapping of testing problems to game is not one to one: many games can be designed for the same test. Past work established that it is futile to play single game betting on every observation: test martingales in the data filtration are powerless. Two avenues have been explored to circumvent this impossibility: betting in a reduced filtration (wealth is a test martingale in a coarsened filtration), or playing many games in parallel (wealth is an e-process in the data filtration). The former has proved to be difficult to theoretically analyze, while the latter only works for binary or discrete observation spaces. Here, we introduce a different approach that circumvents both drawbacks. We design a new (yet simple) game in which we observe the data sequence in pairs. Despite the fact that betting on individual observations is futile, we show that betting on pairs of observations is not. To elaborate, we prove that our game leads to a nontrivial test martingale, which is interesting because it has been obtained by shrinking the filtration very slightly. We show that our test controls type-1 error despite continuous monitoring, and achieves power one for both binary and continuous observations, under a broad class of alternatives. Due to the shrunk filtration, optional stopping is only allowed at even stopping times, not at odd ones: a relatively minor price. We provide a wide array of simulations that align with our theoretical findings.

No ratings yet

View paper →

The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists

EJT

AI Alignment Forum·2023

[NOTE: This paper was previously titled 'The Shutdown Problem: Three Theorems'.] This paper is an updated version of the first half of my AI Alignment Awards contest entry. My theorems build on the theorems of Soares, Fallenstein, Yudkowsky, and Armstrong in various ways.[1] These theorems can guide our search for solutions to the shutdown problem.[2] One aim of the paper is to get academic philosophers and decision theorists interested in the shutdown problem and related topics in AI alignment. They’re my assumed audience. I’m posting here because I think the theorems will also be interesting to people already familiar with the shutdown problem. For discussion and feedback, I thank Adam Bales, Ryan Carey, Bill D’Alessandro, Tomi Francis, Vera Gahlen, Dan Hendrycks, Cameron Domenico Kirk-Giannini, Jojo Lee, Andreas Mogensen, Sami Petersen, Rio Popper, Brad Saad, Nate Soares, Rhys Southan, Christian Tarsney, Teru Thomas, John Wentworth, Tim L. Williamson, and Keith Wynroe. Abstract I explain and motivate the shutdown problem: the problem of designing artificial agents that (1) shut down when a shutdown button is pressed, (2) don’t try to prevent or cause the pressing of the shutdown button, and (3) otherwise pursue goals competently. I prove three theorems that make the difficulty precise. These theorems suggest that agents satisfying some innocuous-seeming conditions will often try to prevent or cause the pressing of the shutdown button, even in cases where it’s costly t...

No ratings yet

View paper →

Oldest Papers

Automated clinical coding using off-the-shelf large language models

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

MemGPT: Towards LLMs as Operating Systems

Detecting Multidimensional DIF in Polytomous Items with IRT Methods and Estimation Approaches

Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations

Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

HANDS UP: WHAT’S UP WITH HASTAS REVIEWING THE DISCOURSE OF INTRODUCING HAND GESTURES TO CODIFY WAACKING

How to survive the zombie apocalypse—and prepare for other disasters.

GraphMaker: Can Diffusion Models Generate Large Attributed Graphs?

Testing exchangeability by pairwise betting

The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists