Aim and Scope
Leveraging the foundation built in the prior workshops SpLU-RoboNLP 2023, SpLU-RoboNLP 2021, SpLU 2020, SpLU-RoboNLP 2019, SpLU 2018, and RoboNLP 2017, we propose the fourth combined workshop on Spatial Language Understanding and Grounded Communication for Robotics. Natural language communication with general-purpose embodied robots has long been a dream inspired by science fiction, and natural language interfaces have the potential to make robots more accessible to a wider range of users. Achieving this goal requires the continuous improvement of and development of new technologies for linking language to perception and action in the physical world. In particular, given the rise of large vision and language generative models, spatial language understanding and natural interactions have become more exciting topics to explore. This joint workshop aims to bring together the perspectives of researchers working on physical robot systems with human users, simulated embodied environments, multimodal interaction, and spatial language understanding to forge collaborations.Topics of Interest
We are interested in but not limited to original research in developing computational models, benchmarks, evaluation metrics, analysis, surveys and position papers on the following topics:
- Deployment of Large Language Models for Situated Dialogue and Language Grounding
- Spatial Reasoning with Large Language Models
- Aligning and Translating Language to Situated Actions
- Evaluation Metrics for Language Grounding and Human-Robot Communication
- Human-Computer Interactions Through Natural or Structural Language
- Instruction Understanding and Spatial Reasoning based on Multimodal Information for Navigation, Articulation, and Manipulation
- Interactive Situated Dialogue for Physical Tasks
- Language-based Game Playing for Grounding
- Spatial Language and Skill Learning via Grounded Dialogue
- (Spatial) Language Generation for Embodied Tasks
- (Spatially-) Grounded Knowledge Representations
- Spatial Reasoning in Image and Video Diffusion Models
- Qualitative Spatial Representations and Neuro-symbolic Modeling
- Utilization and Limitations of Large (Multimodal-)Language Models in Spatial Understanding and Grounded Communication
Call for Papers
We cordially invite authors to contribute by submitting their long papers and short papers.
Long Papers
Technical papers: 8 pages excluding references, 1 additional page allowed for camera reaedy.Short Papers
Position statements describing previously unpublished work or demos: 4 pages excluding references, 1 additional page allowed for camera reaedy.Non-Archival Option
ACL workshops are traditionally archival. To allow dual submission of your work to SpLU-RoboNLP 2024 from *ACL Findings and other conferences/journals, we are also including a non-archival track. Space permitting, these submissions will still participate and present their work in the workshop, and will be hosted on the workshop website, but will not be included in the official proceedings. Please apply the ACL format and submit through OpenReview, but indicate that this is a cross-submission (non-archival) at the bottom of the submission form.Submission Instructions
All submissions must be in compliance with the ACL formatting guidelines and code of ethics. Papers should be submitted electronically through the OpenReview protocol. The peer review will be double-blind.
Style and Formatting
ACL TemplateSubmissions Website
OpenReviewImportant Dates
- Submission Open: 5 Feburary 2024 (Anywhere on Earth)
Submission Deadline: 17 May 2024 (Anywhere on Earth)- Extended Submission Deadline: 27 May 2024 (Anywhere on Earth)
Notification of Acceptance:17 June 2024 (Anywhere on Earth)- Extended Notification of Acceptance:24 June 2024 (Anywhere on Earth)
- Camera Ready Deadline: 30 June 2024 (GMT)
- Workshop Day: 16 August 2024 (co-located with ACL 2024)
Invited Speakers
Inderjeet Mani (Formerly, Yahoo! Labs, Georgetown University)
Title: Grounding Spatial Natural Language with Generative AI Bio: Inderjeet Mani is a research scientist specializing in NLP and AI. His research in NLP has included areas like automatic summarization, narrative modeling, and temporal and spatial information extraction from text. He has also contributed to research on question-answering, bioinformatics, ontologies, machine translation, geographical information systems, and multimedia information processing. His publications include a hundred-odd scientific papers (totaling over 13,000 citations), along with six scholarly books and a science-fiction thriller, along with nearly fifty shorter literary pieces, including a popular essay about robot readers. Currently semi-retired, his prior affiliations have included Georgetown University (Associate Professor), Yahoo (Senior Director), Cambridge University (Visiting Fellow), MITRE (Senior Principal Scientist), Brandeis University (Visiting Scholar), MIT (Research Affiliate), and the Indian Institute of Science (Consulting Scientist). He has also served on the editorial boards of the journals Computational Linguistics (2002-2004) and Natural Language Engineering (2011-2015), while also reviewing for numerous AI journals and conferences. Abstract: When communicating in NL about spatial relations and movement, humans seldom use precise geometries and equations of motion, instead relying on a qualitative understanding. Qualitative and quantitative grounding of spatial relations in NL can together provide more flexible and higher-level communication with robots. I begin with an outline of a spatial semantics for natural language based on qualitative reasoning formalisms. Generative AI systems can extract spatial relations based on these formalisms from visual and/or text data, and generate NL descriptions based on them. They also appear to carry out rudimentary reasoning using these formalisms, providing large-scale, albeit noisy, training data. However, for the results to be convincing, formal reasoning needs to be integrated more tightly with neural architectures, beginning with improved tokenization. Evaluations need to go well beyond SOTA benchmarks and must extend to extrinsic tasks of interest to robotics. |
|
Malihe Alikhani (Northeastern University)
Title: From Ambiguity to Clarity: Navigating Uncertainty in Human-Machine Conversations Bio: Malihe Alikhani is an assistant professor of AI and social justice at the Khoury School of Computer Science, Northeastern University. She is affiliated with the Northeastern Ethics Institute as well of the Institute for Experiential AI. Her research interests center on using representations of communicative structure, machine learning, and cognitive science to design equitable and inclusive NLP systems for critical applications such as education, health, and social justice. She has designed several models for sign language understanding generation, dialogue systems for deaf and hard of hearing and AI systems for evaluation of speech impairment. She has worked on projects with fair and limited data. Her work has received multiple best paper awards at ACL 2021, UAI2022, INLG2021, UMAP2022, and EMNLP 2023 and has been supported by DARPA, NIH, CDC, Google, and Amazon. Abstract: This talk delves into the intricacies of uncertainty in human-machine dialogue, mainly focusing on the challenges and solutions related to ambiguities arising from impoverished contextual representations. We examine how linguistically informed context representations can mitigate data-related uncertainty in a deployed dialogue system similar to Alexa. We acknowledge that certain types of data-related uncertainty are unavoidable and investigate the capabilities of modern billion-scale language models in representing this form of uncertainty in conversations. Shifting our focus to epistemic uncertainty arising from misaligned background knowledge between humans and machines, we explore strategies for quantifying and reducing this form of uncertainty. Our discussion encompasses various facets of human-machine convergence, including Theory of Mind, fairness, and pragmatics. By leveraging machine learning theory and cognitive science insights, we aim to quantify epistemic uncertainty and propose algorithms that improve grounding between humans and machines. This exploration sheds light on the theoretical underpinnings of uncertainty in dialogue systems and offers practical solutions for improving human-machine communication. |
|
Daniel Fried (Carnegie Mellon University)
Title: Benchmarks and Tree Search for Multimodal LLM Web Agents Bio: Daniel Fried is an assistant professor in the Language Technologies Institute at CMU, and a research scientist at Meta AI. His research focuses on language grounding, interaction, and applied pragmatics, with a particular focus on language interfaces such as grounded instruction following and code generation. Previously, he was a postdoc at Meta AI and the University of Washington and completed a PhD at UC Berkeley. His research has been supported by an Okawa Research Award, a Google PhD Fellowship and a Churchill Fellowship. Abstract:LLMs have the potential to help automate tasks on the web, and thereby help people better interact with the digital world. We present benchmarks, WebArena and VisualWebArena, which evaluate the ability of LLMs to act as agentive, language-based interfaces to carry out everyday tasks on the web. Succeeding on these tasks requires systems to ground language into the visual and semi-structured content of web pages, carry out common sub-tasks which are shared across tasks and web pages, and make decisions under uncertainty and partial observability -- all challenges for current agentive methods built on top of (vision and) language models. We also present a tree search method which addresses some of these challenges, obtaining strong improvements across state-of-the-art agents on these benchmarks. |
|
Yu Su (Ohio State University)
Title: Web Agents: A New Frontier for Embodied Agents Bio: Yu Su is a Distinguished Assistant Professor at the Ohio State University and a researcher at Microsoft. He co-directs the OSU NLP group. He has broad interests in artificial and biological intelligence, with a primary interest in the role of language as a vehicle of thought and communication. His recent interests center around language agents with a series of well-received works such as Mind2Web, SeeAct, HippoRAG, LLM-Planner, and MMMU. His work has received multiple paper awards, including the Best Student Paper Award at CVPR 2024 and Outstanding Paper Award at ACL 2023. Abstract:The digital world, e.g., the World Wide Web, provides a powerful yet underexplored form of embodiment for AI agents, where perception involves understanding visual renderings and the underlying markup languages, and effectors are mice and keyboards. Empowered by large language models (LLMs), web agents are rapidly rising as a new frontier for embodied agents that provide both the breadth and depth needed for driving agent development. On the other hand, web agents can potentially lead to many practical applications, thus raising substantial commercial interests as well. In this talk, I will present an overview of web agents, covering the history, the promises, and the challenges. I will then give an in-depth discussion on multimodality and grounding. Finally, I will conclude with an outlook of promising future directions, including planning, synthetic data, and safety. |
|
Yoav Artzi (Cornell University)
Title: Back to Fundamentals: Challenges in Visual Reasoning Bio: Yoav Artzi is an Associate Professor in the Department of Computer Science and Cornell Tech at Cornell University, arXiv's associate faculty director, and a researcher at ASAPP. His research focuses on developing models and learning methods for natural language understanding and generation in interactive systems. He received an NSF CAREER award, and his work was acknowledged by awards and honorable mentions at ACL, EMNLP, NAACL, and IROS. Yoav holds a B.Sc. from Tel Aviv University and a Ph.D. from the University of Washington. Abstract:Various visual reasoning capacities remain elusive even for large state-of-the-art models. This includes fairly basic spatial reasoning and abstraction abilities, showing a not completely surprising (or to some maybe surprising) gap between model capabilities and what is easy for humans. In this talk, I will describe several focused benchmarks that reveal fundamental deficiencies in the reasoning of contemporary high-performance models |
|
Manling Li (Northwestern University)
Title: From Language Models to Agent Models: Planning and Reasoning with Physical World Knowledge Bio: Manling Li is a postdoc at Stanford University and incoming Assistant Professor at Northwestern University. She obtained the PhD degree in Computer Science at University of Illinois Urbana-Champaign in 2023. She works on the intersection of language, vision, and robotics. Her work on multimodal knowledge extraction won the ACL'20 Best Demo Paper Award, her work on AI for Science won NAACL'21 Best Demo Paper Award, and her work on controlling LLMs won ACL’24 Outstanding Paper Award. She was a recipient of Microsoft Research PhD Fellowship in 2021, an EE CS Rising Star in 2022, a DARPA Riser in 2022, etc. She served as Organizing Committee of ACL 25 and EMNLP 2024, and delivered tutorials about multimodal knowledge at IJCAI'23, CVPR'23, NAACL'22, AAAI'21, ACL'21, etc. Additional information is available at https://limanling.github.io/. Abstract: While Large Language Models excel in language processing, Large Agent Models are designed to interact with the environment. This transition poses significant challenges in understanding lower-level visual details, and long-horizon reasoning for effective goal interpretation and decision-making. Despite the impressive performance of LLMs/VLMs on various benchmarks, these models perceive images as bags of words (semantic concepts). In detail, they use semantic understanding as a shortcut but lack the ability to recognize geometric structures or solve spatial problems such as mazes. To interact with the physical world, we focus on two dimensions: (1) From high-level semantic to low-level geometric understanding: We introduce a low-level visual description language that serves as geometric tokens, allowing the abstraction of multimodal low-level geometric structures. (2) From fast-thinking to slow-thinking: We propose to quantify long-horizon reasoning by incorporating Markov Decision Process (MDP) based decision-making. The key difference between language models and agent models lies in their decision-making capabilities. This fundamental difference necessitates a shift in how we approach the development of large agent models, focusing on both geometric understanding and long-term planning to create more capable embodied AI agents. |
Schedule
Organizing Committee
Parisa Kordjamshidi Michigan State University kordjams@msu.edu |
Xin Eric Wang University of California Santa Cruzxwang366@ucsc.edu |
Yue Zhang Michigan State Universityzhan1624@msu.edu |
Ziqiao Ma University of Michiganmarstin@umich.edu |
Mert Inan Northeastern Universityinan.m@northeastern.edu |
Advisory Committee
Raymond J. Mooney The University of Texas at Austin |
Joyce Y. Chai University of Michigan |
Anthony G. Cohn University of Leeds |
Program Committee (Alphabetical Order)
Meta | |
Bosch | |
Army Research Lab | |
Google DeepMind | |
Oregon State University | |
University of California Santa Barbara | |
University of Washington | |
University of North Carolina at Chapel Hill | |
University of California Berkeley | |
University of Groningen | |
University of Texas, Health Science Center at Houston | |
Northwestern University | |
University of Illinois Chicago | |
Heidelberg University | |
Google DeepMind | |
University of Gothenburg | |
Army Research Lab | |
Stanford University | |
Amazon Alexa AI | |
University of Adelaide | |
University of Michigan | |
University of California Santa Cruz |
Ethics Committee
University of Maryland Baltimore County | |
University of Michigan |