SpLU-RoboNLP 2024

Aim and Scope

Leveraging the foundation built in the prior workshops SpLU-RoboNLP 2023, SpLU-RoboNLP 2021, SpLU 2020, SpLU-RoboNLP 2019, SpLU 2018, and RoboNLP 2017, we propose the fourth combined workshop on Spatial Language Understanding and Grounded Communication for Robotics. Natural language communication with general-purpose embodied robots has long been a dream inspired by science fiction, and natural language interfaces have the potential to make robots more accessible to a wider range of users. Achieving this goal requires the continuous improvement of and development of new technologies for linking language to perception and action in the physical world. In particular, given the rise of large vision and language generative models, spatial language understanding and natural interactions have become more exciting topics to explore. This joint workshop aims to bring together the perspectives of researchers working on physical robot systems with human users, simulated embodied environments, multimodal interaction, and spatial language understanding to forge collaborations.

Topics of Interest

We are interested in but not limited to original research in developing computational models, benchmarks, evaluation metrics, analysis, surveys and position papers on the following topics:

Deployment of Large Language Models for Situated Dialogue and Language Grounding
Spatial Reasoning with Large Language Models
Aligning and Translating Language to Situated Actions
Evaluation Metrics for Language Grounding and Human-Robot Communication
Human-Computer Interactions Through Natural or Structural Language
Instruction Understanding and Spatial Reasoning based on Multimodal Information for Navigation, Articulation, and Manipulation
Interactive Situated Dialogue for Physical Tasks
Language-based Game Playing for Grounding
Spatial Language and Skill Learning via Grounded Dialogue
(Spatial) Language Generation for Embodied Tasks
(Spatially-) Grounded Knowledge Representations
Spatial Reasoning in Image and Video Diffusion Models
Qualitative Spatial Representations and Neuro-symbolic Modeling
Utilization and Limitations of Large (Multimodal-)Language Models in Spatial Understanding and Grounded Communication

Call for Papers

We cordially invite authors to contribute by submitting their long papers and short papers.

Long Papers

Technical papers: 8 pages excluding references, 1 additional page allowed for camera reaedy.

Short Papers

Position statements describing previously unpublished work or demos: 4 pages excluding references, 1 additional page allowed for camera reaedy.

Non-Archival Option

ACL workshops are traditionally archival. To allow dual submission of your work to SpLU-RoboNLP 2024 from *ACL Findings and other conferences/journals, we are also including a non-archival track. Space permitting, these submissions will still participate and present their work in the workshop, and will be hosted on the workshop website, but will not be included in the official proceedings. Please apply the ACL format and submit through OpenReview, but indicate that this is a cross-submission (non-archival) at the bottom of the submission form.

Submission Instructions

All submissions must be in compliance with the ACL formatting guidelines and code of ethics. Papers should be submitted electronically through the OpenReview protocol. The peer review will be double-blind.

Style and Formatting

ACL Template

Submissions Website

OpenReview

Important Dates

Submission Open: 5 Feburary 2024 (Anywhere on Earth)
~~Submission Deadline: 17 May 2024 (Anywhere on Earth)~~
Extended Submission Deadline: 27 May 2024 (Anywhere on Earth)
~~Notification of Acceptance:17 June 2024 (Anywhere on Earth)~~
Extended Notification of Acceptance:24 June 2024 (Anywhere on Earth)
Camera Ready Deadline: 30 June 2024 (GMT)
Workshop Day: 16 August 2024 (co-located with ACL 2024)

Invited Speakers

	Inderjeet Mani (Formerly, Yahoo! Labs, Georgetown University) Title: Grounding Spatial Natural Language with Generative AI Bio: Inderjeet Mani is a research scientist specializing in NLP and AI. His research in NLP has included areas like automatic summarization, narrative modeling, and temporal and spatial information extraction from text. He has also contributed to research on question-answering, bioinformatics, ontologies, machine translation, geographical information systems, and multimedia information processing. His publications include a hundred-odd scientific papers (totaling over 13,000 citations), along with six scholarly books and a science-fiction thriller, along with nearly fifty shorter literary pieces, including a popular essay about robot readers. Currently semi-retired, his prior affiliations have included Georgetown University (Associate Professor), Yahoo (Senior Director), Cambridge University (Visiting Fellow), MITRE (Senior Principal Scientist), Brandeis University (Visiting Scholar), MIT (Research Affiliate), and the Indian Institute of Science (Consulting Scientist). He has also served on the editorial boards of the journals Computational Linguistics (2002-2004) and Natural Language Engineering (2011-2015), while also reviewing for numerous AI journals and conferences. Abstract: When communicating in NL about spatial relations and movement, humans seldom use precise geometries and equations of motion, instead relying on a qualitative understanding. Qualitative and quantitative grounding of spatial relations in NL can together provide more flexible and higher-level communication with robots. I begin with an outline of a spatial semantics for natural language based on qualitative reasoning formalisms. Generative AI systems can extract spatial relations based on these formalisms from visual and/or text data, and generate NL descriptions based on them. They also appear to carry out rudimentary reasoning using these formalisms, providing large-scale, albeit noisy, training data. However, for the results to be convincing, formal reasoning needs to be integrated more tightly with neural architectures, beginning with improved tokenization. Evaluations need to go well beyond SOTA benchmarks and must extend to extrinsic tasks of interest to robotics.
	Malihe Alikhani (Northeastern University) Title: From Ambiguity to Clarity: Navigating Uncertainty in Human-Machine Conversations Bio: Malihe Alikhani is an assistant professor of AI and social justice at the Khoury School of Computer Science, Northeastern University. She is affiliated with the Northeastern Ethics Institute as well of the Institute for Experiential AI. Her research interests center on using representations of communicative structure, machine learning, and cognitive science to design equitable and inclusive NLP systems for critical applications such as education, health, and social justice. She has designed several models for sign language understanding generation, dialogue systems for deaf and hard of hearing and AI systems for evaluation of speech impairment. She has worked on projects with fair and limited data. Her work has received multiple best paper awards at ACL 2021, UAI2022, INLG2021, UMAP2022, and EMNLP 2023 and has been supported by DARPA, NIH, CDC, Google, and Amazon. Abstract: This talk delves into the intricacies of uncertainty in human-machine dialogue, mainly focusing on the challenges and solutions related to ambiguities arising from impoverished contextual representations. We examine how linguistically informed context representations can mitigate data-related uncertainty in a deployed dialogue system similar to Alexa. We acknowledge that certain types of data-related uncertainty are unavoidable and investigate the capabilities of modern billion-scale language models in representing this form of uncertainty in conversations. Shifting our focus to epistemic uncertainty arising from misaligned background knowledge between humans and machines, we explore strategies for quantifying and reducing this form of uncertainty. Our discussion encompasses various facets of human-machine convergence, including Theory of Mind, fairness, and pragmatics. By leveraging machine learning theory and cognitive science insights, we aim to quantify epistemic uncertainty and propose algorithms that improve grounding between humans and machines. This exploration sheds light on the theoretical underpinnings of uncertainty in dialogue systems and offers practical solutions for improving human-machine communication.
	Daniel Fried (Carnegie Mellon University) Title: Benchmarks and Tree Search for Multimodal LLM Web Agents Bio: Daniel Fried is an assistant professor in the Language Technologies Institute at CMU, and a research scientist at Meta AI. His research focuses on language grounding, interaction, and applied pragmatics, with a particular focus on language interfaces such as grounded instruction following and code generation. Previously, he was a postdoc at Meta AI and the University of Washington and completed a PhD at UC Berkeley. His research has been supported by an Okawa Research Award, a Google PhD Fellowship and a Churchill Fellowship. Abstract:LLMs have the potential to help automate tasks on the web, and thereby help people better interact with the digital world. We present benchmarks, WebArena and VisualWebArena, which evaluate the ability of LLMs to act as agentive, language-based interfaces to carry out everyday tasks on the web. Succeeding on these tasks requires systems to ground language into the visual and semi-structured content of web pages, carry out common sub-tasks which are shared across tasks and web pages, and make decisions under uncertainty and partial observability -- all challenges for current agentive methods built on top of (vision and) language models. We also present a tree search method which addresses some of these challenges, obtaining strong improvements across state-of-the-art agents on these benchmarks.
	Yu Su (Ohio State University) Title: Web Agents: A New Frontier for Embodied Agents Bio: Yu Su is a Distinguished Assistant Professor at the Ohio State University and a researcher at Microsoft. He co-directs the OSU NLP group. He has broad interests in artificial and biological intelligence, with a primary interest in the role of language as a vehicle of thought and communication. His recent interests center around language agents with a series of well-received works such as Mind2Web, SeeAct, HippoRAG, LLM-Planner, and MMMU. His work has received multiple paper awards, including the Best Student Paper Award at CVPR 2024 and Outstanding Paper Award at ACL 2023. Abstract:The digital world, e.g., the World Wide Web, provides a powerful yet underexplored form of embodiment for AI agents, where perception involves understanding visual renderings and the underlying markup languages, and effectors are mice and keyboards. Empowered by large language models (LLMs), web agents are rapidly rising as a new frontier for embodied agents that provide both the breadth and depth needed for driving agent development. On the other hand, web agents can potentially lead to many practical applications, thus raising substantial commercial interests as well. In this talk, I will present an overview of web agents, covering the history, the promises, and the challenges. I will then give an in-depth discussion on multimodality and grounding. Finally, I will conclude with an outlook of promising future directions, including planning, synthetic data, and safety.
	Yoav Artzi (Cornell University) Title: Back to Fundamentals: Challenges in Visual Reasoning Bio: Yoav Artzi is an Associate Professor in the Department of Computer Science and Cornell Tech at Cornell University, arXiv's associate faculty director, and a researcher at ASAPP. His research focuses on developing models and learning methods for natural language understanding and generation in interactive systems. He received an NSF CAREER award, and his work was acknowledged by awards and honorable mentions at ACL, EMNLP, NAACL, and IROS. Yoav holds a B.Sc. from Tel Aviv University and a Ph.D. from the University of Washington. Abstract:Various visual reasoning capacities remain elusive even for large state-of-the-art models. This includes fairly basic spatial reasoning and abstraction abilities, showing a not completely surprising (or to some maybe surprising) gap between model capabilities and what is easy for humans. In this talk, I will describe several focused benchmarks that reveal fundamental deficiencies in the reasoning of contemporary high-performance models
	Manling Li (Northwestern University) Title: From Language Models to Agent Models: Planning and Reasoning with Physical World Knowledge Bio: Manling Li is a postdoc at Stanford University and incoming Assistant Professor at Northwestern University. She obtained the PhD degree in Computer Science at University of Illinois Urbana-Champaign in 2023. She works on the intersection of language, vision, and robotics. Her work on multimodal knowledge extraction won the ACL'20 Best Demo Paper Award, her work on AI for Science won NAACL'21 Best Demo Paper Award, and her work on controlling LLMs won ACL’24 Outstanding Paper Award. She was a recipient of Microsoft Research PhD Fellowship in 2021, an EE CS Rising Star in 2022, a DARPA Riser in 2022, etc. She served as Organizing Committee of ACL 25 and EMNLP 2024, and delivered tutorials about multimodal knowledge at IJCAI'23, CVPR'23, NAACL'22, AAAI'21, ACL'21, etc. Additional information is available at https://limanling.github.io/. Abstract: While Large Language Models excel in language processing, Large Agent Models are designed to interact with the environment. This transition poses significant challenges in understanding lower-level visual details, and long-horizon reasoning for effective goal interpretation and decision-making. Despite the impressive performance of LLMs/VLMs on various benchmarks, these models perceive images as bags of words (semantic concepts). In detail, they use semantic understanding as a shortcut but lack the ability to recognize geometric structures or solve spatial problems such as mazes. To interact with the physical world, we focus on two dimensions: (1) From high-level semantic to low-level geometric understanding: We introduce a low-level visual description language that serves as geometric tokens, allowing the abstraction of multimodal low-level geometric structures. (2) From fast-thinking to slow-thinking: We propose to quantify long-horizon reasoning by incorporating Markov Decision Process (MDP) based decision-making. The key difference between language models and agent models lies in their decision-making capabilities. This fundamental difference necessitates a shift in how we approach the development of large agent models, focusing on both geometric understanding and long-term planning to create more capable embodied AI agents.

Schedule

Organizing Committee


Parisa Kordjamshidi Michigan State University kordjams@msu.edu	Xin Eric Wang University of California Santa Cruz xwang366@ucsc.edu	Yue Zhang Michigan State University zhan1624@msu.edu	Ziqiao Ma University of Michigan marstin@umich.edu	Mert Inan Northeastern University inan.m@northeastern.edu

Advisory Committee


Raymond J. Mooney The University of Texas at Austin	Joyce Y. Chai University of Michigan	Anthony G. Cohn University of Leeds

Program Committee (Alphabetical Order)

Chris Paxton	Meta
Cristian-Paul Bara	Bosch
Felix Gervits	Army Research Lab
Drew A. Hudson	Google DeepMind
Jacob Krantz	Oregon State University
Jiachen Li	University of California Santa Barbara
Jiafei Duan	University of Washington
Jialu Li	University of North Carolina at Chapel Hill
Jiayi Pan	University of California Berkeley
Johan Bos	University of Groningen
Kirk Roberts	University of Texas, Health Science Center at Houston
Manling Li	Northwestern University
Natalie Parde	University of Illinois Chicago
Raphael Schumann	Heidelberg University
Roma Patel	Google DeepMind
Simon Dobnik	University of Gothenburg
Stephanie M. Lukin	Army Research Lab
Weiyu Liu	Stanford University
Xiaofeng Gao	Amazon Alexa AI
Yanyuan Qiao	University of Adelaide
Yichi Zhang	University of Michigan
Yue Fan	University of California Santa Cruz

Ethics Committee

Luke Edward Richards	University of Maryland Baltimore County
Yuwei Bao	University of Michigan

If you are interested in joining the program committee and participating in reviewing submissions, please email at splu-robonlp2024@googlegroups.com including your prior reviewing experience and a link to your publication records in your email.