Publications

  • 2025

    Maps from Motion (MfM): Generating 2D Semantic Maps from Sparse Multi-view Images

    Matteo Toso, Stefano Fiorini, Stuart James, Alessio Del Bue

    International Conference on 3D Vision (3DV'25) | Singapore

    World-wide detailed 2D maps require enormous collective efforts. OpenStreetMap is the result of 11 million registered users manually annotating the GPS location of over 1.75 billion entries, including distinctive landmarks and common urban objects. At the same time, manual annotations can include errors and are slow to update, limiting the map's accuracy. Maps from Motion (MfM) is a step forward to automatize such time-consuming map making procedure by computing 2D maps of semantic objects directly from a collection of uncalibrated multi-view images. From each image, we extract a set of object detections, and estimate their spatial arrangement in a top-down local map centered in the reference frame of the camera that captured the image. Aligning these local maps is not a trivial problem, since they provide incomplete, noisy fragments of the scene, and matching detections across them is unreliable because of the presence of repeated pattern and the limited appearance variability of urban objects. We address this with a novel graph-based framework, that encodes the spatial and semantic distribution of the objects detected in each image, and learns how to combine them to predict the objects' poses in a global reference system, while taking into account all possible detection matches and preserving the topology observed in each image. Despite the complexity of the problem, our best model achieves global 2D registration with an average accuracy within 4 meters (i.e., below GPS accuracy) even on sparse sequences with strong viewpoint change, on which COLMAP has an 80% failure rate. We provide extensive evaluation on synthetic and real-world data, showing how the method obtains a solution even in scenarios where standard optimization techniques fail.
  • 2024

    Re-assembling the past: The RePAIR dataset and benchmark for real world 2D and 3D puzzle solving

    Theodore Tsesmelis, Luca Palmieri, Marina Khoroshiltseva, Adeela Islam, Gur Elkin, Ofir Itzhak Shahar, Gianluca Scarpellini, Stefano Fiorini, Yaniv Ohayon, Nadav Alali, Sinem Aslan, Pietro Morerio, Sebastiano Vascon, Elena Gravina, Maria Cristina Napolitano, Giuseppe Scarpati, Gabriel Zuchtriegel, Alexandra Spühler, Michel E. Fuchs, Stuart James, Ohad Ben-Shahar, Marcello Pelillo, Alessio Del Bue.

    Conference on Neural Information Processing Systems (NeurIPS'24) Datasets and Benchmarks Track | Vancouver, Canada

    This paper proposes the RePAIR dataset that represents a challenging benchmark to test modern computational and data driven methods for puzzle-solving and reassembly tasks. Our dataset has unique properties that are uncommon to current benchmarks for 2D and 3D puzzle solving. The fragments and fractures are realistic, caused by a collapse of a fresco during a World War II bombing at the Pompeii archaeological park. The fragments are also eroded and have missing pieces with irregular shapes and different dimensions, challenging further the reassembly algorithms. The dataset is multi-modal providing hi-res images with characteristic pictorial elements, detailed 3D scans of the fragments and meta-data annotated by the archaeologists. Ground truth has been generated through several years of unceasing fieldwork, including the excavation and cleaning of each fragment, followed by manual puzzle solving by archaeologists of a subset of 1,000 pieces among the 16,000 available. After digitizing all the fragments in 3D, a benchmark was prepared to challenge current reassembly and puzzle-solving methods that often solve more simplistic synthetic scenarios. The tested baselines show that there clearly exists a gap to fill in solving this computationally complex problem.
  • GANzzle++: Generative approaches for jigsaw puzzle solving as local to global assignment in latent spatial representations

    Davide Talon, Alessio Del Bue, and Stuart James

    Pattern Recognition Letters |

    Jigsaw puzzles are a popular and enjoyable pastime that humans can easily solve, even with many pieces. However, solving a jigsaw is a combinatorial problem, and the space of possible solutions is exponential in the number of pieces, intractable for pairwise solutions. In contrast to the classical pairwise local matching of pieces based on edge heuristics, we estimate an approximate solution image, i.e., a mental image, of the puzzle and exploit it to guide the placement of pieces as a piece-to-global assignment problem. Therefore, from unordered pieces, we consider conditioned generation approaches, including Generative Adversarial Networks (GAN) models, Slot Attention (SA) and Vision Transformers (ViT), to recover the solution image. Given the generated solution representation, we cast the jigsaw solving as a 1-to-1 assignment matching problem using Hungarian attention, which places pieces in corresponding positions in the global solution estimate. Results show that the newly proposed GANzzle-SA and GANzzle-VIT benefit from the early fusion strategy where pieces are jointly compressed and gathered for global structure recovery. A single deep learning model generalizes to puzzles of different sizes and improves the performances by a large margin. Evaluated on PuzzleCelebA and PuzzleWikiArts, our approaches bridge the gap of deep learning strategies with respect to optimization-based classic puzzle solvers.
  • Positional diffusion: Graph-based diffusion models for set ordering

    Francesco Giuliari, Gianluca Scarpellini, Stefano Fiorini, Stuart James, Pietro Morerio, Yiming Wang, Alessio Del Bue

    Pattern Recognition Letters |

    Positional reasoning is the process of ordering an unsorted set of parts into a consistent structure. To address this problem, we present Positional Diffusion, a plug-and-play graph formulation with Diffusion Probabilistic Models. Using a diffusion process, we add Gaussian noise to the set elements’ position and map them to a random position in a continuous space. Positional Diffusion learns to reverse the noising process and recover the original positions through an Attention-based Graph Neural Network. To evaluate our method, we conduct extensive experiments on three different tasks and seven datasets, comparing our approach against the state-of-the-art methods for visual puzzle-solving, sentence ordering, and room arrangement, demonstrating that our method outperforms long-lasting research on puzzle solving with up to compared to the second-best deep learning method, and performs on par against the state-of-the-art methods on sentence ordering and room rearrangement. Our work highlights the suitability of diffusion models for ordering problems and proposes a novel formulation and method for solving various ordering tasks. We release our code at https://github.com/IIT-PAVIS/Positional_Diffusion
  • ArtAI4DS: AI Art and its Empowering Role in Digital Storytelling

    Teresa Fernandes, Valentina Nisi, Nuno Nunes, and Stuart James

    IFIP International Conference on Entertainment Computing (IFIP-ICEC'24) | Manaus/Amazonas, Brazil

    In an era of global interconnections, storytelling is a compelling medium for fostering understanding, building connections, and facilitating cultural exchange. Throughout history, visual imagery has been used to enrich narratives. However, this has been a privilege for those with artistic skills. Artificial Intelligence, specifically Generative AI, has the potential to democratize the process, allowing individuals to bring their narratives to life visually, regardless of their artistic prowess. To address this challenge, we developed an AI-powered tool called ArtAI4DS (Art AI for Digital Storytelling), that employs generative images (i.e., from Stable Diffusion) created from story-derived keywords. ArtAI4DS emerged from a research process starting with a `Wizard of Oz' pre-workshop, which informed the structure of a subsequent co-design workshop. Here, participants' hand-drawn images were compared with AI-generated ones, providing insights into user preferences and tool efficacy. The ArtAI4DS then went through four iterative prototypes, drawing valuable insights from various participants. The tool’s refinement process balanced the intricate duality of human creativity and technological innovation, culminating in an artistic expression platform that transforms stories into vivid and captivating images. The final tool, evaluated through user interviews and AttrakDiff questionnaire, showcases its potential as an engaging platform for transforming narratives with solid user affirmation of its motivational and emotional resonance.
  • Interactive Digital Storytelling Navigating the Inherent Currents of the Diasporic Mind

    Valentina Nisi, Paulo Bala, Miguel Pessoa, Stuart James, Nuno Nunes

    International Conference on Interactive Digital Storytelling (ICIDS'24) | Manaus/Amazonas, Brazil

    Due to a recent increase in conflicts, natural disasters, and economic crises, a growing wave of migrant populations has been searching for asylum in Europe. For this population of asylum seekers, the migration process, like currents and rapids, can be dangerous, uneven, and violent, and the integration into their host communities can add to the preexisting trauma. Extending on HCI increasing attention to the caring understanding of human life values, this paper presents initial research focused on refugees' storytelling activities to support their well-being. Here, we describe and discuss the results from a set of studies with the \FirstCPR{}\CPRFootnote{} to design and refine a bespoke interactive digital storytelling authoring tool. This study aims to promote social cohesion and equal participation in European society by using Digital Storytelling to allow migrant communities to share and connect their stories and experiences. The authors contribute with a novel digital storytelling prototype tool and the discussion and reflections stemming from the user-centered design approach. The insights gained from this work are relevant for interaction designers and researchers seeking to support vulnerable populations through Interactive Digital Storytelling.
  • 6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model

    Matteo Bortolon, Theodore Tsesmelis, Stuart James, Fabio Poiesi, Alessio Del Bue

    European Conference for Computer Vision (ECCV'24) | Milan, Italy

    We propose 6DGS to estimate the camera pose of a target RGB image given a 3D Gaussian Splatting (3DGS) model representing the scene. 6DGS avoids the iterative process typical of analysis-by-synthesis methods (\eg iNeRF) that also require an initialization of the camera pose in order to converge. Instead, our method estimates a 6DoF pose by inverting the 3DGS rendering process. Starting from the object surface, we define a radiant Ellicell that uniformly generates rays departing from each ellipsoid that parameterize the 3DGS model. Each Ellicell ray is associated with the rendering parameters of each ellipsoid, which in turn is used to obtain the best bindings between the target image pixels and the cast rays. These pixel-ray bindings are then ranked to select the best scoring bundle of rays, which their intersection provides the camera center and, in turn, the camera rotation.
  • IFFNeRF: Initialisation Free and Fast 6DoF pose estimation from a single image and a NeRF model

    Matteo Bortolon, Theodore Tsesmelis, Stuart James, Fabio Poiesi, Alessio Del Bue

    International Conference on Robotics and Automation (ICRA'24) | Yokohama, Japan

    We introduce IFFNeRF to estimate the six degrees-of-freedom (6DoF) camera pose of a given image, building on the Neural Radiance Fields (NeRF) formulation. IFFNeRF is specifically designed to operate in real-time and eliminates the need for an initial pose guess that is proximate to the sought solution. IFFNeRF utilizes the Metropolis-Hasting algorithm to sample surface points from within the NeRF model. From these sampled points, we cast rays and deduce the color for each ray through pixel-level view synthesis. The camera pose can then be estimated as the solution to a Least Squares problem by selecting correspondences between the query image and the resulting bundle. We facilitate this process through a learned attention mechanism, bridging the query image embedding with the embedding of parameterized rays, thereby matching rays pertinent to the image. Through synthetic and real evaluation settings, we show that our method can improve the angular and translation error accuracy by 80.1% and 67.3%, respectively, compared to iNeRF while performing at 34fps on consumer hardware and not requiring the initial pose guess.
  • PRAGO: Differentiable multi-view pose optimization from objectness detections

    Matteo Taiana, Matteo Toso, Stuart James, Alessio Del Bue

    International Conference on 3D Vision (3DV'24) | Davos, Swirzerland

    Robustly estimating camera poses from a set of images is a fundamental task which remains challenging for differentiable methods, especially in the case of small and sparse camera pose graphs. To overcome this challenge, we propose Pose-refined Rotation Averaging Graph Optimization (PRAGO). From a set of objectness detections on unordered images, our method reconstructs the rotational pose, and in turn, the absolute pose, in a differentiable manner benefiting from the optimization of a sequence of geometrical tasks. We show how our objectness pose-refinement module in PRAGO is able to refine the inherent ambiguities in pairwise relative pose estimation without removing edges and avoiding making early decisions on the viability of graph edges. PRAGO then refines the absolute rotations through iterative graph construction, reweighting the graph edges to compute the final rotational pose, which can be converted into absolute poses using translation averaging. We show that PRAGO is able to outperform non-differentiable solvers on small and sparse scenes extracted from 7-Scenes achieving a relative improvement of 21% for rotations while achieving similar translation estimates.
  • Towards the Reusability and Compositionality of Causal Representations

    Davide Talon, Phillip Lippe, Stuart James, Alessio Del Bue, Sara Magliacane

    Causal Representation Learning | New Orleans, USA

    Causal Representation Learning (CRL) aims at identifying high-level causal factors and their relationships from high-dimensional observations, e.g., images. While most CRL works focus on learning causal representations in a single environment, in this work we instead propose a first step towards learning causal representations from temporal sequences of images that can be adapted in a new environment, or composed across multiple related environments. In particular, we introduce DECAF, a framework that detects which causal factors can be reused and which need to be adapted from previously learned causal representations. Our approach is based on the availability of intervention targets, that indicate which variables are perturbed at each time step. Experiments on three benchmark datasets show that integrating our framework with four state-of-the-art CRL approaches leads to accurate representations in a new environment with only a few samples.
  • 2023

    Inclusive Digital Storytelling: Artificial Intelligence and Augmented Reality to re-centre Stories from the Margins

    Valentina Nisi, Stuart James, Paulo Bala, Alessio Del Bue, Nuno Jardim Nunes

    International Conference on Interactive Digital Storytelling (ICIDS) | Kobe, Japan

    As the concept of the Metaverse becomes a reality, storytelling tools sharpen their teeth to include Artificial Intelligence and Augmented Reality as prominent enabling features. While digitally savvy and privileged populations are well-positioned to use technology, marginalized groups risk being left behind and excluded from societal progress, deepening the digital divide. In this paper, we describe MEMEX, an interactive digital storytelling tool where Artificial Intelligence and Augmented Reality play enabling roles in support of the cultural integration of communities at risk of exclusion. The tool was developed in the context of 3 years EU-funded project, and in this paper, we focus on describing its final working prototype with its pilot study.
  • Connected to the people : Social Inclusion & Cohesion in Action through a Cultural Heritage Digital Tool

    Valentina Nisi, Paulo Bala, Vanessa Cesário, Stuart James, Alessio Del Bue, and Nuno Jardim Nunes

    ACM Conference On Computer-Supported Cooperative Work And Social Computing (CSCW) | Minneapolis, USA

  • Positional Diffusion: Ordering Unordered Sets with Diffusion Probabilistic Models

    Francesco Giuliari, Gianluca Scarpellini, Stuart James, Yiming Wang, Alessio Del Bue

    arXiv | preprint

    Positional reasoning is the process of ordering unsorted parts contained in a set into a consistent structure. We present Positional Diffusion, a plug-and-play graph formulation with Diffusion Probabilistic Models to address positional reasoning. We use the forward process to map elements' positions in a set to random positions in a continuous space. Positional Diffusion learns to reverse the noising process and recover the original positions through an Attention-based Graph Neural Network. We conduct extensive experiments with benchmark datasets including two puzzle datasets, three sentence ordering datasets, and one visual storytelling dataset, demonstrating that our method outperforms long-lasting research on puzzle solving with up to +18% compared to the second-best deep learning method, and performs on par against the state-of-the-art methods on sentence ordering and visual storytelling. Our work highlights the suitability of diffusion models for ordering problems and proposes a novel formulation and method for solving various ordering tasks. Project website at https://iit-pavis.github.io/Positional_Diffusion/
  • You are here! Finding position and orientation on a 2D map from a single image: The Flatlandia localization problem and dataset

    Matteo Toso, Matteo Taiana, Stuart James, Alessio Del Bue

    arXiv | preprint

    We introduce Flatlandia, a novel problem for visual localization of an image from object detections composed of two specific tasks: i) Coarse Map Localization: localizing a single image observing a set of objects in respect to a 2D map of object landmarks; ii) Fine-grained 3DoF Localization: estimating latitude, longitude, and orientation of the image within a 2D map. Solutions for these new tasks exploit the wide availability of open urban maps annotated with GPS locations of common objects (\eg via surveying or crowd-sourced). Such maps are also more storage-friendly than standard large-scale 3D models often used in visual localization while additionally being privacy-preserving. As existing datasets are unsuited for the proposed problem, we provide the Flatlandia dataset, designed for 3DoF visual localization in multiple urban settings and based on crowd-sourced data from five European cities. We use the Flatlandia dataset to validate the complexity of the proposed tasks.
  • Locality-aware subgraphs for inductive link prediction in knowledge graphs

    Hebatallah A. Mohamed, Diego Pilutti, Stuart James, Alessio Del Bue, Marcello Pelillo, Sebastiano Vascon

    Pattern Recognition Letters (PR-L) | Journal

    Recent methods of inductive reasoning on Knowledge Graphs (KGs) transform the link prediction problem into a graph classification task. They first extract a subgraph around each target link based on the -hop neighborhood of the target entities, encode the subgraphs using a Graph Neural Network (GNN), then learn a function that maps subgraph structural patterns to link existence. Although these methods have witnessed great successes, increasing often leads to an exponential expansion of the neighborhood, thereby degrading the GNN expressivity due to oversmoothing. In this paper, we formulate the subgraph extraction as a local clustering procedure that aims at sampling tightly-related subgraphs around the target links, based on a personalized PageRank (PPR) approach. Empirically, on three real-world KGs, we show that reasoning over subgraphs extracted by PPR-based local clustering can lead to a more accurate link prediction model than relying on neighbors within fixed hop distances. Furthermore, we investigate graph properties such as average clustering coefficient and node degree, and show that there is a relation between these and the performance of subgraph-based link prediction.
  • 2022

    Writing with (Digital) Scissors: Designing a Text Editing Tool for Assisted Storytelling using Crowd-Generated Content

    Paulo Bala, Stuart James, Alessio Del Bue, Valentina Nisi

    International Conference on Interactive Digital Storytelling (ICIDS 2022) | Santa Cruz, USA

    Digital Storytelling can exploit numerous technologies and sources of information to support the creation, refinement and enhancement of a narrative. Research on text editing tools has created novel interactions that support authors in different stages of the creative process, such as the inclusion of crowd-generated content for writing. While these interactions have the potential to change workflows, integration of these in a way that is useful and matches users’ needs is unclear. In order to investigate the space of Assisted Storytelling, we designed and conducted a study to analyze how users write and edit a story about Cultural Heritage using an auxiliary source like Wikipedia. Through a diffractive analysis of stories, creative processes, and social and cultural contexts, we reflect and derive implications for design. These were applied to develop an AI-supported text editing tool using crowd-sourced content from Wikipedia and Wikidata.
  • PoserNet: Refining Relative Camera Poses Exploiting Object Detections

    Matteo Taiana, Matteo Toso, Stuart James, Alessio Del Bue

    European Conference on Computer Vision (ECCV 2022) | Tal Aviv, Israel

    The estimation of the camera poses associated with a set of images commonly relies on feature matches between the images. In contrast, we are the first to address this challenge by using objectness regions to guide the pose estimation problem rather than explicit semantic object detections. We propose Pose Refiner Network (PoserNet) a light-weight Graph Neural Network to refine the approximate pair-wise relative camera poses. PoserNet exploits associations between the objectness regions - concisely expressed as bounding boxes - across multiple views to globally refine sparsely connected view graphs. We evaluate on the 7-Scenes dataset across varied sizes of graphs and show how this process can be beneficial to optimisation-based Motion Averaging algorithms improving the median error on the rotation by 62 ◦ with respect to the initial estimates obtained based on bounding boxes. Code and data are available at github.com/IIT-PAVIS/PoserNet.
    @inproceedings{posernet_eccv2022,
    Title = {PoserNet: Refining Relative Camera Poses Exploiting Object Detections},
    Author = {Matteo Taiana and Matteo Toso and Stuart James and Alessio Del Bue},
    booktitle = {Proceedings of the European Conference on Computer Vision ({ECCV})},
    Year = {2022},
    }
  • Geolocation of Cultural Heritage using Multi-View Knowledge Graph Embedding

    Hebatallah A. Mohamed, Sebastiano Vascon, Feliks Hibraj, Stuart James, Diego Pilutti, Alessio Del Bue, Marcello Pelillo

    International Workshop on Pattern Recognition for Cultural Heritage (PatReCH 2022) | Montréal Québec

    Knowledge Graphs (KGs) have proven to be a reliable way of structuring data. They can provide a rich source of contextual information about cultural heritage collections. However, cultural heritage KGs are far from being complete. They are often missing important attributes such as geographical location, especially for sculptures and mobile or indoor entities such as paintings. In this paper, we first present a framework for ingesting knowledge about tangible cultural heritage entities from various data sources and their connected multi-hop knowledge into a geolocalized KG. Secondly, we propose a multi-view learning model for estimating the relative distance between a given pair of cultural heritage entities, based on the geographical as well as the knowledge connections of the entities.
  • GANzzle: Reframing jigsaw puzzle solving as a retrieval task using a generative mental image

    Davide Talon, Alessio Del Bue, Stuart James

    IEEE International Conference on Image Processing (ICIP 2022) | Bordeaux, France`

    Puzzle solving is a combinatorial challenge due to the difficulty of matching adjacent pieces. Instead, we infer a mental image from all pieces, which a given piece can then be matched against avoiding the combinatorial explosion. Exploiting advancements in Generative Adversarial methods, we learn how to reconstruct the image given a set of unordered pieces, allowing the model to learn a joint embedding space to match an encoding of each piece to the cropped layer of the generator. Therefore we frame the problem as a R@1 retrieval task, and then solve the linear assignment using differentiable Hungarian attention, making the process end-to-end. In doing so our model is puzzle size agnostic, in contrast to prior deep learning methods which are single size. We evaluate on two new large-scale datasets, where our model is on par with deep learning methods, while generalizing to multiple puzzle sizes.
  • Emerging Strategies in Asymmetric Sketch Interactions for Object Retrieval in Virtual Reality

    Daniele Giunchi, Riccardo Bovo, Donald Degraen, Stuart James, and Anthony Steed

    Interactive Media,Smart Systems and Emerging Technologiess (IMET 2022) | Cyprus

  • Multi-view 3D Objects Localization from Street-level Scenes

    Javed Ahmad, Matteo Taiana, Matteo Toso, Stuart James, and Alessio Del Bue

    International Conference on Image Analysis and Processing (ICIAP 2021) | Lecce, Italy

    This paper presents a method to localize street-level objects in 3D from images of an urban area. Our method processes 3D sparse point clouds reconstructed from multi-view images and leverages 2D instance segmentation to find all objects within the scene and to generate for each object the corresponding cluster of 3D points and matched 2D detections. The proposed approach is robust to changes in image sizes, viewpoint changes, and changes in the object’s appearance across different views. We validate our approach on challenging street-level crowdsourced images from the Mapillary platform, showing a significant improvement in the mean average precision of object localization for the available Mapillary annotations. These results showcase our method’s effectiveness in localizing objects in 3D, which could potentially be used in applications such as high-definition map generation of urban environments.
  • 2021

    Square peg, round hole: A case study on using Visual Question & Answering in Games

    Paulo Bala, Valentina Nisi, Mara Dionı́sio, Nuno Jardim Nunes, Stuart James

    CHI Play - WIP Track | Virtual

    The discussion about what can Artificial Intelligence (AI) contribute to games has been running for a long time, however, recent advances in AI show promise of providing new kind of experiences for players and new tools for game developers. In contrast with the traditional Finite State Machine for interaction and response, we consider the scenario of Visual Question & Answering (VQA) - the automatic answering of a textual question about an image. VQA is a tool that can enrich possible answers by combining both visual and textual information. It is also trivial to extrapolate to a game setting without going away from the training domain. In this Work In Progress, we present two original prototypes designed to explore the potential of VQA in games and discuss preliminary findings originated through a Wizard of Oz (WOz) pilot study using VQA to investigate how people interact with such an AI algorithm.
  • Amnesia in the Atlantic: an AI Driven Serious Game on Marine Biodiversity

    Mara Dionı́sio, Valentina Nisi, Jin Xin, Paulo Bala, Stuart James, Nuno Jardim Nunes

    International Federation for Information Processing – International Conference on Entertainment Computing (IFIP-ICEC) - Work In Progress (WIP) Track | Coimbra, Portugal

    The use of Conversational Interfaces has evolved rapidly in numerous fields; in particular, they are an interesting tool for Serious Games to leverage on. Conversational Interfaces can assist Serious Games' goals, namely in presenting knowledge through dialogue. With the global acknowledgment of the joint crisis in nature and climate change, it is essential to raise awareness to the fact that many ecosystems are being destroyed and that the biodiversity of our planet is at risk. Therefore in this paper, we present Amnesia in the Atlantic, a Serious Game enhanced with a Conversational Interface embracing the challenge of critically engaging players with marine biodiversity issues.
  • Artificial Intelligence and Art History: A Necessary Debate?

    Mathieu Aubry, Lisandra Costiner, Stuart James

    Histoire de l'art | Debate

  • Mixing Modalities of 3D Sketching and Speech for Interactive Model Retrieval in Virtual Reality

    Daniele Giunchi, Alejandro Sztrajman, Stuart James, Anthony Steed

    IMX'21 | New York

    Sketch and speech are intuitive interaction methods that convey complementary information and have been independently used for 3D model retrieval in virtual environments. While sketch has been shown to be an effective retrieval method, not all collections are easily navigable using this modality alone. We design a new challenging database for sketch comprised of 3D chairs where each of the components (arms, legs, seat, back) are independently colored. To overcome this, we implement a multimodal interface for querying 3D model databases within a virtual environment. We base the sketch on the state-of-the-art for 3D Sketch Retrieval, and use a Wizard-of-Oz style experiment to process the voice input. In this way, we avoid the complexities of natural language processing which frequently requires fine-tuning to be robust. We conduct two user studies and show that hybrid search strategies emerge from the combination of interactions, fostering the advantages provided by both modalities.
    @inbook{10.1145/3452918.3458806,
    author = {Giunchi, Daniele and Sztrajman, Alejandro and James, Stuart and Steed, Anthony},
    title = {Mixing Modalities of 3D Sketching and Speech for Interactive Model Retrieval in Virtual Reality},
    year = {2021},
    isbn = {9781450383899},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3452918.3458806},
    booktitle = {ACM International Conference on Interactive Media Experiences},
    pages = {144–155},
    numpages = {12}}
  • Consistent Mesh Colors for Multi-View Reconstructed 3D Scenes

    Mohamed Dahy Elkhouly, Alessio Del Bue, Stuart James

    arXiv | preprint

    Specular highlights are commonplace in images, however, methods for detecting them and in turn removing the phenomenon are particularly challenging. A reason for this, is due to the difficulty of creating a dataset for training or evaluation, as in the real-world we lack the necessary control over the environment. Therefore, we propose a novel physically-based rendered LIGHT Specularity (LIGHTS) Dataset for the evaluation of the specular highlight detection task. Our dataset consists of 18 high quality architectural scenes, where each scene is rendered with multiple views. In total we have 2,603 views with an average of 145 views per scene. Additionally we propose a simple aggregation based method for specular highlight detection that outperforms prior work by 3.6% in two orders of magnitude less time on our dataset.
  • LIGHTS: LIGHT Specularity Dataset for specular detection in Multi-view

    Mohamed Dahy Elkhouly, Theodore Tsesmelis, Alessio Del Bue, Stuart James

    IEEE International Conference on Image Processing | Anchorage, Alaska

    We address the issue of creating consistent mesh texture maps captured from scenes without color calibration. We find that the method for aggregation of the multiple views is crucial for creating spatially consistent meshes without the need to explicitly optimize for spatial consistency. We compute a color prior from the cross-correlation of observable view faces and the faces per view to identify an optimal per-face color. We then use this color in a re-weighting ratio for the best-view texture, which is identified by prior mesh texturing work, to create a spatial consistent texture map. Despite our method not explicitly handling spatial consistency, our results show qualitatively more consistent results than other state-of-the-art techniques while being computationally more efficient. We evaluate on prior datasets and additionally Matterport3D showing qualitative improvements.
    @inproceedings{ElkhoulyICIP21lights,
    author={Elkhouly, Mohamed Dahy and Tsesmelis, Theodore and Bue, Alessio Del and James, Stuart},
    booktitle={2021 IEEE International Conference on Image Processing (ICIP)},
    title={Lights: Light Specularity Dataset For Specular Detection In Multi-View},
    year={2021},
    volume={},
    number={},
    pages={2908-2912},
    doi={10.1109/ICIP42928.2021.9506354}}
  • 2020

    Machine Learning for Cultural Heritage: A Survey

    Marco Fiorucci, Marina Khoroshiltseva, Massimilano Pontil, Ariana Traviglia, Alessio Del Bue and Stuart James

    Pattern Recognition Letters (PR-L) | Elsevier

    The application of Machine Learning (ML) to Cultural Heritage (CH) has evolved since basic statistical approaches such as Linear Regression to complex Deep Learning models. The question remains how much of this actively improves on the underlying algorithm versus using it within a ‘black box’ setting. We survey across ML and CH literature to identify the theoretical changes which contribute to the algorithm and in turn them suitable for CH applications. Alternatively, and most commonly, when there are no changes, we review the CH applications, features and pre/post-processing which make the algorithm suitable for its use. We analyse the dominant divides within ML, Supervised, Semi-supervised and Unsupervised, and reflect on a variety of algorithms that have been extensively used. From such an analysis, we give a critical look at the use of ML in CH and consider why CH has only limited adoption of ML.
    @article{FiorucciPRL20ml4ch, title = "Machine Learning for Cultural Heritage: A Survey",
    journal = "Pattern Recognition Letters",
    volume = "133",
    pages = "102 - 108",
    year = "2020",
    issn = "0167-8655",
    doi = "https://doi.org/10.1016/j.patrec.2020.02.017",
    url = "http://www.sciencedirect.com/science/article/pii/S0167865520300532",
    author = "Marco Fiorucci and Marina Khoroshiltseva and Massimiliano Pontil and Arianna Traviglia and Alessio [Del Bue] and Stuart James",
    keywords = "Artificial Intelligence, Machine Learning, Cultural Heritage, Digital Humanities",
    abstract = "The application of Machine Learning (ML) to Cultural Heritage (CH) has evolved since basic statistical approaches such as Linear Regression to complex Deep Learning models. The question remains how much of this actively improves on the underlying algorithm versus using it within a ‘black box’ setting. We survey across ML and CH literature to identify the theoretical changes which contribute to the algorithm and in turn them suitable for CH applications. Alternatively, and most commonly, when there are no changes, we review the CH applications, features and pre/post-processing which make the algorithm suitable for its use. We analyse the dominant divides within ML, Supervised, Semi-supervised and Unsupervised, and reflect on a variety of algorithms that have been extensively used. From such an analysis, we give a critical look at the use of ML in CH and consider why CH has only limited adoption of ML."}
  • 2019

    Mixing realities for sketch retrieval in Virtual Reality

    Daniele Giunchi, Stuart James, Donald Degraen and Anthony Steed

    VRCAI'19 | Brisbane, Austrailia

    Users within a Virtual Environment often need support designing the environment around them with the need to find relevant content while remaining immersed. We, therefore, focus on the familiar sketch-based interaction to support the process of content placing and specifically investigate how interactions from a tablet or desktop translate into the virtual environment. To understand sketching interaction within a virtual environment, we compare different methods of sketch interaction, i.e., 3D mid-air sketching, 2D sketching on a virtual tablet, 2D sketching on a fixed virtual whiteboard, and 2D sketching on a real tablet. The user remains immersed within the environment and queries a database containing detailed 3D models and replace them into the virtual environment. Our results show that 3D mid-air sketching is considered to be a more intuitive method to search a collection of models; while the addition of physical devices creates confusion due to the complications of their inclusion within a virtual environment. While we pose our work as a retrieval problem for 3D models of chairs, our results are extendable to other sketching tasks for virtual environments.
    @inproceedings{GiunchiVRCAI19mixingReal,
    author = {Giunchi, Daniele and James, Stuart and Degraen, Donald and Steed, Anthony},
    title = {Mixing Realities for Sketch Retrieval in Virtual Reality},
    year = {2019},
    isbn = {9781450370028},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3359997.3365751},
    doi = {10.1145/3359997.3365751},
    booktitle = {The 17th International Conference on Virtual-Reality Continuum and Its Applications in Industry},
    articleno = {Article 50},
    numpages = {2},
    keywords = {HCI, Sketch, CNN, Virtual Reality},
    location = {Brisbane, QLD, Australia},
    series = {VRCAI ’19}}
  • re-OBJ:Jointly learning the foreground and background for object instance re-identification

    Viabhav Bansal, Stuart James and Alessio Del Bue

    ICIAP'19 | Trento, Italy Best Student Paper Award

    Conventional approaches to object instance re-identification rely on matching appearances of the target objects among a set of frames. However, learning appearances of the objects alone might fail when there are multiple objects with similar appearance or multiple instances of same object class present in the scene. This paper proposes that partial observations of the background can be utilized to aid in the object re-identification task for a rigid scene, especially a rigid environment with a lot of reoccurring identical models of objects. Using an extension to the Mask R-CNN architecture, we learn to encode the important and distinct information in the background jointly with the foreground relevant to rigid real-world scenarios such as an indoor environment where objects are static and the camera moves around the scene. We demonstrate the effectiveness of our joint visual feature in the re-identification of objects in the ScanNet dataset and show a relative improvement of around 28.25% in the rank-1 accuracy over the deepSort method.
    @inproceedings{BansalICIAP19reobj,
    author = {Vaibhav Bansal and Stuart James and Alessio {Del Bue}},
    editor = {Elisa Ricci and Samuel Rota Bul{\`{o}} and Cees Snoek and Oswald Lanz and Stefano Messelodi and Nicu Sebe},
    title = {re-OBJ: Jointly Learning the Foreground and Background for Object Instance Re-identification},
    booktitle = {Image Analysis and Processing - {ICIAP} 2019 - 20th International Conference,
    Trento, Italy, September 9-13, 2019, Proceedings, Part{II}},
    series = {Lecture Notes in Computer Science},
    volume = {11752}, pages = {402--413},
    publisher = {Springer},
    year = {2019},
    url = {https://doi.org/10.1007/978-3-030-30645-8\_37},
    doi = {10.1007/978-3-030-30645-8\_37}}
  • Augmenting datasets for Visual Question and Answering for complex spatial reasoning

    Stuart James and Alessio Del Bue

    CVPR Workshop on VQA | California, USA

  • Autonomous 3D reconstruction, mapping and exploration of indoor environments with a robotic arm

    Yiming Wang$, Stuart James,Elisavet Konstantina Stathopoulou, Carlos Beltran-Gonzalez, Yoshinori Konishi and Alessio Del Bue

    IEEE Robotics and Automation Letters | Macau

    We propose a novel information gain metric that combines hand-crafted and data-driven metrics to address the next best view problem for autonomous 3D mapping of unknown indoor environments. For the hand-crafted metric, we propose an entropy-based information gain that accounts for the previous view points to avoid the camera to revisit the same location and to promote the motion toward unexplored or occluded areas. Whereas for the learnt metric, we adopt a Convolutional Neural Network (CNN) architecture and formulate the problem as a classification problem. The CNN takes as input the current depth image and outputs the motion direction that suggests the largest unexplored surface. We train and test the CNN using a new synthetic dataset based on the SUNCG dataset. The learnt motion direction is then combined with the proposed hand-crafted metric to help handle situations where using only the hand-crafted metric tends to face ambiguities. We finally evaluate the autonomous paths over several real and synthetic indoor scenes including complex industrial and domestic settings and prove that our combined metric is able to further improve the exploration coverage compared to using only the proposed hand-crafted metric.
    @ARTICLE{WangRAL19explore, author={Y. {Wang} and S. {James} and E. K. {Stathopoulou} and C. {Beltrán-González} and Y. {Konishi} and A. {Del Bue}}, journal={IEEE Robotics and Automation Letters}, title={Autonomous 3-D Reconstruction, Mapping, and Exploration of Indoor Environments With a Robotic Arm}, year={2019}, volume={4}, number={4}, pages={3340-3347},}
  • 2018

    Visual Graphs from Motion (VGfM): Scene understanding with object geometry reasoning

    Paul Gay, Stuart James, Alessio Del Bue

    ACCV'18 | Perth, Australia

    Recent approaches on visual scene understanding attempt to build a scene graph -- a computational representation of objects and their pairwise relationships. Such rich semantic representation is very appealing, yet difficult to obtain from a single image, especially when considering complex spatial arrangements in the scene. Differently, an image sequence conveys useful information using the multi-view geometric relations arising from camera motion. Indeed, in such cases, object relationships are naturally related to the 3D scene structure. To this end, this paper proposes a system that first computes the geometrical location of objects in a generic scene and then efficiently constructs scene graphs from video by embedding such geometrical reasoning. Such compelling representation is obtained using a new model where geometric and visual features are merged using an RNN framework. We report results on a dataset we created for the task of 3D scene graph generation in multiple views.
    @InProceedings{GayACCV19vgfm,
    author="Gay, Paul and Stuart, James and Del Bue, Alessio",
    editor="Jawahar, C. V.and Li, Hongdong and Mori, Greg and Schindler, Konrad",
    title="Visual Graphs from Motion (VGfM): Scene Understanding with Object Geometry Reasoning",
    booktitle="Computer Vision -- ACCV 2018",
    year="2019",
    publisher="Springer International Publishing",address="Cham",
    pages="330--346",
    abstract="Recent approaches on visual scene understanding attempt to build a scene graph -- a computational representation of objects and their pairwise relationships. Such rich semantic representation is very appealing, yet difficult to obtain from a single image, especially when considering complex spatial arrangements in the scene. Differently, an image sequence conveys useful information using the multi-view geometric relations arising from camera motions. Indeed, object relationships are naturally related to the 3D scene structure. To this end, this paper proposes a system that first computes the geometrical location of objects in a generic scene and then efficiently constructs scene graphs from video by embedding such geometrical reasoning. Such compelling representation is obtained using a new model where geometric and visual features are merged using an RNN framework. We report results on a dataset we created for the task of 3D scene graph generation in multiple views.",
    isbn="978-3-030-20893-6"}
  • Multi-view Aggregation for Color Naming with Shadow Detection and Removal

    Mohamed Dahy Elkhouly, Stuart James, Alessio Del Bue

    IPAS'18 | Nice, France Best Paper Award

    This paper presents a set of methods for classifying the color attribute of objects when multiple images of the same objects are available. This problem is more complex than the single image estimation since varying environmental effects, such as, shadows or specularities from light sources, can result in poor accuracy. These depend primarily on the camera positions and the material type of the objects. Single image techniques focus on improving the discrimination of between colors, whereas in multi-view systems additional information is available but should be utilized wisely. To this end, we propose three methods to aggregate image pixel information in multi-view that boost the performance of color name classification. Moreover, we study the effect of shadows by employing automatic shadow detection and correction techniques on the color naming problem. We tested our proposals on a new multi-view color names dataset (M3DCN) which contain indoor and outdoor objects. The experimental evaluation shows that one out of the three presented aggregation methods is very efficient and it achieves the highest accuracy in term of classification results. Also, we experimentally show that addressing visual outliers like shadow in multi-view images improves the performance of the color attribute decision process.
    @INPROCEEDINGS{ElkhoulyIPAS18mvcolor, author={M. D. {Elkhouly} and S. {James} and A. {Del Bue}}, booktitle={2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS)}, title={Multi-view Aggregation for Color Naming with Shadow Detection and Removal}, year={2018}, volume={}, number={}, pages={115-120},}
  • 3D Sketching for Interactive Model Retrieval in Virtual Reality

    Daniele Giunchi, Stuart James, Anthony Steed

    Expressive | Victoria, British Columbia, Canada

    Users within a Virtual Environment often need support designing the environment around them with the need to find relevant content while remaining immersed. We, therefore, focus on the familiar sketch-based interaction to support the process of content placing and specifically investigate how interactions from a tablet or desktop translate into the virtual environment. To understand sketching interaction within a virtual environment, we compare different methods of sketch interaction, i.e., 3D mid-air sketching, 2D sketching on a virtual tablet, 2D sketching on a fixed virtual whiteboard, and 2D sketching on a real tablet. The user remains immersed within the environment and queries a database containing detailed 3D models and replace them into the virtual environment. Our results show that 3D mid-air sketching is considered to be a more intuitive method to search a collection of models; while the addition of physical devices creates confusion due to the complications of their inclusion within a virtual environment. While we pose our work as a retrieval problem for 3D models of chairs, our results are extendable to other sketching tasks for virtual environments.
    @inproceedings{10.1145/3229147.3229166,
    author = {Giunchi, Daniele and James, Stuart and Steed, Anthony},
    title = {3D Sketching for Interactive Model Retrieval in Virtual Reality},
    year = {2018},
    isbn = {9781450358927},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3229147.3229166},
    doi = {10.1145/3229147.3229166},
    booktitle = {Proceedings of the Joint Symposium on Computational Aesthetics and Sketch-Based Interfaces and Modeling and Non-Photorealistic Animation and Rendering},
    articleno = {Article 1},
    numpages = {12},
    keywords = {HCI, sketch, virtual reality, CNN},
    location = {Victoria, British Columbia, Canada},
    series = {Expressive ’18}}
  • Model Retrieval by 3D Sketching in Immersive Virtual Reality

    Daniele Giunchi, Stuart James, Anthony Steed

    IEEE VR Poster | Reutlingen, Germany

    We describe a novel method for searching 3D model collections using free-form sketches within a virtual environment as queries. As opposed to traditional Sketch Retrieval, our queries are drawn directly onto an example model. Using immersive virtual reality the user can express their query through a sketch that demonstrates the desired structure, color and texture. Unlike previous sketch-based retrieval methods, users remain immersed within the environment without relying on textual queries or 2D projections which can disconnect the user from the environment. We show how a convolutional neural network (CNN) can create multi-view representations of colored 3D sketches. Using such a descriptor representation, our system is able to rapidly retrieve models and in this way, we provide the user with an interactive method of navigating large object datasets. Through a preliminary user study we demonstrate that by using our VR 3D model retrieval system, users can perform quick and intuitive search. Using our system users can rapidly populate a virtual environment with specific models from a very large database,and thus the technique has the potential to be broadly applicable in immersive editing systems.
  • 2017

    Texture Stationarization: Turning Photos into Tileable Textures

    Joep Moritz, Stuart James, Tom S.F. Haines, Tobias Ritschel, Tim Weyrich

    Computer Graphics Forum (Proc. Eurographics) | Lyon, France

    Texture synthesis has grown into a mature field in computer graphics, allowing the synthesis of naturalistic textures and images from photographic exemplars. Surprisingly little work, however, has been dedicated to synthesizing tileable textures, that is, textures that when laid out in a regular grid of tiles form a homogeneous appearance suitable for use in memory-sensitive real-time graphics applications. One of the key challenges in doing so is that most natural input exemplars exhibit uneven spatial variations that, when tiled, show as repetitive patterns. We propose an approach to synthesize tileable textures while enforcing stationarity properties that effectively mask repetitions while maintaining the unique characteristics of the exemplar. We explore a number of alternative measures for texture stationarity and show how each measure can be integrated into a standard texture synthesis method (PatchMatch) to enforce stationarity at user-controlled scales. We demonstrate the efficacy of our approach using a database of 118 exemplar images, both from publicly available sources as well as new ones captured under uncontrolled conditions, and we quantitatively analyze alternative stationarity measures for their robustness across many test runs using different random seeds. In conclusion, we suggest a novel synthesis approach that employs local histogram matching to reliably turn input photographs of natural surfaces into tiles well suited for artifact-free tiling.
  • Digital Photographic Practices as Expressions of Personhood and Identity: Variations Across School Leavers and Recent Retirees

    K Orzech, W Moncur, A Durrant, S James, J Collomosse

    Journal of Visual Studies |

  • 2016

    Evolutionary Data Purification for Social Media Classification

    Stuart James, John Collomosse

    International Conference on Pattern Recognition (ICPR'16) | Cancun, Mexico

  • Towards Sketched Visual Narratives for Retrieval

    Stuart James

    SketchX - Human Sketch Analysis and its Applications | London, UK

  • 2015

    Visual Narratives: Free-hand Sketch for Visual Search and Navigation of Video

    Stuart James

    PhD Thesis | University of Surrey, Guildford, UK

    Humans have an innate ability to communicate visually; the earliest forms of communication were cave drawings, and children can communicate visual descriptions of scenes through drawings well before they can write. Drawings and sketches offer an intuitive and efficient means for communicating visual concepts. Today, society faces a deluge of digital visual content driven by a surge in the generation of video on social media and the online availability of video archives. Mobile devices are emerging as the dominant platform for consuming this content, with Cisco predicting that by 2018 over 80% of mobile traffic will be video. Sketch offers a familiar and expressive modality for interacting with video on the touch-screens commonly present on such devices. This thesis contributes several new algorithms for searching and manipulating video using free-hand sketches. We propose the Visual Narrative (VN); a storyboarded sequence of one or more actions in the form of sketch that collectively describe an event. We show that VNs can be used to both efficiently search video repositories, and to synthesise video clips. First, we describe a sketch based video retrieval (SBVR) system that fuses multiple modalities (shape, colour, semantics, and motion) in order to find relevant video clips. An efficient multi-modal video descriptor is proposed enabling the search of hundreds of videos in milliseconds. This contrasts with prior SBVR that lacks an efficient index representation, and take minutes or hours to search similar datasets. This contribution not only makes SBVR practical at interactive speeds, but also enables user-refinement of results through relevance feedback to resolve sketch ambiguity, including the relative priority of the different VN modalities. Second, we present the first algorithm for sketch based pose retrieval. A pictographic representation (stick-men) is used to specify a desired human pose within the VN, and similar poses found within a video dataset. We use archival dance performance footage from the UK National Resource Centre for Dance (UK-NRCD), containing diverse examples of human pose. We investigate appropriate descriptors for sketch and video, and propose a novel manifold learning technique for mapping between the two descriptor spaces and so performing sketched pose retrieval. We show that domain adaptation can be applied to boost the performance of this system through a novel piece-wise feature-space warping technique. Third, we present a graph representation for VNs comprising multiple actions. We focus on the extension of our pose retrieval system to a sequence of poses interspersed with actions (e.g. jump, twirl). We show that our graph representation can be used for multiple applications: 1) to retrieve sequences of video comprising multiple actions; 2) to navigate in pictorial form, the retrieved video sequences; 3) to synthesise new video sequences by retrieving and concatenating video fragments from archival footage.
  • 2014

    Enhanced Digital Literacy by Multi-modal Data Mining of the Digital Lifespan

    John Collomosse, Stuart James, Abigail Durrant, Diego Trujillo-Pisanty, Wendy Moncur, Kathryn Orzech, Sarah Martindale, Mike Chantler.

    DE2015 | London, UK

  • Interactive Video Asset Retrieval using Sketched Queries

    Stuart James and John Collomosse

    CVMP'14 | London

  • Particle Filtering approach to salient video object localization

    C Gray, S James, J Collomosse and P Asente

    ICIP'14 | Switzerland

  • ReEnact Sketch based Choreographic Design from Archival Dance Footage

    S James, M Fonseca and J Collomosse

    ACM International Conference on Multimedia Retrieval (ICMR'14) | Glasgow, UK

  • Admixed Portrait Design Intervention to Prompt Reflection on Being Online as a New Parent

    D Trujillo-Pisanty, A Durrant, S Martindale, S James, J Collomosse

    ACM DIS'14 |

  • 2013

    Markov Random Fields for Sketch based Video Retrieval

    R Hu, S James, T Wang and J Collomosse

    ACM International Conference on Multimedia Retrieval (ICMR'13) |

  • 2012

    Skeletons from Sketches of Dancing Poses

    M Fonseca, S James and J Collomosse

    IEEE VL/HCC'12 |

  • Annotated Free-hand Sketches for Video Retrieval using Object Semantics and Motion

    R Hu, S James and J Collomosse

    Springer ACM MultiMedia Modelling (MMM'12) |

  • 2011

    Annotated Sketches for Intuitive Video Retrieval

    Stuart James and John Collomosse

    BMVA / AVA Workshop on Biological and Machine Vision. Perception Journal | Cardiff, UK

  • 2011

    Sketched Visual Narratives for Content Based Video Retrieval

    Stuart James

    MPhil Transfer Report | University of Surrey, UK