posted on 2025-05-01, 00:00authored byRaj Paresh Mehta
3D reconstruction is a fundamental problem in computer vision and graphics, with applications in virtual reality, robotics, and digital content creation. Classical formulations such as voxel grids and meshes provide structured representations. Unfortunately, these representations present significant challenges in training deep learning models for real time processing due to their discrete structures and sparsity. To address some of the challenges, recent works use Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting representations, which have enabled high-quality scene reconstruction with improved rendering capabilities. Compared to NeRFs, 3D Gaussian Splatting offers real-time rendering and the ability to represent large-scale scenes efficiently. However, one of its key limitations is the high memory and storage requirements, as accurately reconstructing a scene often requires millions of Gaussians. To overcome this, we propose a novel spatial grouping and 2D attention-based framework that learns compressed representations of 3D Gaussian Splatting scenes. Our method significantly reduces memory overhead while preserving visual fidelity and rendering quality, making it a viable solution for efficient and scalable 3D scene representation. We use the CO3D Dataset by Meta AI that contains 1.5 million frames from nearly 19,000 videos capturing objects from 50 MS-COCO categories. The dataset mimics real world settings as it is captured without any coordinate calibration, hence, despite its challenges, it allows us to test our framework close to daily application conditions. Our experimental evaluations indicate that it is possible to compress a 3DGS scene by as much as 16 times without compromising on its visual quality.