Skull segmentation from three-dimensional (3D) cone-beam computed tomography (CBCT) images is critical for the diagnosis and treatment planning of the patients with craniomaxillofacial (CMF) deformities. Convolutional neural network (CNN)-based methods are currently dominating volumetric image segmentation, but these methods suffer from the limited GPU memory and the large image size (e.g., 512 × 512 × 448). Typical ad-hoc strategies, such as down-sampling or patch cropping, will degrade segmentation accuracy due to insufficient capturing of local fine details or global contextual information. Other methods such as Global-Local Networks (GLNet) are focusing on the improvement of neural networks, aiming to combine the local details and the global contextual information in a GPU memory-efficient manner. However, all these methods are operating on regular grids, which are computationally inefficient for volumetric image segmentation. In this work, we propose a novel VoxelRend-based network (VR-U-Net) by combining a memory-efficient variant of 3D U-Net with a voxel-based rendering (VoxelRend) module that refines local details via voxel-based predictions on non-regular grids. Establishing on relatively coarse feature maps, the VoxelRend module achieves significant improvement of segmentation accuracy with a fraction of GPU memory consumption. We evaluate our proposed VR-U-Net in the skull segmentation task on a high-resolution CBCT dataset collected from local hospitals. Experimental results show that the proposed VR-U-Net yields high-quality segmentation results in a memory-efficient manner, highlighting the practical value of our method.