Fabric materials are central to recreating realistic appearance of avatars in a virtual world and many VR applications, ranging from virtual try-on, teleconferencing, to character animation. We propose an end-to-end network model that uses video input to estimate the fabric materials of the garment worn by a human or an avatar in a virtual world. To achieve the high accuracy, we jointly learn human body and the garment geometry as conditions to material prediction. Due to the highly dynamic and deformable nature of cloth, general data-driven garment modeling remains a challenge. To address this problem, we propose a two-level auto-encoder to account for both global and local features of any garment geometry that would directly affect material perception. Using this network, we can also achieve smooth geometry transitioning between different garment topologies. During the estimation, we use a closed-loop optimization structure to share information between tasks and feed the learned garment features for temporal estimation of garment materials. Experiments show that our proposed network structures greatly improve the material classification accuracy by 1.5x, with applicability to unseen input. It also runs at least three orders of magnitude faster than the state-of-the-art. We demonstrate the recovered fabric materials on virtual try-on, where we recreate the entire avatar appearance, including body shape and pose, garment geometry and materials from only a single video.