Joint reasoning about objects and 3D scene layout has shown great promise in scene interpretation. One visual cue that has been overlooked is texture arising from a spatial repetition of objects in the scene (e.g., windows of a building). Such texture provides scene-specific constraints among objects, and thus facilitates scene interpretation. We present an approach to: (1) detecting distinct textures of objects in a scene, (2) reconstructing the 3D shape of detected texture surfaces, and (3) combining object detections and shape-from-texture toward a globally consistent scene interpretation. Inference is formulated within the reinforcement learning framework as a sequential interpretation of image regions, starting from confident regions to guide the interpretation of other regions. Our algorithm finds an optimal policy that maps states of detected objects and reconstructed surfaces to actions which ought to be taken in those states, including detecting new objects and identifying new textures, so as to minimize a long-term loss. Tests against ground truth obtained from stereo images demonstrate that we can coarsely reconstruct a 3D model of the scene from a single image, without learning the layout of common scene surfaces, as done in prior work. We also show that reasoning about texture of objects improves object detection.