Effective human-robot collaboration requires the robot system to observe human actions and to make predictions on the future state of the collaborative work space, in order to generate anticipatory robot behavior. The objective of this project is to learn a sequence of representations of the shared human-robot workspace which are increasingly abstract and which allow for predictions for increasing time horizons. We will employ unsupervised deep learning methods to form hierarchical representations of the scene changes and its contents by optimizing prediction and reconstruction objectives, respectively. These representations are computed by convolutional neural networks with relational autoencoders and spatio-temporal pooling. They have a decreasing spatio-temporal granularity and an increasing number of feature maps for the higher layers. To make the representations targeted to human-robot collaboration, we will use and fine-tune them for semantic perception and semantic prediction. Based on these semantic percepts of humans, objects, and actions, human-robot collaborative tasks will be modeled by structural recurrent neural networks. This yields a predicted semantic state of the joint human-robot work space on multiple levels of description, which will be used to plan anticipative robot behavior in a coarse-to-fine manner. We will demonstrate the utility of our approach in collaborative tasks, where the robot supports the human by providing the needed objects in the right order at the right moment.