To facilitate the emerging applications in the 5G networks and beyond, mobile network operators will provide many powerful control functionalities such as RAN slicing and resource scheduling. These control functionalities generally comprise a series of prediction tasks such as channel state information prediction, cellular traffic prediction and user mobility prediction which will be enabled by machine learning (ML) techniques. However, training the ML models offline is inefficient, due to the excessive overhead for forwarding the huge volume of data samples from cellular networks to remote ML training clouds. Thanks to the promising edge computing paradigm, we advocate cooperative online in-network ML training across edge clouds. To alleviate the data skew issue caused by the capacity heterogeneity and dynamics of edge clouds while avoiding excessive overhead, we propose Cocktail, a cost-efficient and data skew-aware online in-network distributed machine learning framework. We build a comprehensive model and formulate an online data scheduling problem to optimize the framework cost while reconciling the data skew from both short-term and long-term perspective. We exploit the stochastic gradient descent to devise an online asymptotically optimal algorithm. As its core building block, we propose optimal policies based on novel graph constructions to respectively solve two subproblems. We also improve the proposed online algorithm with online learning for fast convergence of in-network ML training. A small-scale testbed and large-scale simulations validate the superior performance of our framework.