用Python分析“女神大会”,码农最想娶的女星竟然是......

2018 年 12 月 17 日 51CTO博客

笔者作为一位喜爱足球的球迷,“懂球帝”一定会是款必不可少的 App,即使是只有 16G 的空间,也从未将其卸载。


然而我们今天聊的与足球无关,而是去聊懂球帝上的“女神大会”专栏,作为一个大型“钢铁直男”聚集地,“懂球帝”上对各位女神的评分,对广大“钢铁直男”群体也具有一定代表性。


数据来源


目前女神大会更新至了第 90 期,总共出场了 90 位女神,界面如下:

我们通过 fiddler 获取该界面中女神的照片地址以及每一篇文章的 id 编号,用于之后的爬取和可视化,代码如下:

import json
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import os
os.chdir('D:/爬虫/女神')

id_list = []
title_list = []
pic_list = []
date_list=[]

for i in range(1,6):
   url= 'http://api.dongqiudi.com/search?keywords=%E5%A5%B3%E7%A5%9E%E5%A4%A7%E4%BC%9A&type=all&page='+str(i) 
   html = requests.get(url=url).content
   news = json.loads(html.decode('utf-8'))['news']
   this_id = [k['id'for k in news]
   this_pic = [k['thumb'for k in news]
   this_title = [k['title'for k in news]
   this_date = [k['pubdate'for k in news]
   this_title=[BeautifulSoup(k,"html.parser").text for k in this_title]
   id_list = id_list+this_id
   title_list = title_list+this_title
   pic_list = pic_list+this_pic
   date_list = date_list+this_date


另一方面,每位女神的评分都在下一期当中,我们需要爬取文章内容进行获取:

爬取代码如下:

prev_title_list = []
score_list=[]
count_list=[]
for id in id_list:
   url = 'http://www.dongqiudi.com/archive/{k}.html'.format(k=id)    
   header = {'User-Agent''Mozilla/5.0 (Windows NT 10.0; Win32; x32; rv:54.0) Gecko/20100101 Firefox/54.0',
   'Connection''keep-alive'}
   cookies ='v=3; iuuid=1A6E888B4A4B29B16FBA1299108DBE9CDCB327A9713C232B36E4DB4FF222CF03; webp=true; ci=1%2C%E5%8C%97%E4%BA%AC; __guid=26581345.3954606544145667000.1530879049181.8303; _lxsdk_cuid=1646f808301c8-0a4e19f5421593-5d4e211f-100200-1646f808302c8; _lxsdk=1A6E888B4A4B29B16FBA1299108DBE9CDCB327A9713C232B36E4DB4FF222CF03; monitor_count=1; _lxsdk_s=16472ee89ec-de2-f91-ed0%7C%7C5; __mta=189118996.1530879050545.1530936763555.1530937843742.18'
   cookie = {}
   for line in cookies.split(';'):
       name, value = cookies.strip().split('='1)
       cookie[name] = value    
   html = requests.get(url,cookies=cookie, headers=header).content
   try:
       content = BeautifulSoup(html.decode('utf-8'),"html.parser")
       score = content.find('span',attrs={'style':"color:#ff0000"}).text
       prev_title = content.find('a',attrs={"target""_self"}).text
       prev_title_list.append(prev_title)
       score_list.append(score)
       sentence = content.text.split(',')
       count=[k for k in sentence if re.search('截至目前',str(k))][0]
       count_list.append(count)
   except:
       continue


整体对比


我们此次利用 R 语言中的 ggimage 包,将获取到的女神图片加入到最终的图表中,提高可视化效果。


首先看一下整体评分的 TOP15 名单:

朱茵、林志玲、高圆圆位居榜单前三位,不知道这份榜单是否符合你心目中的女神标准,而这三位也恰好成为了目前出场的 90 位女星当中香港、台湾、大陆的最高分。


值得一提的是,懂球帝小编对于活跃于 90 年代的香港女星情有独钟,从中选取了非常多的女神,而这些女神的评分也都名列前茅。


下面看一下目前出场的 90 位女神中,排名相对靠后的几位:

很多朋友会觉得这份榜单对于年轻女神有些苛刻,可能这也代表了广大网友对于各位年轻女神的美好期许,体现了她们未来的无限可能。


区域对比


我们分区域看一下目前各个区域排名前十的名单:

看完了各个区域 TOP 10 的名单之后,我们进行一下区域的对比:

我们将小提琴图与盒形图相结合,进行区域的对比,可以看到大陆女星的评分相对偏低。


一方面是由于部分女神的评分较低,拉低了整体的分值;另一方面也是由于目前出场的大陆女星年龄普遍偏小,而这一点也会在下一部分得到证实


年份对比


我们看一下各个年份出生的女星总体评分情况对比,其中“60 后”选项也包含了 60 前的女神,“90 后”选项也包含了 00 后的女神:

可以看到 60 后、70 后的女神们平均分数要高于 80 后,而 80 后显著高于 90 后。


一方面说明了大家对老牌女神们的认可;另一方面也是体现了大家对新生女神们的无限期许。


我们下面将区域与年份综合起来进行对比:

可以看到参与评分的大陆女神普遍比较年轻,这也一定程度解释了此前提到的大陆女神整体评分偏低的原因。


而港台女神普遍集中在 60、70 后,这些女神们活跃的 90 年代也是香港电影、电视的黄金时期,我们也期待着香港影视未来的复苏。


后记


懂球帝目前的女神大会做到了 90 期,并没有十分完整地囊括广大女神,比如“四旦双冰”就都没有出现,使得这次的数据并不能完全地表述广大“钢铁直男”心中的女神标准,未来随着期数的增加,相信会有更加完善的分析。


最后,小编突发奇想,想要看下在一周中不同时间出场的女神评分是否会有区别:

出乎小编意料的是,在小编一周中最开心的三天周四(即将放假),周五(迎接放假),周六(享受放假)的三天中出场的女神评分反而偏低,或许是由于数据量偏少,未来随着期数的增加,小编也会密切关注这点。


作者:徐麟

简介:目前就职于互联网公司数据部,哥大统计数据狗,从事数据挖掘&分析工作,喜欢用 R&Python 玩一些不一样的数据。

编辑:陶家龙、孙淑娟

出处:本文经授权转载自微信公众号数据森麟(ID:shujusenlin)。

精彩文章推荐:

过于真实!《互联网公司迷信大全》

互联网大厂是如何360°无死角考察技术候选人的?

百亿大表任意维度查询,如何做到毫秒级返回?

登录查看更多
点赞 0

While nonlinear stochastic partial differential equations arise naturally in spatiotemporal modeling, inference for such systems often faces two major challenges: sparse noisy data and ill-posedness of the inverse problem of parameter estimation. To overcome the challenges, we introduce a strongly regularized posterior by normalizing the likelihood and by imposing physical constraints through priors of the parameters and states. We investigate joint parameter-state estimation by the regularized posterior in a physically motivated nonlinear stochastic energy balance model (SEBM) for paleoclimate reconstruction. The high-dimensional posterior is sampled by a particle Gibbs sampler that combines MCMC with an optimal particle filter exploiting the structure of the SEBM. In tests using either Gaussian or uniform priors based on the physical range of parameters, the regularized posteriors overcome the ill-posedness and lead to samples within physical ranges, quantifying the uncertainty in estimation. Due to the ill-posedness and the regularization, the posterior of parameters presents a relatively large uncertainty, and consequently, the maximum of the posterior, which is the minimizer in a variational approach, can have a large variation. In contrast, the posterior of states generally concentrates near the truth, substantially filtering out observation noise and reducing uncertainty in the unconstrained SEBM.

点赞 0
阅读1+

Deep Learning is applied to energy markets to predict extreme loads observed in energy grids. Forecasting energy loads and prices is challenging due to sharp peaks and troughs that arise due to supply and demand fluctuations from intraday system constraints. We propose deep spatio-temporal models and extreme value theory (EVT) to capture theses effects and in particular the tail behavior of load spikes. Deep LSTM architectures with ReLU and $\tanh$ activation functions can model trends and temporal dependencies while EVT captures highly volatile load spikes above a pre-specified threshold. To illustrate our methodology, we use hourly price and demand data from 4719 nodes of the PJM interconnection, and we construct a deep predictor. We show that DL-EVT outperforms traditional Fourier time series methods, both in-and out-of-sample, by capturing the observed nonlinearities in prices. Finally, we conclude with directions for future research.

点赞 0
阅读1+

Although there have been extensive studies on transmit beamforming in multi-input single-output (MISO) multicell networks, achieving optimal sum-rate with limited channel state information (CSI) is still a challenge even with a single user per cell. A novel cooperative downlink multicell MISO beamforming scheme is proposed with highly limited information exchange among the base stations (BSs) to maximize the sum-rate. In the proposed scheme, each BS can design its beamforming vector with only local CSI based on limited information exchange on CSI. Unlike previous studies, the proposed beamforming design is non-iterative and does not require any vector or matrix feedback but requires only quantized scalar information. The proposed scheme closely achieves the optimal sum-rate bound in almost all signal-to-noise ratio regime based on non-iterative optimization with lower amount of information exchange than existing schemes, which is justified by numerical simulations.

点赞 0
阅读1+

In this paper, we investigate the impact of diverse user preference on learning under the stochastic multi-armed bandit (MAB) framework. We aim to show that when the user preferences are sufficiently diverse and each arm can be optimal for certain users, the O(log T) regret incurred by exploring the sub-optimal arms under the standard stochastic MAB setting can be reduced to a constant. Our intuition is that to achieve sub-linear regret, the number of times an optimal arm being pulled should scale linearly in time; when all arms are optimal for certain users and pulled frequently, the estimated arm statistics can quickly converge to their true values, thus reducing the need of exploration dramatically. We cast the problem into a stochastic linear bandits model, where both the users preferences and the state of arms are modeled as {independent and identical distributed (i.i.d)} d-dimensional random vectors. After receiving the user preference vector at the beginning of each time slot, the learner pulls an arm and receives a reward as the linear product of the preference vector and the arm state vector. We also assume that the state of the pulled arm is revealed to the learner once its pulled. We propose a Weighted Upper Confidence Bound (W-UCB) algorithm and show that it can achieve a constant regret when the user preferences are sufficiently diverse. The performance of W-UCB under general setups is also completely characterized and validated with synthetic data.

点赞 0
阅读1+
Top