To what extent and how

With ORES data. Whiteboard plots.

import pandas as pd
import matplotlib.pyplot as plt
import datetime
import numpy as np
import pymysql
from sqlalchemy import create_engine
%matplotlib inline
import requests

Description by Hong (24.01.2019):

  • item_sample: 50 items ordered by it’s current ores itemquality category
  • item_history: those 50 items combined with all the history revisions
  • item_history_ores: revisions together with ores scores, as well as commend and user information
data_itemHistory = pd.read_csv('ORES_sample_hong_24012019/item_history.csv')
data_itemHistoryORES = pd.read_csv('ORES_sample_hong_24012019/item_history_ores.csv')
data_itemSample = pd.read_csv('ORES_sample_hong_24012019/item_sample.csv') #added "item_id" to the header to make it compliant
#data_itemHistory.columns = ['item_id','rev_id','item_name','category']
#data_itemHistoryORES.columns = ['item_id','item_name','user_name','rev_comment','rev_id','category','probE','probD','probC','probB','probA']
#data_itemSample.columns = ['item_name','unsure','category','probA','probB','probC','probD','probE','item_id']
data_itemHistory.head()
item_id rev_id item_name itemquality_category
0 909 6207 Q624 A
1 909 6214 Q624 A
2 909 6218 Q624 A
3 909 6220 Q624 A
4 909 6226 Q624 A
data_itemHistoryORES.head()
item_id item_name user_name commend rev_id itemquality_category itemquality_A itemquality_B itemquality_C itemquality_D itemquality_E
0 909 Q624 Erik1991 Created page with "An Italian footballer" 6207 E 0.0 0.0 0.022993 0.154185 0.822822
1 909 Q624 Erik1991 /* wbsetsitelink-set:1|dewiki */ Alessandro De... 6214 E 0.0 0.0 0.001042 0.135689 0.863269
2 909 Q624 Erik1991 /* wbsetsitelink-set:1|frwiki */ Alessandro De... 6218 E 0.0 0.0 0.001042 0.135689 0.863269
3 909 Q624 Erik1991 /* wbsetsitelink-set:1|itwiki */ Alessandro De... 6220 E 0.0 0.0 0.001042 0.135689 0.863269
4 909 Q624 Stryn /* wbsetsitelink-set:1|fiwiki */ Alessandro De... 6226 E 0.0 0.0 0.001042 0.135689 0.863269
data_itemSample.head()
item_name rev_id itemquality_category itemquality_A itemquality_B itemquality_C itemquality_D itemquality_E item_id
0 Q624 783929576 A 0.546310 0.177505 0.276185 0.000000 0.0 909
1 Q1310 785914523 A 0.465916 0.279888 0.252273 0.001923 0.0 1699
2 Q1514 787502981 A 0.803967 0.137084 0.058949 0.000000 0.0 1969
3 Q1804 787143300 A 0.510762 0.263424 0.225814 0.000000 0.0 2446
4 Q2191 768899981 A 0.442920 0.337640 0.219440 0.000000 0.0 3117
data_itemHistoryORES.size
64196
data_itemSample.itemquality_category.value_counts()
D    10
A    10
B    10
E    10
C    10
Name: itemquality_category, dtype: int64

Augmenting the data

Get:

  • ugroup from revision_history_201710 (Howl)
  • qualityDimension from rev_quality2 (Howl)

And merge here

# Use data_itemHistoryORES - Why are the other two data tables needed anyway? 
data_itemHistoryORES.rev_id.to_csv("mydata/rev_sampleORES.csv", header=True, index=False)
print(len(data_itemHistoryORES))
5836
data_sampleORESaug = pd.read_csv("mydata/augmented_sampleORES29012019.csv")
data_sampleORESaug.columns = ['rev_id','user_name','item_id','ugroup','qualitydim','timestamp','automated_tool']
data_sampleORESaug.head()
rev_id user_name item_id ugroup qualitydim timestamp automated_tool
0 313208509 ԱշոտՏՆՂ Q8327 0 3 2016-03-17 15:08:34 t
1 344766082 Lisp.hippie Q8673 0 3 2016-06-07 19:02:03 t
2 251151916 Viscontino Q624 0 3 2015-09-11 09:48:00 t
3 251152267 Viscontino Q624 0 3 2015-09-11 09:49:37 t
4 251153210 Viscontino Q624 0 3 2015-09-11 09:53:41 t
#join data_itemSample and data_sampleORESaug to get itemquality_category from data_itemSample
result = pd.concat([data_sampleORESaug, data_itemSample.itemquality_category], axis=1, sort=False)
# per item?
# z-score
result.head(30)
rev_id user_name item_id ugroup qualitydim timestamp automated_tool itemquality_category
0 313208509 ԱշոտՏՆՂ Q8327 0 3 2016-03-17 15:08:34 t A
1 344766082 Lisp.hippie Q8673 0 3 2016-06-07 19:02:03 t A
2 251151916 Viscontino Q624 0 3 2015-09-11 09:48:00 t A
3 251152267 Viscontino Q624 0 3 2015-09-11 09:49:37 t A
4 251153210 Viscontino Q624 0 3 2015-09-11 09:53:41 t A
5 251153565 Viscontino Q624 0 3 2015-09-11 09:55:06 t A
6 251153718 Viscontino Q624 0 3 2015-09-11 09:55:45 t A
7 454464007 Thierry Caro Q624 0 3 2017-02-22 18:39:44 t A
8 359712777 Lymantria Q1310 0 3 2016-07-26 13:49:08 t A
9 359728384 Lymantria Q1310 0 3 2016-07-26 14:54:04 t A
10 359734915 Lymantria Q1310 0 3 2016-07-26 15:13:46 t B
11 359740525 Lymantria Q1310 0 3 2016-07-26 15:33:04 t B
12 359746177 Lymantria Q1310 0 3 2016-07-26 15:52:44 t B
13 359753584 Lymantria Q1310 0 3 2016-07-26 16:15:38 t B
14 449984376 Pintoch Q1310 0 3 2017-02-17 00:03:40 t B
15 453582331 Pintoch Q1310 0 3 2017-02-21 09:20:02 t B
16 489983134 PokestarFan Q1514 0 3 2017-05-26 00:22:09 t B
17 412039139 Teolemon Q1804 0 3 2016-11-28 10:52:16 t B
18 367410377 Lockal Q2117 0 3 2016-08-20 23:39:52 t B
19 67823165 LBE Q4868 0 4 2013-09-01 07:10:30 f B
20 2078060 Tbennert Q5100 0 4 2012-12-22 03:01:10 f C
21 514973708 Alexchris Q5641 0 5 2017-07-07 04:54:08 f C
22 83971539 Kani Q5845 0 5 2013-11-05 17:09:52 f C
23 147101116 Wylve Q5641 0 4 2014-07-23 17:30:34 f C
24 307474 Stryn Q5862 0 4 2012-11-03 09:41:15 f C
25 258180857 Peppepz Q8327 0 4 2015-10-15 06:48:09 f C
26 6213008 B25es Q8527 0 4 2013-02-10 17:43:01 f C
27 156101717 Ariefz Q9667 0 4 2014-09-06 23:19:13 f C
28 144283387 Miguel Chong Q624 0 4 2014-07-13 05:33:16 f C
29 85561874 83.5.136.177 Q1310 3 5 2013-11-11 09:38:29 f C
count = result.groupby(['itemquality_category','item_id','ugroup'])['rev_id'].count()
count.head()
itemquality_category  item_id  ugroup
A                     Q1310    0         2
                      Q624     0         6
                      Q8327    0         1
                      Q8673    0         1
B                     Q1310    0         6
Name: rev_id, dtype: int64
c = count.reset_index()
c.columns = ['itemquality_category','item_id','ugroup','count']
c.head()
itemquality_category item_id ugroup count
0 A Q1310 0 2
1 A Q624 0 6
2 A Q8327 0 1
3 A Q8673 0 1
4 B Q1310 0 6
totalItems = c.groupby(['itemquality_category','item_id'])['count'].agg('sum')
totalItems.head()
itemquality_category  item_id
A                     Q1310      2
                      Q624       6
                      Q8327      1
                      Q8673      1
B                     Q1310      6
Name: count, dtype: int64