import pandas as pd
import numpy as np
%pylab inline
data = pd.read_csv("https://docs.google.com/uc?export=download&id=1mr-KGEeKq-QS7xKtNKakWlDrjdLukv47",encoding='utf8')
data.dropna(inplace=True)
data.head()
better_0 _unit_id _started_at _created_at _trust _worker_id _city age similarity_0 explanation_0 asi1 time_spent score
0 Your keyword 4 1/10/2018 15:55:41 1/10/2018 16:20:41 0.385118 32 Ernakulam 36-50 6 they are all dressed well and using computers ... 4 00:25:00 84
1 The two keywords are completely identical 6 1/10/2018 17:04:22 1/10/2018 17:23:42 0.033270 13 Kolkata 36-50 6 Almost identicalexcept the tiny spelling diffe... 3 00:19:20 52
4 The two keywords are completely identical 6 1/11/2018 05:14:03 1/11/2018 05:21:25 0.708808 70 Mangalagiri 19-25 7 both are similar 5 00:07:22 16
5 The two keywords are completely identical 20 1/10/2018 16:41:16 1/10/2018 17:06:26 0.899786 95 Patna 26-35 7 they both describe the same kind of people 3 00:25:10 42
6 Your keyword 13 1/10/2018 15:47:20 1/10/2018 16:01:19 0.873825 37 Ulhasnagar 19-25 6 We can see a relaxed state in that images 5 00:13:59 41

Time spent on a question (can be useful for worker ability)

pd.to_datetime(data['_created_at'])- pd.to_datetime(data['_started_at']) #use pd.to_numeric() to convert to number of ns
len(data)
data.describe()
_unit_id _trust _worker_id similarity_0 asi1 time_spent
count 63.000000 63.000000 63.000000 63.000000 63.000000 63
mean 15.460317 0.544987 46.571429 5.555556 3.650794 0 days 00:20:49.984126
std 8.918674 0.321671 28.953642 1.329295 1.109471 0 days 00:06:18.184563
min 1.000000 0.033270 1.000000 2.000000 0.000000 0 days 00:06:07
25% 7.000000 0.282123 21.000000 4.000000 3.000000 0 days 00:17:42
50% 15.000000 0.609291 49.000000 6.000000 4.000000 0 days 00:21:55
75% 24.000000 0.852443 70.500000 7.000000 4.000000 0 days 00:25:19
max 31.000000 0.974416 98.000000 7.000000 5.000000 0 days 00:29:49

Let's see how many judgments we have per unit

data.groupby('_unit_id').size()
data.groupby('_unit_id').size().values
data.groupby('_unit_id').size().hist()

Let's remove the units that have only one judgment

(data.groupby('_unit_id').size()==1).values
a = np.where((data.groupby('_unit_id').size()==1))
a
a = list(a[0])
a
data[data['_unit_id'].isin(a)]
data = data[~data['_unit_id'].isin(a)]
len(data)
  1. Create a column with time spent (use pd.to_datetime)
  2. Compute the average time per worker
data['time_spent'] = pd.to_datetime(data['_created_at']) - pd.to_datetime(data['_started_at'])
data.head()
data.groupby('_worker_id').apply(lambda x:  average())

Basic aggregation

Quantitative variables

data.groupby('_unit_id')['similarity_0'].mean()

If we are also doing a per-worker analysis, we can compute values from the worker

data.groupby('_worker_id')['_trust'].mean().values
data.groupby('_worker_id')['_trust'].mean().hist()

Categorical variables

Now we can't do the following because the following is a categorical variable:

data.groupby('_unit_id')['better_0'].mean()

Let's explore what is this column and decide what to do

data.groupby('_unit_id')['better_0'].describe() 
print(data['better_0'].unique())
len(data['better_0'].unique())

The majority vote of an array is simply the mode

data['better_0'].mode()

How is the variable distributed?

data.groupby('better_0')['better_0'].size()

Let's compute the majority voting

data.groupby('_unit_id')['better_0'].apply(lambda x: x.mode())

Sometimes this returns two values, let's get the first in that case (better way would be random)

data.groupby('_unit_id')['better_0'].apply(lambda x: x.mode()[0])

Weighted measures

Weighted mean

def weigthed_mean(df,weights,values): #df is a dataframe containing a single question
    sum_values = (df[weights]*df[values]).sum()
    total_weight = df[weights].sum()
    return sum_values/total_weight
data.groupby('_unit_id').apply(lambda x: weigthed_mean(x,'_trust','similarity_0'))
data.groupby('_unit_id').apply(lambda x: (x['_trust']*x['similarity_0']).sum()/(x['_trust'].sum()))

Weighted majority voting

Now we need, for each unit, to find the category with the highest trust score

data.head()
def weigthed_majority(df,weights,values): #df is a dataframe containing a single question
    #print(df.groupby(values)[weights].sum())
    best_value = df.groupby(values)[weights].sum().argmax()
    return best_value
data.groupby('_unit_id').apply(lambda x: weigthed_majority(x,'_trust','better_0'))

Creating a summary table

results = pd.DataFrame()
results['better'] = data.groupby('_unit_id').apply(lambda x: weigthed_majority(x,'_trust','better_0'))
results['similarity'] = data.groupby('_unit_id').apply(lambda x: weigthed_mean(x,'_trust','similarity_0'))
results['better_code'] = results['better'].astype('category').cat.codes
results

Free text

Now we analyse the case in which we have free text

data['better_0'].unique()
array(['Your keyword', 'The two keywords are completely identical',
       'Search engine query'], dtype=object)
data['explanation_0'].unique()
array(['they are all dressed well and using computers so its more like a  business scenario.',
       'Almost identicalexcept the tiny spelling difference.',
       'both are similar', 'they both describe the same kind of people',
       'We can see a relaxed state in that images', 'YES',
       'A person is generalized and one cannot find the images of Einstein or kids in them.',
       'they are calm', 'genious', 'only 1 image',
       'interested in their work',
       'i think this is correct that calm person because every one is calm in this images',
       'images looks like taking a deep breath',
       'it now seems more like to give these results whn we think of interested person rather than thinking and surprising',
       'based on result of image', 'whipping', 'Yes',
       'calm person and calmness same',
       'result suits more to this kerword', 'yes', 'anger',
       'hot air baloon', 'both are the same', 'same attitude of boss',
       'the results are same',
       'all my words are feature of Search engine query',
       'They all are working in the office',
       'in image person looking very casual',
       'both refer to the same traits but intelligent word is more suited',
       'i know', 'Because all people here look casual.', 'both are same',
       'Casualness is used in both the words',
       'interested person only can do Research, smart, thinging',
       'Casual person is more accurate of the images.',
       'i believe this is my personal theory..so i think aggressive person would be better keyword for these images',
       'My keyword "happy people" and Search engine query "calm person" is almost same.',
       'My answer is more specific regarding images.',
       'i know need search engine when i already knew it',
       'BOTH ARE SIMILAR',
       'by query image i understood that person seems very angry',
       'Everything is related with warm',
       'it gives better ideas about all the image',
       'we got the same image when search in google',
       'with the facial expression we can find him too aggresive',
       'very much about that', "It's the image of that",
       'Smart Person Bring Innovation and must have high IQ',
       'person in aggression is shouting at others',
       'they are all were casual dress', 'By nature',
       'because it shows that',
       'people are working i guess working people is more apt',
       'They also look happy',
       'On detailed viewing smart person might be a better keyword.',
       'everybody is yelling',
       'a complete act of expression works out here'], dtype=object)

We can't use the weighted majority voting here! We need first to assign a score to this values.

Exercise

  • Create a function that assigns a score to each value of the column 'explanation_0' (for example the text lenght len(text), or whether in contains some words from a list, str in value) look here for reference https://pandas.pydata.org/pandas-docs/stable/text.html
  • create a column with this score
  • generate a weighted mean for it (using '_trust')
def compute_score(text):
    score = 0
    for i in ['similar', 'name', 'something']:
        if i in text:
            score += 1
    return score
data['score'] = data['explanation_0'].apply(compute_score)
data.groupby('_unit_id').apply(lambda x: weigthed_mean(x,'_trust','score'))
_unit_id
1     0.000000
2     0.000000
3     0.000000
4     0.000000
6     0.955167
7     0.000000
10    0.000000
13    0.000000
14    0.000000
15    0.000000
16    0.000000
17    0.467747
20    0.000000
21    0.000000
23    0.369922
24    0.000000
25    0.464503
26    0.000000
27    0.000000
30    0.000000
31    0.000000
dtype: float64
data['time_spent'] = pd.to_datetime(data['_created_at']) - pd.to_datetime(data['_started_at'])
data['time'] = pd.to_numeric(data['time_spent'])/1e9
data.groupby('_unit_id').apply(lambda x: weigthed_mean(x,'_trust','time'))
_unit_id
1     1226.583864
2     1346.047720
3     1541.481515
4     1396.977989
6      474.190359
7     1522.166724
10    1016.627134
13    1064.354107
14    1700.328074
15    1330.235426
16     690.252255
17     697.133681
20    1510.000000
21    1315.286800
23    1025.093862
24    1356.460091
25    1196.422715
26    1626.726723
27    1244.551237
30     681.941651
31    1631.000000
dtype: float64