import pandas as pd
import numpy as np
%pylab inline
Populating the interactive namespace from numpy and matplotlib
data = pd.read_csv("https://docs.google.com/uc?export=download&id=1mr-KGEeKq-QS7xKtNKakWlDrjdLukv47",encoding='utf8')
data.dropna(inplace=True)
data.head()
better_0 _unit_id _started_at _created_at _trust _worker_id _city age similarity_0 explanation_0 asi1
0 Your keyword 4 1/10/2018 15:55:41 1/10/2018 16:20:41 0.385118 32 Ernakulam 36-50 6 they are all dressed well and using computers ... 4
1 The two keywords are completely identical 6 1/10/2018 17:04:22 1/10/2018 17:23:42 0.033270 13 Kolkata 36-50 6 Almost identicalexcept the tiny spelling diffe... 3
2 Search engine query 18 1/10/2018 15:53:29 1/10/2018 16:21:35 0.551213 21 Pune 36-50 6 All the images represents the search better... 1
4 The two keywords are completely identical 6 1/11/2018 05:14:03 1/11/2018 05:21:25 0.708808 70 Mangalagiri 19-25 7 both are similar 5
5 The two keywords are completely identical 20 1/10/2018 16:41:16 1/10/2018 17:06:26 0.899786 95 Patna 26-35 7 they both describe the same kind of people 3
len(data)
69
data.describe()
_unit_id _trust _worker_id similarity_0 asi1
count 69.000000 69.000000 69.000000 69.000000 69.000000
mean 15.739130 0.554085 46.028986 5.608696 3.652174
std 8.841167 0.316064 28.716654 1.297253 1.135342
min 1.000000 0.033270 1.000000 2.000000 0.000000
25% 7.000000 0.300126 21.000000 4.000000 3.000000
50% 15.000000 0.588048 48.000000 6.000000 4.000000
75% 24.000000 0.853188 70.000000 7.000000 4.000000
max 31.000000 0.988551 98.000000 7.000000 5.000000

Let's see how many judgments we have per unit

data.groupby('_unit_id').size()
_unit_id
1     2
2     3
3     2
4     4
6     2
7     5
9     1
10    4
11    1
13    3
14    3
15    6
16    2
17    3
18    2
20    1
21    2
23    4
24    5
25    3
26    2
27    3
28    2
30    3
31    1
dtype: int64
data.groupby('_unit_id').size().values
array([2, 3, 2, 4, 2, 5, 1, 4, 1, 3, 3, 6, 2, 3, 2, 1, 2, 4, 5, 3, 2, 3,
       2, 3, 1])
data.groupby('_unit_id').size().hist()
<matplotlib.axes._subplots.AxesSubplot at 0x110c574a8>

Let's remove the units that have only one judgment

(data.groupby('_unit_id').size()==1).values
array([False, False, False, False, False, False,  True, False,  True,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False,  True])
a = np.where((data.groupby('_unit_id').size()==1))
a
(array([ 6,  8, 15, 24]),)
a = list(a[0])
a
[6, 8, 15, 24]
data[data['_unit_id'].isin(a)]
better_0 _unit_id _started_at _created_at _trust _worker_id _city age similarity_0 explanation_0 asi1
1 The two keywords are completely identical 6 1/10/2018 17:04:22 1/10/2018 17:23:42 0.033270 13 Kolkata 36-50 6 Almost identicalexcept the tiny spelling diffe... 3
4 The two keywords are completely identical 6 1/11/2018 05:14:03 1/11/2018 05:21:25 0.708808 70 Mangalagiri 19-25 7 both are similar 5
19 Your keyword 15 1/10/2018 19:23:35 1/10/2018 19:45:53 0.814008 11 Bhopal 36-50 4 based on result of image 5
26 Search engine query 15 1/10/2018 16:23:16 1/10/2018 16:44:03 0.851697 94 Kolkata 19-25 6 yes 3
35 Your keyword 15 1/10/2018 17:28:04 1/10/2018 17:55:56 0.300126 60 Bhubaneswar 19-25 6 same attitude of boss 4
40 Your keyword 15 1/10/2018 17:03:40 1/10/2018 17:21:24 0.851131 63 New Delhi 26-35 6 They all are working in the office 5
41 Search engine query 24 1/10/2018 17:45:51 1/10/2018 18:09:20 0.345770 48 Bhopal 26-35 4 in image person looking very casual 4
53 Your keyword 15 1/10/2018 16:30:44 1/10/2018 16:57:56 0.609291 8 Chennai 36-50 4 interested person only can do Research, smart,... 3
65 Your keyword 24 1/10/2018 18:14:38 1/10/2018 18:30:30 0.124243 33 Thanjavur 19-25 6 i know need search engine when i already knew it 4
74 Search engine query 24 1/11/2018 04:06:47 1/11/2018 04:31:16 0.354657 67 Mangalagiri 19-25 7 we got the same image when search in google 5
78 Search engine query 24 1/10/2018 16:36:03 1/10/2018 17:00:17 0.525274 77 Kolkata 26-35 7 It's the image of that 3
87 Your keyword 24 1/10/2018 19:54:03 1/10/2018 20:15:33 0.888706 35 Delhi 26-35 7 because it shows that 3
97 Search engine query 15 1/11/2018 01:56:50 1/11/2018 02:20:43 0.046097 7 Chennai 26-35 5 On detailed viewing smart person might be a be... 4
data = data[~data['_unit_id'].isin(a)]
len(data)
56

Basic aggregation

Quantitative variables

data.groupby('_unit_id')['similarity_0'].mean()
_unit_id
1     5.500000
2     4.666667
3     6.000000
4     5.750000
7     5.400000
9     6.000000
10    6.000000
11    7.000000
13    6.333333
14    4.000000
16    5.000000
17    5.666667
18    6.000000
20    7.000000
21    7.000000
23    6.500000
25    4.333333
26    4.500000
27    5.666667
28    6.000000
30    5.333333
31    4.000000
Name: similarity_0, dtype: float64

If we are also doing a per-worker analysis, we can compute values from the worker

data.groupby('_worker_id')['_trust'].mean().values
array([0.45086904, 0.92770687, 0.62552536, 0.93464997, 0.04204837,
       0.77016858, 0.5741613 , 0.91541507, 0.95153081, 0.18914215,
       0.54772031, 0.30560117, 0.80167709, 0.5512134 , 0.97441615,
       0.98855115, 0.92891881, 0.96747946, 0.09118499, 0.62159951,
       0.51243052, 0.58563959, 0.58804797, 0.38511825, 0.79730475,
       0.87382468, 0.67732971, 0.85318828, 0.28074398, 0.80507253,
       0.05786407, 0.09042158, 0.42789365, 0.05809224, 0.32548398,
       0.03610733, 0.37399525, 0.79652694, 0.62465149, 0.36259017,
       0.91410825, 0.28350176, 0.91083596, 0.33243423, 0.03891988,
       0.26484709, 0.21250903, 0.63448413, 0.93515637, 0.03589091,
       0.91445626, 0.2602371 , 0.9408621 , 0.9150934 , 0.89978581,
       0.61376345])
data.groupby('_worker_id')['_trust'].mean().hist()
<matplotlib.axes._subplots.AxesSubplot at 0x110cb0a58>

Categorical variables

Now we can't do the following because the following is a categorical variable:

data.groupby('_unit_id')['better_0'].mean()
---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-266-968d2955e2fa> in <module>()
----> 1 data.groupby('_unit_id')['better_0'].mean()

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/groupby.py in mean(self, *args, **kwargs)
   1126         nv.validate_groupby_func('mean', args, kwargs, ['numeric_only'])
   1127         try:
-> 1128             return self._cython_agg_general('mean', **kwargs)
   1129         except GroupByError:
   1130             raise

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
    925 
    926         if len(output) == 0:
--> 927             raise DataError('No numeric types to aggregate')
    928 
    929         return self._wrap_aggregated_output(output, names)

DataError: No numeric types to aggregate

Let's explore what is this column and decide what to do

data.groupby('_unit_id')['better_0'].describe() 
count unique top freq
_unit_id
1 2 2 Your keyword 1
2 3 2 Your keyword 2
3 2 2 Your keyword 1
4 4 3 The two keywords are completely identical 2
7 5 2 Your keyword 3
9 1 1 Your keyword 1
10 4 2 Search engine query 2
11 1 1 The two keywords are completely identical 1
13 3 2 Your keyword 2
14 3 2 Search engine query 2
16 2 1 Search engine query 2
17 3 3 Your keyword 1
18 2 2 Search engine query 1
20 1 1 The two keywords are completely identical 1
21 2 2 Search engine query 1
23 4 1 The two keywords are completely identical 4
25 3 3 Search engine query 1
26 2 1 Search engine query 2
27 3 2 The two keywords are completely identical 2
28 2 2 Search engine query 1
30 3 2 Your keyword 2
31 1 1 Search engine query 1
print(data['better_0'].unique())
len(data['better_0'].unique())
['Your keyword' 'Search engine query'
 'The two keywords are completely identical']
3

The majority vote of an array is simply the mode

data['better_0'].mode()
0    The two keywords are completely identical
dtype: object

How is the variable distributed?

data.groupby('better_0')['better_0'].size()
better_0
Search engine query                          19
The two keywords are completely identical    20
Your keyword                                 17
Name: better_0, dtype: int64

Let's compute the majority voting

data.groupby('_unit_id')['better_0'].apply(lambda x: x.mode())
_unit_id   
1         0    The two keywords are completely identical
          1                                 Your keyword
2         0                                 Your keyword
3         0    The two keywords are completely identical
          1                                 Your keyword
4         0    The two keywords are completely identical
7         0                                 Your keyword
9         0                                 Your keyword
10        0                          Search engine query
          1    The two keywords are completely identical
11        0    The two keywords are completely identical
13        0                                 Your keyword
14        0                          Search engine query
16        0                          Search engine query
17        0                          Search engine query
          1    The two keywords are completely identical
          2                                 Your keyword
18        0                          Search engine query
          1    The two keywords are completely identical
20        0    The two keywords are completely identical
21        0                          Search engine query
          1    The two keywords are completely identical
23        0    The two keywords are completely identical
25        0                          Search engine query
          1    The two keywords are completely identical
          2                                 Your keyword
26        0                          Search engine query
27        0    The two keywords are completely identical
28        0                          Search engine query
          1    The two keywords are completely identical
30        0                                 Your keyword
31        0                          Search engine query
Name: better_0, dtype: object

Sometimes this returns two values, let's get the first in that case (better way would be random)

data.groupby('_unit_id')['better_0'].apply(lambda x: x.mode()[0])
_unit_id
1     The two keywords are completely identical
2                                  Your keyword
3     The two keywords are completely identical
4     The two keywords are completely identical
7                                  Your keyword
9                                  Your keyword
10                          Search engine query
11    The two keywords are completely identical
13                                 Your keyword
14                          Search engine query
16                          Search engine query
17                          Search engine query
18                          Search engine query
20    The two keywords are completely identical
21                          Search engine query
23    The two keywords are completely identical
25                          Search engine query
26                          Search engine query
27    The two keywords are completely identical
28                          Search engine query
30                                 Your keyword
31                          Search engine query
Name: better_0, dtype: object

Weighted measures

Weighted mean

def weigthed_mean(df,weights,values): #df is a dataframe containing a single question
    sum_values = (df[weights]*df[values]).sum()
    total_weight = df[weights].sum()
    return sum_values/total_weight
data.groupby('_unit_id').apply(lambda x: weigthed_mean(x,'_trust','similarity_0'))
_unit_id
1     5.532764
2     4.961362
3     6.806888
4     5.789938
7     5.675547
9     6.000000
10    5.739468
11    7.000000
13    6.437989
14    4.000000
16    4.175357
17    5.840521
18    6.000000
20    7.000000
21    7.000000
23    6.465985
25    4.525120
26    4.556271
27    4.706914
28    5.558800
30    4.415340
31    4.000000
dtype: float64
data.groupby('_unit_id').apply(lambda x: (x['_trust']*x['similarity_0']).sum()/(x['_trust'].sum()))
_unit_id
1     5.532764
2     4.961362
3     6.806888
4     5.789938
7     5.675547
9     6.000000
10    5.739468
11    7.000000
13    6.437989
14    4.000000
16    4.175357
17    5.840521
18    6.000000
20    7.000000
21    7.000000
23    6.465985
25    4.525120
26    4.556271
27    4.706914
28    5.558800
30    4.415340
31    4.000000
dtype: float64

Weighted majority voting

Now we need, for each unit, to find the category with the highest trust score

data.head()
better_0 _unit_id _started_at _created_at _trust _worker_id _city age similarity_0 explanation_0 asi1
0 Your keyword 4 1/10/2018 15:55:41 1/10/2018 16:20:41 0.385118 32 Ernakulam 36-50 6 they are all dressed well and using computers ... 4
2 Search engine query 18 1/10/2018 15:53:29 1/10/2018 16:21:35 0.551213 21 Pune 36-50 6 All the images represents the search better... 1
5 The two keywords are completely identical 20 1/10/2018 16:41:16 1/10/2018 17:06:26 0.899786 95 Patna 26-35 7 they both describe the same kind of people 3
6 Your keyword 13 1/10/2018 15:47:20 1/10/2018 16:01:19 0.873825 37 Ulhasnagar 19-25 6 We can see a relaxed state in that images 5
7 Search engine query 30 1/10/2018 15:40:18 1/10/2018 15:57:06 0.264847 78 Kolkata 19-25 5 YES 3
def weigthed_majority(df,weights,values): #df is a dataframe containing a single question
    #print(df.groupby(values)[weights].sum())
    best_value = df.groupby(values)[weights].sum().argmax()
    return best_value
data.groupby('_unit_id').apply(lambda x: weigthed_majority(x,'_trust','better_0'))
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:3: FutureWarning: 'argmax' is deprecated. Use 'idxmax' instead. The behavior of 'argmax' will be corrected to return the positional maximum in the future. Use 'series.values.argmax' to get the position of the maximum now.
  app.launch_new_instance()
_unit_id
1     The two keywords are completely identical
2                                  Your keyword
3     The two keywords are completely identical
4     The two keywords are completely identical
7                           Search engine query
9                                  Your keyword
10                          Search engine query
11    The two keywords are completely identical
13                                 Your keyword
14                          Search engine query
16                          Search engine query
17    The two keywords are completely identical
18                          Search engine query
20    The two keywords are completely identical
21                          Search engine query
23    The two keywords are completely identical
25    The two keywords are completely identical
26                          Search engine query
27    The two keywords are completely identical
28                          Search engine query
30                                 Your keyword
31                          Search engine query
dtype: object

Creating a summary table

results = pd.DataFrame()
results['better'] = data.groupby('_unit_id').apply(lambda x: weigthed_majority(x,'_trust','better_0'))
results['similarity'] = data.groupby('_unit_id').apply(lambda x: weigthed_mean(x,'_trust','similarity_0'))
results['better_code'] = results['better'].astype('category').cat.codes
results
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:3: FutureWarning: 'argmax' is deprecated. Use 'idxmax' instead. The behavior of 'argmax' will be corrected to return the positional maximum in the future. Use 'series.values.argmax' to get the position of the maximum now.
  app.launch_new_instance()
better similarity better_code
_unit_id
1 The two keywords are completely identical 5.532764 1
2 Your keyword 4.961362 2
3 The two keywords are completely identical 6.806888 1
4 The two keywords are completely identical 5.789938 1
7 Search engine query 5.675547 0
9 Your keyword 6.000000 2
10 Search engine query 5.739468 0
11 The two keywords are completely identical 7.000000 1
13 Your keyword 6.437989 2
14 Search engine query 4.000000 0
16 Search engine query 4.175357 0
17 The two keywords are completely identical 5.840521 1
18 Search engine query 6.000000 0
20 The two keywords are completely identical 7.000000 1
21 Search engine query 7.000000 0
23 The two keywords are completely identical 6.465985 1
25 The two keywords are completely identical 4.525120 1
26 Search engine query 4.556271 0
27 The two keywords are completely identical 4.706914 1
28 Search engine query 5.558800 0
30 Your keyword 4.415340 2
31 Search engine query 4.000000 0

Free text

Now we analyse the case in which we have free text

data['better_0'].unique()
array(['Your keyword', 'Search engine query',
       'The two keywords are completely identical'], dtype=object)
data['explanation_0'].unique()
array(['they are all dressed well and using computers so its more like a  business scenario.',
       'All the images represents the search better...',
       'they both describe the same kind of people',
       'We can see a relaxed state in that images', 'YES',
       'THEY ARE THINKING',
       'A person is generalized and one cannot find the images of Einstein or kids in them.',
       'they are calm', 'genious', 'only 1 image',
       'interested in their work',
       'i think this is correct that calm person because every one is calm in this images',
       'images looks like taking a deep breath',
       'it now seems more like to give these results whn we think of interested person rather than thinking and surprising',
       'whipping', 'Yes', 'calm person and calmness same',
       'result suits more to this kerword',
       'working person uses the  things that i mentioned', 'anger',
       'hot air baloon', 'both are the same', 'both are similar',
       'the results are same',
       'all my words are feature of Search engine query',
       'both refer to the same traits but intelligent word is more suited',
       'i know', 'Because all people here look casual.', 'both are same',
       'Casualness is used in both the words',
       'Casual person is more accurate of the images.',
       'i believe this is my personal theory..so i think aggressive person would be better keyword for these images',
       'They are similar',
       'My keyword "happy people" and Search engine query "calm person" is almost same.',
       'My answer is more specific regarding images.',
       'both were similar', 'BOTH ARE SIMILAR',
       'by query image i understood that person seems very angry',
       'Everything is related with warm',
       'it gives better ideas about all the image',
       'with the facial expression we can find him too aggresive',
       'very much about that',
       'Smart Person Bring Innovation and must have high IQ',
       'person in aggression is shouting at others',
       'they are all were casual dress', 'By nature',
       'we got the same image when search in google',
       'people are working i guess working people is more apt',
       'They also look happy', 'similar', 'everybody is yelling',
       'a complete act of expression works out here'], dtype=object)

We can't use the weighted majority voting here! We need first to assign a score to this values.

Exercise

  • Create a function that assigns a score to each value of the column 'explanation_0' (for example the text lenght len(text), or whether in contains some words from a list, str in value) look here for reference https://pandas.pydata.org/pandas-docs/stable/text.html
  • create a column with this score
  • generate a weighted mean for it
len(data)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-c6201f8cef64> in <module>()
----> 1 len(data)

NameError: name 'data' is not defined
results = pd.DataFrame()
results['better'] = data.groupby('_unit_id').apply(lambda x: weigthed_majority(x,'_trust','better_0'))
results['similarity'] = data.groupby('_unit_id').apply(lambda x: weigthed_mean(x,'_trust','similarity_0'))
results['better_code'] = results['better'].astype('category').cat.codes
results
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-4dccb6d69dbf> in <module>()
----> 1 results = pd.DataFrame()
      2 results['better'] = data.groupby('_unit_id').apply(lambda x: weigthed_majority(x,'_trust','better_0'))
      3 results['similarity'] = data.groupby('_unit_id').apply(lambda x: weigthed_mean(x,'_trust','similarity_0'))
      4 results['better_code'] = results['better'].astype('category').cat.codes
      5 results

NameError: name 'pd' is not defined
import pandas as pd
import numpy as np
%pylab inline
Populating the interactive namespace from numpy and matplotlib
data = pd.read_csv("https://docs.google.com/uc?export=download&id=1mr-KGEeKq-QS7xKtNKakWlDrjdLukv47",encoding='utf8')
data.dropna(inplace=True)
data.head()
better_0 _unit_id _started_at _created_at _trust _worker_id _city age similarity_0 explanation_0 asi1
0 Your keyword 4 1/10/2018 15:55:41 1/10/2018 16:20:41 0.385118 32 Ernakulam 36-50 6 they are all dressed well and using computers ... 4
1 The two keywords are completely identical 6 1/10/2018 17:04:22 1/10/2018 17:23:42 0.033270 13 Kolkata 36-50 6 Almost identicalexcept the tiny spelling diffe... 3
2 Search engine query 18 1/10/2018 15:53:29 1/10/2018 16:21:35 0.551213 21 Pune 36-50 6 All the images represents the search better... 1
4 The two keywords are completely identical 6 1/11/2018 05:14:03 1/11/2018 05:21:25 0.708808 70 Mangalagiri 19-25 7 both are similar 5
5 The two keywords are completely identical 20 1/10/2018 16:41:16 1/10/2018 17:06:26 0.899786 95 Patna 26-35 7 they both describe the same kind of people 3
len(data)
73
data(describe)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-25b72abc085f> in <module>()
----> 1 data(describe)

NameError: name 'describe' is not defined
data.describe()
_unit_id _trust _worker_id similarity_0 asi1
count 73.000000 73.000000 73.000000 73.000000 73.000000
mean 15.315068 0.556718 45.726027 5.630137 3.643836
std 8.925427 0.313751 29.161552 1.274830 1.110178
min 0.000000 0.033270 0.000000 2.000000 0.000000
25% 7.000000 0.300126 20.000000 5.000000 3.000000
50% 15.000000 0.588048 48.000000 6.000000 4.000000
75% 24.000000 0.853188 70.000000 7.000000 4.000000
max 31.000000 0.988551 98.000000 7.000000 5.000000
data.groupby('_unit_id').size()
_unit_id
0     1
1     2
2     3
3     2
4     4
5     1
6     2
7     5
8     1
9     1
10    4
11    1
13    3
14    3
15    6
16    2
17    3
18    2
19    1
20    1
21    2
23    4
24    5
25    3
26    2
27    3
28    2
30    3
31    1
dtype: int64
data.groupby('_unit_id').size().values
array([1, 2, 3, 2, 4, 1, 2, 5, 1, 1, 4, 1, 3, 3, 6, 2, 3, 2, 1, 1, 2, 4,
       5, 3, 2, 3, 2, 3, 1])
data.groupby('_worker_id')['_trust'].mean().hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f8502ae96d8>