import pandas as pd
import numpy as np
%pylab inline
Populating the interactive namespace from numpy and matplotlib
data = pd.read_csv("https://docs.google.com/uc?export=download&id=1mr-KGEeKq-QS7xKtNKakWlDrjdLukv47",encoding='utf8')
data.dropna(inplace=True)
data.head()
better_0 _unit_id _started_at _created_at _trust _worker_id _city age similarity_0 explanation_0 asi1
0 Your keyword 4 1/10/2018 15:55:41 1/10/2018 16:20:41 0.385118 32 Ernakulam 36-50 6 they are all dressed well and using computers ... 4
1 The two keywords are completely identical 6 1/10/2018 17:04:22 1/10/2018 17:23:42 0.033270 13 Kolkata 36-50 6 Almost identicalexcept the tiny spelling diffe... 3
2 Search engine query 18 1/10/2018 15:53:29 1/10/2018 16:21:35 0.551213 21 Pune 36-50 6 All the images represents the search better... 1
4 The two keywords are completely identical 6 1/11/2018 05:14:03 1/11/2018 05:21:25 0.708808 70 Mangalagiri 19-25 7 both are similar 5
5 The two keywords are completely identical 20 1/10/2018 16:41:16 1/10/2018 17:06:26 0.899786 95 Patna 26-35 7 they both describe the same kind of people 3

Time spent on a question (can be useful for worker ability)

pd.to_datetime(data['_created_at'])- pd.to_datetime(data['_started_at']) #use pd.to_numeric() to convert to number of ns
0    00:25:00
1    00:19:20
2    00:28:06
4    00:07:22
5    00:25:10
6    00:13:59
7    00:16:48
8    00:16:12
9    00:22:55
10   00:13:19
11   00:10:26
13   00:17:22
14   00:17:55
15   00:24:56
16   00:27:48
17   00:25:27
18   00:28:41
19   00:22:18
20   00:08:38
21   00:06:07
22   00:28:04
23   00:27:28
25   00:23:54
26   00:20:47
27   00:18:40
29   00:09:51
30   00:17:40
33   00:09:06
34   00:24:43
35   00:27:52
       ...   
57   00:26:12
59   00:13:00
61   00:21:06
62   00:24:37
63   00:25:15
65   00:15:52
66   00:12:22
68   00:14:17
70   00:24:54
71   00:29:49
72   00:23:56
74   00:24:29
75   00:28:43
76   00:16:54
77   00:20:06
78   00:24:14
79   00:20:13
82   00:24:47
83   00:18:22
85   00:18:31
87   00:21:30
88   00:20:33
89   00:23:10
93   00:26:09
94   00:20:54
95   00:19:57
96   00:29:33
97   00:23:53
98   00:21:24
99   00:27:11
Length: 73, dtype: timedelta64[ns]
len(data)
73
data.describe()
_unit_id _trust _worker_id similarity_0 asi1
count 73.000000 73.000000 73.000000 73.000000 73.000000
mean 15.315068 0.556718 45.726027 5.630137 3.643836
std 8.925427 0.313751 29.161552 1.274830 1.110178
min 0.000000 0.033270 0.000000 2.000000 0.000000
25% 7.000000 0.300126 20.000000 5.000000 3.000000
50% 15.000000 0.588048 48.000000 6.000000 4.000000
75% 24.000000 0.853188 70.000000 7.000000 4.000000
max 31.000000 0.988551 98.000000 7.000000 5.000000

Let's see how many judgments we have per unit

data.groupby('_unit_id').size()
_unit_id
0     1
1     2
2     3
3     2
4     4
5     1
6     2
7     5
8     1
9     1
10    4
11    1
13    3
14    3
15    6
16    2
17    3
18    2
19    1
20    1
21    2
23    4
24    5
25    3
26    2
27    3
28    2
30    3
31    1
dtype: int64
data.groupby('_unit_id').size().values
array([1, 2, 3, 2, 4, 1, 2, 5, 1, 1, 4, 1, 3, 3, 6, 2, 3, 2, 1, 1, 2, 4,
       5, 3, 2, 3, 2, 3, 1])
data.groupby('_unit_id').size().hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f999a0de860>

Let's remove the units that have only one judgment

(data.groupby('_unit_id').size()==1).values
array([ True, False, False, False, False,  True, False, False,  True,
        True, False,  True, False, False, False, False, False, False,
        True,  True, False, False, False, False, False, False, False,
       False,  True])
a = np.where((data.groupby('_unit_id').size()==1))
a
(array([ 0,  5,  8,  9, 11, 18, 19, 28]),)
a = list(a[0])
a
[0, 5, 8, 9, 11, 18, 19, 28]
data[data['_unit_id'].isin(a)]
better_0 _unit_id _started_at _created_at _trust _worker_id _city age similarity_0 explanation_0 asi1
2 Search engine query 18 1/10/2018 15:53:29 1/10/2018 16:21:35 0.551213 21 Pune 36-50 6 All the images represents the search better... 1
8 Your keyword 9 1/11/2018 04:55:31 1/11/2018 05:11:43 0.512431 29 Mangalagiri 19-25 6 THEY ARE THINKING 3
25 Search engine query 28 1/10/2018 15:44:18 1/10/2018 16:08:12 0.935156 83 Pune 19-25 5 working person uses the things that i mentioned 4
34 Search engine query 5 1/11/2018 04:13:24 1/11/2018 04:38:07 0.690587 14 Mangalagiri 19-25 7 both are different 3
49 Search engine query 0 1/10/2018 15:57:40 1/10/2018 16:24:15 0.230164 89 Aligarh 26-35 6 beacuse it is a hot air balloon 3
61 The two keywords are completely identical 18 1/10/2018 15:42:35 1/10/2018 16:03:41 0.547720 18 Hyderabad 36-50 6 They are similar 4
66 The two keywords are completely identical 28 1/11/2018 04:26:36 1/11/2018 04:38:58 0.362590 68 Mangalagiri 26-35 7 both were similar 5
76 Search engine query 19 1/10/2018 16:31:02 1/10/2018 16:47:56 0.967048 59 Pune 19-25 6 Their some people shouting at each other 4
95 The two keywords are completely identical 11 1/11/2018 05:45:54 1/11/2018 06:05:51 0.988551 23 Mangalagiri 0-18 7 similar 5
96 Search engine query 8 1/11/2018 02:23:26 1/11/2018 02:52:59 0.520720 0 Hyderabad 26-35 5 Since not all the images belong to science exa... 4
data = data[~data['_unit_id'].isin(a)]
len(data)
63
  1. Create a column with time spent (use pd.to_datetime)
  2. Compute the average time per worker
data['time_spent'] = pd.to_datetime(data['_created_at']) - pd.to_datetime(data['_started_at'])
data.head()
better_0 _unit_id _started_at _created_at _trust _worker_id _city age similarity_0 explanation_0 asi1 time_spent
0 Your keyword 4 1/10/2018 15:55:41 1/10/2018 16:20:41 0.385118 32 Ernakulam 36-50 6 they are all dressed well and using computers ... 4 00:25:00
1 The two keywords are completely identical 6 1/10/2018 17:04:22 1/10/2018 17:23:42 0.033270 13 Kolkata 36-50 6 Almost identicalexcept the tiny spelling diffe... 3 00:19:20
4 The two keywords are completely identical 6 1/11/2018 05:14:03 1/11/2018 05:21:25 0.708808 70 Mangalagiri 19-25 7 both are similar 5 00:07:22
5 The two keywords are completely identical 20 1/10/2018 16:41:16 1/10/2018 17:06:26 0.899786 95 Patna 26-35 7 they both describe the same kind of people 3 00:25:10
6 Your keyword 13 1/10/2018 15:47:20 1/10/2018 16:01:19 0.873825 37 Ulhasnagar 19-25 6 We can see a relaxed state in that images 5 00:13:59

Basic aggregation

Quantitative variables

data.groupby('_unit_id')['similarity_0'].mean()
_unit_id
1     5.500000
2     4.666667
3     6.000000
4     5.750000
6     6.500000
7     5.400000
10    6.000000
13    6.333333
14    4.000000
15    5.166667
16    5.000000
17    5.666667
20    7.000000
21    7.000000
23    6.500000
24    6.200000
25    4.333333
26    4.500000
27    5.666667
30    5.333333
31    4.000000
Name: similarity_0, dtype: float64

If we are also doing a per-worker analysis, we can compute values from the worker

data.groupby('_worker_id')['_trust'].mean().values
array([0.45086904, 0.92770687, 0.62552536, 0.93464997, 0.04204837,
       0.77016858, 0.04609719, 0.60929146, 0.5741613 , 0.81400838,
       0.03326987, 0.91541507, 0.95153081, 0.18914215, 0.30560117,
       0.80167709, 0.97441615, 0.92891881, 0.96747946, 0.09118499,
       0.62159951, 0.58563959, 0.58804797, 0.38511825, 0.12424342,
       0.79730475, 0.88870635, 0.87382468, 0.67732971, 0.85318828,
       0.34576951, 0.28074398, 0.80507253, 0.05786407, 0.09042158,
       0.42789365, 0.05809224, 0.32548398, 0.30012607, 0.03610733,
       0.85113121, 0.37399525, 0.79652694, 0.62465149, 0.3546574 ,
       0.91410825, 0.70880836, 0.28350176, 0.91083596, 0.33243423,
       0.03891988, 0.52527424, 0.26484709, 0.21250903, 0.63448413,
       0.03589091, 0.91445626, 0.2602371 , 0.9408621 , 0.9150934 ,
       0.85169676, 0.89978581, 0.61376345])
data.groupby('_worker_id')['_trust'].mean().hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f999a0db240>

Categorical variables

Now we can't do the following because the following is a categorical variable:

data.groupby('_unit_id')['better_0'].mean()
---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-20-2f584eae24bb> in <module>()
----> 1 data.groupby('_unit_id')['better_0'].mean()

/srv/paws/lib/python3.6/site-packages/pandas/core/groupby.py in mean(self, *args, **kwargs)
   1126         nv.validate_groupby_func('mean', args, kwargs, ['numeric_only'])
   1127         try:
-> 1128             return self._cython_agg_general('mean', **kwargs)
   1129         except GroupByError:
   1130             raise

/srv/paws/lib/python3.6/site-packages/pandas/core/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
    925 
    926         if len(output) == 0:
--> 927             raise DataError('No numeric types to aggregate')
    928 
    929         return self._wrap_aggregated_output(output, names)

DataError: No numeric types to aggregate

Let's explore what is this column and decide what to do

data.groupby('_unit_id')['better_0'].describe() 
count unique top freq
_unit_id
1 2 2 Your keyword 1
2 3 2 Your keyword 2
3 2 2 Your keyword 1
4 4 3 The two keywords are completely identical 2
6 2 1 The two keywords are completely identical 2
7 5 2 Your keyword 3
10 4 2 Search engine query 2
13 3 2 Your keyword 2
14 3 2 Search engine query 2
15 6 2 Your keyword 4
16 2 1 Search engine query 2
17 3 3 Search engine query 1
20 1 1 The two keywords are completely identical 1
21 2 2 Search engine query 1
23 4 1 The two keywords are completely identical 4
24 5 2 Search engine query 3
25 3 3 Search engine query 1
26 2 1 Search engine query 2
27 3 2 The two keywords are completely identical 2
30 3 2 Your keyword 2
31 1 1 Search engine query 1
print(data['better_0'].unique())
len(data['better_0'].unique())
['Your keyword' 'The two keywords are completely identical'
 'Search engine query']
3

The majority vote of an array is simply the mode

data['better_0'].mode()
0    Search engine query
1           Your keyword
dtype: object

How is the variable distributed?

data.groupby('better_0')['better_0'].size()
better_0
Search engine query                          22
The two keywords are completely identical    19
Your keyword                                 22
Name: better_0, dtype: int64

Let's compute the majority voting

data.groupby('_unit_id')['better_0'].apply(lambda x: x.mode())
_unit_id   
1         0    The two keywords are completely identical
          1                                 Your keyword
2         0                                 Your keyword
3         0    The two keywords are completely identical
          1                                 Your keyword
4         0    The two keywords are completely identical
6         0    The two keywords are completely identical
7         0                                 Your keyword
10        0                          Search engine query
          1    The two keywords are completely identical
13        0                                 Your keyword
14        0                          Search engine query
15        0                                 Your keyword
16        0                          Search engine query
17        0                          Search engine query
          1    The two keywords are completely identical
          2                                 Your keyword
20        0    The two keywords are completely identical
21        0                          Search engine query
          1    The two keywords are completely identical
23        0    The two keywords are completely identical
24        0                          Search engine query
25        0                          Search engine query
          1    The two keywords are completely identical
          2                                 Your keyword
26        0                          Search engine query
27        0    The two keywords are completely identical
30        0                                 Your keyword
31        0                          Search engine query
Name: better_0, dtype: object

Sometimes this returns two values, let's get the first in that case (better way would be random)

data.groupby('_unit_id')['better_0'].apply(lambda x: x.mode()[0])
_unit_id
1     The two keywords are completely identical
2                                  Your keyword
3     The two keywords are completely identical
4     The two keywords are completely identical
6     The two keywords are completely identical
7                                  Your keyword
10                          Search engine query
13                                 Your keyword
14                          Search engine query
15                                 Your keyword
16                          Search engine query
17                          Search engine query
20    The two keywords are completely identical
21                          Search engine query
23    The two keywords are completely identical
24                          Search engine query
25                          Search engine query
26                          Search engine query
27    The two keywords are completely identical
30                                 Your keyword
31                          Search engine query
Name: better_0, dtype: object

Weighted measures

Weighted mean

def weigthed_mean(df,weights,values): #df is a dataframe containing a single question
    sum_values = (df[weights]*df[values]).sum()
    total_weight = df[weights].sum()
    return sum_values/total_weight
data.groupby('_unit_id').apply(lambda x: weigthed_mean(x,'_trust','similarity_0'))
_unit_id
1     5.532764
2     4.961362
3     6.806888
4     5.789938
6     6.955167
7     5.675547
10    5.739468
13    6.437989
14    4.000000
15    5.166934
16    4.175357
17    5.840521
20    7.000000
21    7.000000
23    6.465985
24    6.481138
25    4.525120
26    4.556271
27    4.706914
30    4.415340
31    4.000000
dtype: float64
data.groupby('_unit_id').apply(lambda x: (x['_trust']*x['similarity_0']).sum()/(x['_trust'].sum()))
_unit_id
1     5.532764
2     4.961362
3     6.806888
4     5.789938
6     6.955167
7     5.675547
10    5.739468
13    6.437989
14    4.000000
15    5.166934
16    4.175357
17    5.840521
20    7.000000
21    7.000000
23    6.465985
24    6.481138
25    4.525120
26    4.556271
27    4.706914
30    4.415340
31    4.000000
dtype: float64

Weighted majority voting

Now we need, for each unit, to find the category with the highest trust score

data.head()
better_0 _unit_id _started_at _created_at _trust _worker_id _city age similarity_0 explanation_0 asi1 time_spent
0 Your keyword 4 1/10/2018 15:55:41 1/10/2018 16:20:41 0.385118 32 Ernakulam 36-50 6 they are all dressed well and using computers ... 4 00:25:00
1 The two keywords are completely identical 6 1/10/2018 17:04:22 1/10/2018 17:23:42 0.033270 13 Kolkata 36-50 6 Almost identicalexcept the tiny spelling diffe... 3 00:19:20
4 The two keywords are completely identical 6 1/11/2018 05:14:03 1/11/2018 05:21:25 0.708808 70 Mangalagiri 19-25 7 both are similar 5 00:07:22
5 The two keywords are completely identical 20 1/10/2018 16:41:16 1/10/2018 17:06:26 0.899786 95 Patna 26-35 7 they both describe the same kind of people 3 00:25:10
6 Your keyword 13 1/10/2018 15:47:20 1/10/2018 16:01:19 0.873825 37 Ulhasnagar 19-25 6 We can see a relaxed state in that images 5 00:13:59
def weigthed_majority(df,weights,values): #df is a dataframe containing a single question
    #print(df.groupby(values)[weights].sum())
    best_value = df.groupby(values)[weights].sum().argmax()
    return best_value
data.groupby('_unit_id').apply(lambda x: weigthed_majority(x,'_trust','better_0'))
/srv/paws/lib/python3.6/site-packages/ipykernel_launcher.py:3: FutureWarning: 'argmax' is deprecated. Use 'idxmax' instead. The behavior of 'argmax' will be corrected to return the positional maximum in the future. Use 'series.values.argmax' to get the position of the maximum now.
  This is separate from the ipykernel package so we can avoid doing imports until
_unit_id
1     The two keywords are completely identical
2                                  Your keyword
3     The two keywords are completely identical
4     The two keywords are completely identical
6     The two keywords are completely identical
7                           Search engine query
10                          Search engine query
13                                 Your keyword
14                          Search engine query
15                                 Your keyword
16                          Search engine query
17    The two keywords are completely identical
20    The two keywords are completely identical
21                          Search engine query
23    The two keywords are completely identical
24                          Search engine query
25    The two keywords are completely identical
26                          Search engine query
27    The two keywords are completely identical
30                                 Your keyword
31                          Search engine query
dtype: object

Creating a summary table

results = pd.DataFrame()
results['better'] = data.groupby('_unit_id').apply(lambda x: weigthed_majority(x,'_trust','better_0'))
results['similarity'] = data.groupby('_unit_id').apply(lambda x: weigthed_mean(x,'_trust','similarity_0'))
results['better_code'] = results['better'].astype('category').cat.codes
results
/srv/paws/lib/python3.6/site-packages/ipykernel_launcher.py:3: FutureWarning: 'argmax' is deprecated. Use 'idxmax' instead. The behavior of 'argmax' will be corrected to return the positional maximum in the future. Use 'series.values.argmax' to get the position of the maximum now.
  This is separate from the ipykernel package so we can avoid doing imports until
better similarity better_code
_unit_id
1 The two keywords are completely identical 5.532764 1
2 Your keyword 4.961362 2
3 The two keywords are completely identical 6.806888 1
4 The two keywords are completely identical 5.789938 1
6 The two keywords are completely identical 6.955167 1
7 Search engine query 5.675547 0
10 Search engine query 5.739468 0
13 Your keyword 6.437989 2
14 Search engine query 4.000000 0
15 Your keyword 5.166934 2
16 Search engine query 4.175357 0
17 The two keywords are completely identical 5.840521 1
20 The two keywords are completely identical 7.000000 1
21 Search engine query 7.000000 0
23 The two keywords are completely identical 6.465985 1
24 Search engine query 6.481138 0
25 The two keywords are completely identical 4.525120 1
26 Search engine query 4.556271 0
27 The two keywords are completely identical 4.706914 1
30 Your keyword 4.415340 2
31 Search engine query 4.000000 0

Free text

Now we analyse the case in which we have free text

data['better_0'].unique()
array(['Your keyword', 'The two keywords are completely identical',
       'Search engine query'], dtype=object)
data['explanation_0'].unique()
array(['they are all dressed well and using computers so its more like a  business scenario.',
       'Almost identicalexcept the tiny spelling difference.',
       'both are similar', 'they both describe the same kind of people',
       'We can see a relaxed state in that images', 'YES',
       'A person is generalized and one cannot find the images of Einstein or kids in them.',
       'they are calm', 'genious', 'only 1 image',
       'interested in their work',
       'i think this is correct that calm person because every one is calm in this images',
       'images looks like taking a deep breath',
       'it now seems more like to give these results whn we think of interested person rather than thinking and surprising',
       'based on result of image', 'whipping', 'Yes',
       'calm person and calmness same',
       'result suits more to this kerword', 'yes', 'anger',
       'hot air baloon', 'both are the same', 'same attitude of boss',
       'the results are same',
       'all my words are feature of Search engine query',
       'They all are working in the office',
       'in image person looking very casual',
       'both refer to the same traits but intelligent word is more suited',
       'i know', 'Because all people here look casual.', 'both are same',
       'Casualness is used in both the words',
       'interested person only can do Research, smart, thinging',
       'Casual person is more accurate of the images.',
       'i believe this is my personal theory..so i think aggressive person would be better keyword for these images',
       'My keyword "happy people" and Search engine query "calm person" is almost same.',
       'My answer is more specific regarding images.',
       'i know need search engine when i already knew it',
       'BOTH ARE SIMILAR',
       'by query image i understood that person seems very angry',
       'Everything is related with warm',
       'it gives better ideas about all the image',
       'we got the same image when search in google',
       'with the facial expression we can find him too aggresive',
       'very much about that', "It's the image of that",
       'Smart Person Bring Innovation and must have high IQ',
       'person in aggression is shouting at others',
       'they are all were casual dress', 'By nature',
       'because it shows that',
       'people are working i guess working people is more apt',
       'They also look happy',
       'On detailed viewing smart person might be a better keyword.',
       'everybody is yelling',
       'a complete act of expression works out here'], dtype=object)

We can't use the weighted majority voting here! We need first to assign a score to this values.

Exercise

  • Create a function that assigns a score to each value of the column 'explanation_0' (for example the text lenght len(text)) look here for reference https://pandas.pydata.org/pandas-docs/stable/text.html
  • create a column with this score
  • generate a weighted mean for it (using '_trust')
def compute_score(text):
    score = len(text)
    return score
data['score'] = data['explanation_0'].apply(compute_score)
data.groupby('_unit_id').apply(lambda x: weigthed_mean(x,'_trust','score'))
_unit_id
1      23.943016
2      31.753655
3      16.386224
4      35.375638
6      17.614001
7      61.580451
10     17.415827
13     35.607012
14    105.873602
15     26.945191
16     17.682504
17     20.668617
20     42.000000
21     52.155783
23     17.500025
24     28.380821
25     27.679910
26     63.362268
27     23.662647
30      7.575972
31     43.000000
dtype: float64

Exercise

  1. aggregate per _unit_id using average means for time_spent (you need to apply pd.to_numeric() and divide by 1e9 to get a column in seconds
  2. create a code that assigns 1 if the text contains any element for a list of words (list_words=['similar','same'], by doing a for loop (for i in list_words) and checking with (if i in text)
  3. compute the weigthed mean for that
data['time'] = pd.to_numeric(data['time_spent'])/1e9 

data.groupby('_unit_id').apply(lambda x: weigthed_mean(x,'_trust','time'))
data
better_0 _unit_id _started_at _created_at _trust _worker_id _city age similarity_0 explanation_0 asi1 time_spent score time asd
0 Your keyword 4 1/10/2018 15:55:41 1/10/2018 16:20:41 0.385118 32 Ernakulam 36-50 6 they are all dressed well and using computers ... 4 00:25:00 84 1500.0 NaN
1 The two keywords are completely identical 6 1/10/2018 17:04:22 1/10/2018 17:23:42 0.033270 13 Kolkata 36-50 6 Almost identicalexcept the tiny spelling diffe... 3 00:19:20 52 1160.0 1226.583864
4 The two keywords are completely identical 6 1/11/2018 05:14:03 1/11/2018 05:21:25 0.708808 70 Mangalagiri 19-25 7 both are similar 5 00:07:22 16 442.0 1396.977989
5 The two keywords are completely identical 20 1/10/2018 16:41:16 1/10/2018 17:06:26 0.899786 95 Patna 26-35 7 they both describe the same kind of people 3 00:25:10 42 1510.0 NaN
6 Your keyword 13 1/10/2018 15:47:20 1/10/2018 16:01:19 0.873825 37 Ulhasnagar 19-25 6 We can see a relaxed state in that images 5 00:13:59 41 839.0 474.190359
7 Search engine query 30 1/10/2018 15:40:18 1/10/2018 15:57:06 0.264847 78 Kolkata 19-25 5 YES 3 00:16:48 3 1008.0 1522.166724
9 Your keyword 7 1/10/2018 15:39:53 1/10/2018 16:02:48 0.260237 87 Hyderabad 26-35 6 A person is generalized and one cannot find th... 5 00:22:55 83 1375.0 NaN
10 Search engine query 10 1/10/2018 16:25:03 1/10/2018 16:38:22 0.915093 91 New Delhi 19-25 4 they are calm 3 00:13:19 13 799.0 1016.627134
11 Your keyword 25 1/11/2018 03:56:10 1/11/2018 04:06:36 0.212509 79 Roorkee 19-25 4 genious 3 00:10:26 7 626.0 NaN
13 Search engine query 10 1/10/2018 19:57:39 1/10/2018 20:15:01 0.770169 6 Hyderabad 26-35 6 only 1 image 4 00:17:22 12 1042.0 1064.354107
14 The two keywords are completely identical 10 1/11/2018 03:02:17 1/11/2018 03:20:12 0.914456 86 Roorkee 26-35 7 interested in their work 4 00:17:55 24 1075.0 1700.328074
15 Search engine query 26 1/10/2018 16:49:06 1/10/2018 17:14:02 0.283502 71 Cochin 19-25 3 i think this is correct that calm person becau... 3 00:24:56 81 1496.0 1330.235426
16 Search engine query 4 1/10/2018 20:14:31 1/10/2018 20:42:19 0.373995 64 Kolkata 19-25 5 YES 4 00:27:48 3 1668.0 690.252255
17 Your keyword 14 1/10/2018 17:45:54 1/10/2018 18:11:21 0.035891 85 Mumbai 26-35 4 images looks like taking a deep breath 5 00:25:27 38 1527.0 697.133681
18 Search engine query 14 1/11/2018 04:34:08 1/11/2018 05:02:49 0.797305 34 Amritsar 26-35 4 it now seems more like to give these results w... 4 00:28:41 114 1721.0 NaN
19 Your keyword 15 1/10/2018 19:23:35 1/10/2018 19:45:53 0.814008 11 Bhopal 36-50 4 based on result of image 5 00:22:18 24 1338.0 NaN
20 Your keyword 30 1/11/2018 04:06:57 1/11/2018 04:15:35 0.634484 80 Dehradun 19-25 4 whipping 3 00:08:38 8 518.0 1510.000000
21 Search engine query 17 1/10/2018 16:51:40 1/10/2018 16:57:47 0.613763 98 New Delhi 19-25 4 Yes 2 00:06:07 3 367.0 1315.286800
22 The two keywords are completely identical 10 1/10/2018 16:19:16 1/10/2018 16:47:20 0.189142 17 Kolkata 36-50 7 calm person and calmness same 5 00:28:04 29 1684.0 NaN
23 Your keyword 1 1/10/2018 16:49:39 1/10/2018 17:17:07 0.801677 20 Bangalore 26-35 5 result suits more to this kerword 4 00:27:28 33 1648.0 1025.093862
26 Search engine query 15 1/10/2018 16:23:16 1/10/2018 16:44:03 0.851697 94 Kolkata 19-25 6 yes 3 00:20:47 3 1247.0 1626.726723
27 The two keywords are completely identical 27 1/10/2018 17:12:07 1/10/2018 17:30:47 0.796527 65 Meerut 19-25 4 anger 2 00:18:40 5 1120.0 1244.551237
29 Search engine query 16 1/10/2018 16:11:25 1/10/2018 16:21:16 0.940862 90 Dehradun 36-50 4 hot air baloon 3 00:09:51 14 591.0 NaN
30 The two keywords are completely identical 4 1/10/2018 17:14:11 1/10/2018 17:31:51 0.585640 30 Chennai 19-25 7 both are the same 0 00:17:40 17 1060.0 681.941651
33 The two keywords are completely identical 17 1/11/2018 05:50:49 1/11/2018 05:59:55 0.915415 15 Mangalagiri 19-25 7 both are similar 3 00:09:06 16 546.0 NaN
35 Your keyword 15 1/10/2018 17:28:04 1/10/2018 17:55:56 0.300126 60 Bhubaneswar 19-25 6 same attitude of boss 4 00:27:52 21 1672.0 NaN
37 The two keywords are completely identical 23 1/11/2018 05:40:36 1/11/2018 06:00:46 0.450869 1 Durgapur 36-50 5 the results are same 4 00:20:10 20 1210.0 NaN
38 Search engine query 26 1/10/2018 19:19:35 1/10/2018 19:48:43 0.305601 19 Bardhaman 19-25 6 all my words are feature of Search engine query 3 00:29:08 47 1748.0 NaN
40 Your keyword 15 1/10/2018 17:03:40 1/10/2018 17:21:24 0.851131 63 New Delhi 26-35 6 They all are working in the office 5 00:17:44 34 1064.0 NaN
41 Search engine query 24 1/10/2018 17:45:51 1/10/2018 18:09:20 0.345770 48 Bhopal 26-35 4 in image person looking very casual 4 00:23:29 35 1409.0 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
50 The two keywords are completely identical 13 1/11/2018 10:20:36 1/11/2018 10:40:45 0.934650 4 Kolkata 26-35 7 Because all people here look casual. 4 00:20:09 36 1209.0 NaN
51 The two keywords are completely identical 23 1/10/2018 16:00:44 1/10/2018 16:22:39 0.574161 10 Rajahmundry 19-25 7 both are same 4 00:21:55 13 1315.0 NaN
52 Search engine query 14 1/10/2018 15:40:36 1/10/2018 16:05:59 0.057864 51 Mumbai 36-50 4 Casualness is used in both the words 4 00:25:23 36 1523.0 NaN
53 Your keyword 15 1/10/2018 16:30:44 1/10/2018 16:57:56 0.609291 8 Chennai 36-50 4 interested person only can do Research, smart,... 3 00:27:12 55 1632.0 NaN
54 Search engine query 25 1/10/2018 15:58:13 1/10/2018 16:27:48 0.910836 72 Indore 26-35 2 Casual person is more accurate of the images. 3 00:29:35 45 1775.0 NaN
57 Search engine query 7 1/11/2018 02:58:59 1/11/2018 03:25:11 0.927707 2 Bhubaneswar 26-35 6 i believe this is my personal theory..so i thi... 5 00:26:12 107 1572.0 NaN
59 The two keywords are completely identical 25 1/11/2018 05:16:40 1/11/2018 05:29:40 0.974416 22 Mangalagiri 19-25 7 both are similar 4 00:13:00 16 780.0 NaN
62 The two keywords are completely identical 23 1/10/2018 16:30:20 1/10/2018 16:54:57 0.038920 76 Patna 26-35 7 My keyword "happy people" and Search engine qu... 5 00:24:37 79 1477.0 NaN
63 Your keyword 2 1/10/2018 15:41:12 1/10/2018 16:06:27 0.967479 26 Ranchi 26-35 4 My answer is more specific regarding images. 4 00:25:15 44 1515.0 NaN
65 Your keyword 24 1/10/2018 18:14:38 1/10/2018 18:30:30 0.124243 33 Thanjavur 19-25 6 i know need search engine when i already knew it 4 00:15:52 48 952.0 NaN
68 The two keywords are completely identical 1 1/11/2018 05:19:08 1/11/2018 05:33:25 0.914108 69 Mangalagiri 19-25 6 BOTH ARE SIMILAR 3 00:14:17 16 857.0 NaN
70 Your keyword 17 1/11/2018 04:26:12 1/11/2018 04:51:06 0.427894 53 Mangalagiri 19-25 6 by query image i understood that person seems ... 3 00:24:54 56 1494.0 NaN
71 Your keyword 7 1/10/2018 23:35:07 1/11/2018 00:04:56 0.621600 28 Kolkata 50-80 7 Everything is related with warm 3 00:29:49 31 1789.0 NaN
72 Search engine query 7 1/10/2018 16:06:43 1/10/2018 16:30:39 0.677330 38 Hyderabad 26-35 5 it gives better ideas about all the image 4 00:23:56 41 1436.0 NaN
74 Search engine query 24 1/11/2018 04:06:47 1/11/2018 04:31:16 0.354657 67 Mangalagiri 19-25 7 we got the same image when search in google 5 00:24:29 43 1469.0 NaN
75 Search engine query 16 1/10/2018 19:32:49 1/10/2018 20:01:32 0.090422 52 Hyderabad 26-35 6 with the facial expression we can find him too... 4 00:28:43 56 1723.0 NaN
77 Your keyword 2 1/10/2018 21:04:55 1/10/2018 21:25:01 0.928919 25 Howrah 19-25 6 very much about that 3 00:20:06 20 1206.0 NaN
78 Search engine query 24 1/10/2018 16:36:03 1/10/2018 17:00:17 0.525274 77 Kolkata 26-35 7 It's the image of that 3 00:24:14 22 1454.0 NaN
79 The two keywords are completely identical 21 1/11/2018 03:08:06 1/11/2018 03:28:19 0.588048 31 Unnao 26-35 7 Smart Person Bring Innovation and must have hi... 0 00:20:13 51 1213.0 NaN
82 The two keywords are completely identical 4 1/10/2018 15:39:51 1/10/2018 16:04:38 0.625525 3 Hyderabad 36-50 5 person in aggression is shouting at others 5 00:24:47 42 1487.0 NaN
83 Your keyword 30 1/11/2018 05:56:56 1/11/2018 06:15:18 0.042048 5 Burdwan 26-35 7 they are all were casual dress 5 00:18:22 30 1102.0 NaN
85 Your keyword 7 1/10/2018 17:08:03 1/10/2018 17:26:34 0.280744 49 Noida 0-18 3 By nature 5 00:18:31 9 1111.0 NaN
87 Your keyword 24 1/10/2018 19:54:03 1/10/2018 20:15:33 0.888706 35 Delhi 26-35 7 because it shows that 3 00:21:30 21 1290.0 NaN
88 The two keywords are completely identical 27 1/11/2018 04:05:34 1/11/2018 04:26:07 0.058092 55 Mangalagiri 19-25 7 we got the same image when search in google 3 00:20:33 43 1233.0 NaN
89 Search engine query 21 1/10/2018 15:50:11 1/10/2018 16:13:21 0.805073 50 Hyderabad 26-35 7 people are working i guess working people is m... 4 00:23:10 53 1390.0 NaN
93 The two keywords are completely identical 3 1/11/2018 04:13:16 1/11/2018 04:39:25 0.853188 45 Mangalagiri 19-25 7 BOTH ARE SIMILAR 5 00:26:09 16 1569.0 NaN
94 Your keyword 13 1/10/2018 16:01:31 1/10/2018 16:22:25 0.325484 57 Guwahati 36-50 6 They also look happy 4 00:20:54 20 1254.0 NaN
97 Search engine query 15 1/11/2018 01:56:50 1/11/2018 02:20:43 0.046097 7 Chennai 26-35 5 On detailed viewing smart person might be a be... 4 00:23:53 59 1433.0 NaN
98 Your keyword 3 1/11/2018 06:49:39 1/11/2018 07:11:03 0.091185 27 Kolkata 36-50 5 everybody is yelling 4 00:21:24 20 1284.0 NaN
99 Search engine query 31 1/11/2018 04:19:00 1/11/2018 04:46:11 0.951531 16 Mangalagiri 19-25 4 a complete act of expression works out here 3 00:27:11 43 1631.0 NaN

63 rows × 15 columns

def compute_score(text):
    score=0
    for s in ['similar','same']:
        if s in text:
            score +=1
    return score
data['code']=data['explanation_0'].apply(lambda x : compute_score(x))
data
better_0 _unit_id _started_at _created_at _trust _worker_id _city age similarity_0 explanation_0 asi1 time_spent score time asd code
0 Your keyword 4 1/10/2018 15:55:41 1/10/2018 16:20:41 0.385118 32 Ernakulam 36-50 6 they are all dressed well and using computers ... 4 00:25:00 84 1500.0 NaN 0
1 The two keywords are completely identical 6 1/10/2018 17:04:22 1/10/2018 17:23:42 0.033270 13 Kolkata 36-50 6 Almost identicalexcept the tiny spelling diffe... 3 00:19:20 52 1160.0 1226.583864 0
4 The two keywords are completely identical 6 1/11/2018 05:14:03 1/11/2018 05:21:25 0.708808 70 Mangalagiri 19-25 7 both are similar 5 00:07:22 16 442.0 1396.977989 1
5 The two keywords are completely identical 20 1/10/2018 16:41:16 1/10/2018 17:06:26 0.899786 95 Patna 26-35 7 they both describe the same kind of people 3 00:25:10 42 1510.0 NaN 1
6 Your keyword 13 1/10/2018 15:47:20 1/10/2018 16:01:19 0.873825 37 Ulhasnagar 19-25 6 We can see a relaxed state in that images 5 00:13:59 41 839.0 474.190359 0
7 Search engine query 30 1/10/2018 15:40:18 1/10/2018 15:57:06 0.264847 78 Kolkata 19-25 5 YES 3 00:16:48 3 1008.0 1522.166724 0
9 Your keyword 7 1/10/2018 15:39:53 1/10/2018 16:02:48 0.260237 87 Hyderabad 26-35 6 A person is generalized and one cannot find th... 5 00:22:55 83 1375.0 NaN 0
10 Search engine query 10 1/10/2018 16:25:03 1/10/2018 16:38:22 0.915093 91 New Delhi 19-25 4 they are calm 3 00:13:19 13 799.0 1016.627134 0
11 Your keyword 25 1/11/2018 03:56:10 1/11/2018 04:06:36 0.212509 79 Roorkee 19-25 4 genious 3 00:10:26 7 626.0 NaN 0
13 Search engine query 10 1/10/2018 19:57:39 1/10/2018 20:15:01 0.770169 6 Hyderabad 26-35 6 only 1 image 4 00:17:22 12 1042.0 1064.354107 0
14 The two keywords are completely identical 10 1/11/2018 03:02:17 1/11/2018 03:20:12 0.914456 86 Roorkee 26-35 7 interested in their work 4 00:17:55 24 1075.0 1700.328074 0
15 Search engine query 26 1/10/2018 16:49:06 1/10/2018 17:14:02 0.283502 71 Cochin 19-25 3 i think this is correct that calm person becau... 3 00:24:56 81 1496.0 1330.235426 0
16 Search engine query 4 1/10/2018 20:14:31 1/10/2018 20:42:19 0.373995 64 Kolkata 19-25 5 YES 4 00:27:48 3 1668.0 690.252255 0
17 Your keyword 14 1/10/2018 17:45:54 1/10/2018 18:11:21 0.035891 85 Mumbai 26-35 4 images looks like taking a deep breath 5 00:25:27 38 1527.0 697.133681 0
18 Search engine query 14 1/11/2018 04:34:08 1/11/2018 05:02:49 0.797305 34 Amritsar 26-35 4 it now seems more like to give these results w... 4 00:28:41 114 1721.0 NaN 0
19 Your keyword 15 1/10/2018 19:23:35 1/10/2018 19:45:53 0.814008 11 Bhopal 36-50 4 based on result of image 5 00:22:18 24 1338.0 NaN 0
20 Your keyword 30 1/11/2018 04:06:57 1/11/2018 04:15:35 0.634484 80 Dehradun 19-25 4 whipping 3 00:08:38 8 518.0 1510.000000 0
21 Search engine query 17 1/10/2018 16:51:40 1/10/2018 16:57:47 0.613763 98 New Delhi 19-25 4 Yes 2 00:06:07 3 367.0 1315.286800 0
22 The two keywords are completely identical 10 1/10/2018 16:19:16 1/10/2018 16:47:20 0.189142 17 Kolkata 36-50 7 calm person and calmness same 5 00:28:04 29 1684.0 NaN 1
23 Your keyword 1 1/10/2018 16:49:39 1/10/2018 17:17:07 0.801677 20 Bangalore 26-35 5 result suits more to this kerword 4 00:27:28 33 1648.0 1025.093862 0
26 Search engine query 15 1/10/2018 16:23:16 1/10/2018 16:44:03 0.851697 94 Kolkata 19-25 6 yes 3 00:20:47 3 1247.0 1626.726723 0
27 The two keywords are completely identical 27 1/10/2018 17:12:07 1/10/2018 17:30:47 0.796527 65 Meerut 19-25 4 anger 2 00:18:40 5 1120.0 1244.551237 0
29 Search engine query 16 1/10/2018 16:11:25 1/10/2018 16:21:16 0.940862 90 Dehradun 36-50 4 hot air baloon 3 00:09:51 14 591.0 NaN 0
30 The two keywords are completely identical 4 1/10/2018 17:14:11 1/10/2018 17:31:51 0.585640 30 Chennai 19-25 7 both are the same 0 00:17:40 17 1060.0 681.941651 1
33 The two keywords are completely identical 17 1/11/2018 05:50:49 1/11/2018 05:59:55 0.915415 15 Mangalagiri 19-25 7 both are similar 3 00:09:06 16 546.0 NaN 1
35 Your keyword 15 1/10/2018 17:28:04 1/10/2018 17:55:56 0.300126 60 Bhubaneswar 19-25 6 same attitude of boss 4 00:27:52 21 1672.0 NaN 1
37 The two keywords are completely identical 23 1/11/2018 05:40:36 1/11/2018 06:00:46 0.450869 1 Durgapur 36-50 5 the results are same 4 00:20:10 20 1210.0 NaN 1
38 Search engine query 26 1/10/2018 19:19:35 1/10/2018 19:48:43 0.305601 19 Bardhaman 19-25 6 all my words are feature of Search engine query 3 00:29:08 47 1748.0 NaN 0
40 Your keyword 15 1/10/2018 17:03:40 1/10/2018 17:21:24 0.851131 63 New Delhi 26-35 6 They all are working in the office 5 00:17:44 34 1064.0 NaN 0
41 Search engine query 24 1/10/2018 17:45:51 1/10/2018 18:09:20 0.345770 48 Bhopal 26-35 4 in image person looking very casual 4 00:23:29 35 1409.0 NaN 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
50 The two keywords are completely identical 13 1/11/2018 10:20:36 1/11/2018 10:40:45 0.934650 4 Kolkata 26-35 7 Because all people here look casual. 4 00:20:09 36 1209.0 NaN 0
51 The two keywords are completely identical 23 1/10/2018 16:00:44 1/10/2018 16:22:39 0.574161 10 Rajahmundry 19-25 7 both are same 4 00:21:55 13 1315.0 NaN 1
52 Search engine query 14 1/10/2018 15:40:36 1/10/2018 16:05:59 0.057864 51 Mumbai 36-50 4 Casualness is used in both the words 4 00:25:23 36 1523.0 NaN 0
53 Your keyword 15 1/10/2018 16:30:44 1/10/2018 16:57:56 0.609291 8 Chennai 36-50 4 interested person only can do Research, smart,... 3 00:27:12 55 1632.0 NaN 0
54 Search engine query 25 1/10/2018 15:58:13 1/10/2018 16:27:48 0.910836 72 Indore 26-35 2 Casual person is more accurate of the images. 3 00:29:35 45 1775.0 NaN 0
57 Search engine query 7 1/11/2018 02:58:59 1/11/2018 03:25:11 0.927707 2 Bhubaneswar 26-35 6 i believe this is my personal theory..so i thi... 5 00:26:12 107 1572.0 NaN 0
59 The two keywords are completely identical 25 1/11/2018 05:16:40 1/11/2018 05:29:40 0.974416 22 Mangalagiri 19-25 7 both are similar 4 00:13:00 16 780.0 NaN 1
62 The two keywords are completely identical 23 1/10/2018 16:30:20 1/10/2018 16:54:57 0.038920 76 Patna 26-35 7 My keyword "happy people" and Search engine qu... 5 00:24:37 79 1477.0 NaN 1
63 Your keyword 2 1/10/2018 15:41:12 1/10/2018 16:06:27 0.967479 26 Ranchi 26-35 4 My answer is more specific regarding images. 4 00:25:15 44 1515.0 NaN 0
65 Your keyword 24 1/10/2018 18:14:38 1/10/2018 18:30:30 0.124243 33 Thanjavur 19-25 6 i know need search engine when i already knew it 4 00:15:52 48 952.0 NaN 0
68 The two keywords are completely identical 1 1/11/2018 05:19:08 1/11/2018 05:33:25 0.914108 69 Mangalagiri 19-25 6 BOTH ARE SIMILAR 3 00:14:17 16 857.0 NaN 0
70 Your keyword 17 1/11/2018 04:26:12 1/11/2018 04:51:06 0.427894 53 Mangalagiri 19-25 6 by query image i understood that person seems ... 3 00:24:54 56 1494.0 NaN 0
71 Your keyword 7 1/10/2018 23:35:07 1/11/2018 00:04:56 0.621600 28 Kolkata 50-80 7 Everything is related with warm 3 00:29:49 31 1789.0 NaN 0
72 Search engine query 7 1/10/2018 16:06:43 1/10/2018 16:30:39 0.677330 38 Hyderabad 26-35 5 it gives better ideas about all the image 4 00:23:56 41 1436.0 NaN 0
74 Search engine query 24 1/11/2018 04:06:47 1/11/2018 04:31:16 0.354657 67 Mangalagiri 19-25 7 we got the same image when search in google 5 00:24:29 43 1469.0 NaN 1
75 Search engine query 16 1/10/2018 19:32:49 1/10/2018 20:01:32 0.090422 52 Hyderabad 26-35 6 with the facial expression we can find him too... 4 00:28:43 56 1723.0 NaN 0
77 Your keyword 2 1/10/2018 21:04:55 1/10/2018 21:25:01 0.928919 25 Howrah 19-25 6 very much about that 3 00:20:06 20 1206.0 NaN 0
78 Search engine query 24 1/10/2018 16:36:03 1/10/2018 17:00:17 0.525274 77 Kolkata 26-35 7 It's the image of that 3 00:24:14 22 1454.0 NaN 0
79 The two keywords are completely identical 21 1/11/2018 03:08:06 1/11/2018 03:28:19 0.588048 31 Unnao 26-35 7 Smart Person Bring Innovation and must have hi... 0 00:20:13 51 1213.0 NaN 0
82 The two keywords are completely identical 4 1/10/2018 15:39:51 1/10/2018 16:04:38 0.625525 3 Hyderabad 36-50 5 person in aggression is shouting at others 5 00:24:47 42 1487.0 NaN 0
83 Your keyword 30 1/11/2018 05:56:56 1/11/2018 06:15:18 0.042048 5 Burdwan 26-35 7 they are all were casual dress 5 00:18:22 30 1102.0 NaN 0
85 Your keyword 7 1/10/2018 17:08:03 1/10/2018 17:26:34 0.280744 49 Noida 0-18 3 By nature 5 00:18:31 9 1111.0 NaN 0
87 Your keyword 24 1/10/2018 19:54:03 1/10/2018 20:15:33 0.888706 35 Delhi 26-35 7 because it shows that 3 00:21:30 21 1290.0 NaN 0
88 The two keywords are completely identical 27 1/11/2018 04:05:34 1/11/2018 04:26:07 0.058092 55 Mangalagiri 19-25 7 we got the same image when search in google 3 00:20:33 43 1233.0 NaN 1
89 Search engine query 21 1/10/2018 15:50:11 1/10/2018 16:13:21 0.805073 50 Hyderabad 26-35 7 people are working i guess working people is m... 4 00:23:10 53 1390.0 NaN 0
93 The two keywords are completely identical 3 1/11/2018 04:13:16 1/11/2018 04:39:25 0.853188 45 Mangalagiri 19-25 7 BOTH ARE SIMILAR 5 00:26:09 16 1569.0 NaN 0
94 Your keyword 13 1/10/2018 16:01:31 1/10/2018 16:22:25 0.325484 57 Guwahati 36-50 6 They also look happy 4 00:20:54 20 1254.0 NaN 0
97 Search engine query 15 1/11/2018 01:56:50 1/11/2018 02:20:43 0.046097 7 Chennai 26-35 5 On detailed viewing smart person might be a be... 4 00:23:53 59 1433.0 NaN 0
98 Your keyword 3 1/11/2018 06:49:39 1/11/2018 07:11:03 0.091185 27 Kolkata 36-50 5 everybody is yelling 4 00:21:24 20 1284.0 NaN 0
99 Search engine query 31 1/11/2018 04:19:00 1/11/2018 04:46:11 0.951531 16 Mangalagiri 19-25 4 a complete act of expression works out here 3 00:27:11 43 1631.0 NaN 0

63 rows × 16 columns

data.groupby('_unit_id').apply(lambda x: weigthed_mean(x,'_trust','code'))
_unit_id
1     0.000000
2     0.000000
3     0.000000
4     0.297237
6     0.955167
7     0.000000
10    0.067821
13    0.000000
14    0.000000
15    0.086433
16    0.000000
17    0.467747
20    1.000000
21    0.000000
23    1.000000
24    0.158425
25    0.464503
26    0.000000
27    0.328988
30    0.000000
31    0.000000
dtype: float64