Analysing the Results of the Abandonment Controlled Experiment

# Auxiliary Formatting Cell -------------------------------------------
from IPython.display import HTML
style = "<style>div.info { background-color: #DCD8D7;border-color: #dFb5b4; border-left: 5px solid #DCD8D7  ; padding: 0.5em;}</style>"
HTML(style)
# ---------------------------------------------------------------------

Notebook Metadata

Author Cristina Sarasua
Last Update 01.08.2018

Purpose Analyse the results of the controlled experiment run for the abandonment project, where the same bacth of HITS has been deployed several times in CrowdFlower(?), controlling for two variables:

  1. the reward (B1 reward \$0.10 per HIT B2 reward \$ 0.30 per HIT)
  2. the length (B1 3 docs \$ 0.15 per HIT B2 6 docs \$ 0.30 per HIT) 5 cents per document

Work Description

Reminder: The abandonment is defined in the paper as "workers previewing or starting a HIT and later on deciding to drop ot before copmletion, thus, giving up the reward.

Data

Data Files

  1. task_res JSON files from F8
  2. logs JSON files generated by Kevin et al. with logger

Data Explanation

  • Ground truth for Topic 418

``SET 1 d1 -- LA010790-0121 -- REL d2 -- LA010289-0001 -- NOT REL d3 -- LA010289-0021 -- NOT REL d4 -- LA011190-0156 -- REL d5 -- LA010289-0060 -- NOT REL d6 -- LA012590-0067 -- REL

SET 2 d1 -- LA052189-0009 -- REL d2 -- LA052189-0189 -- NOT REL d3 -- LA052189-0196 -- NOT REL d4 -- LA052589-0174 -- REL d5 -- LA052590-0132 -- NOT REL d6 -- LA052590-0204 -- REL

EXP 1A and 2A --> SET 1 EXP 1B and 2B --> SET 2``


  • Each worker works only one ONE unit / HIT. Therefore, the worker-unit analysis does not apply here. We can do for e.g., time a worker-document analysis.

Methodology

Reseach Question & Hypotheses

People-oriented

  1. Work done (actions, sessions)
  2. Time: When paying more, people who abandon ... Per worker // per worker-document
  3. Quality: B

Comment: theories?

Experiment-oriented - Proportions of the various groups: not meaningful at all, since we might have a different number of total workers in the experiment?


Findings

Data Quality / General Things
  • All workers were logged, no error
Abandonment

Discussion

  • Interpretation: TBC
  • Implications: TBC
  • The limitations of this analysis are ... TBC

Code

import pandas as pd
import matplotlib.pyplot as plt
import datetime
import numpy as np
%matplotlib inline
# Data Directories
dataDir = ""
taskDir = dataDir + "tasks/"
logDir = dataDir + "logs/"


#dataDir = "/Users/sarasua/Documents/RESEARCH/collab_Abandonment/controlledexperiment/results/"
#taskDir = dataDir + "task_res/"
#logDir = dataDir + "logs/"

# Concrete Task Files
fExp1A = taskDir + "CONTROLLED_EXPFIXED_REWARD_PAY15_DOCSET1.json"
fExp1B = taskDir + "CONTROLLED_EXPFIXED_REWARD_PAY30_DOCSET2.json"
fExp2A = taskDir + "CONTROLLED_EXPFIXED_LEN_PAY10_DOCSET1.json"
fExp2B = taskDir + "CONTROLLED_EXPFIXED_LEN_PAY30_DOCSET2.json"

# Concrete Logged Events Files
fLog1A = logDir + "CONTROLLED_EXPFIXED_REWARD_PAY15_DOCSET1.json"
fLog1B = logDir + "CONTROLLED_EXPFIXED_REWARD_PAY30_DOCSET2.json"
fLog2A = logDir + "CONTROLLED_EXPFIXED_LEN_PAY10_DOCSET1.json"
fLog2B = logDir + "CONTROLLED_EXPFIXED_LEN_PAY30_DOCSET2.json"

Data Preprocessing

TODO: Data Preprocessing Tasks

  • Load data
  • Separate abandoned and submit people like Lei did
  • If we want to compare with the other batch of tasks (aka experiment in the wild) then we need to rescale the relevance judgements because in the first experiment it was a 4-level scale and in this one is 2)
  • Merging the files into a DF that I am interested in
  • Do the split of groups in a similar way - but how to analyse later?
  • In cross-check: did they ensure that in experiments A and B there are always disjoint workers? that would be in-between subject experiment design
exp1A = pd.read_json(path_or_buf = fExp1A, lines = True, encoding = 'utf-8', orient = "records")
exp1B = pd.read_json(path_or_buf = fExp1B, lines = True, encoding = 'utf-8', orient = "records")
exp2A = pd.read_json(path_or_buf = fExp2A, lines = True, encoding = 'utf-8', orient = "records")
exp2B = pd.read_json(path_or_buf = fExp2B, lines = True, encoding = 'utf-8', orient = "records")

log1A = pd.read_json(path_or_buf = fLog1A, lines = True, encoding = 'utf-8', orient = "records")
log1B = pd.read_json(path_or_buf = fLog1B, lines = True, encoding = 'utf-8', orient = "records")
log2A = pd.read_json(path_or_buf = fLog2A, lines = True, encoding = 'utf-8', orient = "records")
log2B = pd.read_json(path_or_buf = fLog2B, lines = True, encoding = 'utf-8', orient = "records")
# Data Format & Content Exploration - with example exp1A 
exp1A.head()
agreement created_at data gold_pool id job_id judgments_count missed_count results state updated_at
0 1 2018-07-25 22:54:27 {'unit_id': '1'} NaN 1830562407 1286845 1 0 {'judgments': [{'id': 3895462947, 'created_at'... finalized 2018-07-26 01:40:32
1 1 2018-07-25 22:54:27 {'unit_id': '2'} NaN 1830562408 1286845 1 0 {'judgments': [{'id': 3895409444, 'created_at'... finalized 2018-07-26 01:03:17
2 1 2018-07-25 22:54:27 {'unit_id': '3'} NaN 1830562409 1286845 1 0 {'judgments': [{'id': 3895402529, 'created_at'... finalized 2018-07-26 00:59:09
3 1 2018-07-25 22:54:27 {'unit_id': '4'} NaN 1830562410 1286845 1 0 {'judgments': [{'id': 3895398081, 'created_at'... finalized 2018-07-26 00:56:45
4 1 2018-07-25 22:54:27 {'unit_id': '5'} NaN 1830562411 1286845 1 0 {'judgments': [{'id': 3895417305, 'created_at'... finalized 2018-07-26 01:08:23
# Create the colum workerid extracted from the dict
def extractUnitId(row):
    resDic = row['data']    
    unitId = resDic['unit_id']
    return unitId
exp1A['unit_id'] = exp1A.apply(extractUnitId,axis=1)
exp1B['unit_id'] = exp1B.apply(extractUnitId,axis=1)
exp2A['unit_id'] = exp2A.apply(extractUnitId,axis=1)
exp2B['unit_id'] = exp2B.apply(extractUnitId,axis=1)
# Create the colum workerid extracted from the dict
def extractWorkerId(row):
    resDic = row['results']
    workerId = resDic['judgments'][0]['worker_id']
    if(len(resDic['judgments']) > 1):
        print('One worker with more than one judgment! '+ str(workerId))
    
    return workerId
    
exp1A['worker_id'] = exp1A.apply(extractWorkerId,axis=1)
exp1B['worker_id'] = exp1B.apply(extractWorkerId,axis=1)
exp2A['worker_id'] = exp2A.apply(extractWorkerId,axis=1)
exp2B['worker_id'] = exp2B.apply(extractWorkerId,axis=1)
# Data Format & Content Exploration - with example log1A 
log1A.head()
judgments message pay server_time session_id step steps task_id time times worker_id
0 [None, -1, -1, -1] Start 15 2018-07-26 00:53:33.683495 0.ldhmu5i2 1 [1] 1286845 1532566410102 [None, 0, 0, 0] 43978656
1 [None, -1, -1, -1] MW Worker Rejected:43978656 15 2018-07-26 00:53:33.911709 0.ldhmu5i2 1 [1] 1286845 1532566410115 [None, 0, 0, 0] 43978656
2 [None, -1, -1, -1] Start 15 2018-07-26 00:53:42.309351 0.2g5zq7ff 1 [1] 1286845 1532566414517 [None, 0, 0, 0] 41202032
3 [None, -1, -1, -1] Start 15 2018-07-26 00:54:02.042663 0.er79jbbc 1 [1] 1286845 1532566439356 [None, 0, 0, 0] 39404795
4 [None, -1, -1, -1] MW Worker Rejected:39404795 15 2018-07-26 00:54:02.229070 0.er79jbbc 1 [1] 1286845 1532566439373 [None, 0, 0, 0] 39404795

Explanation by Kevin: var final_log = { “session_id”: session_id, // unique session id (to capture page refresh) “message”: String(message), // message that triggered the log -- see below “worker_id”: worker_id, // worker id “task_id”:task_id, // task_id “time”: Date.now(), //time of sending log “step”: step, // step into the task (i.e., 1,2,3,4... no_docs, paystep) “judgments”: judgments, //array of judgments -- start at 1 (0 is null) “times”: times, //array of times for the judgments -- start at 1 (0 is null) “steps”: steps, // array of steps in to the task; e.g., if the worker pressed back at step 2 the array is 1,2,1,2,3,... };

the message-set is:

  • nextButton
  • backButton
  • Final_OK --> task concluded succesfully
  • paying --> paying the worker
  • Start --> start task
  • MW Worker Rejected:’+worker_id --> worker blacklisted that tried to start the task
  • MWorker ok --> opposite of the last (not sure if present)
log1A.message.unique()
array(['Start', 'MW Worker Rejected:43978656',
       'MW Worker Rejected:39404795', 'MW Worker Rejected:43726491',
       'MW Worker Rejected:39718779', 'nextButton', 'Final_OK', 'paying',
       'MW Worker Rejected:44239317', 'backButton',
       'MW Worker Rejected:44103483', 'MW Worker Rejected:44179475',
       'MW Worker Rejected:28009209', 'MW Worker Rejected:6476374',
       'MW Worker Rejected:37101159', 'MW Worker Rejected:40073351',
       'MW Worker Rejected:44003641', 'MW Worker Rejected:31822324',
       'MW Worker Rejected:34949514', 'MW Worker Rejected:38968301',
       'MW Worker Rejected:33313161', 'MW Worker Rejected:44204171',
       'MW Worker Rejected:44361126', 'MW Worker Rejected:11001780',
       'MW Worker Rejected:44433928', 'MW Worker Rejected:42074684',
       'MW Worker Rejected:44259527', 'MW Worker Rejected:18495435',
       'MW Worker Rejected:33453966', 'MW Worker Rejected:43967749'],
      dtype=object)
sessions = log1A[['worker_id', 'session_id']]
sessions.groupby(['worker_id']).size().unique()
array([ 2,  6,  8,  1,  7,  3,  9,  4, 10, 14,  5, 11])
import json
def getJudgments(row):
    text_result = row['results']['judgments'][0]['data']['text_result']
    textrjson = json.loads(text_result)
    judgments = textrjson['judgments']
     # return pd.Series(judgments) OK but just the array expects the shape of the original data frame calling the apply
    return len(judgments)
# Helpers
def countJudgments(row):
    #return len(row['judgments'])
    return row['judgments_count'] # it's wrapped in the judgments - data
# Cross checks & Basic stats - units per people etc. Global and separating people? 
def checkTask(taskDf):
    
    # checking published config
    print('total number of HITs:' + str(len(taskDf)))
    # KO print('number of judgments per HIT' + str(taskDf.results.map(lambda x: len(x)).max()))   
    nulls = pd.isnull(taskDf)
    
    # missing values
       
    print('Empty value in data column: ' + str(len(nulls.loc[nulls['data'] == True])) + ' out of '+ str(len(nulls['data'])))
    print('Empty value in results column: ' + str(len(nulls.loc[nulls['results'] == True])) + ' out of '+ str(len(nulls['results'])))
    print('Empty value in created_at column: ' + str(len(nulls.loc[nulls['created_at'] == True])) + ' out of '+ str(len(nulls['created_at'])))
    print('Empty value in updated_at column: ' + str(len(nulls.loc[nulls['updated_at'] == True])) + ' out of '+ str(len(nulls['updated_at'])))
    print('Empty value in id column: ' + str(len(nulls.loc[nulls['id'] == True])) + 'out of '+ str(len(nulls['id'])))
    print('Empty value in job_id column: ' + str(len(nulls.loc[nulls['job_id'] == True])) + ' out of '+ str(len(nulls['job_id'])))
    print('Empty value in worker_id column: ' + str(len(nulls.loc[nulls['worker_id'] == True])) + ' out of '+ str(len(nulls['worker_id'])))
    print('Empty value in unit_id column: ' + str(len(nulls.loc[nulls['unit_id'] == True])) + ' out of '+ str(len(nulls['unit_id'])))
    
    
    
    # counts
    print('Total number of workers: ' + str(taskDf['worker_id'].nunique()))
    print('Total number of units - they are judgments: ' + str(taskDf['unit_id'].nunique())) 
    print('AVG Number of units per worker: '+ str(taskDf.groupby(['worker_id'])['unit_id'].nunique().mean()) + ' Max Number of units per worker: '+ str(taskDf.groupby(['worker_id'])['unit_id'].nunique().max()) )
    print('Number of judgments per worker: ' )
    judgmentsCount = pd.Series()
    # when returning an array it takes the length of the DF here! Pandas - print(len(taskDf.columns))
    judgmentsCount = taskDf.apply(getJudgments,axis=1)
    print(judgmentsCount.describe())
        
   
exp1A['results'][0]['judgments'][0]['data']['text_result'] # this one gives an array of 4?
'{"session_id":"0.4nbus5f6","message":"paying","worker_id":33017350,"task_id":1286845,"time":1532569214713,"step":5,"judgments":[null,"2","0","0"],"times":[null,77.2,70.998,41.138],"steps":[1,2,3,4,5],"pay":15}'
checkTask(exp1A) # is the title of the files misleading? From the number of judgments sent by workers it looks like exp1A is the one of the lenth
checkTask(exp1B)
checkTask(exp2A)
checkTask(exp2B)
total number of HITs:100
Empty value in data column: 0 out of 100
Empty value in results column: 0 out of 100
Empty value in created_at column: 0 out of 100
Empty value in updated_at column: 0 out of 100
Empty value in id column: 0out of 100
Empty value in job_id column: 0 out of 100
Empty value in worker_id column: 0 out of 100
Empty value in unit_id column: 0 out of 100
Total number of workers: 100
Total number of units - they are judgments: 100
AVG Number of units per worker: 1.0 Max Number of units per worker: 1
Number of judgments per worker: 
count    100.0
mean       4.0
std        0.0
min        4.0
25%        4.0
50%        4.0
75%        4.0
max        4.0
dtype: float64
total number of HITs:100
Empty value in data column: 0 out of 100
Empty value in results column: 0 out of 100
Empty value in created_at column: 0 out of 100
Empty value in updated_at column: 0 out of 100
Empty value in id column: 0out of 100
Empty value in job_id column: 0 out of 100
Empty value in worker_id column: 0 out of 100
Empty value in unit_id column: 0 out of 100
Total number of workers: 100
Total number of units - they are judgments: 100
AVG Number of units per worker: 1.0 Max Number of units per worker: 1
Number of judgments per worker: 
count    100.0
mean       7.0
std        0.0
min        7.0
25%        7.0
50%        7.0
75%        7.0
max        7.0
dtype: float64
total number of HITs:100
Empty value in data column: 0 out of 100
Empty value in results column: 0 out of 100
Empty value in created_at column: 0 out of 100
Empty value in updated_at column: 0 out of 100
Empty value in id column: 0out of 100
Empty value in job_id column: 0 out of 100
Empty value in worker_id column: 0 out of 100
Empty value in unit_id column: 0 out of 100
Total number of workers: 100
Total number of units - they are judgments: 100
AVG Number of units per worker: 1.0 Max Number of units per worker: 1
Number of judgments per worker: 
count    100.0
mean       7.0
std        0.0
min        7.0
25%        7.0
50%        7.0
75%        7.0
max        7.0
dtype: float64
total number of HITs:100
Empty value in data column: 0 out of 100
Empty value in results column: 0 out of 100
Empty value in created_at column: 0 out of 100
Empty value in updated_at column: 0 out of 100
Empty value in id column: 0out of 100
Empty value in job_id column: 0 out of 100
Empty value in worker_id column: 0 out of 100
Empty value in unit_id column: 0 out of 100
Total number of workers: 100
Total number of units - they are judgments: 100
AVG Number of units per worker: 1.0 Max Number of units per worker: 1
Number of judgments per worker: 
count    100.0
mean       7.0
std        0.0
min        7.0
25%        7.0
50%        7.0
75%        7.0
max        7.0
dtype: float64
def checkLog(logDf):
    # missing values
    nulls = pd.isnull(logDf)
    print('Empty value in data column: ' + str(len(nulls.loc[nulls['message'] == True])) + 'out of '+ str(len(nulls['message'])))
    print('Empty value in session_id column: ' + str(len(nulls.loc[nulls['session_id'] == True])) + 'out of '+ str(len(nulls['session_id'])))
    print('Empty value in task_id column: ' + str(len(nulls.loc[nulls['task_id'] == True])) + 'out of '+ str(len(nulls['task_id'])))
    print('Empty value in time column: ' + str(len(nulls.loc[nulls['time'] == True])) + 'out of '+ str(len(nulls['time'])))
    print('Empty value in times column: ' + str(len(nulls.loc[nulls['times'] == True])) + 'out of '+ str(len(nulls['times'])))
    print('Empty value in worker_id column: ' + str(len(nulls.loc[nulls['worker_id'] == True])) + 'out of '+ str(len(nulls['worker_id'])))
    print('Empty value in pay column: ' + str(len(nulls.loc[nulls['pay'] == True])) + 'out of '+ str(len(nulls['pay'])))
    print('Empty value in judgmens column: ' + str(len(nulls.loc[nulls['judgments'] == True])) + 'out of '+ str(len(nulls['judgments'])))

    # counts
    print('Total number of workers: ' + str(logDf['worker_id'].nunique()))
    print('Total number of tasks: ' + str(logDf['task_id'].nunique())) # task = unit
    print('AVG Number of sessions per worker: '+ str(logDf.groupby(['worker_id'])['session_id'].nunique().mean()) + ' Max Number of sessions per worker: '+ str(logDf.groupby(['worker_id'])['session_id'].nunique().max()) )
    print('AVG Number of tasks per worker: '+ str(logDf.groupby(['worker_id'])['task_id'].nunique().mean()) + ' Max Number of tasks per worker: '+ str(logDf.groupby(['worker_id'])['task_id'].nunique().max()) )
checkLog(log1A)
checkLog(log1B)
checkLog(log2A)
checkLog(log2B)
Empty value in data column: 0out of 790
Empty value in session_id column: 0out of 790
Empty value in task_id column: 0out of 790
Empty value in time column: 0out of 790
Empty value in times column: 0out of 790
Empty value in worker_id column: 0out of 790
Empty value in pay column: 0out of 790
Empty value in judgmens column: 0out of 790
Total number of workers: 209
Total number of tasks: 1
AVG Number of sessions per worker: 1.1961722488038278 Max Number of sessions per worker: 6
AVG Number of tasks per worker: 1.0 Max Number of tasks per worker: 1
Empty value in data column: 0out of 1129
Empty value in session_id column: 0out of 1129
Empty value in task_id column: 0out of 1129
Empty value in time column: 0out of 1129
Empty value in times column: 0out of 1129
Empty value in worker_id column: 0out of 1129
Empty value in pay column: 0out of 1129
Empty value in judgmens column: 0out of 1129
Total number of workers: 202
Total number of tasks: 1
AVG Number of sessions per worker: 1.2227722772277227 Max Number of sessions per worker: 8
AVG Number of tasks per worker: 1.0 Max Number of tasks per worker: 1
Empty value in data column: 0out of 1089
Empty value in session_id column: 0out of 1089
Empty value in task_id column: 1out of 1089
Empty value in time column: 0out of 1089
Empty value in times column: 0out of 1089
Empty value in worker_id column: 0out of 1089
Empty value in pay column: 0out of 1089
Empty value in judgmens column: 0out of 1089
Total number of workers: 207
Total number of tasks: 1
AVG Number of sessions per worker: 1.1159420289855073 Max Number of sessions per worker: 6
AVG Number of tasks per worker: 0.9951690821256038 Max Number of tasks per worker: 1
Empty value in data column: 0out of 1146
Empty value in session_id column: 0out of 1146
Empty value in task_id column: 1out of 1146
Empty value in time column: 0out of 1146
Empty value in times column: 0out of 1146
Empty value in worker_id column: 0out of 1146
Empty value in pay column: 0out of 1146
Empty value in judgmens column: 0out of 1146
Total number of workers: 178
Total number of tasks: 1
AVG Number of sessions per worker: 1.1629213483146068 Max Number of sessions per worker: 4
AVG Number of tasks per worker: 0.9943820224719101 Max Number of tasks per worker: 1
def checkTaskJobJointly(taskDf, logDf):
    
    abandonedDf = logDf[~logDf['worker_id'].isin(taskDf['worker_id'])]
    completedDf = logDf[logDf['worker_id'].isin(taskDf['worker_id'])]
    
    # all the answers in the task completion report are also in the log data set
    print('Number of people who abandoned: ' + str(len(logDf['worker_id'][~logDf['worker_id'].isin(taskDf['worker_id'])].unique())) ) #+ ' and they are IDs: '+  str(logDf['worker_id'][~logDf['worker_id'].isin(taskDf['worker_id'])])
    print('Number of people who submitted: ' + str(len(logDf['worker_id'][logDf['worker_id'].isin(taskDf['worker_id'])].unique())) ) #+ ' and they are IDs: '+  str(logDf['worker_id'][logDf['worker_id'].isin(taskDf['worker_id'])])
    print('Number of people who were not logged: ' + str(len(taskDf['worker_id'][~taskDf['worker_id'].isin(logDf['worker_id'])].unique())) )
    print('*Total number of workers in Task*: '+ str(taskDf['worker_id'].nunique()))
    print('*Total number of workers in Log*: '+ str(logDf['worker_id'].nunique()))
    
    return abandonedDf, completedDf
   
    

print('--- Experiment 1  ------------------------')
print('--- (A) ------------------------')
aban_1A, complet_1A = checkTaskJobJointly(exp1A, log1A)
print('--- (B) ------------------------')
aban_1B, complet_1B = checkTaskJobJointly(exp1B, log1B)
print('--- Experiment 2  ------------------------')
print('--- (A) ------------------------')
aban_2A, complet_2A = checkTaskJobJointly(exp2A, log2A)
print('--- (B) ------------------------')
aban_2B, complet_2B = checkTaskJobJointly(exp2B, log2B)
--- Experiment 1  ------------------------
--- (A) ------------------------
Number of people who abandoned: 109
Number of people who submitted: 100
Number of people who were not logged: 0
*Total number of workers in Task*: 100
*Total number of workers in Log*: 209
--- (B) ------------------------
Number of people who abandoned: 102
Number of people who submitted: 100
Number of people who were not logged: 0
*Total number of workers in Task*: 100
*Total number of workers in Log*: 202
--- Experiment 2  ------------------------
--- (A) ------------------------
Number of people who abandoned: 107
Number of people who submitted: 100
Number of people who were not logged: 0
*Total number of workers in Task*: 100
*Total number of workers in Log*: 207
--- (B) ------------------------
Number of people who abandoned: 78
Number of people who submitted: 100
Number of people who were not logged: 0
*Total number of workers in Task*: 100
*Total number of workers in Log*: 178

Building the 4 groups of people:

Focus is on the log files, filtering in one way or the other.

# Get the two subgroups for abandoned workes, who either abandones right away or abandoned after restarting -- more than one session
def abandSpec(df):
    # (!!) Pandas passes through the first twice
    dfG = df.groupby(['worker_id'])
    abanA = dfG.filter(lambda x: len(x['session_id'].unique()) == 1)
    abanB = dfG.filter(lambda x: len(x['session_id'].unique()) > 1)
    return abanA,abanB
# Get the two subgroups for completed workes, who either submitted answers right away or submitted after restarting -- more than one session
# Coded in a different method for extensibility reasons
def completSpec(df):
    # (!!) Pandas passes through the first twice
    dfG = df.groupby(['worker_id'])
    complA = dfG.filter(lambda x: len(x['session_id'].unique()) == 1)
    complB = dfG.filter(lambda x: len(x['session_id'].unique()) > 1)
    return complA,complB
    
    
# Get all the concrete subsets for all versions of the two controlled experiments.

# Experiment 1 (A,B settings)
abanA_1A,abanB_1A = abandSpec(aban_1A)
completA_1A,completB_1A = completSpec(complet_1A)

abanA_1B,abanB_1B = abandSpec(aban_1B)
completA_1B,completB_1B = completSpec(complet_1B)

# Experiment 2 (A,B settings)
abanA_2A,abanB_2A = abandSpec(aban_2A)
completA_2A,completB_2A = completSpec(complet_2A)

abanA_2B,abanB_2B = abandSpec(aban_2B)
completA_2B,completB_2B = completSpec(complet_2B)
# Cross-check - CORRECT
print('abandoned subgroups 1A')
print(abanA_1A.worker_id.nunique() + abanB_1A.worker_id.nunique())
print('abandoned subgroups 1B')
print(abanA_1B.worker_id.nunique() + abanB_1B.worker_id.nunique())
print('abandoned subgroups 2A')
print(abanA_2A.worker_id.nunique() + abanB_2A.worker_id.nunique())
print('abandoned subgroups 2B')
print(abanA_2B.worker_id.nunique() + abanB_2B.worker_id.nunique())

print('completed subgroups 1A')
print(completA_1A.worker_id.nunique() + completB_1A.worker_id.nunique())
print('completed subgroups 1B')
print(completA_1B.worker_id.nunique() + completB_1B.worker_id.nunique())
print('completed subgroups 2A')
print(completA_2A.worker_id.nunique() + completB_2A.worker_id.nunique())
print('completed subgroups 2B')
print(completA_2B.worker_id.nunique() + completB_2B.worker_id.nunique())
abandoned subgroups 1A
109
abandoned subgroups 1B
102
abandoned subgroups 2A
107
abandoned subgroups 2B
78
completed subgroups 1A
100
completed subgroups 1B
100
completed subgroups 2A
100
completed subgroups 2B
100
# ----- Testing Pandas
#log1A[log1A['worker_id']==41202032]
#d = log1A.sort_values(by=['worker_id'])
#d.head(100)
#log1Ag = log1A.groupby(['worker_id'])
#abb = log1Ag.filter(lambda x: len(x['session_id'].unique()) > 1)
#abb
#abb.groupby(['worker_id']).get_group(41202032)
#abb.groupby(['worker_id']).get_group(6476374) #- does not find it - it's correct
# --
# a = [1,2,3]
# b = [2,3,4]
# data = pd.DataFrame()
# data['a'] = pd.Series(a)
# data['b'] = pd.Series(b)
# data.head()
# data['a'][~data['a'].isin(data['b'])]
# data['a'][data['a'].isin(data['b'])]
# data['a'].isin(data['b'])
# data[~data['a'].isin(data['b'])]
# ----------- end of testing Pandas

Experiment-based Hypotheses

Compare per group, compare per experiment controlled.

from scipy.stats import shapiro
from scipy.stats import normaltest
from scipy.stats import anderson

# Input: series has the sample whose distribution we want to test
# Output: gaussian boolean True if it is normal distribution and False otherwise.
def testNormality(series):
    
    alpha = 0.05
    gaussian = False
    
    # only if the three tests give normal will be normal. If we find one that is not passed, then it is NOT normal. 
    
    # Shapiro-Wilk Test - for smaller data sets around thousands of records
    print('length of series in Shapiro is: '+ str(len(series)))
    stats1, p1 = shapiro(series)
    print('Statistics Shapiro-Wilk Test =%.3f, p=%.3f' % (stats1, p1))
    if p1 > alpha:
        gaussian = True
    print('Shapiro.Wilk says it is Normal '+ str(gaussian))
    
    gaussian = False # because of intermediate printing, reinitialize
    # D'Agostino and Pearson's Test
    stats2, p2 = normaltest(series) #dataw.humid
    print('Statistics D\'Agostino and Pearson\'s Test=%.3f, p=%.3f' % (stats2, p2))
    if p2 > alpha:
        gaussian = False
        print('D\'Agostino and Pearson\'s says it is Normal '+ str(gaussian))
    
    # Anderson-Darling Test
    '''result = anderson(series) 
    print('Statistic: %.3f' % result.statistic)
    for i in range(len(result.critical_values)):
        sl, cv = result.significance_level[i], result.critical_values[i]
        if result.statistic > result.critical_values[i]:
            gaussian = False'''
        
    
    return gaussian
    
from scipy.stats import ttest_ind
from scipy.stats import mannwhitneyu

# Input:
# series1 is the series with the set of measurements for every single worker in case A of controlled experiment
# series2 is the series with the set of measurements for every single worker in case B of controlled experiment
# gaussian is the boolean value indicating if the samples have passed the test of normality or not (True is apply parametric test)
# Output:
# stats of statistical test 
# p-value 
# acceptHo (True if we fail to reject it and False if we reject it) 
# See also all tables for picking the tests (e.g., https://help.xlstat.com/customer/en/portal/articles/2062457-which-statistical-test-should-you-use-)
def compareTwoSamples(series1,series2, gaussian):
    # Tests to compare two samples (H0: they have equal distribution; H1: they have different distribution)
    
    alpha = 0.05
    acceptH0 = False
    
    if (gaussian == True):
        # Run Student's T-test
        stats, p = ttest_ind(series1, series2)
        print('Statistics=%.3f, p=%.3f' % (stats, p))
        
    else:
        
        # Run Mann-Whitney; Kruskal-Wallis test is for more samples.
        stats, p = mannwhitneyu(series1, series2)
        print('Statistics=%.3f, p=%.3f' % (stats, p))
        
        # result - hypothesis testing
   
    if p > alpha:
        acceptH0 = True
    
    print('The two samples have the same distribution (meanA = meanB): ' + str(acceptH0))
    return stats,p,acceptH0        
        
    
    
    

Work Done

Just compare per experiment controlled.

Comparing the means of sessions done by workers in the two populations

Idea: In those that abandoned, the higher the reward and the longer the set of documents the more value the HIT has, the more they try:

H0R means of session count are equal in both populations (reward)

H1R not equal

H0L means of session count are equal in both populations (length)

H1L not equal

# Functions to compute the measurements
def describeSessionCount(df):
    dfG = df.groupby(['worker_id'])
    sessionCounts = dfG.apply(lambda x: len(x['session_id'].unique()))
    print(sessionCounts.describe())
    sessionCountsRI = sessionCounts.reset_index()
    del(sessionCountsRI['worker_id'])
    sessionCountsRI.columns=['sessionCount']
    return sessionCountsRI
sessionC_abanA_1A = describeSessionCount(abanA_1A)
sessionC_abanB_1A = describeSessionCount(abanB_1A)
sessionC_abanA_1B = describeSessionCount(abanA_1B)
sessionC_abanB_1B = describeSessionCount(abanB_1B)
sessionC_abanA_2A = describeSessionCount(abanA_2A)
sessionC_abanB_2A = describeSessionCount(abanB_2A)
sessionC_abanA_2B = describeSessionCount(abanA_2B)
sessionC_abanB_2B = describeSessionCount(abanB_2B)
count    90.0
mean      1.0
std       0.0
min       1.0
25%       1.0
50%       1.0
75%       1.0
max       1.0
dtype: float64
count    19.000000
mean      2.105263
std       0.315302
min       2.000000
25%       2.000000
50%       2.000000
75%       2.000000
max       3.000000
dtype: float64
count    90.0
mean      1.0
std       0.0
min       1.0
25%       1.0
50%       1.0
75%       1.0
max       1.0
dtype: float64
count    12.000000
mean      2.666667
std       1.723281
min       2.000000
25%       2.000000
50%       2.000000
75%       2.250000
max       8.000000
dtype: float64
count    100.0
mean       1.0
std        0.0
min        1.0
25%        1.0
50%        1.0
75%        1.0
max        1.0
dtype: float64
count    7.000000
mean     2.571429
std      1.511858
min      2.000000
25%      2.000000
50%      2.000000
75%      2.000000
max      6.000000
dtype: float64
count    73.0
mean      1.0
std       0.0
min       1.0
25%       1.0
50%       1.0
75%       1.0
max       1.0
dtype: float64
count    5.000000
mean     2.600000
std      0.894427
min      2.000000
25%      2.000000
50%      2.000000
75%      3.000000
max      4.000000
dtype: float64
sessionC_completA_1A = describeSessionCount(completA_1A)
sessionC_completB_1A = describeSessionCount(completB_1A)
sessionC_completA_1B = describeSessionCount(completA_1B)
sessionC_completB_1B = describeSessionCount(completB_1B)
sessionC_completA_2A = describeSessionCount(completA_2A)
sessionC_completB_2A = describeSessionCount(completB_2A)
sessionC_completA_2B = describeSessionCount(completA_2B)
sessionC_completB_2B = describeSessionCount(completB_2B)
count    88.0
mean      1.0
std       0.0
min       1.0
25%       1.0
50%       1.0
75%       1.0
max       1.0
dtype: float64
count    12.000000
mean      2.666667
std       1.230915
min       2.000000
25%       2.000000
50%       2.000000
75%       3.000000
max       6.000000
dtype: float64
count    83.0
mean      1.0
std       0.0
min       1.0
25%       1.0
50%       1.0
75%       1.0
max       1.0
dtype: float64
count    17.000000
mean      2.470588
std       0.874475
min       2.000000
25%       2.000000
50%       2.000000
75%       3.000000
max       5.000000
dtype: float64
count    89.0
mean      1.0
std       0.0
min       1.0
25%       1.0
50%       1.0
75%       1.0
max       1.0
dtype: float64
count    11.000000
mean      2.181818
std       0.404520
min       2.000000
25%       2.000000
50%       2.000000
75%       2.000000
max       3.000000
dtype: float64
count    84.0
mean      1.0
std       0.0
min       1.0
25%       1.0
50%       1.0
75%       1.0
max       1.0
dtype: float64
count    16.00000
mean      2.31250
std       0.60208
min       2.00000
25%       2.00000
50%       2.00000
75%       2.25000
max       4.00000
dtype: float64
'''The Shapiro.Wilk test needs n >= 20 and when splitting we get some of the groups with n under 20
norm_sessionC_abanA_1A = testNormality(sessionC_abanA_1A)
print("final: " + str(norm_sessionC_abanA_1A))
norm_sessionC_abanB_1A = testNormality(sessionC_abanB_1A)
print("final: " + str(norm_sessionC_abanB_1A))

norm_sessionC_abanA_1B = testNormality(sessionC_abanA_1B)
print("final: " + str(norm_sessionC_abanA_1B))
norm_sessionC_abanB_1B = testNormality(sessionC_abanB_1B)
print("final: " + str(norm_sessionC_abanB_1B ))

print('-------')

norm_sessionC_completA_1A = testNormality(sessionC_completA_1A)
print("final: " + str(norm_sessionC_completA_1A))
norm_sessionC_completB_1A = testNormality(sessionC_completB_1A)
print("final: " + str(norm_sessionC_completB_1A))
norm_sessionC_completA_1B = testNormality(sessionC_completA_1B)
print("final: " + str(norm_sessionC_completA_1B))
norm_sessionC_completB_1B = testNormality(sessionC_completB_1B)
print("final: " + str(norm_sessionC_completB_1B ))'''
'The Shapiro.Wilk test needs n >= 20 and when splitting we get some of the groups with n under 20\nnorm_sessionC_abanA_1A = testNormality(sessionC_abanA_1A)\nprint("final: " + str(norm_sessionC_abanA_1A))\nnorm_sessionC_abanB_1A = testNormality(sessionC_abanB_1A)\nprint("final: " + str(norm_sessionC_abanB_1A))\n\nnorm_sessionC_abanA_1B = testNormality(sessionC_abanA_1B)\nprint("final: " + str(norm_sessionC_abanA_1B))\nnorm_sessionC_abanB_1B = testNormality(sessionC_abanB_1B)\nprint("final: " + str(norm_sessionC_abanB_1B ))\n\nprint(\'-------\')\n\nnorm_sessionC_completA_1A = testNormality(sessionC_completA_1A)\nprint("final: " + str(norm_sessionC_completA_1A))\nnorm_sessionC_completB_1A = testNormality(sessionC_completB_1A)\nprint("final: " + str(norm_sessionC_completB_1A))\nnorm_sessionC_completA_1B = testNormality(sessionC_completA_1B)\nprint("final: " + str(norm_sessionC_completA_1B))\nnorm_sessionC_completB_1B = testNormality(sessionC_completB_1B)\nprint("final: " + str(norm_sessionC_completB_1B ))'
# merge the two types of abandoned
sessionC_aban_1A = sessionC_abanA_1A.append(sessionC_abanB_1A,ignore_index=True)
sessionC_aban_1B = sessionC_abanA_1B.append(sessionC_abanB_1B,ignore_index=True)
sessionC_aban_2A = sessionC_abanA_2A.append(sessionC_abanB_2A,ignore_index=True)
sessionC_aban_2B = sessionC_abanA_2B.append(sessionC_abanB_2B,ignore_index=True)

# merge the two types of completed
sessionC_complet_1A = sessionC_completA_1A.append(sessionC_completB_1A,ignore_index=True)
sessionC_complet_1B = sessionC_completA_1B.append(sessionC_completB_1B,ignore_index=True)
sessionC_complet_2A = sessionC_completA_2A.append(sessionC_completB_2A,ignore_index=True)
sessionC_complet_2B = sessionC_completA_2B.append(sessionC_completB_2B,ignore_index=True)


norm_sessionC_aban_1A = testNormality(sessionC_aban_1A)
print("final: " + str(norm_sessionC_aban_1A))
norm_sessionC_aban_1B = testNormality(sessionC_aban_1B)
print("final: " + str(norm_sessionC_aban_1B))
norm_sessionC_aban_2A = testNormality(sessionC_aban_2A)
print("final: " + str(norm_sessionC_aban_2A))
norm_sessionC_aban_2B = testNormality(sessionC_aban_2B)
print("final: " + str(norm_sessionC_aban_2B))


norm_sessionC_complet_1A = testNormality(sessionC_complet_1A)
print("final: " + str(norm_sessionC_complet_1A))
norm_sessionC_complet_1B = testNormality(sessionC_complet_1B)
print("final: " + str(norm_sessionC_complet_1B))
norm_sessionC_complet_2A = testNormality(sessionC_complet_2A)
print("final: " + str(norm_sessionC_complet_2A))
norm_sessionC_complet_2B = testNormality(sessionC_complet_2B)
print("final: " + str(norm_sessionC_complet_2B))

print('\n')
print('-- TESTS for all abandoned and all completed --')

normal = norm_sessionC_aban_1A and norm_sessionC_aban_1B
print('Abandoned 1A and 1B')
compareTwoSamples(sessionC_aban_1A, sessionC_aban_1B, normal )

normal = norm_sessionC_complet_1A and norm_sessionC_compet_1B
print('Completed 1A and 1B')
compareTwoSamples(sessionC_complet_1A, sessionC_complet_1B, normal)

print('--')

normal = norm_sessionC_aban_2A and norm_sessionC_aban_2B
print('Abandoned 2A and 2B')
compareTwoSamples(sessionC_aban_2A, sessionC_aban_2B, normal)

normal = norm_sessionC_complet_2A and norm_sessionC_compet_2B
print('Completed 2A and 2B')
compareTwoSamples(sessionC_complet_2A, sessionC_complet_2B, normal)

print('--')
length of series in Shapiro is: 109
Statistics Shapiro-Wilk Test =0.476, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=60.346, p=0.000
final: False
length of series in Shapiro is: 102
Statistics Shapiro-Wilk Test =0.261, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=176.924, p=0.000
final: False
length of series in Shapiro is: 107
Statistics Shapiro-Wilk Test =0.190, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=197.283, p=0.000
final: False
length of series in Shapiro is: 78
Statistics Shapiro-Wilk Test =0.249, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=115.498, p=0.000
final: False
length of series in Shapiro is: 100
Statistics Shapiro-Wilk Test =0.331, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=133.321, p=0.000
final: False
length of series in Shapiro is: 100
Statistics Shapiro-Wilk Test =0.438, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=97.299, p=0.000
final: False
length of series in Shapiro is: 100
Statistics Shapiro-Wilk Test =0.367, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=88.741, p=0.000
final: False
length of series in Shapiro is: 100
Statistics Shapiro-Wilk Test =0.446, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=82.240, p=0.000
final: False


-- TESTS for all abandoned and all completed --
Abandoned 1A and 1B
Statistics=5261.500, p=0.138
The two samples have the same distribution (meanA = meanB): True
Completed 1A and 1B
Statistics=4755.500, p=0.165
The two samples have the same distribution (meanA = meanB): True
--
Abandoned 2A and 2B
Statistics=4171.000, p=0.496
The two samples have the same distribution (meanA = meanB): True
Completed 2A and 2B
Statistics=4743.000, p=0.145
The two samples have the same distribution (meanA = meanB): True
--
 
 
 

Time

Compare per group, compare per experiment controlled.

# Functions to compute the measurements
log1A['times'].head(100)
0                              [None, 0, 0, 0]
1                              [None, 0, 0, 0]
2                              [None, 0, 0, 0]
3                              [None, 0, 0, 0]
4                              [None, 0, 0, 0]
5                              [None, 0, 0, 0]
6                              [None, 0, 0, 0]
7                              [None, 0, 0, 0]
8                              [None, 0, 0, 0]
9                              [None, 0, 0, 0]
10                             [None, 0, 0, 0]
11                             [None, 0, 0, 0]
12                             [None, 0, 0, 0]
13                             [None, 0, 0, 0]
14                             [None, 0, 0, 0]
15                             [None, 0, 0, 0]
16                             [None, 0, 0, 0]
17                             [None, 0, 0, 0]
18                             [None, 0, 0, 0]
19                             [None, 0, 0, 0]
20                             [None, 0, 0, 0]
21                             [None, 0, 0, 0]
22                             [None, 0, 0, 0]
23                             [None, 0, 0, 0]
24                             [None, 0, 0, 0]
25                             [None, 0, 0, 0]
26                             [None, 0, 0, 0]
27                        [None, 47.377, 0, 0]
28                             [None, 0, 0, 0]
29                             [None, 0, 0, 0]
                        ...                   
70                             [None, 0, 0, 0]
71                             [None, 0, 0, 0]
72                             [None, 0, 0, 0]
73                             [None, 0, 0, 0]
74                             [None, 0, 0, 0]
75                             [None, 0, 0, 0]
76                             [None, 0, 0, 0]
77                             [None, 0, 0, 0]
78                       [None, 132.372, 0, 0]
79                         [None, 7.416, 0, 0]
80                             [None, 0, 0, 0]
81                             [None, 0, 0, 0]
82        [None, 7.416, 4.5760000000000005, 0]
83    [None, 7.416, 4.5760000000000005, 5.477]
84    [None, 7.416, 4.5760000000000005, 5.477]
85             [None, 132.372, 18.238, 14.263]
86                  [None, 132.372, 18.238, 0]
87             [None, 132.372, 18.238, 14.263]
88                             [None, 0, 0, 0]
89                             [None, 0, 0, 0]
90                             [None, 0, 0, 0]
91                             [None, 0, 0, 0]
92                             [None, 0, 0, 0]
93                             [None, 0, 0, 0]
94                             [None, 0, 0, 0]
95                             [None, 0, 0, 0]
96                        [None, 59.931, 0, 0]
97                        [None, 83.099, 0, 0]
98                             [None, 0, 0, 0]
99                             [None, 0, 0, 0]
Name: times, Length: 100, dtype: object
 
 
 

Quality

Compare per group, compare per experiment controlled.

# Functions to compute the measurements
 
 
 
# Calling for all cases
 
 

Prediction task

TODO: Feature engineering

(non confidence determination)

  • Number of starts
  • Length of start-gone-start-...
  • Time between sessions on non completed work?

There are indeed several possible prediction tasks:

  • % of abandonment in the job (as they wrote in the paper but that is hard with only with one job, right?)
  • Classification of people abandoning person or not - try this