Analysing the Results of the Abandonment Controlled Experiment

# Auxiliary Formatting Cell -------------------------------------------
from IPython.display import HTML
style = "<style>div.info { background-color: #DCD8D7;border-color: #dFb5b4; border-left: 5px solid #DCD8D7  ; padding: 0.5em;}</style>"
HTML(style)
# ---------------------------------------------------------------------

Notebook Metadata

Author Cristina Sarasua
Last Update 01.08.2018

Purpose Analyse the results of the controlled experiment run for the abandonment project, where the same bacth of HITS has been deployed several times in CrowdFlower(?), controlling for two variables:

  1. the reward (B1 reward \$0.10 per HIT B2 reward \$ 0.30 per HIT)
  2. the length (B1 3 docs \$ 0.15 per HIT B2 6 docs \$ 0.30 per HIT) 5 cents per document

Work Description

Reminder: The abandonment is defined in the paper as "workers previewing or starting a HIT and later on deciding to drop ot before copmletion, thus, giving up the reward.

Data

Data Files

  1. task_res JSON files from F8
  2. logs JSON files generated by Kevin et al. with logger

Data Explanation

  • Experiment 1: REWARD
    • A 0.10 per HIT 6 documents
    • B 0.30 per HIT 6 documents
    • C 0.30 per HIT (like B) with quality checks
  • Experiment 2: LENGTH

    • A 3 documents 6 documents
    • B 6 documents 6 documets
  • Ground truth for Topic 418

```SET 1 d1 -- LA010790-0121 -- REL d2 -- LA010289-0001 -- NOT REL d3 -- LA010289-0021 -- NOT REL d4 -- LA011190-0156 -- REL d5 -- LA010289-0060 -- NOT REL d6 -- LA012590-0067 -- REL

SET 2 d1 -- LA052189-0009 -- REL d2 -- LA052189-0189 -- NOT REL d3 -- LA052189-0196 -- NOT REL d4 -- LA052589-0174 -- REL d5 -- LA052590-0132 -- NOT REL d6 -- LA052590-0204 -- REL

EXP 1A and 2A --> SET 1 EXP 1B and 2B --> SET 2 ```

  • Each worker works only one ONE unit / HIT. Therefore, the worker-unit analysis does not apply here. We can do for e.g., time a worker-document analysis.

Findings

Data Quality / General Things
  • All workers were logged, no error
Abandonment
  • Lower reward --> more people abandoned. Shorter length --> more people abandoned. But, in both cases the difference is small. (See also "Abandonment Stats")

(A and B comparisons)

  • Comparison of means of sessionCount between the two populations, both in experiment 1 (1A and 1B) and experiment 2 (2A and 2B), indicate that the means are equal (or that there is not a significant difference between them). That is, increasing the reward or the length of documents (in these experiments) did not change the distribution. (See also "1. Work Done")
  • Comparison of means of number of messages between the two populations, both in experiment 1 (1A and 1B) and experiment 2 (2A and 2B), indicate that the means are equal (or that there is not a significant difference between them). That is, increasing the reward or the length of documents (in these experiments) did not change the distribution. (See also "1. Work Done")
  • Comparison of means of time invested in session between the pairs of populations, in 1A and 1B, the two populations are significantly different, while 2Aand 2B are not. (note: I updated this 1A and 1B comparison, as I corrected a variable name). (See also "2. Time Invested")

(C comparisons with quality checks)

  • Comparison of means of sessionCount between the pairs of populations (2B and 2C) indicates that the two samples are significantly different. Between 1B and 2C we don’t see a significant difference. (See also "1. Work Done")
  • Comparison of means of number of messages between the pairs of populations, (2B and 2C) indicates that the two samples are significantly different and (1B and 2C) are significantly different too. (See also "1. Work Done")
  • Comparison of means of time invested in session between the pairs of populations, (2B and 2C) indicates that the two samples are significantly different and (1B and 2C) are significantly different too. (See also "2. Time Invested")

Notes

  • I changed from the last version:
    • exp1 is exp2 and viceversa because the original task and log files had the inverted name. I

Discussion

  • Interpretation: TBC
  • Implications: TBC
  • The limitations of this analysis are ... TBC

Code

#!pip install statsmodels
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import numpy as np
%matplotlib inline
import seaborn as sns
from statsmodels.sandbox.stats.multicomp import multipletests 
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
# Data Directories
dataDir = ""
taskDir = dataDir + "F8tasks/"
logDir = dataDir + "logs/"


#dataDir = "/Users/sarasua/Documents/RESEARCH/collab_Abandonment/controlledexperiment/results/"
#taskDir = dataDir + "task_res/"
#logDir = dataDir + "logs/"

# Concrete Task Files 
#original files are inverted, what is called LEN is REWARD and what is called REWARD is LEN! looking at the number of judgments we can see that
#REWARD
fExp1A = taskDir + "CONTROLLED_EXPFIXED_LEN_PAY10_DOCSET1.json"
fExp1B = taskDir + "CONTROLLED_EXPFIXED_LEN_PAY30_DOCSET2.json"
#LENGTH
fExp2A = taskDir + "CONTROLLED_EXPFIXED_REWARD_PAY15_DOCSET1.json" 
fExp2B = taskDir + "CONTROLLED_EXPFIXED_REWARD_PAY30_DOCSET2.json"

# Concrete Logged Events Files
#original files are inverted, what is called LEN is REWARD and what is called REWARD is LEN! looking at the number of judgments we can see that

#REWARD
fLog1A = logDir + "CONTROLLED_EXPFIXED_LEN_PAY10_DOCSET1.json"
fLog1B = logDir + "CONTROLLED_EXPFIXED_LEN_PAY30_DOCSET2.json"
fLog1C = logDir + "CONTROLLED_EXPFIXED_LEN_QUALITY_PAY30_DOCSET1.json" 
#LENGTH
fLog2A = logDir + "CONTROLLED_EXPFIXED_REWARD_PAY15_DOCSET1.json"
fLog2B = logDir + "CONTROLLED_EXPFIXED_REWARD_PAY30_DOCSET2.json"

Data Preprocessing

TODO: Data Preprocessing Tasks

  • Load data
  • Separate abandoned and submit people like Lei did
  • If we want to compare with the other batch of tasks (aka experiment in the wild) then we need to rescale the relevance judgements because in the first experiment it was a 4-level scale and in this one is 2)
  • Merging the files into a DF that I am interested in
  • Do the split of groups in a similar way - but how to analyse later?
  • In cross-check: did they ensure that in experiments A and B there are always disjoint workers? that would be in-between subject experiment design
exp1A = pd.read_json(path_or_buf = fExp1A, lines = True, encoding = 'utf-8', orient = "records")
exp1B = pd.read_json(path_or_buf = fExp1B, lines = True, encoding = 'utf-8', orient = "records")
exp2A = pd.read_json(path_or_buf = fExp2A, lines = True, encoding = 'utf-8', orient = "records")
exp2B = pd.read_json(path_or_buf = fExp2B, lines = True, encoding = 'utf-8', orient = "records")

log1A = pd.read_json(path_or_buf = fLog1A, lines = True, encoding = 'utf-8', orient = "records")
log1B = pd.read_json(path_or_buf = fLog1B, lines = True, encoding = 'utf-8', orient = "records")
log1C = pd.read_json(path_or_buf = fLog1C, lines = True, encoding = 'utf-8', orient = "records")
log2A = pd.read_json(path_or_buf = fLog2A, lines = True, encoding = 'utf-8', orient = "records")
log2B = pd.read_json(path_or_buf = fLog2B, lines = True, encoding = 'utf-8', orient = "records")
# Data Format & Content Exploration - with example exp1A 
exp1A.head()
agreement created_at data gold_pool id job_id judgments_count missed_count results state updated_at
0 1 2018-07-25 11:15:26 {'unit_id': '1'} NaN 1829425306 1286626 1 0 {'judgments': [{'created_at': '2018-07-25T12:0... finalized 2018-07-25 12:08:15
1 1 2018-07-25 11:15:26 {'unit_id': '2'} NaN 1829425307 1286626 1 0 {'judgments': [{'created_at': '2018-07-25T12:1... finalized 2018-07-25 12:16:04
2 1 2018-07-25 11:15:26 {'unit_id': '3'} NaN 1829425308 1286626 1 0 {'judgments': [{'created_at': '2018-07-25T12:0... finalized 2018-07-25 12:09:02
3 1 2018-07-25 11:15:26 {'unit_id': '4'} NaN 1829425309 1286626 1 0 {'judgments': [{'created_at': '2018-07-25T13:2... finalized 2018-07-25 13:29:20
4 1 2018-07-25 11:15:26 {'unit_id': '5'} NaN 1829425310 1286626 1 0 {'judgments': [{'created_at': '2018-07-25T12:0... finalized 2018-07-25 12:08:20
# Create the colum workerid extracted from the dict
def extractUnitId(row):
    resDic = row['data']    
    unitId = resDic['unit_id']
    return unitId
exp1A['unit_id'] = exp1A.apply(extractUnitId,axis=1)
exp1B['unit_id'] = exp1B.apply(extractUnitId,axis=1)
exp2A['unit_id'] = exp2A.apply(extractUnitId,axis=1)
exp2B['unit_id'] = exp2B.apply(extractUnitId,axis=1)
# Create the colum workerid extracted from the dict
def extractWorkerId(row):
    resDic = row['results']
    workerId = resDic['judgments'][0]['worker_id']
    if(len(resDic['judgments']) > 1):
        print('One worker with more than one judgment! '+ str(workerId))
    
    return workerId
    
exp1A['worker_id'] = exp1A.apply(extractWorkerId,axis=1)
exp1B['worker_id'] = exp1B.apply(extractWorkerId,axis=1)
exp2A['worker_id'] = exp2A.apply(extractWorkerId,axis=1)
exp2B['worker_id'] = exp2B.apply(extractWorkerId,axis=1)
# Data Format & Content Exploration - with example log1A 
log1A.head()
judgments message pay server_time session_id step steps task_id time times worker_id
0 [None, -1, -1, -1, -1, -1, -1] Start 10 2018-07-25 11:27:40.008932 0.f03mqh9l 1 [1] 1286626.0 1532518056947 [None, 0, 0, 0, 0, 0, 0] 44031296
1 [None, -1, -1, -1, -1, -1, -1] Start 10 2018-07-25 11:28:02.781684 0.euhmtag5 1 [1] 1286626.0 1532518078966 [None, 0, 0, 0, 0, 0, 0] 43415523
2 [None, -1, -1, -1, -1, -1, -1] Start 10 2018-07-25 11:28:09.870873 0.ir8afh08 1 [1] 1286626.0 1532518090184 [None, 0, 0, 0, 0, 0, 0] 6335115
3 [None, -1, -1, -1, -1, -1, -1] Start 10 2018-07-25 11:28:13.668804 0.mlxv7rso 1 [1] 1286626.0 1532518080321 [None, 0, 0, 0, 0, 0, 0] 44643986
4 [None, -1, -1, -1, -1, -1, -1] Start 10 2018-07-25 11:28:18.000698 0.mfy9iuvp 1 [1] 1286626.0 1532518095276 [None, 0, 0, 0, 0, 0, 0] 42787196

Explanation by Kevin: var final_log = { “session_id”: session_id, // unique session id (to capture page refresh) “message”: String(message), // message that triggered the log -- see below “worker_id”: worker_id, // worker id “task_id”:task_id, // task_id “time”: Date.now(), //time of sending log “step”: step, // step into the task (i.e., 1,2,3,4... no_docs, paystep) “judgments”: judgments, //array of judgments -- start at 1 (0 is null) “times”: times, //array of times for the judgments -- start at 1 (0 is null) “steps”: steps, // array of steps in to the task; e.g., if the worker pressed back at step 2 the array is 1,2,1,2,3,... };

the message-set is:

  • nextButton
  • backButton
  • Final_OK --> task concluded succesfully
  • paying --> paying the worker
  • Start --> start task
  • MW Worker Rejected:’+worker_id --> worker blacklisted that tried to start the task
  • MWorker ok --> opposite of the last (not sure if present)
log1A.message.unique()
array(['Start', 'nextButton', 'backButton', 'Final_OK', 'paying'],
      dtype=object)
sessions = log1A[['worker_id', 'session_id']]
sessions.groupby(['worker_id']).size().unique()
array([ 1,  9, 11,  2, 10,  8, 19, 17,  6, 13, 14])
import json
def getJudgments(row):
    text_result = row['results']['judgments'][0]['data']['text_result']
    textrjson = json.loads(text_result)
    judgments = textrjson['judgments']
     # return pd.Series(judgments) OK but just the array expects the shape of the original data frame calling the apply
    return len(judgments)
# Helpers
def countJudgments(row):
    #return len(row['judgments'])
    return row['judgments_count'] # it's wrapped in the judgments - data
# Cross checks & Basic stats - units per people etc. Global and separating people? 
def checkTask(taskDf):
    
    # checking published config
    print('total number of HITs:' + str(len(taskDf)))
    # KO print('number of judgments per HIT' + str(taskDf.results.map(lambda x: len(x)).max()))   
    nulls = pd.isnull(taskDf)
    
    # missing values
       
    print('Empty value in data column: ' + str(len(nulls.loc[nulls['data'] == True])) + ' out of '+ str(len(nulls['data'])))
    print('Empty value in results column: ' + str(len(nulls.loc[nulls['results'] == True])) + ' out of '+ str(len(nulls['results'])))
    print('Empty value in created_at column: ' + str(len(nulls.loc[nulls['created_at'] == True])) + ' out of '+ str(len(nulls['created_at'])))
    print('Empty value in updated_at column: ' + str(len(nulls.loc[nulls['updated_at'] == True])) + ' out of '+ str(len(nulls['updated_at'])))
    print('Empty value in id column: ' + str(len(nulls.loc[nulls['id'] == True])) + 'out of '+ str(len(nulls['id'])))
    print('Empty value in job_id column: ' + str(len(nulls.loc[nulls['job_id'] == True])) + ' out of '+ str(len(nulls['job_id'])))
    print('Empty value in worker_id column: ' + str(len(nulls.loc[nulls['worker_id'] == True])) + ' out of '+ str(len(nulls['worker_id'])))
    print('Empty value in unit_id column: ' + str(len(nulls.loc[nulls['unit_id'] == True])) + ' out of '+ str(len(nulls['unit_id'])))
    
    
    
    # counts
    print('Total number of workers: ' + str(taskDf['worker_id'].nunique()))
    print('Total number of units - they are judgments: ' + str(taskDf['unit_id'].nunique())) 
    print('AVG Number of units per worker: '+ str(taskDf.groupby(['worker_id'])['unit_id'].nunique().mean()) + ' Max Number of units per worker: '+ str(taskDf.groupby(['worker_id'])['unit_id'].nunique().max()) )
    print('Number of judgments per worker: ' )
    judgmentsCount = pd.Series()
    # when returning an array it takes the length of the DF here! Pandas - print(len(taskDf.columns))
    judgmentsCount = taskDf.apply(getJudgments,axis=1)
    print(judgmentsCount.describe())
        
   
exp1A['results'][0]['judgments'][0]['data']['text_result'] # this one gives an array of 4?
'{"session_id":"0.23pxaqv8","message":"paying","worker_id":37101159,"task_id":1286626,"time":1532520481330,"step":8,"judgments":[null,"0","0","0","0","0","0"],"times":[null,149.016,15.209,12.168,25.372,63.684,20.503],"steps":[1,2,3,4,5,6,7,8],"pay":10}'
checkTask(exp1A) # is the title of the files misleading? From the number of judgments sent by workers it looks like exp1A is the one of the lenth
checkTask(exp1B)
checkTask(exp2A)
checkTask(exp2B)
total number of HITs:100
Empty value in data column: 0 out of 100
Empty value in results column: 0 out of 100
Empty value in created_at column: 0 out of 100
Empty value in updated_at column: 0 out of 100
Empty value in id column: 0out of 100
Empty value in job_id column: 0 out of 100
Empty value in worker_id column: 0 out of 100
Empty value in unit_id column: 0 out of 100
Total number of workers: 100
Total number of units - they are judgments: 100
AVG Number of units per worker: 1.0 Max Number of units per worker: 1
Number of judgments per worker: 
count    100.0
mean       7.0
std        0.0
min        7.0
25%        7.0
50%        7.0
75%        7.0
max        7.0
dtype: float64
total number of HITs:100
Empty value in data column: 0 out of 100
Empty value in results column: 0 out of 100
Empty value in created_at column: 0 out of 100
Empty value in updated_at column: 0 out of 100
Empty value in id column: 0out of 100
Empty value in job_id column: 0 out of 100
Empty value in worker_id column: 0 out of 100
Empty value in unit_id column: 0 out of 100
Total number of workers: 100
Total number of units - they are judgments: 100
AVG Number of units per worker: 1.0 Max Number of units per worker: 1
Number of judgments per worker: 
count    100.0
mean       7.0
std        0.0
min        7.0
25%        7.0
50%        7.0
75%        7.0
max        7.0
dtype: float64
total number of HITs:100
Empty value in data column: 0 out of 100
Empty value in results column: 0 out of 100
Empty value in created_at column: 0 out of 100
Empty value in updated_at column: 0 out of 100
Empty value in id column: 0out of 100
Empty value in job_id column: 0 out of 100
Empty value in worker_id column: 0 out of 100
Empty value in unit_id column: 0 out of 100
Total number of workers: 100
Total number of units - they are judgments: 100
AVG Number of units per worker: 1.0 Max Number of units per worker: 1
Number of judgments per worker: 
count    100.0
mean       4.0
std        0.0
min        4.0
25%        4.0
50%        4.0
75%        4.0
max        4.0
dtype: float64
total number of HITs:100
Empty value in data column: 0 out of 100
Empty value in results column: 0 out of 100
Empty value in created_at column: 0 out of 100
Empty value in updated_at column: 0 out of 100
Empty value in id column: 0out of 100
Empty value in job_id column: 0 out of 100
Empty value in worker_id column: 0 out of 100
Empty value in unit_id column: 0 out of 100
Total number of workers: 100
Total number of units - they are judgments: 100
AVG Number of units per worker: 1.0 Max Number of units per worker: 1
Number of judgments per worker: 
count    100.0
mean       7.0
std        0.0
min        7.0
25%        7.0
50%        7.0
75%        7.0
max        7.0
dtype: float64
def checkLog(logDf):
    # missing values
    nulls = pd.isnull(logDf)
    print('Empty value in data column: ' + str(len(nulls.loc[nulls['message'] == True])) + 'out of '+ str(len(nulls['message'])))
    print('Empty value in session_id column: ' + str(len(nulls.loc[nulls['session_id'] == True])) + 'out of '+ str(len(nulls['session_id'])))
    print('Empty value in task_id column: ' + str(len(nulls.loc[nulls['task_id'] == True])) + 'out of '+ str(len(nulls['task_id'])))
    print('Empty value in time column: ' + str(len(nulls.loc[nulls['time'] == True])) + 'out of '+ str(len(nulls['time'])))
    print('Empty value in times column: ' + str(len(nulls.loc[nulls['times'] == True])) + 'out of '+ str(len(nulls['times'])))
    print('Empty value in worker_id column: ' + str(len(nulls.loc[nulls['worker_id'] == True])) + 'out of '+ str(len(nulls['worker_id'])))
    print('Empty value in pay column: ' + str(len(nulls.loc[nulls['pay'] == True])) + 'out of '+ str(len(nulls['pay'])))
    print('Empty value in judgmens column: ' + str(len(nulls.loc[nulls['judgments'] == True])) + 'out of '+ str(len(nulls['judgments'])))

    # counts
    print('Total number of workers: ' + str(logDf['worker_id'].nunique()))
    print('Total number of tasks: ' + str(logDf['task_id'].nunique())) # task = unit
    print('AVG Number of sessions per worker: '+ str(logDf.groupby(['worker_id'])['session_id'].nunique().mean()) + ' Max Number of sessions per worker: '+ str(logDf.groupby(['worker_id'])['session_id'].nunique().max()) )
    print('AVG Number of tasks per worker: '+ str(logDf.groupby(['worker_id'])['task_id'].nunique().mean()) + ' Max Number of tasks per worker: '+ str(logDf.groupby(['worker_id'])['task_id'].nunique().max()) )
checkLog(log1A)
checkLog(log1B)
checkLog(log1C)
checkLog(log2A)
checkLog(log2B)
Empty value in data column: 0out of 1089
Empty value in session_id column: 0out of 1089
Empty value in task_id column: 1out of 1089
Empty value in time column: 0out of 1089
Empty value in times column: 0out of 1089
Empty value in worker_id column: 0out of 1089
Empty value in pay column: 0out of 1089
Empty value in judgmens column: 0out of 1089
Total number of workers: 207
Total number of tasks: 1
AVG Number of sessions per worker: 1.1159420289855073 Max Number of sessions per worker: 6
AVG Number of tasks per worker: 0.9951690821256038 Max Number of tasks per worker: 1
Empty value in data column: 0out of 1146
Empty value in session_id column: 0out of 1146
Empty value in task_id column: 1out of 1146
Empty value in time column: 0out of 1146
Empty value in times column: 0out of 1146
Empty value in worker_id column: 0out of 1146
Empty value in pay column: 0out of 1146
Empty value in judgmens column: 0out of 1146
Total number of workers: 178
Total number of tasks: 1
AVG Number of sessions per worker: 1.1629213483146068 Max Number of sessions per worker: 4
AVG Number of tasks per worker: 0.9943820224719101 Max Number of tasks per worker: 1
Empty value in data column: 0out of 3592
Empty value in session_id column: 0out of 3592
Empty value in task_id column: 1out of 3592
Empty value in time column: 0out of 3592
Empty value in times column: 0out of 3592
Empty value in worker_id column: 1out of 3592
Empty value in pay column: 0out of 3592
Empty value in judgmens column: 0out of 3592
Total number of workers: 318
Total number of tasks: 1
AVG Number of sessions per worker: 1.8962264150943395 Max Number of sessions per worker: 210
AVG Number of tasks per worker: 1.0 Max Number of tasks per worker: 1
Empty value in data column: 0out of 790
Empty value in session_id column: 0out of 790
Empty value in task_id column: 0out of 790
Empty value in time column: 0out of 790
Empty value in times column: 0out of 790
Empty value in worker_id column: 0out of 790
Empty value in pay column: 0out of 790
Empty value in judgmens column: 0out of 790
Total number of workers: 209
Total number of tasks: 1
AVG Number of sessions per worker: 1.1961722488038278 Max Number of sessions per worker: 6
AVG Number of tasks per worker: 1.0 Max Number of tasks per worker: 1
Empty value in data column: 0out of 1129
Empty value in session_id column: 0out of 1129
Empty value in task_id column: 0out of 1129
Empty value in time column: 0out of 1129
Empty value in times column: 0out of 1129
Empty value in worker_id column: 0out of 1129
Empty value in pay column: 0out of 1129
Empty value in judgmens column: 0out of 1129
Total number of workers: 202
Total number of tasks: 1
AVG Number of sessions per worker: 1.2227722772277227 Max Number of sessions per worker: 8
AVG Number of tasks per worker: 1.0 Max Number of tasks per worker: 1
def checkTaskJobJointly(taskDf, logDf):
    
    abandonedDf = logDf[~logDf['worker_id'].isin(taskDf['worker_id'])]
    completedDf = logDf[logDf['worker_id'].isin(taskDf['worker_id'])]
    
    # all the answers in the task completion report are also in the log data set
    print('Number of people who abandoned: ' + str(len(logDf['worker_id'][~logDf['worker_id'].isin(taskDf['worker_id'])].unique())) ) #+ ' and they are IDs: '+  str(logDf['worker_id'][~logDf['worker_id'].isin(taskDf['worker_id'])])
    print('Number of people who submitted: ' + str(len(logDf['worker_id'][logDf['worker_id'].isin(taskDf['worker_id'])].unique())) ) #+ ' and they are IDs: '+  str(logDf['worker_id'][logDf['worker_id'].isin(taskDf['worker_id'])])
    print('Number of people who were not logged: ' + str(len(taskDf['worker_id'][~taskDf['worker_id'].isin(logDf['worker_id'])].unique())) )
    print('*Total number of workers in Task*: '+ str(taskDf['worker_id'].nunique()))
    print('*Total number of workers in Log*: '+ str(logDf['worker_id'].nunique()))
    
    return abandonedDf, completedDf
   
    

Abandonment Stats

print('--- Experiment 1  ------------------------')
print('--- (A) ------------------------')
aban_1A, complet_1A = checkTaskJobJointly(exp1A, log1A)
print('--- (B) ------------------------')
aban_1B, complet_1B = checkTaskJobJointly(exp1B, log1B)
print('--- (C) ------------------------')
aban_1C, complet_1C = checkTaskJobJointly(exp1B, log1C) 
print('--- Experiment 2  ------------------------')
print('--- (A) ------------------------')
aban_2A, complet_2A = checkTaskJobJointly(exp2A, log2A)
print('--- (B) ------------------------')
aban_2B, complet_2B = checkTaskJobJointly(exp2B, log2B)
print('--- (C) ------------------------')
aban_2C, complet_2C = checkTaskJobJointly(exp2B, log1B)
--- Experiment 1  ------------------------
--- (A) ------------------------
Number of people who abandoned: 107
Number of people who submitted: 100
Number of people who were not logged: 0
*Total number of workers in Task*: 100
*Total number of workers in Log*: 207
--- (B) ------------------------
Number of people who abandoned: 78
Number of people who submitted: 100
Number of people who were not logged: 0
*Total number of workers in Task*: 100
*Total number of workers in Log*: 178
--- (C) ------------------------
Number of people who abandoned: 292
Number of people who submitted: 27
Number of people who were not logged: 73
*Total number of workers in Task*: 100
*Total number of workers in Log*: 318
--- Experiment 2  ------------------------
--- (A) ------------------------
Number of people who abandoned: 109
Number of people who submitted: 100
Number of people who were not logged: 0
*Total number of workers in Task*: 100
*Total number of workers in Log*: 209
--- (B) ------------------------
Number of people who abandoned: 102
Number of people who submitted: 100
Number of people who were not logged: 0
*Total number of workers in Task*: 100
*Total number of workers in Log*: 202
--- (C) ------------------------
Number of people who abandoned: 177
Number of people who submitted: 1
Number of people who were not logged: 99
*Total number of workers in Task*: 100
*Total number of workers in Log*: 178

Building the 4 groups of people:

Focus is on the log files, filtering in one way or the other.

# Get the two subgroups for abandoned workes, who either abandones right away or abandoned after restarting -- more than one session
def abandSpec(df):
    # (!!) Pandas passes through the first twice
    dfG = df.groupby(['worker_id'])
    abanA = dfG.filter(lambda x: len(x['session_id'].unique()) == 1)
    abanB = dfG.filter(lambda x: len(x['session_id'].unique()) > 1)
    return abanA,abanB
# Get the two subgroups for completed workes, who either submitted answers right away or submitted after restarting -- more than one session
# Coded in a different method for extensibility reasons
def completSpec(df):
    # (!!) Pandas passes through the first twice
    dfG = df.groupby(['worker_id'])
    complA = dfG.filter(lambda x: len(x['session_id'].unique()) == 1)
    complB = dfG.filter(lambda x: len(x['session_id'].unique()) > 1)
    return complA,complB
    
    
# Get all the concrete subsets for all versions of the two controlled experiments.

# Experiment 1 (A,B settings)
abanA_1A,abanB_1A = abandSpec(aban_1A)
completA_1A,completB_1A = completSpec(complet_1A)

abanA_1B,abanB_1B = abandSpec(aban_1B)
completA_1B,completB_1B = completSpec(complet_1B)

# Experiment 2 (A,B settings)
abanA_2A,abanB_2A = abandSpec(aban_2A)
completA_2A,completB_2A = completSpec(complet_2A)

abanA_2B,abanB_2B = abandSpec(aban_2B)
completA_2B,completB_2B = completSpec(complet_2B)
# Cross-check - CORRECT
print('abandoned subgroups 1A')
print(abanA_1A.worker_id.nunique() + abanB_1A.worker_id.nunique())
print('abandoned subgroups 1B')
print(abanA_1B.worker_id.nunique() + abanB_1B.worker_id.nunique())
print('abandoned subgroups 2A')
print(abanA_2A.worker_id.nunique() + abanB_2A.worker_id.nunique())
print('abandoned subgroups 2B')
print(abanA_2B.worker_id.nunique() + abanB_2B.worker_id.nunique())

print('completed subgroups 1A')
print(completA_1A.worker_id.nunique() + completB_1A.worker_id.nunique())
print('completed subgroups 1B')
print(completA_1B.worker_id.nunique() + completB_1B.worker_id.nunique())
print('completed subgroups 2A')
print(completA_2A.worker_id.nunique() + completB_2A.worker_id.nunique())
print('completed subgroups 2B')
print(completA_2B.worker_id.nunique() + completB_2B.worker_id.nunique())
abandoned subgroups 1A
107
abandoned subgroups 1B
78
abandoned subgroups 2A
109
abandoned subgroups 2B
102
completed subgroups 1A
100
completed subgroups 1B
100
completed subgroups 2A
100
completed subgroups 2B
100
# ----- Testing Pandas
#log1A[log1A['worker_id']==41202032]
#d = log1A.sort_values(by=['worker_id'])
#d.head(100)
#log1Ag = log1A.groupby(['worker_id'])
#abb = log1Ag.filter(lambda x: len(x['session_id'].unique()) > 1)
#abb
#abb.groupby(['worker_id']).get_group(41202032)
#abb.groupby(['worker_id']).get_group(6476374) #- does not find it - it's correct
# --
# a = [1,2,3]
# b = [2,3,4]
# data = pd.DataFrame()
# data['a'] = pd.Series(a)
# data['b'] = pd.Series(b)
# data.head()
# data['a'][~data['a'].isin(data['b'])]
# data['a'][data['a'].isin(data['b'])]
# data['a'].isin(data['b'])
# data[~data['a'].isin(data['b'])]
# ----------- end of testing Pandas

Experiment-based Hypotheses

Normality test and statistical tests to analyse the difference between the means (in measurement X) of two populations (experiment in setting A and experiment in setting B,

# There is no worker that appears in both settings (A and B)
print(len(exp1A[exp1A['worker_id'].isin(exp1B)]))
print(len(exp2A[exp1A['worker_id'].isin(exp2B)]))
0
0
print(len(log1A[log1A['worker_id'].isin(log1B)]))
print(len(log2A[log2A['worker_id'].isin(log2B)]))
print(len(log1B[log1B['worker_id'].isin(log1C)]))
0
0
0
from scipy.stats import shapiro
from scipy.stats import normaltest
from scipy.stats import anderson

# Input: series has the sample whose distribution we want to test
# Output: gaussian boolean True if it is normal distribution and False otherwise.
def testNormality(series):
    
    alpha = 0.05
    gaussian = False
    
    # only if the three tests give normal will be normal. If we find one that is not passed, then it is NOT normal. 
    
    # Shapiro-Wilk Test - for smaller data sets around thousands of records
    print('length of series in Shapiro is: '+ str(len(series)))
    stats1, p1 = shapiro(series)
    print('Statistics Shapiro-Wilk Test =%.3f, p=%.3f' % (stats1, p1))
    if p1 > alpha:
        gaussian = True
    print('Shapiro.Wilk says it is Normal '+ str(gaussian))
    
    gaussian = False # because of intermediate printing, reinitialize
    # D'Agostino and Pearson's Test
    stats2, p2 = normaltest(series) #dataw.humid
    print('Statistics D\'Agostino and Pearson\'s Test=%.3f, p=%.3f' % (stats2, p2))
    if p2 > alpha:
        gaussian = False
        print('D\'Agostino and Pearson\'s says it is Normal '+ str(gaussian))
    
    # Anderson-Darling Test
    '''result = anderson(series) 
    print('Statistic: %.3f' % result.statistic)
    for i in range(len(result.critical_values)):
        sl, cv = result.significance_level[i], result.critical_values[i]
        if result.statistic > result.critical_values[i]:
            gaussian = False'''
        
    
    return gaussian
    
from scipy.stats import ttest_ind
from scipy.stats import mannwhitneyu

# Input:
# series1 is the series with the set of measurements for every single worker in case A of controlled experiment
# series2 is the series with the set of measurements for every single worker in case B of controlled experiment
# gaussian is the boolean value indicating if the samples have passed the test of normality or not (True is apply parametric test)
# Output:
# stats of statistical test 
# p-value 
# acceptHo (True if we fail to reject it and False if we reject it) 
# See also all tables for picking the tests (e.g., https://help.xlstat.com/customer/en/portal/articles/2062457-which-statistical-test-should-you-use-)
def compareTwoSamples(series1,series2, gaussian):
    # Tests to compare two samples (H0: they have equal distribution; H1: they have different distribution)
    
    alpha = 0.05
    acceptH0 = False
    
    if (gaussian == True):
        # Run Student's T-test
        stats, p = ttest_ind(series1, series2)
        print('Statistics=%.3f, p=%.3f' % (stats, p))
        
    else:
        
        # Run Mann-Whitney; Kruskal-Wallis test is for more samples.
        stats, p = mannwhitneyu(series1, series2)
        print('Statistics=%.3f, p=%.3f' % (stats, p))
        
        # result - hypothesis testing
   
    if p > alpha:
        acceptH0 = True
    
    print('The two samples have the same distribution: ' + str(acceptH0))
    return stats,p,acceptH0        
        
    
    
    

1. Work Done

Idea: People who abandon, *try longer* when they see more value / potential in the HIT. The more reward / the more documents the more value the HIT has for a worker. Workers may abandon because of their fear to get a "bad reputation", but when the reward is higher, the extrinsic motivation is stronger and one could think that they try longer (either clicking the answers or restarting the process after having closed it).

  • We have two pairs of populations to compare:
    • Experiment 1 (A and B) and
    • Experiment 2 (A and B)
  • We measure "trying longer" using two different measurements:
    • The number of sessions: start, leave, start again
    • The number of messages: they go further in the process (e.g., the click on many answers instead of staying at start)

Functions to compute the measurements

def getSessionCount(df):
    dfG = df.groupby(['worker_id'])
    sessionCounts = dfG.apply(lambda x: len(x['session_id'].unique()))
    sessionCountsRI = sessionCounts.reset_index()
    del(sessionCountsRI['worker_id'])
    sessionCountsRI.columns=['sessionCount']
    return sessionCountsRI
def getMessageCount(df):
    dfG = df.groupby(['worker_id'])
    messageCounts = dfG.apply(lambda x: len(x['message']))
    messageCountsRI = messageCounts.reset_index()
    del(messageCountsRI['worker_id'])
    messageCountsRI.columns=['messageCount']
    return messageCountsRI

SessionCount

sessionC_aban_1A= getSessionCount(aban_1A)
sessionC_aban_1B= getSessionCount(aban_1B)
sessionC_aban_1C= getSessionCount(aban_1C)

sessionC_aban_2A= getSessionCount(aban_2A)
sessionC_aban_2B= getSessionCount(aban_2B)
print(sessionC_aban_1A.describe())
print(sessionC_aban_1B.describe())
print(sessionC_aban_1C.describe())

print(sessionC_aban_2A.describe())
print(sessionC_aban_2B.describe())
       sessionCount
count    107.000000
mean       1.102804
std        0.530834
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        6.000000
       sessionCount
count     78.000000
mean       1.102564
std        0.444000
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        4.000000
       sessionCount
count    291.000000
mean       1.951890
std       12.251576
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max      210.000000
       sessionCount
count    109.000000
mean       1.192661
std        0.440477
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        3.000000
       sessionCount
count    102.000000
mean       1.196078
std        0.783988
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        8.000000

Are the populations (both in pair) normal distrinbutions?

norm_sessionC_aban_1A = testNormality(sessionC_aban_1A)
print("final: " + str(norm_sessionC_aban_1A))
norm_sessionC_aban_1B = testNormality(sessionC_aban_1B)
print("final: " + str(norm_sessionC_aban_1B))
norm_sessionC_aban_1C = testNormality(sessionC_aban_1C)
print("final: " + str(norm_sessionC_aban_1C))
norm_sessionC_aban_2A = testNormality(sessionC_aban_2A)
print("final: " + str(norm_sessionC_aban_2A))
norm_sessionC_aban_2B = testNormality(sessionC_aban_2B)
print("final: " + str(norm_sessionC_aban_2B))
length of series in Shapiro is: 107
Statistics Shapiro-Wilk Test =0.190, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=197.283, p=0.000
final: False
length of series in Shapiro is: 78
Statistics Shapiro-Wilk Test =0.249, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=115.498, p=0.000
final: False
length of series in Shapiro is: 291
Statistics Shapiro-Wilk Test =0.045, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=668.252, p=0.000
final: False
length of series in Shapiro is: 109
Statistics Shapiro-Wilk Test =0.476, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=60.346, p=0.000
final: False
length of series in Shapiro is: 102
Statistics Shapiro-Wilk Test =0.261, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=176.924, p=0.000
final: False

Exp1_H0: means of sessionCount are equal in both populations

Exp1_H1: means of sessionCount are not equal in both populations

normal = norm_sessionC_aban_1A and norm_sessionC_aban_1B
print('Abandoned 1A and 1B')
stats, p1sc, accept1sc = compareTwoSamples(sessionC_aban_1A, sessionC_aban_1B, normal )
Abandoned 1A and 1B
Statistics=4171.000, p=0.496
The two samples have the same distribution: True
ax = sns.distplot(sessionC_aban_1A)
ax.set_xlabel("Number of sessions per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 1 - setting A")
/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Text(0.5,1,'Distribution of session count abandoned group Experiment 1 - setting A')
ax = sns.distplot(sessionC_aban_1B)
ax.set_xlabel("Number of sessions per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 1 - setting B")
/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Text(0.5,1,'Distribution of session count abandoned group Experiment 1 - setting B')

Exp2_H0: means of sessionCount are equal in both populations

Exp2_H1: means of sessionCount are not equal in both populations

normal = norm_sessionC_aban_2A and norm_sessionC_aban_2B
print('Abandoned 2A and 2B')
stats,p2sc,accept2sc = compareTwoSamples(sessionC_aban_2A, sessionC_aban_2B, normal)
Abandoned 2A and 2B
Statistics=5261.500, p=0.138
The two samples have the same distribution: True
ax = sns.distplot(sessionC_aban_2A)
ax.set_xlabel("Number of sessions per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 2 - setting A")
/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Text(0.5,1,'Distribution of session count abandoned group Experiment 2 - setting A')

ExpQ_H0: means of sessionCount are equal in both populations

ExpQ_H1: means of sessionCount are not equal in both populations

ax = sns.distplot(sessionC_aban_1B)
ax.set_xlabel("Number of sessions per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 1 - setting B")
/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Text(0.5,1,'Distribution of session count abandoned group Experiment 1 - setting B')
normal = norm_sessionC_aban_1C and norm_sessionC_aban_1B
print('Abandoned 1C and 1B')
stats,pqsc,acceptqsc= compareTwoSamples(sessionC_aban_1C, sessionC_aban_1B, normal)
Abandoned 1C and 1B
Statistics=10032.500, p=0.006
The two samples have the same distribution: False
ax = sns.distplot(sessionC_aban_1C)
ax.set_xlabel("Number of sessions per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 1 - setting C")
/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Text(0.5,1,'Distribution of session count abandoned group Experiment 1 - setting C')
normal = norm_sessionC_aban_2B and norm_sessionC_aban_1C
print('Abandoned 2B and 1C')
stats,pq2sc,acceptq2sc = compareTwoSamples(sessionC_aban_2B, sessionC_aban_1C, normal)
Abandoned 2B and 1C
Statistics=13891.500, p=0.068
The two samples have the same distribution: True

Number of Messages

messageC_aban_1A= getMessageCount(aban_1A)
messageC_aban_1B= getMessageCount(aban_1B)
messageC_aban_1C= getMessageCount(aban_1C)

messageC_aban_2A= getMessageCount(aban_2A)
messageC_aban_2B= getMessageCount(aban_2B)
print(messageC_aban_1A.describe())
print(messageC_aban_1B.describe())
print(messageC_aban_1C.describe())

print(messageC_aban_2A.describe())
print(messageC_aban_2B.describe())
       messageCount
count    107.000000
mean       1.158879
std        0.568837
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        6.000000
       messageCount
count     78.000000
mean       1.487179
std        1.585188
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max       11.000000
       messageCount
count    291.000000
mean      10.701031
std       17.677669
min        1.000000
25%        1.000000
50%        4.000000
75%       12.000000
max      229.000000
       messageCount
count    109.000000
mean       1.522936
std        0.834403
min        1.000000
25%        1.000000
50%        1.000000
75%        2.000000
max        6.000000
       messageCount
count    102.000000
mean       1.705882
std        1.668510
min        1.000000
25%        1.000000
50%        1.000000
75%        2.000000
max       15.000000

Normality

norm_messageC_aban_1A = testNormality(messageC_aban_1A)
print("final: " + str(norm_messageC_aban_1A))
norm_messageC_aban_1B = testNormality(messageC_aban_1B)
print("final: " + str(norm_messageC_aban_1B))
norm_messageC_aban_1C = testNormality(messageC_aban_1C)
print("final: " + str(norm_messageC_aban_1C))

norm_messageC_aban_2A = testNormality(messageC_aban_2A)
print("final: " + str(norm_messageC_aban_2A))
norm_messageC_aban_2B = testNormality(messageC_aban_2B)
print("final: " + str(norm_messageC_aban_2B))
length of series in Shapiro is: 107
Statistics Shapiro-Wilk Test =0.290, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=172.215, p=0.000
final: False
length of series in Shapiro is: 78
Statistics Shapiro-Wilk Test =0.349, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=105.635, p=0.000
final: False
length of series in Shapiro is: 291
Statistics Shapiro-Wilk Test =0.499, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=428.798, p=0.000
final: False
length of series in Shapiro is: 109
Statistics Shapiro-Wilk Test =0.613, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=83.161, p=0.000
final: False
length of series in Shapiro is: 102
Statistics Shapiro-Wilk Test =0.438, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=152.159, p=0.000
final: False

Exp1_H0: means of messageCount are equal in both populations

Exp1_H1: means of messageCount are not equal in both populations

normal = norm_messageC_aban_1A and norm_messageC_aban_1B
print('Abandoned 1A and 1B')
stats,p1mc,accept1mc = compareTwoSamples(messageC_aban_1A, messageC_aban_1B, normal )
Abandoned 1A and 1B
Statistics=3988.000, p=0.194
The two samples have the same distribution: True
ax = sns.distplot(messageC_aban_1A)
ax.set_xlabel("Number of messages per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 1 - setting A")
/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Text(0.5,1,'Distribution of session count abandoned group Experiment 1 - setting A')
ax = sns.distplot(messageC_aban_1B)
ax.set_xlabel("Number of messages per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 1 - setting B")
/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Text(0.5,1,'Distribution of session count abandoned group Experiment 1 - setting B')

Exp2_H0: means of messageCount are equal in both populations

Exp2_H1: means of messageCount are not equal in both populations

normal = norm_messageC_aban_2A and norm_messageC_aban_2B
print('Abandoned 2A and 2B')
stats,p2mc,accept2mc = compareTwoSamples(messageC_aban_2A, messageC_aban_2B, normal)
Abandoned 2A and 2B
Statistics=5436.000, p=0.374
The two samples have the same distribution: True
ax = sns.distplot(messageC_aban_2A)
ax.set_xlabel("Number of messages per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 2 - setting A")
/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Text(0.5,1,'Distribution of session count abandoned group Experiment 2 - setting A')
ax = sns.distplot(messageC_aban_2B)
ax.set_xlabel("Number of messages per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 2 - setting B")
/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Text(0.5,1,'Distribution of session count abandoned group Experiment 2 - setting B')
print('values in population A')
messageC_aban_2A['messageCount'].value_counts()
values in population A
1    65
2    38
4     2
3     2
6     1
5     1
Name: messageCount, dtype: int64
print('values in population B')
messageC_aban_2B['messageCount'].value_counts()
values in population B
1     66
2     24
4      7
3      2
15     1
6      1
5      1
Name: messageCount, dtype: int64
messageC_aban_2A.hist(log=True)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x119bd1710>]],
      dtype=object)

ExpQ_H0: means of messageCount are equal in both populations

ExpQ_H1: means of messageCount are not equal in both populations

normal = norm_messageC_aban_1C and norm_messageC_aban_1B
print('Abandoned 1C and 1B')
stats,pqmc,acceptqmc = compareTwoSamples(messageC_aban_1C, messageC_aban_1B, normal)
Abandoned 1C and 1B
Statistics=4628.000, p=0.000
The two samples have the same distribution: False
ax = sns.distplot(messageC_aban_1C)
ax.set_xlabel("Number of messages per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 2 - setting C")
/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Text(0.5,1,'Distribution of session count abandoned group Experiment 2 - setting C')
normal = norm_messageC_aban_2B and norm_messageC_aban_1C
print('Abandoned 2B and 1C')
stats,pq2mc,acceptq2mc = compareTwoSamples(messageC_aban_2B, messageC_aban_1C, normal)
Abandoned 2B and 1C
Statistics=7448.500, p=0.000
The two samples have the same distribution: False

2. Time invested

Functions to compute the measurements

group = log1A.groupby(['worker_id','session_id'])['server_time']
sessiontime = group.apply(lambda x: (x.max() - x.min()).total_seconds())
# testing
# group.get_group((6476374, '0.8lvbip6m'))
# group.get_group((6476374, '0.8lvbip6m')).max()-group.get_group((6476374, '0.8lvbip6m')).min()
def getTimePerSession(df):
   
    dfG = df.groupby(['worker_id','session_id'])['server_time']
    sessionTimes = dfG.apply(lambda x: (x.max() - x.min()).total_seconds())
    sessionTimesRI = sessionTimes.reset_index()
    del(sessionTimesRI['worker_id'])
    del(sessionTimesRI['session_id'])
    sessionTimesRI.columns=['sessionTime']
    return sessionTimesRI
    
    
# more?

Time per session

sessionTime_aban_1A= getTimePerSession(aban_1A)
sessionTime_aban_1B= getTimePerSession(aban_1B)
sessionTime_aban_1C= getTimePerSession(aban_1C)

sessionTime_aban_2A= getTimePerSession(aban_2A)
sessionTime_aban_2B= getTimePerSession(aban_2B)
print(sessionTime_aban_1A.describe())
print(sessionTime_aban_1B.describe())
print(sessionTime_aban_1C.describe())

print(sessionTime_aban_2A.describe())
print(sessionTime_aban_2B.describe())
       sessionTime
count   118.000000
mean     14.225572
std     108.940254
min       0.000000
25%       0.000000
50%       0.000000
75%       0.000000
max    1147.664558
       sessionTime
count    86.000000
mean     48.604076
std     241.855162
min       0.000000
25%       0.000000
50%       0.000000
75%       0.000000
max    2122.246052
       sessionTime
count   568.000000
mean    138.322278
std     255.149176
min       0.000000
25%       0.000000
50%       0.000000
75%     174.020518
max    1648.308943
       sessionTime
count   130.000000
mean      5.573162
std      38.715014
min       0.000000
25%       0.000000
50%       0.000000
75%       0.007999
max     414.607930
       sessionTime
count   122.000000
mean     17.836769
std     142.297113
min       0.000000
25%       0.000000
50%       0.000000
75%       0.115205
max    1466.078933

Normality

norm_sessionTime_aban_1A = testNormality(sessionTime_aban_1A)
print("final: " + str(norm_sessionTime_aban_1A))
norm_sessionTime_aban_1B = testNormality(sessionTime_aban_1B)
print("final: " + str(norm_sessionTime_aban_1B))
norm_sessionTime_aban_1C = testNormality(sessionTime_aban_1C)
print("final: " + str(norm_sessionTime_aban_1C))

norm_sessionTime_aban_2A = testNormality(sessionTime_aban_2A)
print("final: " + str(norm_sessionTime_aban_2A))
norm_sessionTime_aban_2B = testNormality(sessionTime_aban_2B)
print("final: " + str(norm_sessionTime_aban_2B))
length of series in Shapiro is: 118
Statistics Shapiro-Wilk Test =0.113, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=242.973, p=0.000
final: False
length of series in Shapiro is: 86
Statistics Shapiro-Wilk Test =0.201, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=167.391, p=0.000
final: False
length of series in Shapiro is: 568
Statistics Shapiro-Wilk Test =0.620, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=277.372, p=0.000
final: False
length of series in Shapiro is: 130
Statistics Shapiro-Wilk Test =0.130, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=256.100, p=0.000
final: False
length of series in Shapiro is: 122
Statistics Shapiro-Wilk Test =0.107, p=0.000
Shapiro.Wilk says it is Normal False
Statistics D'Agostino and Pearson's Test=240.925, p=0.000
final: False

Exp1_H0: means of time per session are equal in both populations

Exp1_H1: means of time per session are not equal in both populations

normal = norm_sessionTime_aban_1A and norm_sessionTime_aban_1B
print('Abandoned 1A and 1B')
stats,p1ts,accept1ts = compareTwoSamples(sessionTime_aban_1A, sessionTime_aban_1B, normal )
Abandoned 1A and 1B
Statistics=4789.000, p=0.066
The two samples have the same distribution: True
ax = sns.distplot(sessionTime_aban_1A)
ax.set_xlabel("Time per session per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 1 - setting A")
/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Text(0.5,1,'Distribution of session count abandoned group Experiment 1 - setting A')
ax = sns.distplot(sessionTime_aban_1B)
ax.set_xlabel("Time per session per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 1 - setting B")
/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Text(0.5,1,'Distribution of session count abandoned group Experiment 1 - setting B')

Exp2_H0: means of time per session are equal in both populations

Exp2_H1: means of time per session are not equal in both populations

normal = norm_sessionTime_aban_2A and norm_sessionTime_aban_2B
print('Abandoned 2A and 2B')
stats,p2ts,accept2ts = compareTwoSamples(sessionTime_aban_2A, sessionTime_aban_2B, normal)
Abandoned 2A and 2B
Statistics=6913.000, p=0.016
The two samples have the same distribution: False
ax = sns.distplot(sessionTime_aban_2A)
ax.set_xlabel("Time per session per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 2 - setting A")
/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Text(0.5,1,'Distribution of session count abandoned group Experiment 2 - setting A')
ax = sns.distplot(sessionTime_aban_2B)
ax.set_xlabel("Time per session per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 2 - setting B")
/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Text(0.5,1,'Distribution of session count abandoned group Experiment 2 - setting B')

ExpQ_H0: means of time per session are equal in both populations

ExpQ_H1: means of time per session are not equal in both populations

normal = norm_sessionTime_aban_1C and norm_sessionTime_aban_1B
print('Abandoned 1C and 1B')
stats,pqts,acceptqts = compareTwoSamples(sessionTime_aban_1C, sessionTime_aban_1B, normal)
Abandoned 1C and 1B
Statistics=17866.500, p=0.000
The two samples have the same distribution: False
ax = sns.distplot(sessionTime_aban_1C)
ax.set_xlabel("Time per session per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 2 - setting C")
/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Text(0.5,1,'Distribution of session count abandoned group Experiment 2 - setting C')
normal = norm_sessionTime_aban_2B and norm_sessionTime_aban_1C
print('Abandoned 2B and 1C')
stats,pq2ts,acceptq2ts = compareTwoSamples(sessionTime_aban_2B, sessionTime_aban_1C, normal)
Abandoned 2B and 1C
Statistics=30410.000, p=0.007
The two samples have the same distribution: False
sessionTime_aban_1C.hist(log=True)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x11d4d3cf8>]],
      dtype=object)
sessionTime_aban_2B.hist(log=True)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x11d6ac1d0>]],
      dtype=object)

Corrections

ps = pd.Series([p1sc,p2sc,pqsc,pq2sc,p1mc,p2mc,pqmc,pq2mc,p1ts,p2ts,pqts,pq2ts])
ps.head(11)
0     4.961015e-01
1     1.376391e-01
2     6.414342e-03
3     6.818856e-02
4     1.938103e-01
5     3.735934e-01
6     2.877487e-17
7     4.219429e-15
8     6.552328e-02
9     1.628683e-02
10    9.067023e-07
dtype: float64
corrected_p = multipletests(ps, 
                                            alpha = 0.05, 
                                            method = 'sidak') 
print(str(accept1sc)+' '+str(accept2sc)+' '+str(acceptqsc)+' '+str(acceptq2sc)+' '+str(accept1mc)+' '+str(accept2mc)+' '+str(acceptqmc)+' '+str(acceptq2mc)+' '+str(accept1ts)+' '+str(accept2ts)+' '+str(acceptqts)+' '+str(acceptq2ts))
print(corrected_p)
print('p > 0.05 is accept H0')
True True False True True True False False True False False False
(array([False, False, False, False, False, False,  True,  True, False,
       False,  True, False]), array([9.99732011e-01, 8.30851230e-01, 7.43138476e-02, 5.71514182e-01,
       9.24621597e-01, 9.96350145e-01, 0.00000000e+00, 5.06261699e-14,
       5.56573311e-01, 1.78851248e-01, 1.08803738e-05, 8.35732391e-02]), 0.004265318777560645, 0.004166666666666667)
p > 0.05 is accept H0
#pip install scikit-posthocs
#import scikit_posthocs as sp
x = [[1,2,3,5,1], [12,31,54], [10,12,6,74,11]]
sp.posthoc_conover(x, p_adjust = 'holm')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-83-969164e1de7c> in <module>()
      1 
      2 x = [[1,2,3,5,1], [12,31,54], [10,12,6,74,11]]
----> 3 sp.posthoc_conover(x, p_adjust = 'holm')

NameError: name 'sp' is not defined

With two-side ANOVA

  • independent vars
    • reward, length --> categorical values

Building the data for ANOVA

sessionC | exp | variant
1 1 A
3 1 A
1 1 A
...
1 1 B
2 1 B
1 1 B
...
1 1 C
2 1 C
1 1 C
...
1 2 A
1 2 A
...
1 2 B
1 2 B
..
?

sessionC_aban_1A
sessionC_aban_1B
sessionC_aban_1C

sessionTime_aban_2A
sessionTime_aban_2B
sessionTime
0 0.302825
1 0.000000
2 0.000000
3 0.051633
4 0.225698
5 0.000000
6 0.000000
7 0.068203
8 0.000000
9 0.015886
10 0.117176
11 0.000000
12 0.244297
13 0.000000
14 0.000000
15 25.788719
16 0.000000
17 0.000000
18 0.059415
19 0.000000
20 0.000000
21 0.000751
22 0.000000
23 1466.078933
24 0.000000
25 0.608219
26 0.013120
27 0.000000
28 0.046127
29 0.409284
... ...
92 0.126355
93 0.000000
94 0.003054
95 0.000000
96 0.000000
97 0.000000
98 0.000000
99 0.094291
100 0.069831
101 1.212784
102 1.364625
103 1.089849
104 0.573025
105 3.096394
106 0.295953
107 0.000000
108 0.409919
109 0.499067
110 0.000000
111 0.000000
112 0.109291
113 0.000000
114 0.000000
115 0.000000
116 0.000000
117 0.000000
118 95.189342
119 0.000000
120 0.000000
121 0.000000

122 rows × 1 columns

s1A = pd.DataFrame(sessionC_aban_1A)
s1A['exp'] = 1
s1A['variant'] = 'A'
s1A['qual'] = 0

s1B = pd.DataFrame(sessionC_aban_1B)
s1B['exp'] = 1
s1B['variant'] = 'B'
s1B['qual'] = 0

s1C = pd.DataFrame(sessionC_aban_1C)
s1C['exp'] = 1
s1C['variant'] = 'C'
s1C['qual'] = 1

s2A = pd.DataFrame(sessionC_aban_2A)
s2A['exp'] = 2
s2A['variant'] = 'A'
s2A['qual'] = 0

s2B = pd.DataFrame(sessionC_aban_2B)
s2B['exp'] = 2
s2B['variant'] = 'B'
s2B['qual'] = 0
data_anova= pd.DataFrame()
data_anova = pd.concat([s1A, s1B, s1C, s2A, s2B])
#data_anova = pd.concat([s1A, s1B, s2A, s2B])
data_anova['variant'].unique()
array(['A', 'B', 'C'], dtype=object)
data_anova.shape
(687, 4)
import seaborn as sns

def plot_anova (data,y):
    sns.set(style="whitegrid")

    # Draw a pointplot to show pulse as a function of three categorical factors
    g = sns.catplot(x="variant", y=y, hue="qual", col="exp",
                capsize=.2, palette="YlGnBu_d", height=6, aspect=.75,
                kind="point", data=data)
    g.despine(left=True)
plot_anova(data_anova,y="sessionCount")
/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
data_anova.isnull().values.any()
print(s1A.shape)
print(s1B.shape)
print(s1C.shape)
print(s2A.shape)
print(s2B.shape)
data_anova[data_anova['variant']=='B']['exp'].unique()
 
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

#model = ols(formula='sessionCount ~ C(exp) + C(variant)', data=data_anova).fit()
model = ols(formula='sessionCount ~ C(exp) * C(variant) * C(qual)', data=data_anova).fit()
 
aov_table = anova_lm(model, typ=2) 
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-116-a0e5489a4827> in <module>()
      5 model = ols(formula='sessionCount ~ C(exp) * C(variant) * C(qual)', data=data_anova).fit()
      6 
----> 7 aov_table = anova_lm(model, typ=2)

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/statsmodels/stats/anova.py in anova_lm(*args, **kwargs)
    321     if len(args) == 1:
    322         model = args[0]
--> 323         return anova_single(model, **kwargs)
    324 
    325     try:

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/statsmodels/stats/anova.py in anova_single(model, **kwargs)
     76     elif typ in [2, "II"]:
     77         return anova2_lm_single(model, design_info, n_rows, test, pr_test,
---> 78                 robust)
     79     elif typ in [3, "III"]:
     80         return anova3_lm_single(model, design_info, n_rows, test, pr_test,

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/statsmodels/stats/anova.py in anova2_lm_single(model, design_info, n_rows, test, pr_test, robust)
    209         #from IPython.core.debugger import Pdb; Pdb().set_trace()
    210         if test == 'F':
--> 211             f = model.f_test(L12, cov_p=robust_cov)
    212             table.ix[i, test] = test_value = f.fvalue
    213             table.ix[i, pr_test] = f.pvalue

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/statsmodels/base/model.py in f_test(self, r_matrix, cov_p, scale, invcov)
   1373         """
   1374         res = self.wald_test(r_matrix, cov_p=cov_p, scale=scale,
-> 1375                              invcov=invcov, use_f=True)
   1376         return res
   1377 

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/statsmodels/base/model.py in wald_test(self, r_matrix, cov_p, scale, invcov, use_f)
   1461                                  "dimensions that are asymptotically "
   1462                                  "non-normal")
-> 1463             invcov = np.linalg.inv(cov_p)
   1464 
   1465         if (hasattr(self, 'mle_settings') and

/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/numpy/linalg/linalg.py in inv(a)
    530     signature = 'D->D' if isComplexType(t) else 'd->d'
    531     extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 532     ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
    533     return wrap(ainv.astype(result_t, copy=False))
    534 

/Users/alessandro/.virtualenvs/newpymc3/lib/python3.5/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag)
     87 
     88 def _raise_linalgerror_singular(err, flag):
---> 89     raise LinAlgError("Singular matrix")
     90 
     91 def _raise_linalgerror_nonposdef(err, flag):

LinAlgError: Singular matrix
aov_table
import seaborn as sns
sns.set(style="whitegrid")

# Load the example exercise dataset
df = sns.load_dataset("exercise")
df.head()
# Draw a pointplot to show pulse as a function of three categorical factors
g = sns.catplot(x="time", y="pulse", hue="kind", col="diet",
                capsize=.2, palette="YlGnBu_d", height=6, aspect=.75,
                kind="point", data=df)
g.despine(left=True)
 
 
 
 

Discuss: Prediction task? classification task: worker_to_abandon (C1), worker_to_complete (C2)?

#when it's done we want to make both the comparisons 1C-1B and 1C-2B