Analysing the Results of the Abandonment Controlled Experiment

# Auxiliary Formatting Cell -------------------------------------------
from IPython.display import HTML
style = "<style>div.info { background-color: #DCD8D7;border-color: #dFb5b4; border-left: 5px solid #DCD8D7  ; padding: 0.5em;}</style>"
HTML(style)
# ---------------------------------------------------------------------

Notebook Metadata

Author Cristina Sarasua
Last Update 01.08.2018

Purpose Analyse the results of the controlled experiment run for the abandonment project, where the same bacth of HITS has been deployed several times in CrowdFlower(?), controlling for two variables:

  1. the reward (B1 reward \$0.10 per HIT B2 reward \$ 0.30 per HIT)
  2. the length (B1 3 docs \$ 0.15 per HIT B2 6 docs \$ 0.30 per HIT) 5 cents per document

Work Description

Reminder: The abandonment is defined in the paper as "workers previewing or starting a HIT and later on deciding to drop ot before copmletion, thus, giving up the reward.

Data

Data Files

  1. task_res JSON files from F8
  2. logs JSON files generated by Kevin et al. with logger

Data Explanation

  • Experiment 1: REWARD
    • A 0.10 per HIT 6 documents
    • B 0.30 per HIT 6 documents
    • C 0.30 per HIT (like B) with quality checks
  • Experiment 2: LENGTH

    • A 3 documents 6 documents
    • B 6 documents 6 documets
  • Ground truth for Topic 418

```SET 1 d1 -- LA010790-0121 -- REL d2 -- LA010289-0001 -- NOT REL d3 -- LA010289-0021 -- NOT REL d4 -- LA011190-0156 -- REL d5 -- LA010289-0060 -- NOT REL d6 -- LA012590-0067 -- REL

SET 2 d1 -- LA052189-0009 -- REL d2 -- LA052189-0189 -- NOT REL d3 -- LA052189-0196 -- NOT REL d4 -- LA052589-0174 -- REL d5 -- LA052590-0132 -- NOT REL d6 -- LA052590-0204 -- REL

EXP 1A and 2A --> SET 1 EXP 1B and 2B --> SET 2 ```

  • Each worker works only one ONE unit / HIT. Therefore, the worker-unit analysis does not apply here. We can do for e.g., time a worker-document analysis.

Findings

Data Quality / General Things
  • All workers were logged, no error
Abandonment
  • Lower reward --> more people abandoned. Shorter length --> more people abandoned. But, in both cases the difference is small. (See also "Abandonment Stats")

(A and B comparisons)

  • Comparison of means of sessionCount between the two populations, both in experiment 1 (1A and 1B) and experiment 2 (2A and 2B), indicate that the means are equal (or that there is not a significant difference between them). That is, increasing the reward or the length of documents (in these experiments) did not change the distribution. (See also "1. Work Done")
  • Comparison of means of number of messages between the two populations, both in experiment 1 (1A and 1B) and experiment 2 (2A and 2B), indicate that the means are equal (or that there is not a significant difference between them). That is, increasing the reward or the length of documents (in these experiments) did not change the distribution. (See also "1. Work Done")
  • Comparison of means of time invested in session between the pairs of populations, in 1A and 1B, the two populations are significantly different, while 2Aand 2B are not. (note: I updated this 1A and 1B comparison, as I corrected a variable name). (See also "2. Time Invested")

(C comparisons with quality checks)

  • Comparison of means of sessionCount between the pairs of populations (2B and 2C) indicates that the two samples are significantly different. Between 1B and 2C we don’t see a significant difference. (See also "1. Work Done")
  • Comparison of means of number of messages between the pairs of populations, (2B and 2C) indicates that the two samples are significantly different and (1B and 2C) are significantly different too. (See also "1. Work Done")
  • Comparison of means of time invested in session between the pairs of populations, (2B and 2C) indicates that the two samples are significantly different and (1B and 2C) are significantly different too. (See also "2. Time Invested")

Notes

  • I changed from the last version:
    • exp1 is exp2 and viceversa because the original task and log files had the inverted name. I

Discussion

  • Interpretation: TBC
  • Implications: TBC
  • The limitations of this analysis are ... TBC

Code

#!pip install statsmodels
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import numpy as np
%matplotlib inline
import seaborn as sns
from statsmodels.sandbox.stats.multicomp import multipletests 
/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
# Data Directories
dataDir = ""
taskDir = dataDir + "tasks/"
logDir = dataDir + "logs/"


#dataDir = "/Users/sarasua/Documents/RESEARCH/collab_Abandonment/controlledexperiment/results/"
#taskDir = dataDir + "task_res/"
#logDir = dataDir + "logs/"

# Concrete Task Files 
#original files are inverted, what is called LEN is REWARD and what is called REWARD is LEN! looking at the number of judgments we can see that
#REWARD
fExp1A = taskDir + "CONTROLLED_EXPFIXED_LEN_PAY10_DOCSET1.json"
fExp1B = taskDir + "CONTROLLED_EXPFIXED_LEN_PAY30_DOCSET2.json"
#LENGTH
fExp2A = taskDir + "CONTROLLED_EXPFIXED_REWARD_PAY15_DOCSET1.json" 
fExp2B = taskDir + "CONTROLLED_EXPFIXED_REWARD_PAY30_DOCSET2.json"

# Concrete Logged Events Files
#original files are inverted, what is called LEN is REWARD and what is called REWARD is LEN! looking at the number of judgments we can see that

#REWARD
fLog1A = logDir + "CONTROLLED_EXPFIXED_LEN_PAY10_DOCSET1.json"
fLog1B = logDir + "CONTROLLED_EXPFIXED_LEN_PAY30_DOCSET2.json"
fLog1C = logDir + "CONTROLLED_EXPFIXED_LEN_QUALITY_PAY30_DOCSET1.json" 
#LENGTH
fLog2A = logDir + "CONTROLLED_EXPFIXED_REWARD_PAY15_DOCSET1.json"
fLog2B = logDir + "CONTROLLED_EXPFIXED_REWARD_PAY30_DOCSET2.json"

Data Preprocessing

TODO: Data Preprocessing Tasks

  • Load data
  • Separate abandoned and submit people like Lei did
  • If we want to compare with the other batch of tasks (aka experiment in the wild) then we need to rescale the relevance judgements because in the first experiment it was a 4-level scale and in this one is 2)
  • Merging the files into a DF that I am interested in
  • Do the split of groups in a similar way - but how to analyse later?
  • In cross-check: did they ensure that in experiments A and B there are always disjoint workers? that would be in-between subject experiment design
exp1A = pd.read_json(path_or_buf = fExp1A, lines = True, encoding = 'utf-8', orient = "records")
exp1B = pd.read_json(path_or_buf = fExp1B, lines = True, encoding = 'utf-8', orient = "records")
exp2A = pd.read_json(path_or_buf = fExp2A, lines = True, encoding = 'utf-8', orient = "records")
exp2B = pd.read_json(path_or_buf = fExp2B, lines = True, encoding = 'utf-8', orient = "records")

log1A = pd.read_json(path_or_buf = fLog1A, lines = True, encoding = 'utf-8', orient = "records")
log1B = pd.read_json(path_or_buf = fLog1B, lines = True, encoding = 'utf-8', orient = "records")
log1C = pd.read_json(path_or_buf = fLog1C, lines = True, encoding = 'utf-8', orient = "records")
log2A = pd.read_json(path_or_buf = fLog2A, lines = True, encoding = 'utf-8', orient = "records")
log2B = pd.read_json(path_or_buf = fLog2B, lines = True, encoding = 'utf-8', orient = "records")
 
# Data Format & Content Exploration - with example exp1A 
exp1A.head()
# Create the colum workerid extracted from the dict
def extractUnitId(row):
    resDic = row['data']    
    unitId = resDic['unit_id']
    return unitId
exp1A['unit_id'] = exp1A.apply(extractUnitId,axis=1)
exp1B['unit_id'] = exp1B.apply(extractUnitId,axis=1)
exp2A['unit_id'] = exp2A.apply(extractUnitId,axis=1)
exp2B['unit_id'] = exp2B.apply(extractUnitId,axis=1)
# Create the colum workerid extracted from the dict
def extractWorkerId(row):
    resDic = row['results']
    workerId = resDic['judgments'][0]['worker_id']
    if(len(resDic['judgments']) > 1):
        print('One worker with more than one judgment! '+ str(workerId))
    
    return workerId
    
exp1A['worker_id'] = exp1A.apply(extractWorkerId,axis=1)
exp1B['worker_id'] = exp1B.apply(extractWorkerId,axis=1)
exp2A['worker_id'] = exp2A.apply(extractWorkerId,axis=1)
exp2B['worker_id'] = exp2B.apply(extractWorkerId,axis=1)
# Data Format & Content Exploration - with example log1A 
log1A.head()

Explanation by Kevin: var final_log = { “session_id”: session_id, // unique session id (to capture page refresh) “message”: String(message), // message that triggered the log -- see below “worker_id”: worker_id, // worker id “task_id”:task_id, // task_id “time”: Date.now(), //time of sending log “step”: step, // step into the task (i.e., 1,2,3,4... no_docs, paystep) “judgments”: judgments, //array of judgments -- start at 1 (0 is null) “times”: times, //array of times for the judgments -- start at 1 (0 is null) “steps”: steps, // array of steps in to the task; e.g., if the worker pressed back at step 2 the array is 1,2,1,2,3,... };

the message-set is:

  • nextButton
  • backButton
  • Final_OK --> task concluded succesfully
  • paying --> paying the worker
  • Start --> start task
  • MW Worker Rejected:’+worker_id --> worker blacklisted that tried to start the task
  • MWorker ok --> opposite of the last (not sure if present)
log1A.message.unique()
sessions = log1A[['worker_id', 'session_id']]
sessions.groupby(['worker_id']).size().unique()
import json
def getJudgments(row):
    text_result = row['results']['judgments'][0]['data']['text_result']
    textrjson = json.loads(text_result)
    judgments = textrjson['judgments']
     # return pd.Series(judgments) OK but just the array expects the shape of the original data frame calling the apply
    return len(judgments)
# Helpers
def countJudgments(row):
    #return len(row['judgments'])
    return row['judgments_count'] # it's wrapped in the judgments - data
# Cross checks & Basic stats - units per people etc. Global and separating people? 
def checkTask(taskDf):
    
    # checking published config
    print('total number of HITs:' + str(len(taskDf)))
    # KO print('number of judgments per HIT' + str(taskDf.results.map(lambda x: len(x)).max()))   
    nulls = pd.isnull(taskDf)
    
    # missing values
       
    print('Empty value in data column: ' + str(len(nulls.loc[nulls['data'] == True])) + ' out of '+ str(len(nulls['data'])))
    print('Empty value in results column: ' + str(len(nulls.loc[nulls['results'] == True])) + ' out of '+ str(len(nulls['results'])))
    print('Empty value in created_at column: ' + str(len(nulls.loc[nulls['created_at'] == True])) + ' out of '+ str(len(nulls['created_at'])))
    print('Empty value in updated_at column: ' + str(len(nulls.loc[nulls['updated_at'] == True])) + ' out of '+ str(len(nulls['updated_at'])))
    print('Empty value in id column: ' + str(len(nulls.loc[nulls['id'] == True])) + 'out of '+ str(len(nulls['id'])))
    print('Empty value in job_id column: ' + str(len(nulls.loc[nulls['job_id'] == True])) + ' out of '+ str(len(nulls['job_id'])))
    print('Empty value in worker_id column: ' + str(len(nulls.loc[nulls['worker_id'] == True])) + ' out of '+ str(len(nulls['worker_id'])))
    print('Empty value in unit_id column: ' + str(len(nulls.loc[nulls['unit_id'] == True])) + ' out of '+ str(len(nulls['unit_id'])))
    
    
    
    # counts
    print('Total number of workers: ' + str(taskDf['worker_id'].nunique()))
    print('Total number of units - they are judgments: ' + str(taskDf['unit_id'].nunique())) 
    print('AVG Number of units per worker: '+ str(taskDf.groupby(['worker_id'])['unit_id'].nunique().mean()) + ' Max Number of units per worker: '+ str(taskDf.groupby(['worker_id'])['unit_id'].nunique().max()) )
    print('Number of judgments per worker: ' )
    judgmentsCount = pd.Series()
    # when returning an array it takes the length of the DF here! Pandas - print(len(taskDf.columns))
    judgmentsCount = taskDf.apply(getJudgments,axis=1)
    print(judgmentsCount.describe())
        
   
exp1A['results'][0]['judgments'][0]['data']['text_result'] # this one gives an array of 4?
checkTask(exp1A) # is the title of the files misleading? From the number of judgments sent by workers it looks like exp1A is the one of the lenth
checkTask(exp1B)
checkTask(exp2A)
checkTask(exp2B)
def checkLog(logDf):
    # missing values
    nulls = pd.isnull(logDf)
    print('Empty value in data column: ' + str(len(nulls.loc[nulls['message'] == True])) + 'out of '+ str(len(nulls['message'])))
    print('Empty value in session_id column: ' + str(len(nulls.loc[nulls['session_id'] == True])) + 'out of '+ str(len(nulls['session_id'])))
    print('Empty value in task_id column: ' + str(len(nulls.loc[nulls['task_id'] == True])) + 'out of '+ str(len(nulls['task_id'])))
    print('Empty value in time column: ' + str(len(nulls.loc[nulls['time'] == True])) + 'out of '+ str(len(nulls['time'])))
    print('Empty value in times column: ' + str(len(nulls.loc[nulls['times'] == True])) + 'out of '+ str(len(nulls['times'])))
    print('Empty value in worker_id column: ' + str(len(nulls.loc[nulls['worker_id'] == True])) + 'out of '+ str(len(nulls['worker_id'])))
    print('Empty value in pay column: ' + str(len(nulls.loc[nulls['pay'] == True])) + 'out of '+ str(len(nulls['pay'])))
    print('Empty value in judgmens column: ' + str(len(nulls.loc[nulls['judgments'] == True])) + 'out of '+ str(len(nulls['judgments'])))

    # counts
    print('Total number of workers: ' + str(logDf['worker_id'].nunique()))
    print('Total number of tasks: ' + str(logDf['task_id'].nunique())) # task = unit
    print('AVG Number of sessions per worker: '+ str(logDf.groupby(['worker_id'])['session_id'].nunique().mean()) + ' Max Number of sessions per worker: '+ str(logDf.groupby(['worker_id'])['session_id'].nunique().max()) )
    print('AVG Number of tasks per worker: '+ str(logDf.groupby(['worker_id'])['task_id'].nunique().mean()) + ' Max Number of tasks per worker: '+ str(logDf.groupby(['worker_id'])['task_id'].nunique().max()) )
checkLog(log1A)
checkLog(log1B)
checkLog(log1C)
checkLog(log2A)
checkLog(log2B)
def checkTaskJobJointly(taskDf, logDf):
    
    abandonedDf = logDf[~logDf['worker_id'].isin(taskDf['worker_id'])]
    completedDf = logDf[logDf['worker_id'].isin(taskDf['worker_id'])]
    
    # all the answers in the task completion report are also in the log data set
    print('Number of people who abandoned: ' + str(len(logDf['worker_id'][~logDf['worker_id'].isin(taskDf['worker_id'])].unique())) ) #+ ' and they are IDs: '+  str(logDf['worker_id'][~logDf['worker_id'].isin(taskDf['worker_id'])])
    print('Number of people who submitted: ' + str(len(logDf['worker_id'][logDf['worker_id'].isin(taskDf['worker_id'])].unique())) ) #+ ' and they are IDs: '+  str(logDf['worker_id'][logDf['worker_id'].isin(taskDf['worker_id'])])
    print('Number of people who were not logged: ' + str(len(taskDf['worker_id'][~taskDf['worker_id'].isin(logDf['worker_id'])].unique())) )
    print('*Total number of workers in Task*: '+ str(taskDf['worker_id'].nunique()))
    print('*Total number of workers in Log*: '+ str(logDf['worker_id'].nunique()))
    
    return abandonedDf, completedDf
   
    

Abandonment Stats

print('--- Experiment 1  ------------------------')
print('--- (A) ------------------------')
aban_1A, complet_1A = checkTaskJobJointly(exp1A, log1A)
print('--- (B) ------------------------')
aban_1B, complet_1B = checkTaskJobJointly(exp1B, log1B)
print('--- (C) ------------------------')
aban_1C, complet_1C = checkTaskJobJointly(exp1B, log1C) 
print('--- Experiment 2  ------------------------')
print('--- (A) ------------------------')
aban_2A, complet_2A = checkTaskJobJointly(exp2A, log2A)
print('--- (B) ------------------------')
aban_2B, complet_2B = checkTaskJobJointly(exp2B, log2B)
print('--- (C) ------------------------')
aban_2C, complet_2C = checkTaskJobJointly(exp2B, log1B)

Building the 4 groups of people:

Focus is on the log files, filtering in one way or the other.

# Get the two subgroups for abandoned workes, who either abandones right away or abandoned after restarting -- more than one session
def abandSpec(df):
    # (!!) Pandas passes through the first twice
    dfG = df.groupby(['worker_id'])
    abanA = dfG.filter(lambda x: len(x['session_id'].unique()) == 1)
    abanB = dfG.filter(lambda x: len(x['session_id'].unique()) > 1)
    return abanA,abanB
# Get the two subgroups for completed workes, who either submitted answers right away or submitted after restarting -- more than one session
# Coded in a different method for extensibility reasons
def completSpec(df):
    # (!!) Pandas passes through the first twice
    dfG = df.groupby(['worker_id'])
    complA = dfG.filter(lambda x: len(x['session_id'].unique()) == 1)
    complB = dfG.filter(lambda x: len(x['session_id'].unique()) > 1)
    return complA,complB
    
    
# Get all the concrete subsets for all versions of the two controlled experiments.

# Experiment 1 (A,B settings)
abanA_1A,abanB_1A = abandSpec(aban_1A)
completA_1A,completB_1A = completSpec(complet_1A)

abanA_1B,abanB_1B = abandSpec(aban_1B)
completA_1B,completB_1B = completSpec(complet_1B)

# Experiment 2 (A,B settings)
abanA_2A,abanB_2A = abandSpec(aban_2A)
completA_2A,completB_2A = completSpec(complet_2A)

abanA_2B,abanB_2B = abandSpec(aban_2B)
completA_2B,completB_2B = completSpec(complet_2B)
# Cross-check - CORRECT
print('abandoned subgroups 1A')
print(abanA_1A.worker_id.nunique() + abanB_1A.worker_id.nunique())
print('abandoned subgroups 1B')
print(abanA_1B.worker_id.nunique() + abanB_1B.worker_id.nunique())
print('abandoned subgroups 2A')
print(abanA_2A.worker_id.nunique() + abanB_2A.worker_id.nunique())
print('abandoned subgroups 2B')
print(abanA_2B.worker_id.nunique() + abanB_2B.worker_id.nunique())

print('completed subgroups 1A')
print(completA_1A.worker_id.nunique() + completB_1A.worker_id.nunique())
print('completed subgroups 1B')
print(completA_1B.worker_id.nunique() + completB_1B.worker_id.nunique())
print('completed subgroups 2A')
print(completA_2A.worker_id.nunique() + completB_2A.worker_id.nunique())
print('completed subgroups 2B')
print(completA_2B.worker_id.nunique() + completB_2B.worker_id.nunique())
# ----- Testing Pandas
#log1A[log1A['worker_id']==41202032]
#d = log1A.sort_values(by=['worker_id'])
#d.head(100)
#log1Ag = log1A.groupby(['worker_id'])
#abb = log1Ag.filter(lambda x: len(x['session_id'].unique()) > 1)
#abb
#abb.groupby(['worker_id']).get_group(41202032)
#abb.groupby(['worker_id']).get_group(6476374) #- does not find it - it's correct
# --
# a = [1,2,3]
# b = [2,3,4]
# data = pd.DataFrame()
# data['a'] = pd.Series(a)
# data['b'] = pd.Series(b)
# data.head()
# data['a'][~data['a'].isin(data['b'])]
# data['a'][data['a'].isin(data['b'])]
# data['a'].isin(data['b'])
# data[~data['a'].isin(data['b'])]
# ----------- end of testing Pandas

Experiment-based Hypotheses

Normality test and statistical tests to analyse the difference between the means (in measurement X) of two populations (experiment in setting A and experiment in setting B,

# There is no worker that appears in both settings (A and B)
print(len(exp1A[exp1A['worker_id'].isin(exp1B)]))
print(len(exp2A[exp1A['worker_id'].isin(exp2B)]))
print(len(log1A[log1A['worker_id'].isin(log1B)]))
print(len(log2A[log2A['worker_id'].isin(log2B)]))
print(len(log1B[log1B['worker_id'].isin(log1C)]))
from scipy.stats import shapiro
from scipy.stats import normaltest
from scipy.stats import anderson

# Input: series has the sample whose distribution we want to test
# Output: gaussian boolean True if it is normal distribution and False otherwise.
def testNormality(series):
    
    alpha = 0.05
    gaussian = False
    
    # only if the three tests give normal will be normal. If we find one that is not passed, then it is NOT normal. 
    
    # Shapiro-Wilk Test - for smaller data sets around thousands of records
    print('length of series in Shapiro is: '+ str(len(series)))
    stats1, p1 = shapiro(series)
    print('Statistics Shapiro-Wilk Test =%.3f, p=%.3f' % (stats1, p1))
    if p1 > alpha:
        gaussian = True
    print('Shapiro.Wilk says it is Normal '+ str(gaussian))
    
    gaussian = False # because of intermediate printing, reinitialize
    # D'Agostino and Pearson's Test
    stats2, p2 = normaltest(series) #dataw.humid
    print('Statistics D\'Agostino and Pearson\'s Test=%.3f, p=%.3f' % (stats2, p2))
    if p2 > alpha:
        gaussian = False
        print('D\'Agostino and Pearson\'s says it is Normal '+ str(gaussian))
    
    # Anderson-Darling Test
    '''result = anderson(series) 
    print('Statistic: %.3f' % result.statistic)
    for i in range(len(result.critical_values)):
        sl, cv = result.significance_level[i], result.critical_values[i]
        if result.statistic > result.critical_values[i]:
            gaussian = False'''
        
    
    return gaussian
    
'''
from scipy.stats import ttest_ind
from scipy.stats import mannwhitneyu

# Input:
# series1 is the series with the set of measurements for every single worker in case A of controlled experiment
# series2 is the series with the set of measurements for every single worker in case B of controlled experiment
# gaussian is the boolean value indicating if the samples have passed the test of normality or not (True is apply parametric test)
# Output:
# stats of statistical test 
# p-value 
# acceptHo (True if we fail to reject it and False if we reject it) 
# See also all tables for picking the tests (e.g., https://help.xlstat.com/customer/en/portal/articles/2062457-which-statistical-test-should-you-use-)
def compareTwoSamples(series1,series2, gaussian):
    # Tests to compare two samples (H0: they have equal distribution; H1: they have different distribution)
    
    alpha = 0.05
    acceptH0 = False
    
    if (gaussian == True):
        # Run Student's T-test
        stats, p = ttest_ind(series1, series2)
        print('Statistics=%.3f, p=%.3f' % (stats, p))
        
    else:
        
        # Run Mann-Whitney; Kruskal-Wallis test is for more samples.
        stats, p = mannwhitneyu(series1, series2)
        print('Statistics=%.3f, p=%.3f' % (stats, p))
        
        # result - hypothesis testing
   
    if p > alpha:
        acceptH0 = True
    
    print('The two samples have the same distribution: ' + str(acceptH0))
    return stats,p,acceptH0        
        
    
'''
# this is the implementation I had with variable accept0 True (fail to reject) and False (reject). Since multipletests from statsmodels and other libraries use True as for reject and False for fail to reject (acceptho) I prefer to change this to avoid confusion.
    
from scipy.stats import ttest_ind
from scipy.stats import mannwhitneyu

# Input:
# series1 is the series with the set of measurements for every single worker in case A of controlled experiment
# series2 is the series with the set of measurements for every single worker in case B of controlled experiment
# gaussian is the boolean value indicating if the samples have passed the test of normality or not (True is apply parametric test)
# Output:
# stats of statistical test 
# p-value 
# rejectJ0 (True if we reject it and False if we fail to reject it (i.e., accept)) 
# See also all tables for picking the tests (e.g., https://help.xlstat.com/customer/en/portal/articles/2062457-which-statistical-test-should-you-use-)
def compareTwoSamples(series1,series2, gaussian):
    # Tests to compare two samples (H0: they have equal distribution; H1: they have different distribution)
    
    alpha = 0.05
    rejectH0 = True
    
    if (gaussian == True):
        # Run Student's T-test
        stats, p = ttest_ind(series1, series2)
        print('Statistics=%.3f, p=%.3f' % (stats, p))
        
    else:
        
        # Run Mann-Whitney; Kruskal-Wallis test is for more samples.
        stats, p = mannwhitneyu(series1, series2)
        print('Statistics=%.3f, p=%.3f' % (stats, p))
        
        # result - hypothesis testing
   
    if p > alpha:
        rejectH0 = False
    
    print('The two samples have a statistically different distribution: ' + str(rejectH0))
    # reject True means we go for H1 which is that the populations do not have the same means
    return stats,p,rejectH0        
        
    

1. Work Done

Idea: People who abandon, *try longer* when they see more value / potential in the HIT. The more reward / the more documents the more value the HIT has for a worker. Workers may abandon because of their fear to get a "bad reputation", but when the reward is higher, the extrinsic motivation is stronger and one could think that they try longer (either clicking the answers or restarting the process after having closed it).

  • We have two pairs of populations to compare:
    • Experiment 1 (A and B) and
    • Experiment 2 (A and B)
  • We measure "trying longer" using two different measurements:
    • The number of sessions: start, leave, start again
    • The number of messages: they go further in the process (e.g., the click on many answers instead of staying at start)

Functions to compute the measurements

def getSessionCount(df):
    dfG = df.groupby(['worker_id'])
    sessionCounts = dfG.apply(lambda x: len(x['session_id'].unique()))
    sessionCountsRI = sessionCounts.reset_index()
    del(sessionCountsRI['worker_id'])
    sessionCountsRI.columns=['sessionCount']
    return sessionCountsRI
def getMessageCount(df):
    dfG = df.groupby(['worker_id'])
    messageCounts = dfG.apply(lambda x: len(x['message']))
    messageCountsRI = messageCounts.reset_index()
    del(messageCountsRI['worker_id'])
    messageCountsRI.columns=['messageCount']
    return messageCountsRI

SessionCount

sessionC_aban_1A= getSessionCount(aban_1A)
sessionC_aban_1B= getSessionCount(aban_1B)
sessionC_aban_1C= getSessionCount(aban_1C)

sessionC_aban_2A= getSessionCount(aban_2A)
sessionC_aban_2B= getSessionCount(aban_2B)
print(sessionC_aban_1A.describe())
print(sessionC_aban_1B.describe())
print(sessionC_aban_1C.describe())

print(sessionC_aban_2A.describe())
print(sessionC_aban_2B.describe())

Are the populations (both in pair) normal distrinbutions?

norm_sessionC_aban_1A = testNormality(sessionC_aban_1A)
print("final: " + str(norm_sessionC_aban_1A))
norm_sessionC_aban_1B = testNormality(sessionC_aban_1B)
print("final: " + str(norm_sessionC_aban_1B))
norm_sessionC_aban_1C = testNormality(sessionC_aban_1C)
print("final: " + str(norm_sessionC_aban_1C))
norm_sessionC_aban_2A = testNormality(sessionC_aban_2A)
print("final: " + str(norm_sessionC_aban_2A))
norm_sessionC_aban_2B = testNormality(sessionC_aban_2B)
print("final: " + str(norm_sessionC_aban_2B))

Exp1_H0: means of sessionCount are equal in both populations

Exp1_H1: means of sessionCount are not equal in both populations

normal = norm_sessionC_aban_1A and norm_sessionC_aban_1B
print('Abandoned 1A and 1B')
stats, p1sc, reject1sc = compareTwoSamples(sessionC_aban_1A, sessionC_aban_1B, normal )
ax = sns.distplot(sessionC_aban_1A)
ax.set_xlabel("Number of sessions per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 1 - setting A")
ax = sns.distplot(sessionC_aban_1B)
ax.set_xlabel("Number of sessions per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 1 - setting B")

Exp2_H0: means of sessionCount are equal in both populations

Exp2_H1: means of sessionCount are not equal in both populations

normal = norm_sessionC_aban_2A and norm_sessionC_aban_2B
print('Abandoned 2A and 2B')
stats,p2sc,reject2sc = compareTwoSamples(sessionC_aban_2A, sessionC_aban_2B, normal)
ax = sns.distplot(sessionC_aban_2A)
ax.set_xlabel("Number of sessions per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 2 - setting A")

ExpQ_H0: means of sessionCount are equal in both populations

ExpQ_H1: means of sessionCount are not equal in both populations

ax = sns.distplot(sessionC_aban_1B)
ax.set_xlabel("Number of sessions per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 1 - setting B")
normal = norm_sessionC_aban_1C and norm_sessionC_aban_1B
print('Abandoned 1C and 1B')
stats,pqsc,rejectqsc= compareTwoSamples(sessionC_aban_1C, sessionC_aban_1B, normal)
ax = sns.distplot(sessionC_aban_1C)
ax.set_xlabel("Number of sessions per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 1 - setting C")
normal = norm_sessionC_aban_2B and norm_sessionC_aban_1C
print('Abandoned 2B and 1C')
stats,pq2sc,rejectq2sc = compareTwoSamples(sessionC_aban_2B, sessionC_aban_1C, normal)
 

Number of Messages

messageC_aban_1A= getMessageCount(aban_1A)
messageC_aban_1B= getMessageCount(aban_1B)
messageC_aban_1C= getMessageCount(aban_1C)

messageC_aban_2A= getMessageCount(aban_2A)
messageC_aban_2B= getMessageCount(aban_2B)
print(messageC_aban_1A.describe())
print(messageC_aban_1B.describe())
print(messageC_aban_1C.describe())

print(messageC_aban_2A.describe())
print(messageC_aban_2B.describe())

Normality

norm_messageC_aban_1A = testNormality(messageC_aban_1A)
print("final: " + str(norm_messageC_aban_1A))
norm_messageC_aban_1B = testNormality(messageC_aban_1B)
print("final: " + str(norm_messageC_aban_1B))
norm_messageC_aban_1C = testNormality(messageC_aban_1C)
print("final: " + str(norm_messageC_aban_1C))

norm_messageC_aban_2A = testNormality(messageC_aban_2A)
print("final: " + str(norm_messageC_aban_2A))
norm_messageC_aban_2B = testNormality(messageC_aban_2B)
print("final: " + str(norm_messageC_aban_2B))

Exp1_H0: means of messageCount are equal in both populations

Exp1_H1: means of messageCount are not equal in both populations

normal = norm_messageC_aban_1A and norm_messageC_aban_1B
print('Abandoned 1A and 1B')
stats,p1mc,reject1mc = compareTwoSamples(messageC_aban_1A, messageC_aban_1B, normal )
ax = sns.distplot(messageC_aban_1A)
ax.set_xlabel("Number of messages per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 1 - setting A")
ax = sns.distplot(messageC_aban_1B)
ax.set_xlabel("Number of messages per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 1 - setting B")

Exp2_H0: means of messageCount are equal in both populations

Exp2_H1: means of messageCount are not equal in both populations

normal = norm_messageC_aban_2A and norm_messageC_aban_2B
print('Abandoned 2A and 2B')
stats,p2mc,reject2mc = compareTwoSamples(messageC_aban_2A, messageC_aban_2B, normal)
ax = sns.distplot(messageC_aban_2A)
ax.set_xlabel("Number of messages per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 2 - setting A")
ax = sns.distplot(messageC_aban_2B)
ax.set_xlabel("Number of messages per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 2 - setting B")
print('values in population A')
messageC_aban_2A['messageCount'].value_counts()
print('values in population B')
messageC_aban_2B['messageCount'].value_counts()
messageC_aban_2A.hist(log=True)

ExpQ_H0: means of messageCount are equal in both populations

ExpQ_H1: means of messageCount are not equal in both populations

normal = norm_messageC_aban_1C and norm_messageC_aban_1B
print('Abandoned 1C and 1B')
stats,pqmc,rejectqmc = compareTwoSamples(messageC_aban_1C, messageC_aban_1B, normal)
ax = sns.distplot(messageC_aban_1C)
ax.set_xlabel("Number of messages per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 2 - setting C")
normal = norm_messageC_aban_2B and norm_messageC_aban_1C
print('Abandoned 2B and 1C')
stats,pq2mc,rejectq2mc = compareTwoSamples(messageC_aban_2B, messageC_aban_1C, normal)

2. Time invested

Functions to compute the measurements

group = log1A.groupby(['worker_id','session_id'])['server_time']
time = group.apply(lambda x: (x.max() - x.min()).total_seconds())
# testing
# group.get_group((6476374, '0.8lvbip6m'))
# group.get_group((6476374, '0.8lvbip6m')).max()-group.get_group((6476374, '0.8lvbip6m')).min()
def getTimePerSession(df):
   
    '''
    dfG = df.groupby(['worker_id','session_id'])['server_time']
    times = dfG.apply(lambda x: (x.max() - x.min()).total_seconds())
    timesRI = times.reset_index()
    del(timesRI['worker_id'])
    del(timesRI['session_id'])
    timesRI.columns=['time']
    
    print('original df shape:')
    print(df.shape)
    
    print('timeRI  shape:')
    print(timesRI.shape)
    return timesRI
    '''

    dfG = df.groupby(['worker_id','session_id'])['server_time']
    times = dfG.apply(lambda x: (x.max() - x.min()).total_seconds())
   
    timesRI = times.reset_index()
    timesRI.columns=['worker_id','session_id','time_spent']
    
    
    timesPerSession = timesRI.groupby(['worker_id'])['time_spent'].mean()
    timesPerSRI = timesPerSession.reset_index()
   
    del(timesPerSRI ['worker_id'])
   
    timesPerSRI.columns=['avgtimesession']
   
    return timesPerSRI
    
    
# more?

Time per session

sessionTime_aban_1A= getTimePerSession(aban_1A)
sessionTime_aban_1B= getTimePerSession(aban_1B)
sessionTime_aban_1C= getTimePerSession(aban_1C)

sessionTime_aban_2A= getTimePerSession(aban_2A)
sessionTime_aban_2B= getTimePerSession(aban_2B)
print(sessionTime_aban_1A.describe())
print(sessionTime_aban_1B.describe())
print(sessionTime_aban_1C.describe())

print(sessionTime_aban_2A.describe())
print(sessionTime_aban_2B.describe())

Normality

norm_sessionTime_aban_1A = testNormality(sessionTime_aban_1A)
print("final: " + str(norm_sessionTime_aban_1A))
norm_sessionTime_aban_1B = testNormality(sessionTime_aban_1B)
print("final: " + str(norm_sessionTime_aban_1B))
norm_sessionTime_aban_1C = testNormality(sessionTime_aban_1C)
print("final: " + str(norm_sessionTime_aban_1C))

norm_sessionTime_aban_2A = testNormality(sessionTime_aban_2A)
print("final: " + str(norm_sessionTime_aban_2A))
norm_sessionTime_aban_2B = testNormality(sessionTime_aban_2B)
print("final: " + str(norm_sessionTime_aban_2B))

Exp1_H0: means of time per session are equal in both populations

Exp1_H1: means of time per session are not equal in both populations

normal = norm_sessionTime_aban_1A and norm_sessionTime_aban_1B
print('Abandoned 1A and 1B')
stats,p1ts,reject1ts = compareTwoSamples(sessionTime_aban_1A, sessionTime_aban_1B, normal )
ax = sns.distplot(sessionTime_aban_1A)
ax.set_xlabel("AVG Time per session per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 1 - setting A")
ax = sns.distplot(sessionTime_aban_1B)
ax.set_xlabel("AVG Time per session per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 1 - setting B")

Exp2_H0: means of time per session are equal in both populations

Exp2_H1: means of time per session are not equal in both populations

normal = norm_sessionTime_aban_2A and norm_sessionTime_aban_2B
print('Abandoned 2A and 2B')
stats,p2ts,reject2ts = compareTwoSamples(sessionTime_aban_2A, sessionTime_aban_2B, normal)
ax = sns.distplot(sessionTime_aban_2A)
ax.set_xlabel("AVG Time per session per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 2 - setting A")
ax = sns.distplot(sessionTime_aban_2B)
ax.set_xlabel("AVG Time per session per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 2 - setting B")

ExpQ_H0: means of time per session are equal in both populations

ExpQ_H1: means of time per session are not equal in both populations

normal = norm_sessionTime_aban_1C and norm_sessionTime_aban_1B
print('Abandoned 1C and 1B')
stats,pqts,rejectqts = compareTwoSamples(sessionTime_aban_1C, sessionTime_aban_1B, normal)
ax = sns.distplot(sessionTime_aban_1C)
ax.set_xlabel("AVG Time per session per worker")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of session count abandoned group Experiment 2 - setting C")
normal = norm_sessionTime_aban_2B and norm_sessionTime_aban_1C
print('Abandoned 2B and 1C')
stats,pq2ts,rejectq2ts = compareTwoSamples(sessionTime_aban_2B, sessionTime_aban_1C, normal)
sessionTime_aban_1C.hist(log=True)
sessionTime_aban_2B.hist(log=True)

Plots distributions together

print(messageC_aban_1A.head())

sns.distplot(messageC_aban_1A) sns.distplot(messageC_aban_1B)

plt.show()

Corrections

ps = pd.Series([p1sc,p2sc,pqsc,pq2sc,p1mc,p2mc,pqmc,pq2mc,p1ts,p2ts,pqts,pq2ts])
ps.head(11)
corrected_p = multipletests(ps, 
                                            alpha = 0.05, 
                                            method = 'sidak') 
print('WHAT WE WANT IS REJECT TRUE, BECAUSE THAT WOULD SHOULD STATISTICALLY SIGNIFICANT DIFFERENCE BETWEEN THE TWO POPULATIONS')
print(str(reject1sc)+' '+str(reject2sc)+' '+str(rejectqsc)+' '+str(rejectq2sc)+' '+str(reject1mc)+' '+str(reject2mc)+' '+str(rejectqmc)+' '+str(rejectq2mc)+' '+str(reject1ts)+' '+str(reject2ts)+' '+str(rejectqts)+' '+str(rejectq2ts))
print(corrected_p)
print('p < 0.05 is reject H0, which is accept H1')

ANOVA

Building the data for ANOVA

sessionC | exp | variant
1 1 A
3 1 A
1 1 A
...
1 1 B
2 1 B
1 1 B
...
1 1 C
2 1 C
1 1 C
...
1 2 A
1 2 A
...
1 2 B
1 2 B
..

def add_effect_size(aov,conf_value=0.05):
    mse = aov['sum_sq'][-1]/aov['df'][-1]
    aov['omega_sq'] = 'NaN'
    aov['omega_sq'] = (aov[:-1]['sum_sq']-(aov[:-1]['df']*mse))/(sum(aov['sum_sq'])+mse)
    aov['passed?'] = aov['PR(>F)']<conf_value
    return aov
def buildDataForAnova(series1A,series1B,series1C,series2A,series2B,withC):
    
    s = pd.DataFrame(series1A)

    s1A = pd.DataFrame(series1A)
    s1A['qual'] = '0'
    s1A['pay_per_doc'] = 0.017
    s1A['length'] = 6
    s1A['doc_set'] = '1'

    s1B = pd.DataFrame(series1B)
    s1B['qual'] = '0'
    s1B['pay_per_doc'] = 0.05
    s1B['length'] = 6
    s1B['doc_set'] = '2'

    s1C = pd.DataFrame(series1C)
    s1C['qual'] = '1'
    s1C['pay_per_doc'] = 0.05
    s1C['length'] = 6
    s1C['doc_set'] = '2'

    s2A = pd.DataFrame(series2A)
    s2A['qual'] = '0'
    s2A['pay_per_doc'] = 0.05
    s2A['length'] = 3
    s2A['doc_set'] = '1'

    s2B = pd.DataFrame(series2B)
    s2B['qual'] = '0'
    s2B['pay_per_doc'] = 0.05
    s2B['length'] = 6
    s2B['doc_set'] = '2'
    
   
    
    data_anova= pd.DataFrame() 
    data_anova = pd.concat([s1A.reset_index(),s1B.reset_index(),s1C.reset_index(),s2A.reset_index(),s2B.reset_index()])
#    data_anova = pd.concat([s1B.reset_index(),s1C.reset_index(),s2A.reset_index(),s2B.reset_index()])
    print(data_anova.shape)
    print(data_anova.columns)
    print(data_anova.isnull().values.any())
   
    
    
    if withC==True :
        return data_anova
    else:
        return data_anova[~(data_anova['qual']=='1')]
    
'''def buildDataForAnova(series1A,series1B,series1C,series2A,series2B,withC):
    
    s1A = pd.DataFrame(series1A)
    s1A['exp'] = 1
    s1A['variant'] = 'A'
    #s1A['qual'] = 0

    s1B = pd.DataFrame(series1B)
    s1B['exp'] = 1'
    s1B['variant'] = 'B'
    #s1B['qual'] = 0

    s1C = pd.DataFrame(series1C)
    s1C['exp'] = 1
    s1C['variant'] = 'C'
    #s1C['qual'] = 1

    s2A = pd.DataFrame(series2A)
    s2A['exp'] = 2
    s2A['variant'] = 'A'
    #s2A['qual'] = 0

    s2B = pd.DataFrame(series2B)
    s2B['exp'] = 2
    s2B['variant'] = 'B'
    #s1B['qual'] = 0
    
   
    
    data_anova= pd.DataFrame() 
    data_anova = pd.concat([s1A.reset_index(),s1B.reset_index(),s1C.reset_index(),s2A.reset_index(),s2B.reset_index()])
    print(data_anova.shape)
    print(data_anova.columns)
    print(data_anova.isnull().values.any())
   
    
    
    if withC==True :
        return data_anova
    else:
        #return 
        return data_anova[~data_anova['variant'].isin(['C'])]
  '''  
# SessionC
data_anova_wC_sessionC = buildDataForAnova(sessionC_aban_1A,sessionC_aban_1B,sessionC_aban_1C,sessionC_aban_2A,sessionC_aban_2B, True)
data_anova_withoutC_sessionC = buildDataForAnova(sessionC_aban_1A,sessionC_aban_1B,sessionC_aban_1C,sessionC_aban_2A,sessionC_aban_2B, False)
# MessageC
data_anova_wC_messageC = buildDataForAnova(messageC_aban_1A,messageC_aban_1B,messageC_aban_1C,messageC_aban_2A,messageC_aban_2B, True)
data_anova_withoutC_messageC = buildDataForAnova(messageC_aban_1A,messageC_aban_1B,messageC_aban_1C,messageC_aban_2A,messageC_aban_2B, False)
# time
data_anova_wC_time = buildDataForAnova(sessionTime_aban_1A,sessionTime_aban_1B,sessionTime_aban_1C,sessionTime_aban_2A,sessionTime_aban_2B, True)
data_anova_withoutC_time = buildDataForAnova(sessionTime_aban_1A,sessionTime_aban_1B,sessionTime_aban_1C,sessionTime_aban_2A,sessionTime_aban_2B, False)
d =  pd.concat([data_anova_wC_sessionC['sessionCount'],data_anova_wC_messageC['messageCount'],data_anova_wC_time['avgtimesession'],data_anova_wC_sessionC[['qual','pay_per_doc','length','doc_set']]], axis=1, sort=False)
d_withoutC = d[~(d['qual']=='1')]
print(len(d),len(d_withoutC))
len(sessionC_aban_1C) + 396
d.head()
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

#
d_withoutC.head()
d_withoutC[['sessionCount','messageCount','length']] = d_withoutC[['sessionCount','messageCount','length']].apply(lambda x: pd.DataFrame.astype(x,dtype=np.float64))
d[['sessionCount','messageCount','length']] = d[['sessionCount','messageCount','length']].apply(lambda x: pd.DataFrame.astype(x,dtype=np.float64))
d.dtypes, d_withoutC.dtypes
#ALEX MAN(C)OVA It fails!
mod = ols(formula='sessionCount ~ pay_per_doc * length', data=d_withoutC)
res = mod.fit()
print(res.summary())
from statsmodels.multivariate import manova
#ALEX MAN(C)OVA also this fails!
mod = manova.MANOVA.from_formula(formula='messageCount+avgtimesession+sessionCount ~ pay_per_doc * length', data=d_withoutC)
#x = d_withoutC[['pay_per_doc','length']]
#x = sm.add_constant(x, prepend=False)
#mod = manova.MANOVA(endog=d_withoutC[['avgtimesession','messageCount']],exog=x)
res = mod.fit()
'''
model_sessionC = ols(formula='sessionCount ~ C(exp) * C(variant)', data=data_anova_withoutC_sessionC).fit()
#model = ols(formula='sessionCount ~ C(exp) * C(variant) * C(qual)', data=data_anova_withoutC_sessionC).fit()
 
aov_table_sessionC = anova_lm(model_sessionC, typ=2) 
aov_table_sessionC = add_effect_size(aov_table_sessionC)
print(aov_table_sessionC)

model_messageC = ols(formula='messageCount ~ C(exp) * C(variant)', data=data_anova_withoutC_messageC).fit()
#model = ols(formula='sessionCount ~ C(exp) * C(variant) * C(qual)', data=data_anova_withoutC_sessionC).fit()
 
aov_table_messageC = anova_lm(model_messageC, typ=2) 
aov_table_sessionC = add_effect_size(aov_table_messageC)
print(aov_table_messageC)

model_time = ols(formula='time ~ C(exp) * C(variant)', data=data_anova_withoutC_time).fit()
#model = ols(formula='sessionCount ~ C(exp) * C(variant) * C(qual)', data=data_anova_withoutC_sessionC).fit()
 
aov_table_time = anova_lm(model_time, typ=2) 
aov_table_time = add_effect_size(aov_table_time)
print(aov_table_time)
'''
# Plots
''' OLD
import seaborn as sns


def plot_anova (data,y):
    sns.set(style="whitegrid")

    # Draw a pointplot to show pulse as a function of three categorical factors
    g = sns.catplot(x="variant", y=y, col="exp",capsize=.2, palette="YlGnBu_d", height=6, aspect=.75,kind="point", data=data)
    g.despine(left=True)
    
    '''
#!pip install --upgrade seaborn
#plot_anova(data_anova_withoutC_sessionC,y="sessionCount")
#plot_anova(data_anova_withoutC_messageC,y="messageCount")
#plot_anova(data_anova_withoutC_time,y="time")
# Improved
# joined df
 
#without C, categorical
for y in ['messageCount','sessionCount','avgtimesession']:
    model_session = ols(formula= y+ ' ~ C(pay_per_doc) * C(length) ', data=d_withoutC).fit()
    aov_table_session = anova_lm(model_session, typ=2) 
    aov_table_session = add_effect_size(aov_table_session)
    print('------------')
    print('RESULTS for ',y)
    display(aov_table_session)
#without C, non-categ (not safe) for y in ['messageCount','sessionCount','avgtimesession']: model_session = ols(formula= y+ ' ~ pay_per_doc * length', data=d_withoutC).fit() aov_table_session = anova_lm(model_session, typ=2) aov_table_session = add_effect_size(aov_table_session) print('------------') print('RESULTS for ',y) display(aov_table_session)
#additional ANOVA for C (1B,1C,2B)
d_forC = d[(d['pay_per_doc']==0.05 ) &(d['length']==6.0)]
#display(d_forC.head())
#display(d_forC.describe())
for y in ['messageCount','sessionCount','avgtimesession']:
    model_session = ols(formula= y+ ' ~ C(qual)', data=d_forC).fit()
    aov_table_session = anova_lm(model_session, typ=1) 
    aov_table_session = add_effect_size(aov_table_session)
    print('------------')
    print('RESULTS for ',y)
    display(aov_table_session)
#with C, categorical (can't trust it! too sparse) for y in ['messageCount','sessionCount','avgtimesession']: model_session = ols(formula= y+ ' ~( C(pay_per_doc) + C(length) + C(qual))**3 ', data=d).fit() aov_table_session = anova_lm(model_session, typ=2) aov_table_session = add_effect_size(aov_table_session) print('------------') print('RESULTS for ',y) display(aov_table_session)#with C, non-categorical (can't trust it! too sparse) for y in ['messageCount','sessionCount','avgtimesession']: model_session = ols(formula= y+ ' ~( pay_per_doc + length + C(qual))**3 ', data=d).fit() aov_table_session = anova_lm(model_session, typ=2) aov_table_session = add_effect_size(aov_table_session) print('------------') print('RESULTS for ',y) display(aov_table_session)

Discussion of the results

For all experiments without quality control, we performed a 2-way ANOVA (considering pay_per_doc and length factors effects) on 'messageCount','sessionCount','avgtimesession'. The lentgh affects significantly ($p<0.05$) all three measures with large effect size ($\omega^2>0.06$) ( medium ($\omega^2>0.01$) for sessionCount). No significant interaction effects have been measured. Regarding the quality control, we performed a one-way ANOVA over experiments 1B,1C,2B (the ones that share same length and pay per documents). The quality controlled effected ($p<0.05$) the three measures with large effect size($\omega^2>0.06$) ( medium ($\omega^2>0.01$) for sessionCount).

 
#One way ANOVA C
 
 
# Writing
 
 
 

Discuss: Prediction task? classification task: worker_to_abandon (C1), worker_to_complete (C2)?

#when it's done we want to make both the comparisons 1C-1B and 1C-2B