Abandonment - analysis of single task (HIT) - Part III

Step 6 - Measuring the Quality (with comparison with those who completed the tasks)

  • In this section, the responses of all workers will be extracted, including both of those who abandoned and submitted. Their reponses would be checked against groundtruth.
  • In each HIT, these are 8 questions, only part of which we have groundtruth, e.g. 3 out of 8 questions could be provided groundtruth. Because these are S4 tasks, the alternatives are defined at 4 levels with values {0, 1, 2, 3}. Therefore, the initial values of each question is set to be -1, and then by iterating groundtruth, the 8-element vector of values for each HIT is determined. We remove the elements for which the values are -1, and meanwhile remove those value at the same index from the answers from workers, because we do not have groundtruth value to compare these answers.
  • By applying the algorithm of pairwise agreement (refer to Section 4.2.1 of On Fine Grained Relevance Scales for formal definition), the agreement value of each HIT would be estimated by computing from the questions which have groundtruth value in our database.
  • Since we allow the workers to attempt the HIT up tp 3 times if they failed in Final Check in their first attempts, the results that the workers submitted may not be the same as their answers in first attempts. But for those who abandoned, we used their answers in the first attempts. ( Consistency about this issue will be discussed later. )
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import time, os, json
import itertools

DATA_DIR = "data-180614/"
DATA_PROCESSED_DIR = "data-processed/"
# Algorithm of *pairwise agreement*, scores 1 is the ground truth
def tomAgree(scores1, scores2, which_group = 'first'):
    if which_group=='second':
        scores1, scores2 = scores2, scores1
    assert len(scores1) == len(scores2), 'Error on lenghts'
    if len(np.unique(scores1))==1:
        return 1
    scores1, scores2 = (list(t) for t in zip(*sorted(zip(scores1, scores2))))
    groups = np.unique(scores1)
    lst_groups = [np.array([]) for i in range(len(groups))]
    dict_group = {}
    for i in range(len(groups)):
        dict_group[groups[i]] = i
    for i in range(len(scores1)):
        lst_groups[dict_group[scores1[i]]] = np.append( lst_groups[dict_group[scores1[i]]], scores2[i])
    groups_to_check  = list(  itertools.combinations(range(len(lst_groups)), 2)  )
    list_couples =  [ list(itertools.product(lst_groups[j[0]], lst_groups[j[1]])) for j in groups_to_check ]
    list_couples = [item for sublist in list_couples for item in sublist]
    agreement_couple = 0
    possible_couples = len(list_couples)
    for k in list_couples:
        if k[0] < k[1]:
            agreement_couple +=1
    return float(agreement_couple)/(possible_couples)
# prepare data (restart point)
jsonFile = "IR_SCALE_1264351_442.json"
csvFile = "job_1264351.json"

# read DB Abandon
tmp_db = []
with open(DATA_PROCESSED_DIR + jsonFile + '.dbAbandon', 'r') as f:
    lines = f.readlines()
    for line in lines:
        lineJSON = json.loads(line.strip())
        tmp_db.append(lineJSON)

# unifying JSON in *json.dumps* format
for i, r in enumerate(tmp_db):
    if (type(r['message']) == dict):
        tmp_db[i]['message'] = json.dumps(r['message'])
    else:
        tmp_db[i]['message'] = r['message'].encode('utf-8')

# convert from JSON format to pd format
dbAbandon = pd.DataFrame(tmp_db)

# read CSV result
csvRead = pd.read_json(path_or_buf = DATA_DIR + csvFile, lines = True, encoding = 'utf-8', orient = "records")
jsonCSV = json.loads(csvRead.to_json(orient = "records"))
jsonAbandon = json.loads(dbAbandon.sort_values(['worker_id', 'unit_id', 'session', 'timestamp']).to_json(orient = "records"))

# read ground truth
tmp_db = []
with open(DATA_PROCESSED_DIR + jsonFile + '.groundTruth', 'r') as f:
    lines = f.readlines()
    for line in lines:
        lineJSON = json.loads(line.strip())
        tmp_db.append(lineJSON)
dbUnitGroundTruth = pd.DataFrame(tmp_db)
groundTruthArray = []
for g in tmp_db:
    groundTruthArray.append({"id_unit": g['id_unit'], "value": [int(g['doc_1'][1]), int(g['doc_2'][1]), int(g['doc_3'][1]), int(g['doc_4'][1]), int(g['doc_5'][1]), int(g['doc_6'][1]), int(g['doc_7'][1]), int(g['doc_8'][1])]})

# remove element whose value is -1, and add index
newValue = []
newIndex = []
for g in groundTruthArray:
    newValue = []
    newIndex = []
    for i, v in enumerate(g['value']):
        if (v != -1):
            newValue.append(v)
            newIndex.append(i)
    g['groundTruth'] = newValue
    g['groundIdx'] = newIndex
# extract CSV array (result vector for each HIT)
csvArray = []
currUnit = ''
currValueList = []
currWorker = ''
for csv in jsonCSV:
    currUnit = csv['data']['id_unit']
    currValueList = []
    currWorker = csv['results']['judgments'][0]['worker_id']
    currRs = json.loads(json.loads(csv['results']['judgments'][0]['data']['output']))[2]
    for a in currRs:
        currValueList.append(a['rel'])
    csvArray.append({"value": currValueList, "id_unit": currUnit, "worker_id": currWorker})

# extract answers for those who abandoned (from their FIRST attempts)
abandonRsArray = []
currValueList = []
currObj = {}
currSession = ''
msgObj = {}
notSaved = True
for i, r in enumerate(jsonAbandon):
    if (r['session'] != currSession):  # new BROWSER session
        if (i > 0):
            if (notSaved):
                currObj['answer'].append(currValueList)
            abandonRsArray.append(currObj)
        currSession = r['session']
        currObj = {"session": currSession, "id_unit": r['unit_id'], "worker_id": r['worker_id'], "answer": []}
        currValueList = [-1, -1, -1, -1, -1, -1, -1, -1]
        notSaved = True
    if (r['message'][0] == "{"):
        msgObj = json.loads(r['message'])
        if ("step" in msgObj.keys()):
            currValueList[msgObj['doc'] - 1] = msgObj['rel']
            notSaved = True
        if ("final_checks_passed" in msgObj.keys()):
            currObj['answer'].append(currValueList)
            notSaved = False
    if (i == len(jsonAbandon) - 1):
        if (notSaved):
            currObj['answer'].append(currValueList)
        abandonRsArray.append(currObj)
# set *checkList* for each unit (only includes those appearing in ground truth with the same index)
for x in csvArray:
    for y in groundTruthArray:
        if (x['id_unit'] == y['id_unit']):
            x['checkList'] = []
            for z in y['groundIdx']:
                x['checkList'].append(x['value'][z])
            break
for x in abandonRsArray:
    for y in groundTruthArray:
        if (x['id_unit'] == y['id_unit']):
            x['checkList'] = []
            for z in y['groundIdx']:
                x['checkList'].append(x['answer'][0][z])
            break
# marking the score for each unit
scoreSubmit = []
scoreAbandon = []
s = 0
for a in csvArray:
    s = 0
    for g in groundTruthArray:
        if (a['id_unit'] == g['id_unit']):
            s = tomAgree(g['groundTruth'], a['checkList'], which_group='first')
            scoreSubmit.append({"worker_id": a['worker_id'], "id_unit": a['id_unit'], "agreeScore": s})
            break
for a in abandonRsArray:
    stayOnRecord = True
    s = 0
    for x in a['checkList']:
        if (x == -1):
            stayOnRecord = False
    if (stayOnRecord):
        for g in groundTruthArray:
            if (a['id_unit'] == g['id_unit']):
                s = tomAgree(g['groundTruth'], a['checkList'], which_group='first')
                scoreAbandon.append({"worker_id": a['worker_id'], "id_unit": a['id_unit'], "agreeScore": s})
                break

# convert to pd format
dbSubmit_agreeScore = pd.DataFrame(scoreSubmit)
dbAbandon_agreeScore = pd.DataFrame(scoreAbandon)
# distribution of the score
print 'Agreement Score for those who Submitted'
dbSubmit_agreeScore.describe().agreeScore
Agreement Score for those who submitted
count    679.000000
mean       0.924422
std        0.152000
min        0.375000
25%        1.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: agreeScore, dtype: float64
print 'Agreement Score for those who Abandoned'
dbAbandon_agreeScore.describe().agreeScore
Agreement Score for those who Abandoned
count    336.000000
mean       0.540476
std        0.455557
min        0.000000
25%        0.000000
50%        0.666667
75%        1.000000
max        1.000000
Name: agreeScore, dtype: float64

Summary - 6.1 - Quality Comparison of those who completed with abandoned - (1)

  • The average Agreement Score is 92.44% for those who completed, while it is only 54.05% for those who abandoned.
  • Because for those who abandoned, it is likely that the workers stopped before they finished answering the questions with a coverage of groundtruth, for example we have groundtruth of Question 3, 4, 5 but the worker quited just after s/he answered the Question 1, 2, 3, 4, and in that case we do not have adequate elements to make comparison, so we will discard the HIT answers.
  • Let us recall that there are 1683 worker-units, but only 336 (19.96% = 336 / 1683) worker-units have a coverage of the questions with groundtruth.
  • The lowest score in Completed group is 37.5%, while it is 0 in Abandoned group.

# plot in histogram
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax2 = fig.add_subplot(111, frame_on = False)

# agree score of completed
color1 = 'orange'
ax1.hist(dbAbandon_agreeScore['agreeScore'], label = 'abandoned', color = color2, alpha=0.5, density = False, edgecolor = 'white', linewidth = 0.7, bins = 10)
ax1.set_xlabel('Agreement Score of those who Abandoned', color = color2)
ax1.set_ylabel('frequency of occurrence', color = color2)
ax1.tick_params(axis='x', labelcolor = color2)
ax1.tick_params(axis='y', labelcolor = color2)

# agree score of abandoned
color2 = 'gray'
ax2.hist(dbSubmit_agreeScore['agreeScore'], label = 'completed', color = color1, alpha=0.5, density = False, edgecolor = 'white', linewidth = 0.7, bins = 10)
#ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
ax2.xaxis.tick_top()
ax2.yaxis.tick_right()
ax2.xaxis.set_label_position('top')
ax2.yaxis.set_label_position('right')
ax2.set_xlabel('Agreement Score of those who Completed', color = color1)
ax2.set_ylabel('frequency of occurrence', color = color1)
ax2.tick_params(axis = 'x', labelcolor = color1)
ax2.tick_params(axis = 'y', labelcolor = color1)

fig.legend(loc='upper right', bbox_to_anchor=(0.66, 0.87), ncol=1, fancybox=True, shadow=True)
plt.show()
# comparison of distribution
plt.hist([dbAbandon_agreeScore['agreeScore'], dbSubmit_agreeScore['agreeScore']], label = ['abandoned', 'completed'], density = True, edgecolor = 'white', linewidth = 0, bins = 10)
plt.xlabel('Agreement Score')
plt.ylabel('density of probability')
plt.legend(loc = 'upper right', bbox_to_anchor = (0.75, 0.98), ncol = 1, fancybox = True, shadow = True)
plt.title('Comparison of Probability Density')
plt.show()

Summary - 6.1 - Quality Comparison of those who completed with abandoned - (2)

  • The distribution of the agreement score is different from those who completed to abandoned. More than 500 (out of 679) HITs were completed with an agreement score that equals to 1 (100%), and another nearly 100 HITs were completed at the level of agreement score between 60% to 70%. On the contrary, for the group of abandoned tasks, about 40% of observable tasks (about 140 out of 336) were done with an agreement score 1 (100%), and another 40% of the tasks had a zero (0%) agreement score, which shows that the quality of those who abandoned is significantly less than those who completed. In addition, the remaining 20% of tasks (about 40 HITs) were done at a level of agreement score being around 60%.
  • From the probability comparison (the second figure), it is clear that about 50% of the tasks out of the group that were abandoned had a zero (0%) score. In other words, if the quality or accuracy measured by agreement score is zero, there is a 50% probability that the worker would choose to leave the task.
  • Here we used accurate match of groundtruth and the answers from workers. For example, if the ground truth is 3 (High Relavent) but the worker chooses 2 (Somewhat Relavent), we will treat this answer to be incorrect. In fact, however, it is true that different workers have different perception of subjectiveness. Therefore, we will soften this constraint later, and that is, when the worker chooses 2 for a question with groundtruth being 3, we could treat this answer to be somewhat correct.