Abandonment - analysis of single task (HIT) - Part I

Author: Tom
Purpose: To understand abandonment on crowdsourcing platforms
Last update: 6 July, 2018
Dataset used: DIR = data-180614/ (Downloaded from AWS)

Defination & Explanation - important

  1. Unit: specific task. When worker starts a task, the pair <worker, unit> would be the identification of the specific task being done by this worker.

  2. Starts: the beginning of the task, which is observed by 2 types, as shown below:

    • (i) any First logging message received by server regarding the specific <worker, unit> pair, or
    • (ii) Start msg or Start button event after Start button event or question answering messages from the previous Session of the specific <worker, unit> pair.

For the second type of Starts, we allow workers to start the same task multiple times, probabily due to their failures in FINAL CHECK. And in that case, we will log the second start as a new Start of the task.

Meanwhile, Ends of the task would be defined as the last message received by server from the specific <worker, unit> pair if there is only one Start, or the last message before any New Starts if they have multiple Starts.

  1. Session: the period of interaction between workers and tasks from the start to the end. Any Second Starts by the specific <worker, unit> pair would be treated as a second session regardless of whether their browser sessions have changed or not. This concept is different from Browser Session because of the following.
    • A worker may restart the same task immediately after s/he has failed in FINAL CHECK, and in that case, we will identify the worker has started a second Session, while his/her browser session would remain the same, so in that case if we want to know how many Starts s/he had, we cannot just rely on Browser Session statistics.

Outline of Analysis

Methodology of analysis

The following questions will be analysed at 5 different levels (not always all five): (i) worker level, (ii) worker-task level, (iii) session level, (iv) BROWSER Session level, and (v) question level.

  • (i) worker level: giving us the understanding of what is the difference from worker to worker
  • (ii) worker-task level: what happened when workers are doing each of their task, by <worker, unit> pairs
  • (iii) session level: since the workers can start the same task multiple times, session is a fine granular cell to look at for understanding how many times the workers started their attempts in tasks
  • (iv) BROWSER Session level: using BROWSER Sessions to determine how long the workers were engaging in their current tasks
  • (V) question level: providing us information about how the workers were interacting with their tasks

Questions to be answered:

  • How much have they done?
    • actions they performed (msg analysis) (Section Step 4 in Part I)
    • when did they quit? (time analysis) (Section Step 5 in Part II, together with a comparison with those who completed the tasks)
    • docs they judged (attempted some questions?)
  • Quality (Section Step 6 in Part III)
    • finished vs. abandoned
    • quality measurement: agreement rate against groundtruth
the followings are not included in this doc
  • Does abandonment occur often?
    • comparison horizontally (across different HITs)
  • Other factors
    • reward: 0.2\$ --> 0.4\$
    • materials: doc vs. img vs. audio
  • Prediction
    • task/HIT --> model --> predict abandonment rate

Step 0 - Initialisation / Global Variants / Packages

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import time, os, json

DATA_DIR = "/Users/sarasua/Documents/RESEARCH/collab_Abandonment/lei-analysis/data/s4/"
DATA_PROCESSED_DIR = "/Users/sarasua/Documents/RESEARCH/collab_Abandonment/lei-analysis/data/0-data-processed/"

jsonFile = "IR_SCALE_1264351_442.json"
csvFile = "job_1264351.json"

Step 1 - Load JSON from DB and run some statistics

db = pd.read_json(path_or_buf = DATA_DIR + jsonFile, lines = True, encoding = 'utf-8', orient = "records")
dbMsgJSON = pd.read_json(path_or_buf = DATA_PROCESSED_DIR + jsonFile + '.msgJSON', lines = True, encoding = 'utf-8', orient = "records")
dbMsgStr = pd.read_json(path_or_buf = DATA_PROCESSED_DIR + jsonFile+'.msgStr', lines = True, encoding = 'utf-8', orient = "records")
print ('(DB) columns\n'+ str(db.columns))
(DB) columns
Index(['message', 'message_number', 'server_time', 'session', 'task_id',
       'timestamp', 'topic', 'unit_id', 'worker_id'],
      dtype='object')
db.head()
message message_number server_time session task_id timestamp topic unit_id worker_id
0 start task 0 2018-05-08 00:10:38.612031 Y5030AAO 1264351 2018-05-08 00:10:37.221 442.0 442265 40687124
1 start task 0 2018-05-08 00:11:15.519643 OBLG4DP4 1264351 2018-05-08 00:07:59.345 442.0 442265 39068713
2 start task 0 2018-05-08 00:12:45.830076 OT7C59TB 1264351 2018-05-08 00:13:23.820 442.0 442637 38896482
3 Start button pressed. 1 2018-05-08 00:13:53.764660 OT7C59TB 1264351 2018-05-08 00:14:28.593 442.0 442637 38896482
4 start task 0 2018-05-08 00:14:10.599818 VXUVKXNB 1264351 2018-05-08 00:14:08.922 442.0 442241 43811497
dbMsgJSON.head()
message message_number server_time session task_id timestamp topic unit_id worker_id
0 {'msg': 'radio button: change rel', 'doc': 1, ... 2 2018-05-08 00:14:14.438607 OT7C59TB 1264351 2018-05-08 00:14:53.710 442 442637 38896482
1 {'msg': 'test not pass: bad comment value', 'd... 3 2018-05-08 00:14:15.114812 OT7C59TB 1264351 2018-05-08 00:14:55.581 442 442637 38896482
2 {'msg': 'radio button: change rel', 'doc': 1, ... 4 2018-05-08 00:15:04.544346 OT7C59TB 1264351 2018-05-08 00:15:43.673 442 442637 38896482
3 {'msg': 'radio button: change rel', 'doc': 1, ... 5 2018-05-08 00:15:04.569996 OT7C59TB 1264351 2018-05-08 00:15:44.065 442 442637 38896482
4 {'msg': 'radio button: change rel', 'doc': 1, ... 6 2018-05-08 00:15:05.141764 OT7C59TB 1264351 2018-05-08 00:15:45.522 442 442637 38896482
print '\ntotal records:', len(db), '   json msg:', len(dbMsgJSON), '   json str:', len(dbMsgStr)
print '\nUNIQUE VALUES\ntask ID:', db.task_id.unique(), ', topic:', db.topic.unique(), ', unit ID:', len(db.unit_id.unique())
print 'time stamp:', len(db.timestamp.unique()), ', server time:', len(db.server_time.unique())
print 'session:', len(db.session.unique()), ', worker ID:', len(db.worker_id.unique())
  File "<ipython-input-6-36a4b4e0d089>", line 1
    print '\ntotal records:', len(db), '   json msg:', len(dbMsgJSON), '   json str:', len(dbMsgStr)
                           ^
SyntaxError: invalid syntax
  • (#)worker ID < (#)session: some workers started multiple times.
  • (#)unit ID < (#)worker ID: different workers attempted the same task. (n:n relationship)

Step 2 - Load CSV from results and some statistics

csv = pd.read_json(path_or_buf = DATA_DIR + csvFile, lines = True, encoding = 'utf-8', orient = "records")
print ('(CSV) columns\n', csv.columns)
(CSV) columns
 Index(['agreement', 'created_at', 'data', 'gold_pool', 'id', 'job_id',
       'judgments_count', 'missed_count', 'results', 'state', 'updated_at'],
      dtype='object')
csv.head()
agreement created_at data gold_pool id job_id judgments_count missed_count results state updated_at
0 1 2018-05-08 00:05:59 {u'topic': u'442', u'id_unit': u'442_0', u'doc... NaN 1722970310 1264351 1 0 {u'judgments': [{u'unit_data': {u'topic': u'44... finalized 2018-05-08 05:00:45
1 1 2018-05-08 00:05:59 {u'topic': u'442', u'id_unit': u'442_1', u'doc... NaN 1722970311 1264351 1 0 {u'judgments': [{u'unit_data': {u'topic': u'44... finalized 2018-05-10 13:48:47
2 1 2018-05-08 00:05:59 {u'topic': u'442', u'id_unit': u'442_2', u'doc... NaN 1722970312 1264351 1 0 {u'judgments': [{u'unit_data': {u'topic': u'44... finalized 2018-05-11 10:03:12
3 1 2018-05-08 00:05:59 {u'topic': u'442', u'id_unit': u'442_3', u'doc... NaN 1722970313 1264351 1 0 {u'judgments': [{u'unit_data': {u'topic': u'44... finalized 2018-05-09 02:11:26
4 1 2018-05-08 00:05:59 {u'topic': u'442', u'id_unit': u'442_4', u'doc... NaN 1722970314 1264351 1 0 {u'judgments': [{u'unit_data': {u'topic': u'44... finalized 2018-05-09 14:45:55
print '\nunique value'
print 'agreement:', csv.agreement.unique(), '   gold_pool:', csv.gold_pool.unique(), '   job_id:', csv.job_id.unique(), '   state:', csv.state.unique()
print 'judgments_count:', csv.judgments_count.unique(), '   missed_count:', csv.missed_count.unique()
unique value
agreement: [1]    gold_pool: [nan]    job_id: [1264351]    state: [u'finalized']
judgments_count: [1]    missed_count: [0]
# extract worker ID list of those who submitted their answers
rsJSON = json.loads(csv['results'].to_json(orient = "records"))
rsWorkerList = []

# check rs JSON key in <key, value>
i = 0
for rs in rsJSON:
    i = i + 1
    for j in rs['judgments']:
        if j['worker_id'] not in rsWorkerList:
            rsWorkerList.append(j['worker_id'])
print ('the number(#) of workers submitting the answer:', len(rsWorkerList), len(csv))
the number(#) of workers submitting the answer: 679 679

The number(#) of unit ID in *output [9]* is 677, while here there are 679 answers.

# checking the duplicate *unit ID* in CSV
unitIDList = []
for rs in rsJSON:
    for j in rs['judgments']:
        str = j["unit_data"]["id_unit"]
        if (str not in unitIDList):
            unitIDList.append(str)
        else:
            print(str, '  -- duplicate')
print ('unitIDList:', len(unitIDList))
442_102   -- duplicate
442_565   -- duplicate
unitIDList: 677

Step 3 - Split JSON DB into Abandon & Submit by worker ID

  • JSON DB Abandon (set A) = worker ID NOT in CSV, including:
    1. workers who abandoned ONCE
    2. workers who abandoned more than once (started again and then abandoned)
  • JSON DB Submit (Set B) = worker ID found in CSV, including:
    1. workers who submitted without abandonment (started --> submitted)
    2. workers who submitted with previous abandonment (started --> abandoned --> started again --> ... --> submitted)
# check workerID in CSV but not in JSON (javascript not working?)
dbWorkerID = db.worker_id.value_counts().index.tolist()
dbWorkerID_missing = []
for i in rsWorkerList:
    if (i not in dbWorkerID):
        dbWorkerID_missing.append(i)
print ('\nworkerID in CSV but not in JSON (javascript not working?)\n', dbWorkerID_missing)
workerID in CSV but not in JSON (javascript not working?)
 [44053585]
dbAbandon = db.loc[~db['worker_id'].isin(rsWorkerList)]
dbSubmit = db.loc[db['worker_id'].isin(rsWorkerList)]
print ('(worker ID) total:', len(db.worker_id.unique()), '   Abandoned:', len(dbAbandon.worker_id.unique()), '   Submitted:', len(dbSubmit.worker_id.unique()), '/', len(rsWorkerList))
print ('(DB/json/msg) total records:', len(db), '   Abandoned:', len(dbAbandon), '   Submitted:', len(dbSubmit))
(worker ID) total: 1849    Abandoned: 1171    Submitted: 678 / 679
(DB/json/msg) total records: 31903    Abandoned: 12424    Submitted: 19479

Restart from this point: read DB Abandon from file and unify format

tmp_db = []
with open(DATA_PROCESSED_DIR + jsonFile + '.dbAbandon', 'r') as f:
    lines = f.readlines()
    for line in lines:
        lineJSON = json.loads(line.strip())
        tmp_db.append(lineJSON)

# unifying JSON in *json.dumps* format
for i, r in enumerate(tmp_db):
    if (type(r['message']) == dict):
        tmp_db[i]['message'] = json.dumps(r['message'])
    else:
        tmp_db[i]['message'] = r['message'].encode('utf-8')

# convert from JSON format to pd format
dbAbandon = pd.DataFrame(tmp_db)
print ('worker abandoned:', len(dbAbandon.worker_id.unique()))
worker abandoned: 1171
dbAbandon.head()
message message_number server_time session task_id timestamp topic unit_id worker_id
0 b'start task' 0 1525738238612 Y5030AAO 1264351 1525738237221 442.0 442_265 40687124
1 b'start task' 0 1525738275519 OBLG4DP4 1264351 1525738079345 442.0 442_265 39068713
2 b'start task' 0 1525738365830 OT7C59TB 1264351 1525738403820 442.0 442_637 38896482
3 b'Start button pressed.' 1 1525738433764 OT7C59TB 1264351 1525738468593 442.0 442_637 38896482
4 b'start task' 0 1525738450599 VXUVKXNB 1264351 1525738448922 442.0 442_241 43811497

Step 4 - Analysis of Actions / msg before Abandonment

  • In this section, message logs received from worker browsers will be analysed. Analysis would be conducted from three dimensions: (i) actions (the number of msg) from one worker, (ii) the number of units engaged by each worker, and (iii) how many times each worker having ever started the task (start once, twice?).
  • Based on these three dimensions, analysis on worker-task level (different <worker, unit> pairs) is performed.
  • Each sub-section ends with a summary, and is structured as follows.
Dimension per worker per ** pair (worker-task)
actions (# of msg) Summary 4.1 Summary 4.3
units Summary 4.2 /
sessions (# of Starts) Summary 4.4 Summary 4.5
dbAbandonGroup_worker = dbAbandon.groupby('worker_id')
print ('count(#) of workers:', len(dbAbandonGroup_worker))
count(#) of workers: 1171
countMsg = dbAbandonGroup_worker.count().message.value_counts().sort_index()
countMsg.head()
1    451
2    214
3     90
4     74
5     41
Name: message, dtype: int64
# statistics of msg per worker
dbAbandonGroup_worker.count().message.describe()
count    1171.000000
mean       10.609735
std        25.936389
min         1.000000
25%         1.000000
50%         2.000000
75%         6.000000
max       353.000000
Name: message, dtype: float64
vecX = countMsg.index.tolist()  # the number of count observed
vecY = countMsg.tolist()  # frequency of occurrence
plt.plot(vecX, vecY)
plt.xlabel('number of actions performed by worker till abandonment')
plt.ylabel('frequency of occurrence')
plt.show()
# consistency check
def tomCheck(vecX, vecY, t = 1):
    y = 0
    if (t == 1):  # vecX * vecY = (x1, x2, ...) * (y1, y2, ...) = x1 * y1 + x2 * y2 + ...
        for i, x in enumerate(vecX):
            y += x * vecY[i]
    elif (t == -11):  # sum of vecX
        for i in vecX:
            y += i
    elif (t == -12):  # sum of vecY
        for i in vecY:
            y += i
    return y

print '(tom check) vec(X) * vec(Y) =', tomCheck(vecX, vecY), '= total number of messages'
print 'total number of workers:', tomCheck(vecX, vecY, -12)
(tom check) vec(X) * vec(Y) = 12424 = total number of messages
total number of workers: 1171
# plot (part of)
plt.plot(vecX[2:50], vecY[2:50])
plt.xlabel('number of actions performed by worker till abandonment')
plt.ylabel('frequency of occurrence')
plt.show()

Summary - 4.1 - Actions per Worker (worker id)

  1. There are 1171 workers who abandoned on their way. Before their abandonment, they performed 12'424 activities in total.
  2. The average number of messages by each worker is 10.61, which is placed between 75 to 100 percentile, showing that the distribution is right skewed or has longer right tail.
  3. More than a half, i.e. 56.79% = (451 + 214) / 1171, of the workers gave up before the third message sent. This indicates that these workers did not even start the task, since we have 2 steps before (read instructions and start button click).
  4. Around 20 actions at horizontal axis there is a peak observed, it shows that about 20 workers quited after 20 actions recorded. Because the whole actions required to perform a task is 20, with the last one being final check, they might be failed in final check and then gave up.
  5. Values greater than 20 on horizontal axis mean that the worker started a second time or more, maybe the same or different task. Here, the Unit ID is not considered (analysed later).
  6. The maximum number of actions that workers performed is 353, and there are many observations of this value greater than 50, which means these workers started again and again but none of their tasks has been finished.

# statistics of Units per worker
dbAbandon.groupby('worker_id').unit_id.nunique().describe()
count    1171.000000
mean        1.437233
std         1.004323
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max        10.000000
Name: unit_id, dtype: float64
countUnit = dbAbandon.groupby('worker_id').unit_id.nunique().value_counts().sort_index()
countUnit.head()
1    872
2    188
3     71
4     15
5      6
Name: unit_id, dtype: int64
dbAbandonGroup_workerUnit = dbAbandon.groupby(['worker_id', 'unit_id'])
print 'count(#) <worker, unit>:', len(dbAbandonGroup_workerUnit)

vecX = countUnit.index.tolist()  # the number of count observed
vecY = countUnit.tolist()  # frequency of occurrence
print '(tom check) vec(X) * vec(Y) =', tomCheck(vecX, vecY), '= total number of <worker, unit>'
print 'total number of workers:', tomCheck(vecX, vecY, -12)
count(#) <worker, unit>: 1683
(tom check) vec(X) * vec(Y) = 1683 = total number of <worker, unit>
total number of workers: 1171
plt.plot(vecX, vecY)
plt.xlabel('number of units attempted but abandoned by each worker')
plt.ylabel('frequency of occurrence')
plt.show()

Summary - 4.2 - Units per Worker

  1. Total number of workers is 1171, they started 1683 worker-units in total. Different workers starting the same unit is treated as different worker-units, since **** pairs are different
  2. The average number of units by each worker is 1.437, which is placed between 50 to 75 percentile, showing that the distribution is right skewed but not as strong as the distribution of msg per worker.
  3. 872 workers, 74.47% = 872 / 1171, just attempted one unit, while 25.53% of all abandoned workers attempted several units.
  4. The maximum number of units that workers started is 10, meaning that this workers started 10 different tasks but gave up all of them.

# statistics of msg per <worker, unit> pair
dbAbandonGroup_workerUnit.count().message.describe()
count    1683.000000
mean        7.382056
std        12.601439
min         1.000000
25%         1.000000
50%         2.000000
75%         5.000000
max        96.000000
Name: message, dtype: float64
countMsg_workerUnit = dbAbandonGroup_workerUnit.count().message.value_counts().sort_index()
countMsg_workerUnit.head()
1    836
2    270
3     75
4     67
5     37
Name: message, dtype: int64
vecX = countMsg_workerUnit.index.tolist()  # the number of count observed
vecY = countMsg_workerUnit.tolist()  # frequency of occurrence

print '(tom check) vec(X) * vec(Y) =', tomCheck(vecX, vecY), '= total number of messages'
print 'total number of <worker, unit>:', tomCheck(vecX, vecY, -12)
(tom check) vec(X) * vec(Y) = 12424 = total number of messages
total number of <worker, unit>: 1683
plt.plot(vecX, vecY)
plt.xlabel('number of actions performed in each <worker, unit> till abandonment')
plt.ylabel('frequency of occurrence')
plt.show()
# plot (part of)
plt.plot(vecX[2:50], vecY[2:50])
plt.xlabel('number of actions performed in each <worker, unit> till abandonment')
plt.ylabel('frequency of occurrence')
plt.show()

Summary - 4.3 - Actions per pair

  1. The total number of <worker, unit> pairs is 1683, showing that these 1171 workers started 1683 worker-tasks which were given up on the way.
  2. The average number of messages by each <worker, unit> pair is 7.382, which is placed between 75 to 100 percentile, showing that the distribution is still strongly right skewed.
  3. Nearly a half of worker-tasks, i.e. 49.67% = 836 / 1683, were ended just after the first message recorded, which gives the evidence that when working on these units the workers decided to stop while they were reading the instructions so they even did not click the START button.
  4. Another 270 worker-tasks (16.04% = 270 / 1683) ended after the workers clicking the START button, drawing a picture that the workers had read the first question and known what the questions look like and subsequently decided to leave.
  5. Again, around 20 messages that a single worker-task sent to server, the number of sudden suspension is 45. Compared with the number in Summary 4.1 (about 20 workers quited after FINAL CHECK), at least 25 worker-tasks were started a second time and quited again at FINAL CHECK.
  6. The maximum number of actions in a single worker-task is 96, and there are observations of more than 50 messages per worker-task as well, which means these workers restarted their task multiple times (perhaps failed in FINAL CHECK and then started again).

Restart from this point: read DB Starts from file and analysis of starting msg

# read from ***.json.msgStr (DB of starting messages: *dbStarts*)
db = []
with open(jsonFile + '.msgStr', 'r') as f:
    lines = f.readlines()
    for line in lines:
        lineJSON = json.loads(line.strip())
        db.append(lineJSON)
dbStarts = pd.DataFrame(db)  # convert from JSON format to pd format

# read from ***.json.msgJSON (DB of judgments during the task: JSON msg)
db = []
with open(jsonFile + '.msgJSON', 'r') as f:
    lines = f.readlines()
    for line in lines:
        lineJSON = json.loads(line.strip())
        db.append(lineJSON)
dbJudgments = pd.DataFrame(db)  # convert from JSON format to pd format

# check which worker does not *Start* but have *Judged* (javascript not working at the beginning?)
setA = dbStarts.worker_id.unique()
setB = dbJudgments.worker_id.unique()
for i in setB:
    if i not in setA:
        print 'Worker who does not have *starting msg* but has judgments recorded (has done some parts of tasks):\nID ', i
print 'Actually, worker (id == 20559745) submitted the task and is not in any Abandon DB'
Worker who does not have *starting msg* but has judgments recorded (has done some parts of tasks):
ID  20559745
Actually, worker (id == 20559745) submitted the task and is not in any Abandon DB
# split *dbStarts* into abandon & submit
dbStartsAbandon = dbStarts.loc[~dbStarts['worker_id'].isin(rsWorkerList)]
dbStartsSubmit = dbStarts.loc[dbStarts['worker_id'].isin(rsWorkerList)]
dbStartsAbandon.head()
message message_number server_time session task_id timestamp topic unit_id worker_id
0 start task 0 2018-05-08 00:10:38.612031 Y5030AAO 1264351 2018-05-08T00:10:37.221Z 442.0 442_265 40687124
1 start task 0 2018-05-08 00:11:15.519643 OBLG4DP4 1264351 2018-05-08T00:07:59.345Z 442.0 442_265 39068713
2 start task 0 2018-05-08 00:12:45.830076 OT7C59TB 1264351 2018-05-08T00:13:23.820Z 442.0 442_637 38896482
3 Start button pressed. 1 2018-05-08 00:13:53.764660 OT7C59TB 1264351 2018-05-08T00:14:28.593Z 442.0 442_637 38896482
4 start task 0 2018-05-08 00:14:10.599818 VXUVKXNB 1264351 2018-05-08T00:14:08.922Z 442.0 442_241 43811497
# some statistics
print '(workers in dbStarts)\ntotal:',  len(dbStarts.worker_id.unique()), ' Abandoned:', len(dbStartsAbandon.worker_id.unique()), '  Submitted:', len(dbStartsSubmit.worker_id.unique()), '/', len(rsWorkerList)
print '(DB of Starts)\ntotal records (msg):', len(dbStarts), '   Abandoned:', len(dbStartsAbandon), '   Submitted:', len(dbStartsSubmit)
(workers in dbStarts)
total: 1848  Abandoned: 1171   Submitted: 677 / 679
(DB of Starts)
total records (msg): 5044    Abandoned: 2919    Submitted: 2125
# function def that counts starts per <worker, unit>
# the output is DB that only contains starting msg per *worker-task*, i.e. <worker, unit> pair
# There are 3 types of starting msg, see summary 4.4 for details.
def countStarts_perWorkerUnit(db):
    # convert DB to json format
    db = json.loads(db.sort_values(['worker_id', 'unit_id', 'server_time']).to_json(orient = "records"))

    # finding new starts
    workerID = 0
    unitID = ""
    messageNo = 0
    str = ""
    newStarts = []
    for r in db:
        if (r['message'][1:5] != "tart"):
            # Ethics review click
            if (r['message_number'] == (messageNo + 1)):
                # ethics review just after starting, otherwise workers may start a new session after this review
                messageNo = r['message_number']
        elif (r['worker_id'] == workerID and r['unit_id'] == unitID and r['message_number'] == (messageNo + 1) and r['message'] != str):
            # this is not a new start
            r['message_number'] = messageNo
            r['message'] = str
        else:
            # this is a new start
            workerID = r['worker_id']
            unitID = r['unit_id']
            messageNo = r['message_number']
            str = r['message']
            newStarts.append({"worker_id": workerID, "unit_id": unitID, "message_number": messageNo, "message": str})

    # DB of new starts
    db = pd.DataFrame(newStarts)
    return [db]
# get DB of count(#) starts
countStartsAbandon = countStarts_perWorkerUnit(dbStartsAbandon)[0]
print 'total number(#) of Starts that lead to abandonment:', len(countStartsAbandon)
countStartsAbandon.head()
total number(#) of Starts that lead to abandonment: 2111
message message_number unit_id worker_id
0 start task 0 442_120 1883983
1 start task 0 442_32 1883983
2 start task 0 442_502 1883983
3 start task 0 442_334 1963188
4 start task 0 442_434 1963188
# count(#) of starts by workerID
countStartsAbandonGroup_worker = countStartsAbandon.groupby('worker_id')
print 'the number(#) of worker ID after counting:', len(countStartsAbandonGroup_worker)
print 'the number(#) of worker ID before counting:', len(dbStartsAbandon.worker_id.unique())
the number(#) of worker ID after counting: 1170
the number(#) of worker ID before counting: 1171
# check which worker is missing (javascript not working when starting)
setA = countStartsAbandon.worker_id.unique()
setB = dbStartsAbandon.worker_id.unique()
for i in setB:
    if i not in setA:
        print i
40457860
db.loc[db['worker_id'] == 40457860]
message message_number server_time session task_id timestamp topic unit_id worker_id
29213 Ethics button pressed. 0 2018-05-19 07:07:36.942305 6K4PFNER 1264351 2018-05-19 07:19:21.577 442.0 442_366 40457860
# statistics of Starts per worker
countStartsAbandonGroup_worker.count().message.describe()
count    1170.000000
mean        1.804274
std         2.011553
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max        25.000000
Name: message, dtype: float64
countStarts = countStartsAbandonGroup_worker.count().message.value_counts().sort_index()
countStarts.head()
1    771
2    208
3     98
4     34
5     15
Name: message, dtype: int64
vecX = countStarts.index.tolist()  # the number of count observed
vecY = countStarts.tolist()  # frequency of occurrence
print '(tom check) vec(X) * vec(Y) =', tomCheck(vecX, vecY), '= total number of Starts (after counting)'
print 'total number of workers:', tomCheck(vecX, vecY, -12)
print vecX
print vecY
(tom check) vec(X) * vec(Y) = 2111 = total number of Starts (after counting)
total number of workers: 1170
[1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 13, 15, 16, 19, 20, 21, 24, 25]
[771, 208, 98, 34, 15, 15, 8, 7, 3, 1, 3, 1, 1, 1, 1, 1, 1, 1]
plt.plot(vecX, vecY)
plt.xlabel('number of Starts performed by worker till abandonment')
plt.ylabel('frequency of occurrence')
plt.show()

Summary - 4.4 - Starts per worker

When a worker starts a task, there are typically 4 steps, which are:

  • (i) string "start task" is sent to server;
  • (ii) the worker reads the instructions;
  • (iii) the worker may click the Ethics button to read more information; and
  • (iv) "Start button" is clicked to go to Question 1.

In this part, all server logged messages would be analysed to distinguish how many Starts made by workers leads to abandonment in the end. Different user behaviours would fall into one of the following sequences of activities.

  • (A) msg "start task" sent to server --> read instructions or not --> stopped / quited
  • (B) msg "start task" sent to server --> read instructions or not --> "Ethics button clicked" --> read information or not --> stopped / quited
  • (C) msg "start task" sent to server --> read instructions or not --> "Ethics button clicked" --> read information or not --> "Start button clicked" --> start answering one or more questions or not --> stopped / quited
  • (D) msg "start task" sent to server --> read instructions or not --> "Start button clicked" --> start answering one or more questions or not --> stopped / quited

After answering all questions, a Final Check was conducted. By passing this check, the worker would submit the task and will not be in Abandonment DB. If the worker failed in Final Check, s/he may or may not start again or direct to other tasks as they wish. And in that case, the behaviour would be one of the followings.

  • (E) Answering all questions with Final Check failed --> stopped / quited
  • (F) Answering all questions with Final Check failed --> msg "start task" sent to server and follows one of PATH (A)-(D) shown above

In all of the above cases, there are 3 kinds of String messages that are recorded in DB: (i) "start task", (ii) "Ethics button clicked" and (iii) "Start button clicked". All other messages logged would be in type of JSON. Specifically, Path A to D is treated as ONE start, while Path E & F would be regarded as new start beginning, because in these two cases, the worker has completed all steps of one task.

Due to some technical reasons, for example browser extension code does not run properly, some logs, however, may be missing. Based on observation, some workers started with "Start button clicked", i.e. the message of "start task" is missing, and thus, it is impossible to compute the time they spent in reading instructions. (Time analysis comes in the next section Step 5. Please refer to the outline of this document at the beginning for the structure of the whole document.)

Here is the summary of counting Starts per worker, having eliminated duplicated counts based on Path A to F.

  1. The total number of abandoned workers (based on different worker ID) is 1171, among which there is ONE observation that the "starting message" of the worker is missing, leaving logs of "Ethics button click" only. Therefore, the total number of abandoned workers who have ever started the task is 1170.
  2. The number of messages that accounts for Starting a task but abandoned in the end is 2919. After eliminating duplicated counts from them, there are 2111 Starts in this specific task.
  3. The distribution of the number of starts by workers has a mean of 1.804 with standard deviation of 2.012. It is right skewed as well since the mean value places between 50 to 75 percentile.
  4. There are 771 abandoned workers (771 / 1170 = 65.9%) who started only once, while another 208 workers (208 / 1170 = 17.78%) started a second time. The percentages of workers who started three, four or five times are 8.38%, 2.9% and 1.28%, respectively. And the distribution reveals that fewer and fewer workers started more times.
  5. The maximum number of starts is observed as 25. Meanwhile, observation always exists for more than 15 starts till the maximum number 25. It shows that at least one worker starting this specific task 15 times, another worker starting it 16 times, and another 20 times, and so forth. But all of them got no rewards on this task.

# statistics of Starts per worker-task (<worker, unit> pair)
countStartsAbandonGroup_workerUnit = countStartsAbandon.groupby(['worker_id', 'unit_id'])
print 'count(#) <worker, unit>:', len(countStartsAbandonGroup_workerUnit), '\none worker is missing (the same as previous section)'
countStartsAbandonGroup_workerUnit.count().message.describe()
count(#) <worker, unit>: 1682 
one worker is missing (the same as previous section)
count    1682.000000
mean        1.255054
std         0.658799
min         1.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         7.000000
Name: message, dtype: float64
countStarts = countStartsAbandonGroup_workerUnit.count().message.value_counts().sort_index()
countStarts.head()
1    1399
2     173
3      90
4      10
5       5
Name: message, dtype: int64
vecX = countStarts.index.tolist()  # the number of count observed
vecY = countStarts.tolist()  # frequency of occurrence
print vecX
print vecY
print '\n(tom check) vec(X) * vec(Y) =', tomCheck(vecX, vecY), '= total number of Starts of all <worker, unit> pairs'
print 'total number of <worker, unit> pairs:', tomCheck(vecX, vecY, -12)
[1, 2, 3, 4, 5, 6, 7]
[1399, 173, 90, 10, 5, 4, 1]

(tom check) vec(X) * vec(Y) = 2111 = total number of Starts of all <worker, unit> pairs
total number of <worker, unit> pairs: 1682
plt.plot(vecX, vecY)
plt.xlabel('number of Starts performed in each <worker, unit> till abandonment')
plt.ylabel('frequency of occurrence')
plt.show()

Summary - 4.5 - Starts per pair

  1. Let us recall that the total number of <worker, unit> pairs is 1683, and there is ONE observation that the "starting message" of the worker is missing. Therefore, the DB of abandoned Starts has 1682 unique <worker, unit> pairs, or different worker-tasks.
  2. The average value of starts per worker-task is 1.255, and 83.17% (= 1399 / 1682) of worker-tasks have been started only once.
  3. 173 worker-tasks (173 / 1682 = 10.28%) were started again, while 90 worker-tasks (90 / 1682 = 5.35%) were started a third time.
  4. The maximum number of starts per <worker, unit> pair is observed as 7. Compaired with previous section, the maximum value of 25 starts per worker, some workers have tried at least 4 different units or tasks to complete the task. Unfortunately, they got nothing from this task as monetary rewards.

Step 5 - Analysis of Time Spent on each task till Abandonment

please go to tom2.ipynb