Abandonment - analysis of single task (HIT) - Part II

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import time, os, json

DATA_DIR = "data-180614/"
DATA_PROCESSED_DIR = "data-processed/"

jsonFile = "IR_SCALE_1264351_442.json"
csvFile = "job_1264351.json"

# read DB Abandon
tmp_db = []
with open(DATA_PROCESSED_DIR + jsonFile + '.dbAbandon', 'r') as f:
    lines = f.readlines()
    for line in lines:
        lineJSON = json.loads(line.strip())
        tmp_db.append(lineJSON)

# unifying JSON in *json.dumps* format
for i, r in enumerate(tmp_db):
    if (type(r['message']) == dict):
        tmp_db[i]['message'] = json.dumps(r['message'])
    else:
        tmp_db[i]['message'] = r['message'].encode('utf-8')

# convert from JSON format to pd format
dbAbandon = pd.DataFrame(tmp_db)
print 'worker abandoned:', len(dbAbandon.worker_id.unique())
worker abandoned: 1171

Step 5 - Analysis of Time Spent on each task till Abandonment (with comparison with those who completed the tasks)

  • In this section, message logs will be analysed to determine how long they spent for 3 different periods: (i) total time they spent, (ii) time spent on reading instructions, (iii) time spent on answering questions. For each period, the first log is treated as the starting time point and last log will be regarded as the ending time point. If the last log does not exist, instead, the first log of next period will be used. Therefore, if there is only one log in current period AND there is no next period, it will be discarded. Take the followings as examples.
    • (A) start task msg & Start button click msg defines a period of reading instructions (first log & last log)
    • (B) If Start button click msg is missing, the period of reading instructions will be define by start task msg & first question 1 msg (first log of current period first log of next period)*
    • (C) If there is start task msg in the session only, we will discard this session.
  • All analysis is at Browser Session level, giving us a description of how long they stayed for above mentioned periods after they intended to participate.
  • It is worth analysing total time at worker-unit level, particularly for those who failed in FINAL CHECK and started again, to understand how long they spent to engage in specific unit.
  • For the time they spent on answering questions, analysis at question level would give us some clues about why they abandoned as well.
  • Each sub-section ends with a summary, and is structured as follows.
Periods per BROWSER session per ** pair per question
total time Summary 5.1 Summary 5.2 /
reading instructions Summary 5.3 / /
answering questions Summary 5.4 / Summary 5.5
Comparison with those who submitted --------- --------- ---------
total time 5.6 / /
reading instructions 5.7 / /
answering questions 5.8 / /
# BROWSER Session counts(#)
dbAbandonGroup_BSession = dbAbandon.groupby('session').message.count()
print '(DB Abandon) count(#) of Browser sessions:', len(dbAbandon.session.unique())
print '(DB Abandon) count(#) of SINGLE MSG sessions', len(dbAbandonGroup_BSession.loc[dbAbandonGroup_BSession.apply(lambda x: x == 1)])
(DB Abandon) count(#) of Browser sessions: 1850
(DB Abandon) count(#) of SINGLE MSG sessions 1034

Summary - 5.1 - Total Time spent on each BROWSER Session - (1)

  • There are 1850 BROWSER Sessions in total, among which 1034 BROWSER Sessions only have single log message, so that these sessions should be discarded, leaving 1850 - 1034 = 816 BROWSER Session for analysis.
  • Let us recall the analysis of Sessions in **Summary 4.4**. There are 2'111 Starts leading to granular Sessions that were performed in 1'850 BROWSER Sessions. The numbers show that some workers started their tasks a second time within their current BROWSER Sessions.

# What is the relationship between BROWSER Session & Worker?
dbAbandon.groupby('session').worker_id.nunique().sort_values()
session
00SIIXAA    1
OD7X2S44    1
OBLG4DP4    1
OB4ODP5T    1
OAZOMV4C    1
OARJMUO4    1
O9PT85H9    1
O97YOALE    1
O7KY91U0    1
O6BRY61P    1
O6BKIAMK    1
O458848O    1
ODISNCQ1    1
O0DQ4F6W    1
NXA630KW    1
NWGJRC15    1
NVKOONWQ    1
NVDZ26PZ    1
NV8CJU99    1
NV1WADY0    1
NUPSYEC7    1
NUIIVOXQ    1
NU2WS6ZY    1
NTQ6RNF3    1
NTCSLABK    1
NXLL7PJV    1
ODOHAPOI    1
OEASODE6    1
OEGGB5TF    1
OQDI9KRN    1
           ..
CM58NJJ2    1
BZETAIFG    1
C0Z386EC    1
CKLU4VW0    1
CIA9W437    1
CFSLRWC1    1
CFOEWTNT    1
CFM541D5    1
CDAOIR45    1
CCGHB3VV    1
CC7YJXF1    1
CBD4NPK8    1
C9LAIFTP    1
C9BLJVLH    1
C96ZL9DN    1
C8YQCDP6    1
C8N3A9AD    1
C7ZQLBLT    1
C7DCQ5E3    1
C79Q3MC0    1
C5ZIXB1I    1
C5KJE1MC    1
C5AQVOGT    1
C571LP7F    1
C4AMIIPG    1
C40B2JHK    1
C337XJJ9    1
C26HJRA0    1
BZYMYZRR    1
ZZ6C8BCY    1
Name: worker_id, Length: 1850, dtype: int64
# What is the relationship between BROWSER Session & Unit?
dbAbandon.groupby('session').unit_id.nunique().sort_values()
session
00SIIXAA    1
OD7X2S44    1
OBLG4DP4    1
OB4ODP5T    1
OAZOMV4C    1
OARJMUO4    1
O9PT85H9    1
O97YOALE    1
O7KY91U0    1
O6BRY61P    1
O6BKIAMK    1
O458848O    1
ODISNCQ1    1
O0DQ4F6W    1
NXA630KW    1
NWGJRC15    1
NVKOONWQ    1
NVDZ26PZ    1
NV8CJU99    1
NV1WADY0    1
NUPSYEC7    1
NUIIVOXQ    1
NU2WS6ZY    1
NTQ6RNF3    1
NTCSLABK    1
NXLL7PJV    1
ODOHAPOI    1
OEASODE6    1
OEGGB5TF    1
OQDI9KRN    1
           ..
CM58NJJ2    1
BZETAIFG    1
C0Z386EC    1
CKLU4VW0    1
CIA9W437    1
CFSLRWC1    1
CFOEWTNT    1
CFM541D5    1
CDAOIR45    1
CCGHB3VV    1
CC7YJXF1    1
CBD4NPK8    1
C9LAIFTP    1
C9BLJVLH    1
C96ZL9DN    1
C8YQCDP6    1
C8N3A9AD    1
C7ZQLBLT    1
C7DCQ5E3    1
C79Q3MC0    1
C5ZIXB1I    1
C5KJE1MC    1
C5AQVOGT    1
C571LP7F    1
C4AMIIPG    1
C40B2JHK    1
C337XJJ9    1
C26HJRA0    1
BZYMYZRR    1
ZZ6C8BCY    1
Name: unit_id, Length: 1850, dtype: int64

Summary - 5.1 - Total Time spent on each BROWSER Session - (2)

  • The only value of how many workers & units are in a single BROWSER Session is 1, which gives us evidence that if a BROWSER Session is determined, its worker ID and unit ID could be implied. Since one worker or unit can start different BROWSER Sessions, someone doing the same task again for example, the relationship between worker and BROWSER Session is 1:n, and so is that between unit and BROWSER Session.

# extract FIRST & LAST message of each BROWSER Session
multiMsgBSessionList = dbAbandonGroup_BSession.loc[dbAbandonGroup_BSession.apply(lambda x: x != 1)].index.tolist()
dbAbandon_BSessionMultiMsg = dbAbandon.loc[dbAbandon['session'].isin(multiMsgBSessionList)]
print 'count(#) of msg that only contains MultiMsg in each Browser Session:', len(dbAbandon_BSessionMultiMsg)
db1 = json.loads(dbAbandon_BSessionMultiMsg.sort_values(['session', 'server_time']).to_json(orient = "records"))
db2 = []
currSession = ''
currServerTime = 0
currObj = {}
for i, r in enumerate(db1):
    if (r['session'] != currSession):  # cursor is on the FIRST msg of next BROWSER session
        if (i != 0):
            currObj['last_serverTime'] = currServerTime
            db2.append(currObj)
        currSession = r['session']
        currServerTime = r['server_time']
        currObj = {"session": currSession, "first_serverTime": currServerTime, "unit_id": r['unit_id'], "worker_id": r['worker_id']}
    else:
        currServerTime = r['server_time']
        if (i == len(db1) - 1):
            currObj['last_serverTime'] = currServerTime
            db2.append(currObj)

# compute the time (seconds) that each BROWSER Session lasts, the output is *db3[]*
db3 = []
for r in db2:
    db3.append({"sessionTime_total": int((r['last_serverTime'] - r['first_serverTime']) / 1000), "session": r['session'], "unit_id": r['unit_id'], "worker_id": r['worker_id']})

# (DB Abandon) total time for each BROWSER Session
dbAbandon_BSessionTotalTime = pd.DataFrame(db3)
print 'count(#) of BROWSER sessions that lasts for some time (MultiMsg):', len(dbAbandon_BSessionTotalTime)
count(#) of msg that only contains MultiMsg in each Browser Session: 11390
count(#) of BROWSER sessions that lasts for some time (MultiMsg): 816
dbAbandon_BSessionTotalTime.head()
session sessionTime_total unit_id worker_id
0 01O0V2OS 202 442_420 42958246
1 0286S4TK 154 442_173 42711446
2 02WG27WF 45 442_647 43863256
3 03KF0O7G 1430 442_547 35065712
4 059WTKKH 9 442_436 43621163

Summary - 5.1 - Total Time spent on each BROWSER Session - (3)

  • In DB Abandon, there are 12'424 logged messages. From previous discussion, we know that there are 1'034 BROWSER Sessions that have only one logged message. Therefore, the number of remaining messages should be 12424 - 1034 = 11390, which is consistent with the calculation above.

# distribution of total time in each BROWSER Session
dbAbandon_BSessionTotalTime.sessionTime_total.describe()
count     816.000000
mean      393.700980
std       431.875981
min         2.000000
25%        86.000000
50%       230.500000
75%       534.750000
max      3285.000000
Name: sessionTime_total, dtype: float64
# in terms of minutes
(dbAbandon_BSessionTotalTime.sessionTime_total / 60).round(0).describe()
count    816.000000
mean       6.561275
std        7.202329
min        0.000000
25%        1.000000
50%        4.000000
75%        9.000000
max       55.000000
Name: sessionTime_total, dtype: float64

Summary - 5.1 - Total Time spent on each BROWSER Session - (4)

  • The minimum staying time in a single BROWSER Session is 2 seconds, while the maximum is 55 minutes.
  • The average is 6.56 minutes, while is placed between 50 and 75 percentile.

# count(#) different value in terms of 10 SECONDS
statis_BSessionTotalTime = (dbAbandon_BSessionTotalTime.sessionTime_total / 10).round(0).value_counts().sort_index()
vecX = statis_BSessionTotalTime.index.tolist()  # the number of count(#) observed
vecY = statis_BSessionTotalTime.tolist()  # frequency of occurrence
print 'X =\n', vecX
print 'Y =\n', vecY
plt.plot(vecX, vecY)
plt.xlabel('time spent on single BROWSER Session (in terms of 10 seconds)')
plt.ylabel('frequency of occurrence')
plt.show()
X =
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 88.0, 89.0, 90.0, 91.0, 93.0, 94.0, 95.0, 97.0, 98.0, 99.0, 100.0, 101.0, 102.0, 104.0, 105.0, 107.0, 108.0, 109.0, 112.0, 113.0, 114.0, 115.0, 116.0, 117.0, 119.0, 122.0, 123.0, 124.0, 126.0, 127.0, 129.0, 130.0, 131.0, 132.0, 134.0, 135.0, 137.0, 138.0, 141.0, 143.0, 146.0, 147.0, 149.0, 150.0, 157.0, 158.0, 159.0, 160.0, 162.0, 163.0, 164.0, 165.0, 166.0, 172.0, 173.0, 175.0, 177.0, 179.0, 189.0, 208.0, 249.0, 328.0]
Y =
[5, 15, 30, 21, 37, 18, 35, 18, 22, 26, 28, 9, 19, 17, 7, 9, 16, 17, 11, 8, 13, 9, 12, 8, 7, 14, 9, 7, 8, 7, 10, 5, 7, 7, 7, 7, 5, 7, 8, 5, 10, 5, 6, 7, 8, 5, 9, 2, 6, 4, 4, 2, 8, 6, 4, 2, 6, 3, 6, 2, 2, 5, 3, 5, 4, 1, 5, 3, 4, 1, 3, 1, 1, 4, 3, 2, 1, 4, 3, 2, 2, 3, 2, 8, 2, 1, 4, 2, 2, 2, 3, 1, 1, 2, 2, 1, 2, 1, 3, 1, 2, 2, 2, 1, 1, 1, 1, 3, 2, 1, 3, 1, 4, 1, 3, 1, 2, 1, 1, 2, 2, 1, 1, 3, 1, 3, 1, 2, 4, 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 3, 1, 1, 1, 1, 1]
# count(#) different value in terms of 30 SECONDS
statis_BSessionTotalTime = (dbAbandon_BSessionTotalTime.sessionTime_total / 30).round(0).value_counts().sort_index()
vecX = statis_BSessionTotalTime.index.tolist()  # the number of count(#) observed
vecY = statis_BSessionTotalTime.tolist()  # frequency of occurrence
print 'X =\n', vecX
print 'Y =\n', vecY
plt.plot(vecX, vecY)
plt.xlabel('time spent on single BROWSER Session (in terms of 30 seconds)')
plt.ylabel('frequency of occurrence')
plt.show()
X =
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 52.0, 53.0, 54.0, 55.0, 57.0, 58.0, 59.0, 60.0, 63.0, 69.0, 83.0, 110.0]
Y =
[22, 80, 78, 73, 51, 28, 37, 32, 31, 23, 24, 18, 21, 22, 20, 20, 13, 13, 13, 14, 5, 12, 10, 8, 5, 9, 8, 6, 14, 3, 7, 4, 4, 4, 4, 3, 5, 1, 5, 3, 3, 6, 4, 3, 3, 3, 4, 1, 3, 3, 5, 2, 3, 4, 4, 2, 2, 3, 1, 1, 1, 1, 1]
# count(#) different value in terms of MINUTES
statis_BSessionTotalTime = (dbAbandon_BSessionTotalTime.sessionTime_total / 60).round(0).value_counts().sort_index()
vecX = statis_BSessionTotalTime.index.tolist()  # the number of count(#) observed
vecY = statis_BSessionTotalTime.tolist()  # frequency of occurrence
print 'X =\n', vecX
print 'Y =\n', vecY
plt.plot(vecX, vecY)
plt.xlabel('time spent on single BROWSER Session (in terms of minutes)')
plt.ylabel('frequency of occurrence')
plt.show()
X =
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 32.0, 35.0, 41.0, 55.0]
Y =
[65, 148, 97, 76, 57, 44, 42, 39, 28, 29, 19, 20, 15, 14, 17, 12, 7, 10, 5, 9, 8, 7, 7, 5, 5, 6, 4, 6, 3, 6, 2, 1, 1, 1, 1]
c = np.array([0, 0, 0, 0, 0], dtype='float64')
for i in dbAbandon_BSessionTotalTime.sessionTime_total.sort_values().tolist():
    if (i <= 120):  # 2 minutes
        c[0] += 1
    elif (i <= 300):  # 2-5 minutes
        c[1] += 1
    elif (i <= 600):  # 5-10 minutes
        c[2] += 1
    elif (i <= 900):  # 10-15 minutes
        c[3] += 1
    else:  # 15+ munites
        c[4] += 1
print c, '\n', c/816
[275. 189. 171.  82.  99.] 
[0.3370098  0.23161765 0.20955882 0.1004902  0.12132353]

Summary - 5.1 - Total Time spent on each BROWSER Session - (5)

  • One third of BROWSER Sessions were abandoned within 2 minutes, while more than a half (56.86% = (275 + 189) / 816) of BROWSER Sessions were abandoned within 5 minutes.
  • The frequency of abandonment decays when longer time is spent on a single BROWSER Session.
  • The first & second longest time spent observed before abandonment are 55 & 41 minutes in a single BROWSER Session, while only 12.13% (= 99 / 816) BROWSER Sessions lived longer than 15 minutes, indicating that the distribution has a long tail on the right, which is consistent with the statistics (i.e. mean between 50-75 percentile).

5.2 - total time per <worker, unit> pair

  • To compute the total time that a worker spent in a particular unit, it is not correct to just naively check the Start & End messages regarding the <worker, unit> pair across different BROWSER Sessions. Because it is likely that the worker started a task, triggering logging message, and abandoned after doing something, and then s/he started the same task again but several days later, which triggered logging message for the same <worker, unit> pair. If we take the First & Last logging message for the <worker, unit> pair to computer the total time spent on this task, it includes the period of a couple of days that actually the worker is not working on the task. Therefore, to be more precisely by prudence, we only account for each granular BROWSER Session, and compute the total time per <worker, unit> pair by adding up all BROWSER Sessions for that <worker, unit>.
# total time per <worker, unit> pair
dbAbandonGroup_BSessionTotalTime_workerUnit = dbAbandon_BSessionTotalTime.groupby(['worker_id', 'unit_id']).sessionTime_total.sum()
print '(DB Abandon) the number(#) of unique <worker, unit> pairs in all BROWSER Sessions:', len(dbAbandonGroup_BSessionTotalTime_workerUnit)
(DB Abandon) the number(#) of unique <worker, unit> pairs in all BROWSER Sessions: 786

Summary - 5.2 - Total Time spent by each pair - (1)

  • Let us recall that there 816 BROWSER Sessions abandoned on the way, while here, the number of unique <worker, unit> pairs in these sessions is 786, showing that only 30 (= 816 - 786) BROWSER Sessions were restarted at a later time.

# distribution in terms of SECONDS
dbAbandonGroup_BSessionTotalTime_workerUnit.describe()
count     786.000000
mean      408.727735
std       442.549337
min         4.000000
25%        88.000000
50%       246.500000
75%       563.750000
max      3285.000000
Name: sessionTime_total, dtype: float64
# distribution in terms of MINUTES
(dbAbandonGroup_BSessionTotalTime_workerUnit / 60).round(0).describe()
count    786.000000
mean       6.811705
std        7.379446
min        0.000000
25%        1.000000
50%        4.000000
75%        9.000000
max       55.000000
Name: sessionTime_total, dtype: float64

Summary - 5.2 - Total Time spent by each pair - (2)

  • Compared with the distribution in previous section, although there is some diffence in the distribution of Seconds spent per worker-task, it does not show significant changes of the time at Minutes level.

# plot in 10-SECONDS
statis_BSessionTotalTime_workerUnit = (dbAbandonGroup_BSessionTotalTime_workerUnit / 10).round(0).value_counts().sort_index()
vecX = statis_BSessionTotalTime_workerUnit.index.tolist()  # the number of count(#) observed
vecY = statis_BSessionTotalTime_workerUnit.tolist()  # frequency of occurrence
print 'X =\n', vecX
print 'Y =\n', vecY
plt.plot(vecX, vecY)
plt.xlabel('time spent by <worker, unit> pair (in terms of 10 seconds)')
plt.ylabel('frequency of occurrence')
plt.show()
X =
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 88.0, 89.0, 90.0, 91.0, 92.0, 93.0, 94.0, 95.0, 96.0, 97.0, 98.0, 99.0, 100.0, 101.0, 102.0, 104.0, 105.0, 107.0, 108.0, 109.0, 112.0, 113.0, 114.0, 115.0, 116.0, 117.0, 119.0, 122.0, 123.0, 124.0, 126.0, 127.0, 129.0, 130.0, 131.0, 132.0, 134.0, 135.0, 136.0, 137.0, 138.0, 141.0, 142.0, 143.0, 146.0, 147.0, 149.0, 157.0, 158.0, 159.0, 160.0, 161.0, 162.0, 163.0, 164.0, 165.0, 166.0, 172.0, 173.0, 175.0, 177.0, 179.0, 189.0, 208.0, 249.0, 328.0]
Y =
[3, 14, 28, 20, 37, 17, 32, 18, 19, 26, 25, 9, 18, 16, 6, 8, 14, 17, 9, 8, 12, 9, 12, 8, 6, 15, 6, 8, 9, 7, 6, 6, 7, 7, 7, 4, 5, 7, 9, 5, 10, 7, 6, 6, 8, 5, 9, 2, 5, 4, 4, 2, 7, 5, 3, 2, 6, 3, 4, 2, 1, 2, 5, 3, 5, 5, 2, 5, 3, 4, 2, 3, 1, 1, 4, 3, 2, 1, 3, 3, 2, 2, 3, 2, 5, 2, 1, 3, 2, 3, 2, 1, 3, 1, 1, 1, 2, 2, 1, 2, 1, 3, 1, 2, 2, 2, 1, 1, 2, 1, 3, 2, 1, 3, 1, 4, 1, 3, 1, 2, 1, 1, 2, 2, 1, 1, 1, 3, 1, 1, 3, 1, 2, 4, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 3, 1, 1, 1, 1, 1]
# plot in 30-SECONDS
statis_BSessionTotalTime_workerUnit = (dbAbandonGroup_BSessionTotalTime_workerUnit / 30).round(0).value_counts().sort_index()
vecX = statis_BSessionTotalTime_workerUnit.index.tolist()  # the number of count(#) observed
vecY = statis_BSessionTotalTime_workerUnit.tolist()  # frequency of occurrence
print 'X =\n', vecX
print 'Y =\n', vecY
plt.plot(vecX, vecY)
plt.xlabel('time spent by <worker, unit> pair (in terms of 30 seconds)')
plt.ylabel('frequency of occurrence')
plt.show()
X =
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 52.0, 53.0, 54.0, 55.0, 57.0, 58.0, 59.0, 60.0, 63.0, 69.0, 83.0, 110.0]
Y =
[18, 78, 73, 68, 49, 24, 35, 31, 30, 23, 20, 19, 18, 23, 21, 20, 12, 12, 11, 12, 6, 12, 12, 9, 5, 9, 7, 6, 11, 2, 8, 5, 5, 4, 4, 3, 5, 1, 6, 3, 3, 6, 4, 3, 3, 4, 4, 2, 3, 3, 4, 2, 3, 6, 4, 2, 2, 3, 1, 1, 1, 1, 1]
# plot in MINUTES
statis_BSessionTotalTime_workerUnit = (dbAbandonGroup_BSessionTotalTime_workerUnit / 60).round(0).value_counts().sort_index()
vecX = statis_BSessionTotalTime_workerUnit.index.tolist()  # the number of count(#) observed
vecY = statis_BSessionTotalTime_workerUnit.tolist()  # frequency of occurrence
print 'X =\n', vecX
print 'Y =\n', vecY
plt.plot(vecX, vecY)
plt.xlabel('time spent by <worker, unit> pair (in terms of minutes)')
plt.ylabel('frequency of occurrence')
plt.show()
X =
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 32.0, 35.0, 41.0, 55.0]
Y =
[59, 141, 91, 70, 55, 42, 40, 40, 27, 26, 18, 22, 16, 13, 14, 13, 8, 10, 5, 10, 8, 7, 7, 6, 6, 5, 4, 8, 3, 6, 2, 1, 1, 1, 1]
c = np.array([0, 0, 0, 0, 0], dtype='float64')
for i in dbAbandonGroup_BSessionTotalTime_workerUnit.sort_values().tolist():
    if (i <= 120):  # 2 minutes
        c[0] += 1
    elif (i <= 300):  # 2-5 minutes
        c[1] += 1
    elif (i <= 600):  # 5-10 minutes
        c[2] += 1
    elif (i <= 900):  # 10-15 minutes
        c[3] += 1
    else:  # 15+ munites
        c[4] += 1
print c, '\n', c/786
[259. 178. 163.  80. 106.] 
[0.32951654 0.2264631  0.20737913 0.10178117 0.13486005]

Summary - 5.2 - Total Time spent by each pair - (3)

  • Since only 3.68% (= 30 / 816) of the BROWSER Sessions were hosting repeated worker-tasks, the most values of time spent in worker-tasks remain the same even if we change our methodology of analysis from BROWSER Session to <worker, unit> pair. And thus the curves shown above have no significant differences from those in previous analysis.
  • Some short-lasting BROWSER Sessions merged into longer-lasting engagement, contributing to the number of engagement that lasted more than 15 minutes increasing from 99 (12.13% = 99 / 816) to 106 (13.49% = 106 / 786) observations. These engagements, however, have no effectiveness in earning any monetary rewards, even for the longest engagement lasting nearly 1 hour.

5.3 - time in reading instructions per BROWSER Session

  • To compute the time that a worker spent in reading instructions, four types of logging messages will be used:
    • (A) start task
    • (B) Ethics button click
    • (C) Start button pressed
    • (D) first msg that involves answering questions

Ideally, msg (A) & (C) will be used to compute the time that workers spent in reading instructions at the beginning, denoted by RT0. In case that (A) is missing, (B) would be used instead to estimate the time. If both (A) & (B) are missing, this BROWSER Session would be marked as no reading time at the beginning. In case that (C) is missing, (D) will be used instead of (C), unless it is abandoned immediately.

  • Sometimes within the BROWSER Session, it is possible that the worker failed in FINAL CHECK, and subsequently, s/he may refer to the instructions again. In that case, the concept of subsequent reading time would be introduced, and therefore, for each BROWSER Session, there may exist a series of reading time <RT0, RT1, RT2, ...> until the worker gives up.
# count(#) of BROWSER Session (duplicated)
dbAbandonGroup_BSession = dbAbandon.groupby('session').message.count()
print '(DB Abandon) count(#) of Browser sessions:', len(dbAbandon.session.unique())
print '(DB Abandon) count(#) of SINGLE MSG sessions', len(dbAbandonGroup_BSession.loc[dbAbandonGroup_BSession.apply(lambda x: x == 1)])
(DB Abandon) count(#) of Browser sessions: 1850
(DB Abandon) count(#) of SINGLE MSG sessions 1034
# extract MultiMsg BROWSER Sessions
multiMsgBSessionList = dbAbandonGroup_BSession.loc[dbAbandonGroup_BSession.apply(lambda x: x != 1)].index.tolist()
dbAbandon_BSessionMultiMsg = dbAbandon.loc[dbAbandon['session'].isin(multiMsgBSessionList)]
print 'count(#) of msg that only contains MultiMsg in each Browser Session:', len(dbAbandon_BSessionMultiMsg)
count(#) of msg that only contains MultiMsg in each Browser Session: 11390
db1 = json.loads(dbAbandon_BSessionMultiMsg.sort_values(['session', 'server_time']).to_json(orient = "records"))
db2 = []

# extract reading time at the beginning of NEW BROWSER Session
currSession = ''
currServerTime_startRead = 0
currServerTime_endRead = 0
currObj = {}
currStr = ''
stayOnSession = True
for i, r in enumerate(db1):
    if (r['session'] != currSession):  # cursor is on the FIRST msg of NEW BROWSER session
        currSession = r['session']
        if (i != 0):
            currObj['timeReading'] = currServerTime_endRead - currServerTime_startRead
            db2.append(currObj)

        currServerTime_startRead = 0
        if (r['message'][0] != "{"):  # this msg is not involving question answering
            currServerTime_startRead = r['server_time']
        currServerTime_endRead = r['server_time']
        currStr = r['message'][0]
        currObj = {"session": currSession, "seqNo": 0, "unit_id": r['unit_id'], "worker_id": r['worker_id'], "startReading": currServerTime_startRead}
        stayOnSession = True
        if (r['message'][0] == "S"):  # this msg is *Start button pressed*, should be discard because it is going to question part directly
            stayOnSession = False
    else:
        if (stayOnSession):
            if (r['message'][0] != "{"):
                currServerTime_endRead = r['server_time']
                currStr = r['message'][0]
            else:
                if (currStr != "S"):  # the last Msg is not *Start button click*
                    currServerTime_endRead = r['server_time']
                stayOnSession = False  # no need to check following Msg within this Session

    if (i == len(db1) - 1):
        currObj['timeReading'] = currServerTime_endRead - currServerTime_startRead
        db2.append(currObj)
print len(db2)
print db2[:5]
816
[{'timeReading': 152074, 'seqNo': 0, 'unit_id': u'442_420', 'session': u'01O0V2OS', 'startReading': 1526312419434, 'worker_id': 42958246}, {'timeReading': 122012, 'seqNo': 0, 'unit_id': u'442_173', 'session': u'0286S4TK', 'startReading': 1526240092381, 'worker_id': 42711446}, {'timeReading': 45257, 'seqNo': 0, 'unit_id': u'442_647', 'session': u'02WG27WF', 'startReading': 1525866272673, 'worker_id': 43863256}, {'timeReading': 30292, 'seqNo': 0, 'unit_id': u'442_547', 'session': u'03KF0O7G', 'startReading': 1526814619717, 'worker_id': 35065712}, {'timeReading': 0, 'seqNo': 0, 'unit_id': u'442_436', 'session': u'059WTKKH', 'startReading': 1525834084043, 'worker_id': 43621163}]

Summary - 5.3 - Time in Reading Instructions per BROWSER Session - (1)

  • Again, we discarded Single Msg BROWSER Sessions, leaving 816 of those which could be calculated. The total number of logging messages for these 816 BROWSER Sessions is 11390 (= 12424 - 1034), which is consistent with the number of total messages and Single Msg BROWSER Sessions.

# extract reading time in the middle of BROWSER Sessions (can only be run ONCE, unless re-run *db2[]* setting up)
currSession = ''
currServerTime_startRead = 0
currServerTime_endRead = 0
currObj = {}
currStr = ''
readingValid = False
for i, r in enumerate(db1):
    if (r['session'] != currSession):  # cursor is on the FIRST msg of NEW BROWSER session
        if (readingValid and currServerTime_startRead != 0):
            currObj['timeReading'] = currServerTime_endRead - currServerTime_startRead
            db2.append(currObj)
        currSession = r['session']
        readingValid = False
        currServerTime_startRead = 0
        currServerTime_endRead = 0
        currObj = {}
        currStr = ''
    else:
        if (readingValid):
            if (r['message'][0] == "s" or r['message'][0] == "E"):  # this msg is NOT question answering (reading or starting)
                currServerTime_endRead = r['server_time']
                if (currStr == ''):
                    currServerTime_startRead = r['server_time']
                    currStr = r['message'][0]
                    currObj = {"session": currSession, "seqNo": 1, "unit_id": r['unit_id'], "worker_id": r['worker_id'], "startReading": currServerTime_startRead}
            else:
                if (currStr != ''):
                    currServerTime_endRead = r['server_time']
                    currObj['timeReading'] = currServerTime_endRead - currServerTime_startRead
                    db2.append(currObj)
                    readingValid = False
                    currServerTime_startRead = 0
                    currServerTime_endRead = 0
                    currObj = {}
                    currStr = ''
    if (not readingValid and (r['message'][0] == "{" or r['message'][0] == "S")):  # start question answering
        readingValid = True

    if (i == len(db1) - 1):
        if (readingValid and currServerTime_startRead != 0):
            currObj['timeReading'] = currServerTime_endRead - currServerTime_startRead
            db2.append(currObj)
print len(db2)
print db2[-5:]
826
[{'timeReading': 37059, 'seqNo': 1, 'unit_id': u'442_255', 'session': u'O6BKIAMK', 'startReading': 1526091243913, 'worker_id': 43607944}, {'timeReading': 52455, 'seqNo': 1, 'unit_id': u'442_471', 'session': u'PF5DDLC5', 'startReading': 1525980333711, 'worker_id': 43932836}, {'timeReading': 29698, 'seqNo': 1, 'unit_id': u'442_263', 'session': u'TN14Z7GX', 'startReading': 1526563248162, 'worker_id': 44274852}, {'timeReading': 0, 'seqNo': 1, 'unit_id': u'442_657', 'session': u'V7FGYZR7', 'startReading': 1525928399628, 'worker_id': 44190274}, {'timeReading': 18531, 'seqNo': 1, 'unit_id': u'442_27', 'session': u'VM46R7HQ', 'startReading': 1526702380066, 'worker_id': 11665399}]

Summary - 5.3 - Time in Reading Instructions per BROWSER Session - (2)

  • 816 BROWSER Sessions were analysed to compute the reading at the Beginning, while after analysis of reading inthe Middle of the session, the number of total records becomes 826, giving the conclusion of there being 10 observations in BROWSER Sessions during which workers had referred back to the instructions (either Ethics or starting instructions).

# (DB Abandon) reading time for each BROWSER Session (beginning + middle)
dbAbandon_BSessionReadingTime = pd.DataFrame(db2)
print 'count(#) of BROWSER Session:', dbAbandon_BSessionReadingTime.session.nunique()

dbAbandon_BSessionReadingTime = dbAbandon_BSessionReadingTime.loc[dbAbandon_BSessionReadingTime['timeReading'] != 0]
print 'count(#) of BROWSER Sessions that has non-zero reading time:', dbAbandon_BSessionReadingTime.session.nunique()
count(#) of BROWSER Session: 816
count(#) of BROWSER Sessions that has non-zero reading time: 764

Summary - 5.3 - Time in Reading Instructions per BROWSER Session - (3)

  • We ruled out Zero reading time in the data, probably because of some technical reasons such as browser javascript not working so that the logging messages begin with Start button pressed (as discussed in 5.3 section instructions for both (A) & (B) are missing)
  • The number of BROWSER Sessions with non-zero reading time is 764, discarding 52 (= 816 - 764) BROWSER Sessions.

# check if these BSession with ZERO-Reading begin with *Start button click*
db1 = json.loads(dbAbandon_BSessionMultiMsg.sort_values(['session', 'message_number']).to_json(orient = "records"))
currSession = ''
tomList = []
y = 0
for r in db1:
    if (r['session'] != currSession):  # cursor is on the FIRST msg of NEW BROWSER session
        currSession = r['session']
        if (r['message'][0] == 'S'):
            y += 1
            tomList.append(currSession)
            if (r['message_number'] != 0):
                print 'begin with *Start button* & *MsgNo* is 1 (0 is missing):', currSession
print '\ncount(#) of BROWSER Session begin with *Start button click* (no reading at beginning):', y
print '\nworker (ID=AJ2FFXN7) did not read at the beginning but clicked *Ethics* button in the middle'
dbAbandon.loc[dbAbandon['session'] == 'AJ2FFXN7']
begin with *Start button* & *MsgNo* is 1 (0 is missing): T9K69ZQU

count(#) of BROWSER Session begin with *Start button click* (no reading at beginning): 53

worker (ID=AJ2FFXN7) did not read at the beginning but clicked *Ethics* button in the middle
message message_number server_time session task_id timestamp topic unit_id worker_id
6126 Start button pressed. 0 1526303696338 AJ2FFXN7 1264351 1526303720534 442.0 442_606 43607944
6130 Ethics button pressed. 1 1526303757445 AJ2FFXN7 1264351 1526303781593 442.0 442_606 43607944
6131 {"msg": "radio button: change rel", "doc": 1, ... 2 1526303784054 AJ2FFXN7 1264351 1526303808384 442.0 442_606 43607944
6132 {"msg": "change doc", "doc": 1, "step": 1, "co... 3 1526303784853 AJ2FFXN7 1264351 1526303809564 442.0 442_606 43607944
6133 {"msg": "radio button: change rel", "doc": 2, ... 4 1526303809345 AJ2FFXN7 1264351 1526303833623 442.0 442_606 43607944
6134 {"msg": "radio button: change rel", "doc": 2, ... 5 1526303814176 AJ2FFXN7 1264351 1526303838954 442.0 442_606 43607944
6135 {"msg": "change doc", "doc": 2, "step": 2, "co... 6 1526303862081 AJ2FFXN7 1264351 1526303886309 442.0 442_606 43607944
6136 {"msg": "radio button: change rel", "doc": 3, ... 7 1526303896803 AJ2FFXN7 1264351 1526303921093 442.0 442_606 43607944
6142 {"msg": "change doc", "doc": 3, "step": 3, "co... 8 1526304890679 AJ2FFXN7 1264351 1526304914862 442.0 442_606 43607944

Summary - 5.3 - Time in Reading Instructions per BROWSER Session - (4)

  • After scanning all 816 BROWSER Sessions, 53 among them started with Start button pressed and would be classified as no reading time at the start.
  • One of these 53 BROWSER Sessions started with message number being 1, probably because message 0 is missing due to network errors, while the other 52 BROWSER Sessions started with message 0 being Start button pressed, probably because javascript in their browsers was not working properly.
  • One observation among these 53 BROWSER Sessions is shown above. The logging message starts with Start button pressed, classified as no reading time at the start. But just after the worker started to answer questions, s/he pressed the Ethics button and spent some time reading. In that case, this BROWSER Session would be classified as having reading in the middle, and therefore, this BROWSER Session will be in the group of non-zero reading time. So the number of zero-reading time BROWSER Sessions should be 52, which is consistent with all other numbers.

# statistics & plot (analysis) per BROWSER Session (adding up multiple reading, beginning + middle)
statis_BSessionReadingTime = (dbAbandon_BSessionReadingTime.groupby('session').timeReading.sum().sort_values()) / 1000
print 'distribution of reading time (in seconds)'
statis_BSessionReadingTime.describe()
distribution of reading time (in seconds)
count     764.000000
mean      119.838671
std       187.836779
min         0.007000
25%        31.822500
50%        59.256500
75%       126.268250
max      1575.380000
Name: timeReading, dtype: float64
# plot in 10-SECONDS
plot_BSessionReadingTime = (statis_BSessionReadingTime / 10).round(0).value_counts().sort_index()
vecX = plot_BSessionReadingTime.index.tolist()  # the number of count(#) observed
vecY = plot_BSessionReadingTime.tolist()  # frequency of occurrence
print 'X = \n', vecX
print 'Y = \n', vecY
plt.plot(vecX, vecY)
plt.xlabel('time in reading on single BROWSER Session (in 10 seconds)')
plt.ylabel('frequency of occurrence')
plt.show()
X = 
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 37.0, 38.0, 39.0, 41.0, 42.0, 43.0, 44.0, 45.0, 47.0, 49.0, 50.0, 53.0, 56.0, 57.0, 58.0, 61.0, 64.0, 68.0, 70.0, 71.0, 73.0, 75.0, 76.0, 81.0, 84.0, 97.0, 100.0, 109.0, 121.0, 122.0, 126.0, 130.0, 136.0, 149.0, 150.0, 158.0]
Y = 
[7, 49, 70, 88, 87, 63, 47, 40, 24, 32, 29, 14, 15, 18, 15, 9, 11, 14, 11, 6, 12, 8, 6, 4, 1, 7, 2, 3, 7, 5, 4, 1, 2, 2, 2, 1, 2, 4, 2, 3, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
# plot in 30-SECONDS
plot_BSessionReadingTime = (statis_BSessionReadingTime / 30).round(0).value_counts().sort_index()
vecX = plot_BSessionReadingTime.index.tolist()  # the number of count(#) observed
vecY = plot_BSessionReadingTime.tolist()  # frequency of occurrence
print 'X = \n', vecX
print 'Y = \n', vecY
plt.plot(vecX, vecY)
plt.xlabel('time in reading on single BROWSER Session (in 30 seconds)')
plt.ylabel('frequency of occurrence')
plt.show()
X = 
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 23.0, 24.0, 25.0, 27.0, 28.0, 32.0, 33.0, 36.0, 40.0, 41.0, 42.0, 43.0, 45.0, 50.0, 53.0]
Y = 
[56, 245, 150, 85, 47, 35, 31, 26, 12, 12, 10, 6, 3, 6, 6, 3, 3, 1, 1, 3, 1, 1, 2, 2, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1]
# plot in 60-SECONDS (MINUTES)
plot_BSessionReadingTime = (statis_BSessionReadingTime / 60).round(0).value_counts().sort_index()
vecX = plot_BSessionReadingTime.index.tolist()  # the number of count(#) observed
vecY = plot_BSessionReadingTime.tolist()  # frequency of occurrence
print 'X = \n', vecX
print 'Y = \n', vecY
plt.plot(vecX, vecY)
plt.xlabel('time in reading on single BROWSER Session (in minutes)')
plt.ylabel('frequency of occurrence')
plt.show()
X = 
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 16.0, 17.0, 18.0, 20.0, 21.0, 22.0, 23.0, 25.0, 26.0]
Y = 
[176, 316, 111, 61, 26, 23, 10, 10, 4, 3, 2, 2, 4, 4, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1]
# manually count the frequency
c = np.array([0, 0, 0, 0, 0], dtype='float64')
for i in dbAbandon_BSessionReadingTime.groupby('session').timeReading.sum().sort_values().tolist():
    if (i <= 30000):  # 0.5 minutes
        c[0] += 1
    elif (i <= 60000):  # 0.5-1 minutes
        c[1] += 1
    elif (i <= 120000):  # 1-2 minutes
        c[2] += 1
    elif (i <= 300000):  # 2-5 minutes
        c[3] += 1
    else:  # 5+ munites
        c[4] += 1
print c, '\n', c/764
[176. 211. 173. 146.  58.] 
[0.23036649 0.27617801 0.22643979 0.19109948 0.07591623]
# comparison of the 2 distributions (total time & reading time)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax2 = fig.add_subplot(111, frame_on = False)

# total time per BROWSER Session (MINUTES)
statis_BSessionTotalTime = (dbAbandon_BSessionTotalTime.sessionTime_total / 60).round(0).value_counts().sort_index()
vecX = statis_BSessionTotalTime.index.tolist()  # the number of count(#) observed
vecY = statis_BSessionTotalTime.tolist()        # frequency of occurrence
color1 = 'tab:red'
ax1.plot(vecX, vecY, color = color1, label = 'total time')
ax1.set_xlabel('total time per BROWSER Session (minutes)', color = color1)
ax1.set_ylabel('frequency of occurrence', color = color1)
ax1.tick_params(axis = 'x', labelcolor = color1)
ax1.tick_params(axis = 'y', labelcolor = color1)

# reading time per BROWSER Session (MINUTES)
plot_BSessionReadingTime = (statis_BSessionReadingTime / 60).round(0).value_counts().sort_index()
vecX = plot_BSessionReadingTime.index.tolist()  # the number of count(#) observed
vecY = plot_BSessionReadingTime.tolist()        # frequency of occurrence
color2 = 'tab:blue'
ax2.plot(vecX, vecY, color = color2, label = 'reading time')
#ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
ax2.xaxis.tick_top()
ax2.yaxis.tick_right()
ax2.xaxis.set_label_position('top') 
ax2.yaxis.set_label_position('right') 
ax2.set_xlabel('reading time per BROWSER Session (minutes)', color = color2)
ax2.set_ylabel('frequency of occurrence', color = color2)
ax2.tick_params(axis='x', labelcolor = color2)
ax2.tick_params(axis='y', labelcolor = color2)

fig.legend(loc='upper center', bbox_to_anchor=(0.75, 0.88), ncol=1, fancybox=True, shadow=True)
plt.show()

Summary - 5.3 - Time in Reading Instructions per BROWSER Session - (5)

  • The average time that a worker spent in reading instructions is nearly 2 minutes (119.8 seconds), which is placed between 50 and 75 percentile, indicating that the distribution has a long tail on the right.
  • More than a half (23.04% + 27.62% = 50.66%) of the BROWSER Sessions had a reading time less than 1 minute. To be more precisely, 176 (23.04%) BROWSER Sessions had a reading time less than half a minute, while 211 (27.62%) had a reading time more than half minute but less than one minute.
  • Reading time between 1 and 2 minutes was observed in 173 BROWSER Sessions, building up 22.64 percentage, while another 19.11% had a reading time between 2 and 5 minutes.
  • Reading the instructions more than 5 minutes is not commonly observed, and the percentage of that is less than 8%. But the longest reading time is 1'575 seconds (more than 26 minutes), which gives us some clues that the worker may not understand our instructions.
  • By plotting the two distributions together to make a comparison, i.e. of both Total Time and Reading Time, it is obvious that the two distributions follow a similar pattern, steeply increasing at the beginning and then dropping dramatically with a long tail on the right.

5.4 - time in answering questions per BROWSER Session

  • Start button pressed would be regarded as the beginning when workers started to deal with questions. In case of this msg missing, the first msg of any Clicks on alternative choices would be used to estimate the starting time tackling the questions.
  • The last msg accounts for the end of answering questions, as long as it falls into one of the following types.
    • (A) Any Clicks on alternative choices
    • (B) Final Check

If the last msg is neither (A) nor (B), for example Ethics button event or start task msg, then the second last msg will be examined if it is type (A) or (B), and so forth.

# extract time of answering TOTAL questions in each BROWSER Session
db1 = json.loads(dbAbandon_BSessionMultiMsg.sort_values(['session', 'server_time']).to_json(orient = "records"))
db2 = []
currSession = ''
currServerTime_startAnswer = 0
currServerTime_endAnswer = 0
currObj = {}
currStr = ''
for i, r in enumerate(db1):
    if (r['session'] != currSession):  # new BROWSER session
        if (i != 0):
            currObj['startAnswer'] = currServerTime_startAnswer
            if (currServerTime_endAnswer == 0):
                currObj['timeAnswering'] = 0
            else:
                currObj['timeAnswering'] = currServerTime_endAnswer - currServerTime_startAnswer
            db2.append(currObj)
        currSession = r['session']
        currServerTime_startAnswer = 0
        currServerTime_endAnswer = 0
        currObj = {"session": currSession, "unit_id": r['unit_id'], "worker_id": r['worker_id']}
        currStr = ''
    if (currServerTime_startAnswer == 0):
        if (r['message'][0] == "S" or r['message'][0] == "{"):
            currServerTime_startAnswer = r['server_time']
    else:
        if (r['message'][0] == "{"):
            currServerTime_endAnswer = r['server_time']
    if (i == len(db1) - 1):
        currObj['startAnswer'] = currServerTime_startAnswer
        if (currServerTime_endAnswer == 0):
            currObj['timeAnswering'] = 0
        else:
            currObj['timeAnswering'] = ((currServerTime_endAnswer - currServerTime_startAnswer) / 1000).round(0)
        db2.append(currObj)
# convert to pd format
dbAbandon_BSessionAnsweringTime = pd.DataFrame(db2)
print 'total number of BSessions with MultiMsg:', len(dbAbandon_BSessionAnsweringTime)
print 'count(#) of BSessions that *Start time* is ZERO:', len(dbAbandon_BSessionAnsweringTime.loc[dbAbandon_BSessionAnsweringTime['startAnswer'] == 0])
dbAbandon_BSessionAnsweringTime.loc[dbAbandon_BSessionAnsweringTime['startAnswer'] == 0].head()
total number of BSessions with MultiMsg: 816
count(#) of BSessions that *Start time* is ZERO: 26
session startAnswer timeAnswering unit_id worker_id
44 1U6LVQUY 0 0 442_502 44430139
46 1V23VE0C 0 0 442_259 43903982
78 39ECNZRA 0 0 442_155 38735742
136 6BWZ4EOD 0 0 442_125 42646230
232 BBU3TM3I 0 0 442_459 41807814

Summary - 5.4 - Time in Answering Questions per BROWSER Session - (1)

  • Let us recall that among all 1850 BROWSER Sessions, only 816 of them have more than one logging msg.
  • 26 BROWSER Sessions have ZERO Start Time, which means that the workers quited without clicking Start button nor clicking any alternative choices in the question pages. In other words, they gave up before the task actually started.
  • ZERO Start Time implies that the time spent in answering questions is ZERO as well.

print 'count(#) of BSessions that *End time* is ZERO:'
print len(dbAbandon_BSessionAnsweringTime.loc[dbAbandon_BSessionAnsweringTime['timeAnswering'] == 0])

dbAbandon_BSessionAnsweringTime = dbAbandon_BSessionAnsweringTime.loc[dbAbandon_BSessionAnsweringTime['timeAnswering'] != 0]
print 'count(#) of BSessions that has non-zero answering time:', len(dbAbandon_BSessionAnsweringTime)
dbAbandon_BSessionAnsweringTime.head()
count(#) of BSessions that *End time* is ZERO:
259
count(#) of BSessions that has non-zero answering time: 557
session startAnswer timeAnswering unit_id worker_id
0 01O0V2OS 1526312571508 50573 442_420 42958246
1 0286S4TK 1526240214393 32116 442_173 42711446
3 03KF0O7G 1526814650009 1400238 442_547 35065712
4 059WTKKH 1525834084043 9890 442_436 43621163
5 0744LHG5 1525864944787 597736 442_100 44159665

Summary - 5.4 - Time in Answering Questions per BROWSER Session - (2)

  • 259 BROWSER Sessions were ended with ZERO End Time, including 26 ZERO Start Time BROWSER Sessions discussed above. The remaining 233 (= 259 - 26, 28.55% = 233 / 816) BROWSER Sessions have Start Time but no End Time, which gives the evidence that the workers quited just after Start button pressed, without any other activities that involved in answering questions.
  • The number of BROWSER Sessions with non-zero question answering time is 557 (= 816 - 259), and in these sessions, the workers did engage themselves more or less in answering the questions.

# statistics & plot (analysis) per BROWSER Session (adding up multiple reading, beginning + middle)
statis_BSessionAnsweringTime = (dbAbandon_BSessionAnsweringTime['timeAnswering'].sort_values()) / 1000
print 'distribution of answering time (in 1s)\n', statis_BSessionAnsweringTime.describe()
distribution of answering time (in 1s)
count     557.000000
mean      412.445447
std       405.828551
min         2.271000
25%       102.272000
50%       278.847000
75%       597.736000
max      2461.251000
Name: timeAnswering, dtype: float64
# manually count the frequency
c = np.array([0, 0, 0, 0, 0], dtype='float64')
for i in dbAbandon_BSessionAnsweringTime['timeAnswering'].tolist():
    if (i <= 60000):  # 1 minutes
        c[0] += 1
    elif (i <= 120000):  # 1-2 minutes
        c[1] += 1
    elif (i <= 300000):  # 2-5 minutes
        c[2] += 1
    elif (i <= 600000):  # 5-10 minutes
        c[3] += 1
    else:  # 10+ munites
        c[4] += 1
print c, '\n', c/557
[ 88.  67. 134. 129. 139.] 
[0.15798923 0.12028725 0.24057451 0.23159785 0.24955117]
plt.hist(statis_BSessionAnsweringTime, density = False, edgecolor = 'white', linewidth = 0.7, bins = 20)
plt.xlabel('time spent on answering questions in a single BROWSER Session (in seconds)')
plt.ylabel('frequency of occurrence')
plt.show()

Summary - 5.4 - Time in Answering Questions per BROWSER Session - (3)

  • The average time spent in answering questions in a single BROWSER Session is 412 seconds or simply nearly 7 minutes, which is again placed between 50 and 75 percentile and indicates a long tail on the right.
  • The percentage of BROWSER Sessions with less than one minute spent in answering questions is 15.8%, and that is 12.03% for those with answering time between one and two minutes.
  • The remaining BROWSER Sessions could be divided into three approximately quantitatively equal groups, according to the time spent in answering questions being 2-5 minutes, 5-10 minutes and more than 10 minutes. The percentage of each group is 24.06%, 23.16% and 24.96% respectively.
  • The longest time that the worker spent in answering questions is 41 minutes. In other words, this worker spent such a long time to earn $0.2 but gave up in the end without any monetary rewards.

# comparison of distribution
plt.hist([dbAbandon_BSessionTotalTime.sessionTime_total, statis_BSessionAnsweringTime], label = ['entire BROWSER Session', 'answering questions'], density = False, edgecolor = 'white', linewidth = 0.7, bins = 10)
plt.xlabel('time spent in a single BROWSER Session (in seconds)')
plt.ylabel('frequency of occurrence')
plt.legend(loc = 'upper right', bbox_to_anchor = (0.95, 0.98), ncol = 1, fancybox = True, shadow = True)
plt.title('Comparison of Distribution in Absolute Value')
plt.show()
# comparison of distribution
plt.hist([dbAbandon_BSessionTotalTime.sessionTime_total, statis_BSessionAnsweringTime], label = ['entire BROWSER Session', 'answering questions'], density = True, edgecolor = 'white', linewidth = 0.7, bins = 10)
plt.xlabel('time spent in a single BROWSER Session (in seconds)')
plt.ylabel('probability density')
plt.legend(loc = 'upper right', bbox_to_anchor = (0.95, 0.98), ncol = 1, fancybox = True, shadow = True)
plt.title('Comparison of Probability Density')
plt.show()

Summary - 5.4 - Time in Answering Questions per BROWSER Session - (4)

  • The above two plots show comparisons of the time spent in answering questions against the entire BROWSER Sessions.
  • In absolute value comparison (first figure), the difference between the two series in the first bar is caused by the elimination of ZERO answering time. Recall that 259 BROWSER Sessions were ended before clicking on any alternative answers in question page, and we omitted these sessions when computing the time spent in answering questions.
  • In probability density comparison (second figure), the sum of all probability density multiplied by sample values (horizontal axis) should equal to 100%. These 259 BROWSER Sessions, however, have no effect on the probability density function since the value of the sample (time spent in answering questions) is 0, so the product should be 0 as well. Therefore, the probability density of the two series follow the same shape.
  • If we combine the comparison with that in previous section, we can draw a conclusion that the time spent in reading instructions, answering questions and entire BROWSER Sessions follow the same distribution across different worker-tasks. Statistically, a same proportion of workers spent the same proportion of time in both reading instructions and answering questions, as well as engaging in the whole sessions.

5.5 - time in answering questions per question

  • Each question introduces 2 types of logging messages actually, Click on alternative choices and Next button event.
  • A logging msg will be sent to server each time when workers make a choice, and therefore, if the workers change their choices before going to next question, multiple Click on msg will be logged before Next button event.
  • In this section, timestamp will be used because from worker side, browser timestamp is more accurate than server time in the presence of network delay.
# extract time of answering EACH questions for each BROWSER Session
db1 = json.loads(dbAbandon_BSessionMultiMsg.sort_values(['session', 'timestamp']).to_json(orient = "records"))
db2 = []

currSession = ''
currServerTime_startQ = 0
currServerTime_endQ = 0
currObj = {}
questionNo = 0
for i, r in enumerate(db1):
    if (r['session'] != currSession):  # new BROWSER session
        if (i > 0):
            if (currServerTime_endQ == 0):
                currObj['timeAnswering'] = 0
            else:
                currObj['timeAnswering'] = currServerTime_endQ - currServerTime_startQ
            db2.append(currObj)
        currSession = r['session']
        currServerTime_startQ = 0
        currServerTime_endQ = 0
        questionNo = 0
        currObj = {"session": currSession, "unit_id": r['unit_id'], "worker_id": r['worker_id']}
        currObj['startAnswer'] = currServerTime_startQ
        currObj['questionNo'] = questionNo
    if (currServerTime_startQ == 0):
        if (r['message'][0] == "S" or r['message'][0] == "{"):
            currServerTime_startQ = r['timestamp']
            questionNo += 1
            currObj['startAnswer'] = currServerTime_startQ
            currObj['questionNo'] = questionNo
    else:
        if (r['message'][0] == "{"):
            currServerTime_endQ = r['timestamp']
            if ('msg' in json.loads(r['message']).keys() and json.loads(r['message'])["msg"] == 'change doc'):
                if (currServerTime_endQ == 0):
                    currObj['timeAnswering'] = 0
                else:
                    currObj['timeAnswering'] = currServerTime_endQ - currServerTime_startQ
                db2.append(currObj)
                currServerTime_startQ = r['timestamp']
                questionNo += 1
                currServerTime_endQ = 0
                currObj = {"session": currSession, "unit_id": r['unit_id'], "worker_id": r['worker_id']}
                currObj['startAnswer'] = currServerTime_startQ
                currObj['questionNo'] = questionNo
        else:
            if (currServerTime_endQ == 0):
                currObj['timeAnswering'] = 0
            else:
                currObj['timeAnswering'] = currServerTime_endQ - currServerTime_startQ
            db2.append(currObj)
            currServerTime_startQ = r['timestamp']
            questionNo += 1
            currServerTime_endQ = 0
            currObj = {"session": currSession, "unit_id": r['unit_id'], "worker_id": r['worker_id']}
            currObj['startAnswer'] = currServerTime_startQ
            currObj['questionNo'] = questionNo
    if (i == len(db1) - 1):
        if (currServerTime_endQ == 0):
            currObj['timeAnswering'] = 0
        else:
            currObj['timeAnswering'] = currServerTime_endQ - currServerTime_startQ
        db2.append(currObj)

# (DB Abandon) time of answering questions for each BROWSER Session
dbAbandon_questionTime_raw = pd.DataFrame(db2)
print 'count(#) of records of both questions AND FINAL CHECKs:', len(dbAbandon_questionTime_raw)
print 'count(#) of *FINAL CHECK*:', len(dbAbandon_questionTime_raw.loc[dbAbandon_questionTime_raw['timeAnswering'] <= 100])
count(#) of records of both questions AND FINAL CHECKs: 5837
count(#) of *FINAL CHECK*: 928
# (DB Abandon) time per question (excluding *FINAL CHECK*)
dbAbandon_questionTime_question = dbAbandon_questionTime_raw.loc[dbAbandon_questionTime_raw['timeAnswering'] > 100]
print 'count(#) of questions that were answered in DB Abandon:', len(dbAbandon_questionTime_question)
print 'count(#) of BROWSER sessions that has non-zero answering question time:', dbAbandon_questionTime_question.session.nunique()

# consistency checking results:
print '\nThere is ONE BSession in *non-zero answering question time* dataset but NOT in *non-zero answering time* dataset.'
print 'The reason is that in previous analysis *server_time* was adopted, but here it is based on *timestamp*.'
print 'There is a slight difference between these two columns, probably due to network delay. See *sessionID* = PZL3DMDH for details.'
count(#) of questions that were answered in DB Abandon: 4909
count(#) of BROWSER sessions that has non-zero answering question time: 558

There is ONE BSession in *non-zero answering question time* dataset but NOT in *non-zero answering time* dataset.
The reason is that in previous analysis *server_time* was adopted, but here it is based on *timestamp*.
There is a slight difference between these two columns, probably due to network delay. See *sessionID* = PZL3DMDH for details.
# statistics & plot (analysis) per BROWSER Session (do not consider *less time spent when restarted the same one*)
statis_questionAnsweringTime = (dbAbandon_questionTime_question['timeAnswering'].sort_values()) / 1000
print 'distribution of answering time for EACH question (in 1s)\n', statis_questionAnsweringTime.describe()

vecX = statis_questionAnsweringTime.round(0).value_counts().sort_index().index.tolist()  # the number of count(#) observed
vecY = statis_questionAnsweringTime.round(0).value_counts().sort_index().tolist()  # frequency of occurrence
print vecX[:100]
print vecY[:100]

plt.plot(vecX[:100], vecY[:100])
#plt.hist(statis_questionAnsweringTime, density = False, edgecolor = 'white', linewidth = 0.7, bins = 50)
plt.xlabel('time spent in answering each question (in seconds)')
plt.ylabel('frequency of occurrence')
plt.show()
distribution of answering time for EACH question (in 1s)
count    4909.000000
mean       45.748237
std        79.413043
min         0.677000
25%         7.048000
50%        20.312000
75%        50.282000
max      1446.453000
Name: timeAnswering, dtype: float64
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0, 91.0, 92.0, 93.0, 94.0, 95.0, 96.0, 97.0, 98.0, 99.0, 100.0]
[80, 234, 292, 231, 170, 142, 142, 101, 125, 101, 109, 96, 95, 98, 84, 87, 91, 60, 53, 77, 68, 53, 59, 56, 58, 63, 52, 56, 44, 56, 40, 42, 40, 40, 26, 36, 34, 36, 36, 34, 46, 36, 31, 25, 28, 28, 22, 23, 16, 36, 26, 21, 21, 20, 19, 13, 19, 17, 21, 20, 11, 10, 16, 17, 14, 19, 19, 18, 12, 12, 15, 13, 12, 15, 12, 16, 13, 15, 14, 13, 16, 13, 8, 7, 11, 14, 13, 10, 7, 13, 14, 8, 8, 8, 9, 11, 7, 6, 4, 2]
# manually count the frequency
c = np.array([0, 0, 0, 0, 0], dtype='float64')
for i in statis_questionAnsweringTime.tolist():
    if (i <= 5):  # 5 second
        c[0] += 1
    elif (i <= 10):  # 5-10 seconds
        c[1] += 1
    elif (i <= 30):  # 10-30 seconds
        c[2] += 1
    elif (i <= 60):  # 30-60 seconds
        c[3] += 1
    else:  # 1+ munite
        c[4] += 1
print c, '\n', c/4909
[ 919.  654. 1443.  859. 1034.] 
[0.18720717 0.13322469 0.29394989 0.17498472 0.21063353]

Summary - 5.5 - Time in Answering Questions per Question - (1)

  • The total number of questions answered in all BSessions is 4909. The average answering time for each question is 45.7 seconds, with the minimum time spent in one question being 0.677 seconds and the maximum 24.1 minutes.
  • 919 (18.72% = 919 / 4909) questions were answered within 5 seconds, and 654 (13.32% = 654 / 4909) questions were answered between 5 and 10 seconds. The proportions of questions that were answered between 10 to 30 seconds and 30 to 60 seconds are 29.39% and 17.5% respectively. The number of questions with an answering time more than 1 minute is 1034 (21.06% = 1034 / 4909).
  • There is another observation which is that if the workers started the same task a second time, for example restarted just after not passing the Final Check, they would spend significantly less time in answering each question. In the following section, this phenomenon would be taken into consideration.

# split the FIRST & Subsequent attempt of each question from *db2[]* (Sebsequent attempt is defined as any answers after FIRST Final Check)
db3 = []  # FIRST attempt
db4 = []  # Subsequent attempt
currSession = ''
stayOnSession = True
for r in db2:
    if (r['session'] != currSession):  # new BROWSER session
        currSession = r['session']
        stayOnSession = True
    if (r['questionNo'] > 8 and r['timeAnswering'] < 100):
        stayOnSession = False
    if (stayOnSession):
        db3.append(r)
    else:
        db4.append(r)

dbAbandon_questionTime_first = pd.DataFrame(db3)
dbAbandon_questionTime_subseq = pd.DataFrame(db4)

# discard any *Final Check* msg
dbAbandon_questionTime_first = dbAbandon_questionTime_first.loc[dbAbandon_questionTime_first['timeAnswering'] > 100]
dbAbandon_questionTime_subseq = dbAbandon_questionTime_subseq.loc[dbAbandon_questionTime_subseq['timeAnswering'] > 100]
print 'count(#) of questions before FIRST Final Check:', len(dbAbandon_questionTime_first)
print 'count(#) of questions after FIRST Final Check:', len(dbAbandon_questionTime_subseq)
print 'count(#) of BSession having FIRST & Subsequent attempt:', dbAbandon_questionTime_first.session.nunique(), dbAbandon_questionTime_subseq.session.nunique()
count(#) of questions before FIRST Final Check: 3056
count(#) of questions after FIRST Final Check: 1853
count(#) of BSession having FIRST & Subsequent attempt: 558 156

Summary - 5.5 - Time in Answering Questions per Question - (2)

  • By splitting the FIRST and Subsequent attempt, 4909 question answers are divided into two groups. 3056 (62.25% = 3056 / 4909) questions were answered within the First attempt, while in any subsequent attempts, 1853 (37.75%) questions were answered.
  • 156 out of total 558 (27.96%) BSessions have Subsequent attempt, meaning that in these BSessions the workers may not pass the Final Check and thus leading to Subsequent attempt.

# statistics & plot (analysis) per BROWSER Session (in both FIRST & Sebsequent attempt)
statis_questionTime_first = (dbAbandon_questionTime_first['timeAnswering'].sort_values()) / 1000
statis_questionTime_subseq = (dbAbandon_questionTime_subseq['timeAnswering'].sort_values()) / 1000
print 'distribution of answering time for EACH question in *FIRST* attempt (in 1s)\n', statis_questionTime_first.describe()
print '\ndistribution of answering time for EACH question in *Subsequent* attempt (in 1s)\n', statis_questionTime_subseq.describe()
distribution of answering time for EACH question in *FIRST* attempt (in 1s)
count    3056.000000
mean       66.564946
std        93.619119
min         1.410000
25%        18.196000
50%        36.843000
75%        76.720250
max      1446.453000
Name: timeAnswering, dtype: float64

distribution of answering time for EACH question in *Subsequent* attempt (in 1s)
count    1853.000000
mean       11.416955
std        18.992013
min         0.677000
25%         3.016000
50%         5.436000
75%        12.406000
max       397.208000
Name: timeAnswering, dtype: float64

Summary - 5.5 - Time in Answering Questions per Question - (3)

  • From the statistics, the average time spent on each question in First attempt is about 6 times as much as in Subsequent attempt, so are the values at each 25 percentiles, which shows that the distribution of these two datasets are almost the same, giving the evidence that the workers were clustering to the same working patterns.

# plot in line-chart
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax2 = fig.add_subplot(111, frame_on = False)

# FIRST attempt
plot_questionTime_first = (statis_questionTime_first / 1).round(0).value_counts().sort_index()
vecX = plot_questionTime_first.index.tolist()  # the number of count(#) observed
vecY = plot_questionTime_first.tolist()        # frequency of occurrence
color1 = 'tab:pink'
ax1.plot(vecX[:270], vecY[:270], color = color1, alpha=0.5, label = 'first attempt')
ax1.set_xlabel('time spent in answering each question in FIRST attempt (seconds)', color = color1)
ax1.set_ylabel('frequency of occurrence', color = color1)
ax1.tick_params(axis = 'x', labelcolor = color1)
ax1.tick_params(axis = 'y', labelcolor = color1)

# SUBSEQUENT attempt
plot_questionTime_subseq = (statis_questionTime_subseq / 1).round(0).value_counts().sort_index()
vecX = plot_questionTime_subseq.index.tolist()  # the number of count(#) observed
vecY = plot_questionTime_subseq.tolist()        # frequency of occurrence
color2 = 'tab:blue'
ax2.plot(vecX[:55], vecY[:55], color = color2, label = 'subsequent attempt')
#ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
ax2.xaxis.tick_top()
ax2.yaxis.tick_right()
ax2.xaxis.set_label_position('top')
ax2.yaxis.set_label_position('right')
ax2.set_xlabel('time spent in answering each question in SUBSEQUENT attempt (seconds)', color = color2)
ax2.set_ylabel('frequency of occurrence', color = color2)
ax2.tick_params(axis='x', labelcolor = color2)
ax2.tick_params(axis='y', labelcolor = color2)

fig.legend(loc='upper right', bbox_to_anchor=(0.87, 0.87), ncol=1, fancybox=True, shadow=True)
plt.show()

Summary - 5.5 - Time in Answering Questions per Question - (4)

  • The plots strengthened the conclusion that the workers doing the tasks followed the same patterns, although the time they spent in subsequent attempt was only about 1/6 of that in their First attempts.
  • The two curves overlap with each other, which shows that the "principal of 1/6" holds for all proportion of workers in all circumstances.
  • For example, consider that there are 100 workers engaging in their second/third attempts while 500 workers only participating in their first attempt, and 150 (30% out of 500) workers devote 9 min in their tasks. Therefore, in the group of 100 workers in their second/third attempts, there should also be 30% (same proportion) of workers devoting 1/6 time (of 9 min = 1.5 min or 1min 30s) in their tasks. (How often this occurs? This will be checked across different HITs.)

5.6 - Total Time per Browser Session, comparison (a) First Attempt vs. Subsequent Attempt, (b) completed session vs. abandoned session

  • It is possible that the worker failed in Final Check in First attempt, but after restart, s/he passed the Final Check and submitted the task successfully. Completing in Subsequent attempt but failed in First attempt could provide us richer information for a thorough comparison of the differences between those who completed the tasks and who did not. This analysis would come afterwards, and now, let us focus on the total time the workers spent in completing their tasks at BROWSER Session level.
# read DB Submit
tmp_db = []
with open(DATA_PROCESSED_DIR + jsonFile + '.dbSubmit', 'r') as f:
    lines = f.readlines()
    for line in lines:
        lineJSON = json.loads(line.strip())
        tmp_db.append(lineJSON)

# unifying JSON in *json.dumps* format
for i, r in enumerate(tmp_db):
    if (type(r['message']) == dict):
        tmp_db[i]['message'] = json.dumps(r['message'])
    else:
        tmp_db[i]['message'] = r['message'].encode('utf-8')

# convert to pd format and sorting, and then convert back to JSON
dbSubmit = pd.DataFrame(tmp_db)
db = json.loads(dbSubmit.sort_values(['session', 'timestamp']).to_json(orient = "records"))

# split *dbSubmit* into *submit in FIRST attempt* & *submit in Subsequent attempt*
firstSubmitSessionList = []
subseqSubmitSessionList = []
currSession = ''
passFinalCheck = True
firstTime = True
currObj = {}
i = 0
for r in db:
    if (currSession != r['session']):
        currSession = r['session']
        passFinalCheck = True
        firstTime = True
        currObj = {}
    if (currSession in firstSubmitSessionList):
        print '********* msg AFTER submit *********\n', r
        passFinalCheck = False   # no need to continue in current session
    else:
        if (r['message'][0] == "{"):
            currObj = json.loads(r['message'])
            if (passFinalCheck):
                if ('final_checks_passed' in currObj.keys()):
                    if (firstTime):
                        if (currObj['final_checks_passed']):
                            firstSubmitSessionList.append(currSession)
                        else:
                            firstTime = False
                    else:
                        if (currObj['final_checks_passed']):
                            subseqSubmitSessionList.append(currSession)
print '(DB Submit) count(#) of sessions with *First attempt* completion:', len(firstSubmitSessionList)
print '(DB Submit) count(#) of sessions with *Subsequent attempt* completion:', len(subseqSubmitSessionList)

# DB in pd format
dbSubmit_firstSession = dbSubmit.loc[dbSubmit['session'].isin(firstSubmitSessionList)]
dbSubmit_subseqSession = dbSubmit.loc[dbSubmit['session'].isin(subseqSubmitSessionList)]
(DB Submit) count(#) of sessions with *First attempt* completion: 578
(DB Submit) count(#) of sessions with *Subsequent attempt* completion: 94
# copy from previous code - (FIRST ATTEMPT) extract FIRST & LAST message of each BROWSER Session
# be aware that here, "timestamp" is used instead of "server time" in previous section
db1 = json.loads(dbSubmit_firstSession.sort_values(['session', 'timestamp']).to_json(orient = "records"))
db2 = []
currSession = ''
currTimestamp = 0
currObj = {}
for i, r in enumerate(db1):
    if (r['session'] != currSession):  # new BROWSER session
        if (i != 0):
            currObj['last_timestamp'] = currTimestamp
            db2.append(currObj)
        currSession = r['session']
        currTimestamp = r['timestamp']
        currObj = {"session": currSession, "first_timestamp": currTimestamp, "unit_id": r['unit_id'], "worker_id": r['worker_id']}
    else:
        currTimestamp = r['timestamp']
        if (i == len(db1) - 1):
            currObj['last_timestamp'] = currTimestamp
            db2.append(currObj)

# compute the time (seconds) that each BROWSER Session lasts, the output is *db3[]*
db3 = []
for r in db2:
    db3.append({"sessionTime_total": int((r['last_timestamp'] - r['first_timestamp']) / 1000), "session": r['session'], "unit_id": r['unit_id'], "worker_id": r['worker_id']})

# (DB Submit) total time for each BROWSER Session
dbSubmitFirst_BSessionTotalTime = pd.DataFrame(db3)

##################################################################
# copy from previous code - (SUBSEQUENT ATTEMPT) extract FIRST & LAST message of each BROWSER Session
# be aware that here, "timestamp" is used instead of "server time" in previous section
db1 = json.loads(dbSubmit_subseqSession.sort_values(['session', 'timestamp']).to_json(orient = "records"))
db2 = []
currSession = ''
currTimestamp = 0
currObj = {}
for i, r in enumerate(db1):
    if (r['session'] != currSession):  # new BROWSER session
        if (i != 0):
            currObj['last_timestamp'] = currTimestamp
            db2.append(currObj)
        currSession = r['session']
        currTimestamp = r['timestamp']
        currObj = {"session": currSession, "first_timestamp": currTimestamp, "unit_id": r['unit_id'], "worker_id": r['worker_id']}
    else:
        currTimestamp = r['timestamp']
        if (i == len(db1) - 1):
            currObj['last_timestamp'] = currTimestamp
            db2.append(currObj)

# compute the time (seconds) that each BROWSER Session lasts, the output is *db3[]*
db3 = []
for r in db2:
    db3.append({"sessionTime_total": int((r['last_timestamp'] - r['first_timestamp']) / 1000), "session": r['session'], "unit_id": r['unit_id'], "worker_id": r['worker_id']})

# (DB Submit) total time for each BROWSER Session
dbSubmitSubseq_BSessionTotalTime = pd.DataFrame(db3)

# distribution in SECONDS
print 'distribution of total time of submitted sessions in FIRST ATTEMPT'
dbSubmitFirst_BSessionTotalTime['sessionTime_total'].describe()
distribution of total time of submitted sessions in FIRST ATTEMPT
count     578.000000
mean      837.875433
std       437.116800
min       195.000000
25%       470.250000
50%       743.500000
75%      1155.000000
max      3500.000000
Name: sessionTime_total, dtype: float64
# distribution in SECONDS
print 'distribution of total time of submitted sessions in SUBSEQUENT ATTEMPT'
dbSubmitSubseq_BSessionTotalTime['sessionTime_total'].describe()
distribution of total time of submitted sessions in SUBSEQUENT ATTEMPT
count      94.000000
mean      697.212766
std       429.490793
min       210.000000
25%       354.250000
50%       521.500000
75%       938.750000
max      1750.000000
Name: sessionTime_total, dtype: float64
# comparison
dbAbandon_BSessionTotalTime.sessionTime_total.describe()
count     816.000000
mean      393.700980
std       431.875981
min         2.000000
25%        86.000000
50%       230.500000
75%       534.750000
max      3285.000000
Name: sessionTime_total, dtype: float64
# plot in histogram
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax2 = fig.add_subplot(111, frame_on = False)

# FIRST attempt
color1 = 'orange'
ax1.hist(dbSubmitFirst_BSessionTotalTime.sessionTime_total, label = 'first submit', color = color1, alpha=0.5, density = False, edgecolor = 'white', linewidth = 0.7, bins = 20)
ax1.set_xlabel('total time spent in tasks submitted in FIRST attempt (seconds)', color = color1)
ax1.set_ylabel('frequency of occurrence', color = color1)
ax1.tick_params(axis = 'x', labelcolor = color1)
ax1.tick_params(axis = 'y', labelcolor = color1)

# SUBSEQUENT attempt
color2 = 'gray'
ax2.hist(dbSubmitSubseq_BSessionTotalTime.sessionTime_total, label = 'subsequent submit', color = color2, alpha=0.5, density = False, edgecolor = 'white', linewidth = 0.7, bins = 20)
#ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
ax2.xaxis.tick_top()
ax2.yaxis.tick_right()
ax2.xaxis.set_label_position('top')
ax2.yaxis.set_label_position('right')
ax2.set_xlabel('total time spent in tasks submitted in SUBSEQUENT attempt (seconds)', color = color2)
ax2.set_ylabel('frequency of occurrence', color = color2)
ax2.tick_params(axis='x', labelcolor = color2)
ax2.tick_params(axis='y', labelcolor = color2)

fig.legend(loc='upper right', bbox_to_anchor=(0.87, 0.87), ncol=1, fancybox=True, shadow=True)
plt.show()
# comparison of distribution
plt.hist([dbSubmitFirst_BSessionTotalTime.sessionTime_total, dbSubmitSubseq_BSessionTotalTime.sessionTime_total], label = ['first submit', 'subsequent submit'], density = True, edgecolor = 'white', linewidth = 0, bins = 60)
plt.xlabel('time spent in a single BROWSER Session (in seconds)')
plt.ylabel('density of probability')
plt.legend(loc = 'upper right', bbox_to_anchor = (0.95, 0.98), ncol = 1, fancybox = True, shadow = True)
plt.title('Comparison of Probability Density')
plt.show()

Summary - 5.6 - Comparison of Total Time per BSession - (1)

  • We do not have strong evidence in different distributions of BSession duration time in their first and second/third attempts.
  • The two distribution and probability density are overlapping significantly, giving the eveidence that their patterns in doing the tasks remains even though they restarted the HITs. The only difference is that a same proportion of workers spend approximately one half of time in their second/third attempts as they do in first attempts.

# comparison of TOTAL TIME of those who COMPLETED with who ABANDONED
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax2 = fig.add_subplot(111, frame_on = False)

# those who completed
color1 = 'orange'
ax1.hist(pd.concat([dbSubmitFirst_BSessionTotalTime['sessionTime_total'], dbSubmitSubseq_BSessionTotalTime['sessionTime_total']]), label = 'completed', color = color1, alpha=0.5, density = False, edgecolor = 'white', linewidth = 0.7, bins = 20)
ax1.set_xlabel('total time spent in tasks completed (seconds)', color = color1)
ax1.set_ylabel('frequency of occurrence', color = color1)
ax1.tick_params(axis = 'x', labelcolor = color1)
ax1.tick_params(axis = 'y', labelcolor = color1)

# those who abandoned
color2 = 'gray'
ax2.hist(dbAbandon_BSessionTotalTime['sessionTime_total'], label = 'abandoned', color = color2, alpha=0.5, density = False, edgecolor = 'white', linewidth = 0.7, bins = 20)
#ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
ax2.xaxis.tick_top()
ax2.yaxis.tick_right()
ax2.xaxis.set_label_position('top')
ax2.yaxis.set_label_position('right')
ax2.set_xlabel('total time spent in tasks abandoned (seconds)', color = color2)
ax2.set_ylabel('frequency of occurrence', color = color2)
ax2.tick_params(axis='x', labelcolor = color2)
ax2.tick_params(axis='y', labelcolor = color2)

fig.legend(loc='upper right', bbox_to_anchor=(0.87, 0.87), ncol=1, fancybox=True, shadow=True)
plt.show()
# comparison of probability density
plt.hist([dbAbandon_BSessionTotalTime['sessionTime_total'], pd.concat([dbSubmitFirst_BSessionTotalTime['sessionTime_total'], dbSubmitSubseq_BSessionTotalTime['sessionTime_total']])], label = ['abandoned', 'completed'], density = True, edgecolor = 'white', linewidth = 0, bins = 60)
plt.xlabel('time spent in a single BROWSER Session (in seconds)')
plt.ylabel('density of probability')
plt.legend(loc = 'upper right', bbox_to_anchor = (0.95, 0.98), ncol = 1, fancybox = True, shadow = True)
plt.title('Comparison of Probability Density')
plt.show()

Summary - 5.6 - Comparison of Total Time per BSession - (2)

  • The two plots shows that the distribution of duration time per BSession of those who abandoned is totally different from those who completed. The workers who completed were spending longer time dealing with their tasks than those who quit at some time point. There are a huge proportion of workers in Abandoned group stopped in an early stage of their tasks.
  • With the increment of the duration time in Abandoned group, fewer and fewer workers engaged in the tasks, which could gives us the hypothesis that the more the workers devote their time in the tasks, the less likely they leave without completion. (Always true? This will be verified across different HITs.)
  • The Time Comparision will be carried out at a question level later, to answer the question whether the workers in Abandoned group spent much less time in EACH QUESTION than those in Completed group.

 
672