Introduction

This study aims to determine whether or not receiving thanks has an impact on editor activity.

SQL Query

  • returns historical edit and thank information for a set of editors

use PROJECT;

select A.user_name as Username, A.user_registration as Registration_Date, B.num_edits as Num_Edits, coalesce(C.num_thanks, 0) as Num_Thanks, coalesce(D.will_be_thanked, 0) as Thanked_Tomorrow, coalesce(E.thanks, 0) as Weeks_Thanks, coalesce(F.edits, 0) as Weeks_Edits, coalesce(G.edits, 0) as Future_Edits

from (select user_name, user_registration from user where ( user_registration < timestamp(TIME2))) as A

join (select rev_user_text, count(rev_user_text) as num_edits from revision where ( rev_timestamp < timestamp(TIME1) and rev_timestamp >= timestamp(TIME2) and rev_user != 0) group by rev_user_text) as B

on A.user_name = B.rev_user_text

left join (select log_title, count(log_title) as num_thanks from logging_userindex where ( log_action = 'thank' and log_type='thanks' and log_timestamp < timestamp(TIME1) and log_timestamp >= timestamp(TIME2)) group by log_title) as C

on A.user_name = C.log_title

left join (select log_title, count(log_title) as will_be_thanked from logging_userindex where ( log_action = 'thank' and log_type='thanks' and log_timestamp < timestamp(TIME3) and log_timestamp >= timestamp(TIME1)) group by log_title) as D

on A.user_name = D.log_title

left join (select log_title, count(log_title) as thanks from logging_userindex where ( log_action = 'thank' and log_type='thanks' and log_timestamp < timestamp(TIME1) and log_timestamp >= timestamp(TIME4)) group by log_title) as E

on A.user_name = E.log_title

left join (select rev_user_text, count(rev_user_text) as edits from revision where ( rev_timestamp < timestamp(TIME1) and rev_timestamp >= timestamp(TIME4) and rev_user != 0) group by rev_user_text) as F

on A.user_name = F.rev_user_text

left join (select rev_user_text, count(rev_user_text) as edits from revision where ( rev_timestamp < timestamp(TIME5) and rev_timestamp >= timestamp(TIME3) and rev_user != 0) group by rev_user_text) as G

on A.user_name = G.rev_user_text
order by B.num_edits;

Example:

PROJECT = plwiki_p

Data for Number of Edits One Day After Thank

TIME1 = '2018-03-08' TIME2 = '2017-12-08' TIME3 = '2018-03-09' TIME4 = '2018-03-01' TIME5 = '2018-03-10'

Data for Number of Edits One Week After Thank

TIME1 = '2018-03-08' TIME2 = '2017-12-08' TIME3 = '2018-03-09' TIME4 = '2018-03-01' TIME5 = '2018-03-16'

Data for Number of Edits One Month After Thank

TIME1 = '2018-03-08' TIME2 = '2017-12-08' TIME3 = '2018-03-09' TIME4 = '2018-03-01' TIME5 = '2018-04-09'

Data for Number of Edits One Quarter After Thank

TIME1 = '2018-03-08' TIME2 = '2017-12-08' TIME3 = '2018-03-09' TIME4 = '2018-03-01' TIME5 = '2018-06-09'

The goal of this study is to see whether receiving a thank leads to a change in edit behavior. We control for five variables: Tenure, or the number of days since an editor registered, Edits, or the person's edit count over the last three months (the three months leading up to the thank), Thanks, or their thanks received count over the last three months, Short-term edits, or their edit count over the last week (the week leading up to the thank), and Short-term thanks, or their thanks received count over the last week. Each editor also has a sixth field representing whether or not they received a thank the next day (the day after the one relative to which the five features data was collected). We can then compare the future edit counts of those who received a thank to those who didn't, and if we see any differences, we will know they were caused by the thank (because we controlled for the other features).

Format Data

import csv
from datetime import datetime
#filename variables -- tells computer where to find files and what they're called
src_stem = '(part 2)-data/'
src_proj = 'pl-data'
src_stem += src_proj + '/'
srcs = ['next-day/', 'next-week/', 'next-month/', 'next-quarter/']
input_prefix = ''
input_suffix = '.csv'
output_prefix = 'Formatted2'
output_suffix = 'Data.csv'
results_prefix = '(part 2)-results/'
results_suffix = '-results.csv'

#date variables -- used to make reading files easy if they're in my naming convention, ex 'Mar2018'
start_year = 2018
start_month = 3
start_day = 8
num_months = 12

#timestamp variables -- used to read the timestamps in the input files
dtime_format = "%Y%m%d"
dtime_len = 8

input_files = []
output_files = []

#change features and format_data_inner() if using different features
#features = ['Tenure', 'Edits', 'Thanks', 'User Group', 'Total Edits']
features = ['Tenure', 'Edits', 'Thanks', 'Short-term Edits', 'Short-term Thanks']
#takes in a start year, a start month, and a timeframe and returns labels
def generate_dates(start_year, m, n):
    m -= 1
    months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    dates = []
    for i in range(0, n):
        dates.append(months[m]+str(start_year))
        m -= 1
        if (m < 0):
            m = len(months) - 1
            start_year -= 1
    return dates
#uses the start dates to make a timestamp object
def generate_timestamp(year=start_year, month=start_month, day=start_day):
    return datetime(year, month, day)
months = generate_dates(start_year, start_month, num_months)
start_time = generate_timestamp(start_year, start_month, start_day)
#these variables are used in the find_group_size function
num_groups = 10
sample = [0, 1] #[0, 1] is all data, [0.2, 0.8] would be the data between 20th and 60th percentiles, for example
file_size, group_size, remainder = -1, -1, -1
#finds the numerical size of a percentage group (ex: how large is 5%)
def find_group_size(input_file, n):
    #computes the values of:
    global file_size, group_size, remainder
    file_size, group_size, remainder = -1, -1, -1
    file_size = 0
    with open(input_file, 'rt', encoding = 'utf-8') as csvfile:
        rder = csv.DictReader(csvfile)
        for row in rder:
            file_size += 1
    group_size = int(file_size/n)
    remainder = file_size - group_size * n
#uses the file variables to make a list of the relevant files
def make_file_lst():
    for src in srcs:
        data_category = []
        for month in months:
            input_file = src_stem + src + input_prefix + month + input_suffix
            data_category.append(input_file)
        output_file = src_stem + src + output_prefix + output_suffix
        output_files.append(output_file)
        input_files.append(data_category)
make_file_lst()
#passes format_data_inner() the input_files by category 
#category = looking at future edits over a day (or a week or a month...)
def format_data(input_files=input_files, output_files=output_files):
    for i in range(0, len(input_files)):
        format_data_inner(input_files[i], output_files[i])
#combines the files in a category into one file that contains the data in its proper form
#(ex tenure as a number of days as opposed to a string) and assigns ids
def format_data_inner(input_files, output_file, sample=sample, num_groups=num_groups): 
    next_id=0
    global file_size, group_size, remainder
    with open(output_file, 'w', encoding = 'utf-8') as csvfile:
        fieldnames = ['ID'] + features + ['Prediction', 'Future Edits']
        wrter = csv.DictWriter(csvfile, fieldnames=fieldnames)
        wrter.writeheader()

        for input_file in input_files:
            
            find_group_size(input_file, num_groups)
            start_group = round(sample[0] * num_groups)
            end_group = round(sample[1] * num_groups)
            
            with open(input_file, 'r', encoding = 'utf-8') as csvfile:
                rder = csv.DictReader(csvfile)
                
                i, j = 0, 0
                adjusted_group_size = group_size
                for row in rder:
                    next_id += 1
                    i += 1
                    pred = 0 if (int(row['Thanked_Tomorrow']) == 0) else 1
                    tenure = find_tenure_length(row['Registration_Date'])
                    if (j >= start_group and j < end_group):
                        wrter.writerow({'ID' : next_id, 'Tenure' : tenure, 'Edits' : row['Num_Edits'], 'Thanks' : row['Num_Thanks'],
                                        'Short-term Edits' : row['Weeks_Edits'], 'Short-term Thanks' : row['Weeks_Thanks'], 
                                        'Prediction' : pred, 'Future Edits' : row['Future_Edits']})
                    if (i == adjusted_group_size):
                        i = 0
                        j += 1
                        if (j == num_groups - remainder):
                            adjusted_group_size += 1
#converts the registration string to an integer representing the number of days since registration
def find_tenure_length(strtime, d1=start_time, dtime_format=dtime_format):
    d2 = datetime.strptime(strtime[:dtime_len], dtime_format)
    return abs((d1-d2).days)
def find_thanked_total(input_files=input_files, output_files=output_files):
    t = 0
    u = 0
    for i in range(0, len(input_files)):
        d = find_thanked_total_inner(input_files[i], output_files[i])
        t += d[0]
        u += d[1]
    return [t, u]
def find_thanked_total_inner(input_files, output_file, sample=sample, num_groups=num_groups): 
    next_id=0
    t = 0
    u = 0
    global file_size, group_size, remainder

    for input_file in input_files:
            
            find_group_size(input_file, num_groups)
            start_group = round(sample[0] * num_groups)
            end_group = round(sample[1] * num_groups)
            
            with open(input_file, 'r', encoding = 'utf-8') as csvfile:
                rder = csv.DictReader(csvfile)
                
                i, j = 0, 0
                adjusted_group_size = group_size
                for row in rder:
                    next_id += 1
                    i += 1
                    pred = 0 if (int(row['Thanked_Tomorrow']) == 0) else 1
                    if (int(row['Thanked_Tomorrow']) == 1):
                        t += 1
                    else:
                        u += 1
                    tenure = find_tenure_length(row['Registration_Date'])
                    #if (j >= start_group and j < end_group):
                    if (i == adjusted_group_size):
                        i = 0
                        j += 1
                        if (j == num_groups - remainder):
                            adjusted_group_size += 1
    return [t, u]
#only needs to be run once
#format_data()
find_thanked_total()
[888, 256264]

Data Analysis

Several forms of data analysis are presented below:

import sklearn
import numpy
import random

srcs[0] = next day's edit data, srcs[1] = next week's edit data, srcs[2] = next month's edit data, and srcs[3] = next quarter's edit data

input_file = src_stem + srcs[0] + output_prefix + output_suffix
x_train, y_train, x_test, y_test = [], [], [], []
x_set, y_set = [], []
set_ids = []
Random Forest 1
  • Used to evaluate the power of both the features and the treatment (receiving a thank) to predict future edit count.
import math
x_train, y_train, x_test, y_test = [], [], [], []
num_categories = 2
#random forest will predict which category a person falls into
#num_categories = 2 => one category for edit count <= x and one for > x
def get_data_for_random_forest(input_file=input_file):
    fut_edits_avg = 0
    total = 0
    with open(input_file, 'r', encoding = 'utf-8') as csvfile:
        rder = csv.DictReader(csvfile)
        for row in rder:
            fut_edits_avg += int(row['Future Edits'])
            total += 1
            
    fut_edits_avg *= 1.0/total
    fut_edits_avg = math.ceil(fut_edits_avg)

    print(fut_edits_avg)
    
    with open(input_file, 'r', encoding = 'utf-8') as csvfile:
        rder = csv.DictReader(csvfile)
        for row in rder:
            lst = []
            for f in features:
                lst.append(int(row[f]))
            lst.append(int(row['Prediction']))
            d = numpy.array(lst)
            
            #two categories: fut_edits <= fut_edits_avg, fut_edits > fut_edits_avg
            fut_edits = 0
            if (int(row['Future Edits']) > fut_edits_avg):
                fut_edits = 1
            
            if (fut_edits == 0):
                if (random.random() >= 0.1):
                    continue
                
            lst = [x_train, y_train]
            if (random.random() < 0.25):
                lst = [x_test, y_test]
            lst[0].append(d)
            lst[1].append(fut_edits)
input_file
'(part 2)-data/pl-data/next-day/Formatted2Data.csv'
get_data_for_random_forest() #makes training and test data
2
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=5)
clf.fit(x_train, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
def make_lst(x_lst):
    lst = [[0]*len(x_lst) for i in range(0, len(x_lst[0]))]
    for i in range(0, len(x_lst[0])):
        for j in range(0, len(x_lst)):
            lst[i][j] = x_lst[j][i]
    return lst
x_lst = make_lst(x_train)
y_lst = y_train
numpy.corrcoef(x_lst, y_lst)
array([[1.        , 0.02327246, 0.10493703, 0.03674099, 0.08276475,
        0.057167  , 0.21532266],
       [0.02327246, 1.        , 0.04343161, 0.25771246, 0.03046261,
        0.0182006 , 0.09883411],
       [0.10493703, 0.04343161, 1.        , 0.02828037, 0.6627535 ,
        0.38345555, 0.3455305 ],
       [0.03674099, 0.25771246, 0.02828037, 1.        , 0.02746271,
        0.01395177, 0.07410533],
       [0.08276475, 0.03046261, 0.6627535 , 0.02746271, 1.        ,
        0.29325322, 0.25777004],
       [0.057167  , 0.0182006 , 0.38345555, 0.01395177, 0.29325322,
        1.        , 0.19463442],
       [0.21532266, 0.09883411, 0.3455305 , 0.07410533, 0.25777004,
        0.19463442, 1.        ]])
#calculate true/false positives, true/false negatives, accuracy, precision, and recall
def get_stats(x_test=x_test, y_test=y_test):
    pred = clf.predict(x_test)
    stats = [[0]*4]*num_categories
    for i in range(0, len(stats)):
        data = [0]*4
        stats[i] = data
        for j in range(0, len(pred)):
            stats[i][0] += 1 if (pred[j] == i and y_test[j] == i) else 0
            stats[i][1] += 1 if (pred[j] == i and y_test[j] != i) else 0
            stats[i][2] += 1 if (pred[j] != i and y_test[j] != i) else 0
            stats[i][3] += 1 if (pred[j] != i and y_test[j] == i) else 0
    print ("tp, fp, tn, fn")
    print (stats)
    
    data = [[0]*3]*num_categories
    for i in range(0, len(data)):
        data[i] = [0]*3
        tp = stats[i][0]
        fp = stats[i][1]
        tn = stats[i][2]
        fn = stats[i][3]
        data[i][0] = (tp+tn)/(tp+fp+tn+fn)
        data[i][1] = tp/(tp+fp)
        data[i][2] = tp/(tp+fn)
    print ("accuracy, precision, recall")
    print (data)
get_stats(x_test, y_test) #makes training and test data
#prints stats for both categories
tp, fp, tn, fn
[[1407, 105, 586, 93], [586, 93, 1407, 105]]
accuracy, precision, recall
[[0.90963030579644, 0.9305555555555556, 0.938], [0.90963030579644, 0.8630338733431517, 0.8480463096960926]]
get_stats(x_train, y_train)
#same as above only with training data
tp, fp, tn, fn
[[4369, 287, 1865, 291], [1865, 291, 4369, 287]]
accuracy, precision, recall
[[0.9151497357604228, 0.9383591065292096, 0.9375536480686695], [0.9151497357604228, 0.8650278293135436, 0.866635687732342]]
 
Random Forest 2
  • Used to evaluate the power of the features to predict whether or not a person will receive a thank.
x_train, y_train, x_test, y_test = [], [], [], []
def get_data_for_random_forest(input_file=input_file):
    with open(input_file, 'r', encoding = 'utf-8') as csvfile:
        rder = csv.DictReader(csvfile)
        for row in rder:
            lst = []
            for f in features:
                lst.append(int(row[f]))
            d = numpy.array(lst)
            
            pred = int(row['Prediction'])
                
            if (pred == 0):
                if (random.random() >= 0.05):
                    continue
                    
            lst = [x_train, y_train]
            if (random.random() < 0.25):
                lst = [x_test, y_test]
            lst[0].append(d)
            lst[1].append(pred)
get_data_for_random_forest()
clf = RandomForestClassifier(max_depth=5)
clf.fit(x_train, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
get_stats(x_test, y_test) #makes training and test data
#prints stats for both categories
tp, fp, tn, fn
[[809, 31, 38, 10], [38, 10, 809, 31]]
accuracy, precision, recall
[[0.9538288288288288, 0.9630952380952381, 0.9877899877899878], [0.9538288288288288, 0.7916666666666666, 0.5507246376811594]]
get_stats(x_train, y_train)
#same as above only with training data
tp, fp, tn, fn
[[2359, 62, 139, 26], [139, 26, 2359, 62]]
accuracy, precision, recall
[[0.9659706109822119, 0.9743907476249484, 0.9890985324947589], [0.9659706109822119, 0.8424242424242424, 0.6915422885572139]]
Linear Regression
  • Used to evaluate the power of both the features and the treatment to predict future edit count.
x_train, y_train, x_test, y_test = [], [], [], []
def get_data_for_linear_regr(input_file=input_file):
    num_thanked = 0
    num_unthanked = 0
    with open(input_file, 'r', encoding = 'utf-8') as csvfile:
        rder = csv.DictReader(csvfile)
        for row in rder:
            lst = []
            for f in features:
                lst.append(int(row[f]))
            lst.append(int(row['Prediction']))
            d = numpy.array(lst)
            
            lst = [x_train, y_train]

            if (random.random() < 0.25):
                lst = [x_test, y_test]
            lst[0].append(d)
            lst[1].append(int(row['Future Edits']))
get_data_for_linear_regr() #makes training and test data
from sklearn.linear_model import LinearRegression
linearRegr = LinearRegression()
linearRegr.fit(x_train[-1000:], y_train[-1000:])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
def model_accuracy(x_test=x_test, y_test=y_test, tolerance=[5, 0.1]):
    total = 0
    true = 0
    for i in range(0, len(x_test)):
        x = x_test[i]
        y = y_test[i]
        y_pred = linearRegr.score(x.reshape(1, -1), [y])
        total += 1
        dif = abs(y-y_pred)                                                            
        if (dif <= tolerance[0] or dif<=tolerance[1]*y):
            true += 1
    return true*1.0/total
#only looking at the accuracy of the last 500 examples  (that's where most of the relevant examples are)
model_accuracy(x_test[-500:], y_test[-500:])
0.908
Logistic Regression
  • Used to evaluate the power of the features to predict whether or not a person will receive a thank.
x_train, y_train, x_test, y_test = [], [], [], []
def get_data_for_logistic_regr(input_file=input_file):
    with open(input_file, 'r', encoding = 'utf-8') as csvfile:
        rder = csv.DictReader(csvfile)
        for row in rder:
            lst = []
            for f in features:
                lst.append(int(row[f]))
            d = numpy.array(lst)
            
            lst = [x_train, y_train]
            
            if (random.random() < 0.25):
                lst = [x_test, y_test]
            lst[0].append(d)
            lst[1].append(int(row['Prediction']))
get_data_for_logistic_regr(input_file)
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression(class_weight='balanced')
logisticRegr.fit(x_train, y_train)
LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
odds = [numpy.exp(x) for x in logisticRegr.coef_]
odds
[array([1.00010152, 0.99998481, 1.4643195 , 1.00246761, 2.08065692])]
Inter-Feature Correlations
  • Examines the relationships between different features by taking "cross-sections" of the data.
import numpy as np
import matplotlib.pyplot as plt
#split data into two groups based on one feature 
#(ie all members of split1 have x < threshold, all members of split2 have x >= threshold)
def split_data(x, threshold, data=x_test, predictions=y_test):
    split1 = []
    split2 = []
    split1_size=0
    for i in range(0, len(data)):
        editor = data[i]
        covariates = np.append(editor[:x], editor[(x+1):])
        if (editor[x] < threshold):
            split1.append(np.append(covariates, predictions[i]))
            split1_size+=1
        else:
            split2.append(np.append(covariates, predictions[i]))
    split1 = find_means(split1)
    split2 = find_means(split2)
    print("split 1 size: "  + str(split1_size))
    print ("split 2 size: " + str(len(data)-split1_size))
    return [split1, split2]
def find_means(data):
    features_avgs = []
    for i in range(0, len(data)):
        for j in range(0, len(data[i])):
            if (i == 0):
                features_avgs.append(0)
            features_avgs[j] += data[i][j]
    return [(x*1.0/len(data)) for x in features_avgs]
fig, ax = plt.subplots()

x=4 #id of feature used to make split
threshold=1 #threshold for split
data = split_data(x, threshold)

columns = features + ['Prediction']
#columns = all features
rows = [columns[x] + ' < ' + str(threshold), columns[x] + ' >= ' + str(threshold)]

plt.title('Group by ' + columns[x])

columns = columns[:x]+columns[x+1:]

ax.axis('off')
table = ax.table(cellText=data,rowLabels=rows,colLabels=columns,loc='center')
table.scale(2, 2)
split 1 size: 15706
split 2 size: 367
#returns percentage of data that is people who were thanked
def percentage_with_thank_tmrw(data=x_test, predictions=y_test):
    num_thanked = 0
    for i in range(0, len(data)):
        if (predictions[i] == 1):
            num_thanked += 1
    return (num_thanked * 100.0 / len(data))
#prints percentage of data that is people who were thanked
print(str(round(percentage_with_thank_tmrw(), 3)) + '%')
0.46%
Matching
  • Match each thanked editor with an unthanked editor in such a way that all features of both data sets are, on average, balanced
input_file = src_stem + srcs[0] + output_prefix + output_suffix
trial = 'Next Day 1'
output_file = results_prefix + src_proj + results_suffix
x_set, y_set, set_ids = [], [], []
def make_datasets_for_matching(input_file=input_file):
    with open(input_file, 'r', encoding = 'utf-8') as csvfile:
        rder = csv.DictReader(csvfile)
        for row in rder:
            lst = []
            for f in features:
                lst.append(int(row[f]))
            d = numpy.array(lst)
            
            #makes the datasets which will be used for matching
            x_set.append(d)
            y_set.append(int(row['Prediction']))
            set_ids.append(int(row['ID']))
make_datasets_for_matching()
#each editor will be represented by a node 
#(using nodes is not necessary for the matching algorithm I used, 
#but it should make it easier to switch other algorithms in, for example network flow)
class Node():
    def __init__(self, tpe, ID, score, future_edits=0):
        self.t = tpe
        self.id = ID
        self.score = score
        self.future_edits = future_edits
    
    def __eq__(self, other):
        return (self.id == other.id)
    
    def __hash__(self):
        return hash(self.id)
        
    def __str__(self):
        return (str(self.id))
    
    def pretty_print(self):
        return ("ID: " + str(self.id) + ", score: " + str(self.score) + ", future edits: " + str(self.future_edits))

#define global variables
t_nodes = []
u_nodes = []
matches = {}
#make nodes
def make_nodes(input_file, sample=1):
    #collect relevant information from set_ids
    global t_nodes, u_nodes
    node_info = {}
    for i in range(0, len(set_ids)):
        ID = set_ids[i]
        score = x_set[i]
        node_type = y_set[i]
        node_info[ID] = {'score' : score, 'type' : node_type}
    #join information from set_ids with information about future edits
    with open(input_file, 'r', encoding = 'utf-8') as csvfile:
        rder = csv.DictReader(csvfile)
        for row in rder:
            if (not int(row['ID']) in node_info):
                continue
            node_info[int(row['ID'])]['future edits'] = [row['Future Edits']]
    #construct node lists
    for ID in node_info:
        node = Node(node_info[ID]['type'], ID, node_info[ID]['score'], int(node_info[ID]['future edits'][0]))
        if (node.t == 1):
            t_nodes.append(node)
        elif random.random() < 0.25:
            u_nodes.append(node)
make_nodes(input_file)
print(len(t_nodes), len(u_nodes))
270 16010
#match each thanked node to an unthanked node, if possible
def make_matches(t_nodes=t_nodes, u_nodes=u_nodes):
    j = 3 #The feature which is ultimately used to make the closest match
    matched = set()
    for t_node in t_nodes:
        match = u_nodes[0]
        match_score = [20, -1]
        found = False
        #try to match every unthanked node to a thanked node
        for u_node in u_nodes:
            if (u_node in matched): #if the node has already been matched, continue
                continue
            is_match = True
            #if all features are not resonably close together, continue (do not make the match)
            score1, score2 = 0, 0
            for i in range(0, len(u_node.score)):
                if (abs(u_node.score[i]-t_node.score[i]) > 0.2 * u_node.score[i]):
                    is_match = False
                #score1 += u_node.score[i]-t_node.score[i] 
                if (u_node.score[i]>=t_node.score[i]):
                    score2 += 1
            score1 = u_node.score[j] - t_node.score[j]
            if (not is_match):
                continue
            if ((score2 > match_score[1]) or (score2 == match_score[1] and score1 < match_score[0])):
                match_score = [score1, score2]
                match = u_node
                found = True
        #if a match was actually found    
        if (found):
            matches[t_node] = match
            matched.add(match)

Matching is one-to-one and greedy

make_matches()
t_nodes[0].pretty_print()
'ID: 4855, score: [279  27   0  26   0], future edits: 4'
t_nodes[0].score
array([279,  27,   0,  26,   0])
features
['Tenure', 'Edits', 'Thanks', 'Short-term Edits', 'Short-term Thanks']
with open('testing_match_code.csv', 'w', encoding = 'utf-8') as csvfile:
        wrter = csv.writer(csvfile)
        wrter.writerow(['Type'] + features + ['Future Edit Count'])
        for d in t_nodes:
            wrter.writerow([d.t, d.score[0], d.score[1], d.score[2], d.score[3], d.score[4], d.future_edits])
        for d in u_nodes:
            wrter.writerow([d.t, d.score[0], d.score[1], d.score[2], d.score[3], d.score[4], d.future_edits])
len(matches) #number of thanked editors who were matched
49
len(t_nodes) #number of thanked editors total
15
Analysis of pairings
  • what the future edit counts are on both sides of the matching
  • how well the nodes were matched (Ideally, each feature should average out to be slightly higher for the unthanked editor nodes)
def compare_future_edits(t_nodes=t_nodes, matches=matches):
    t_edits_avg = 0
    u_edits_avg = 0
    num_pairs = 0
    greater = 0
    #sum future_edits of both groups
    for node in matches:
        match = matches[node]
        t_edits_avg += node.future_edits
        u_edits_avg += match.future_edits
        if (node.future_edits >= match.future_edits):
            greater += 1
        num_pairs += 1
    
    #calculate and print averages
    t_edits_avg *= 1.0/num_pairs
    u_edits_avg *= 1.0/num_pairs
    #print (greater, len(matches))
    #print("thanked people's future edit count average: " + str(t_edits_avg) + ", unthanked counterparts': " + str(u_edits_avg))
    return [[t_edits_avg, greater], [u_edits_avg, len(matches)-greater]]
compare_future_edits()
[[0.07142857142857142, 14], [0.0, 0]]
def sanity_check(matches=matches):
    for node in matches:
        print (node.score, node.future_edits, matches[node].score, matches[node].future_edits)
sanity_check()
[1267   15    1    0    0] 1 [1523   15    1    0    0] 0
[2816    1    0    1    0] 0 [3295    1    0    1    0] 0
[2111    9    0    0    0] 0 [2564    9    0    0    0] 0
[3205   16    0    4    0] 0 [2826   18    0    4    0] 0
[2860    2    0    0    0] 0 [3335    2    0    0    0] 0
[3054    7    0    2    0] 0 [3556    7    0    2    0] 0
[3754    1    0    0    0] 0 [4259    1    0    0    0] 0
[1909    3    0    0    0] 0 [2320    3    0    0    0] 0
[3152   17    1    0    0] 0 [3650   17    1    0    0] 0
[3709    4    0    4    0] 0 [4359    4    0    4    0] 0
[3715    1    0    0    0] 0 [3871    1    0    0    0] 0
[1633    3    0    1    0] 0 [1860    3    0    1    0] 0
[2910    9    0    6    0] 0 [3086   10    0    6    0] 0
[3527   14    2    0    0] 0 [3927   14    2    0    0] 0
#to test how balanced our matches were
def test_match_efficacy():
    t_dict = {}
    u_dict = {}
    #for every node, store (ID, position in set_ids)
    for i in range(0, len(set_ids)):
        ID = set_ids[i]
        node_type = y_set[i]
        if (node_type == 1):
            t_dict[ID] = i
        else:
            u_dict[ID] = i
            
    header = features
    #print(header)
    t_data = [0] * len(header)
    u_data = [0] * len(header)
    
    for t_node in matches:
        #sum data for each feature of a t_node and its matched u_node
        id1 = t_dict[t_node.id]
        data1 = x_set[id1]
        id2 = u_dict[matches[t_node].id]
        data2 = x_set[id2]
        for i in range(0, len(t_data)):
            t_data[i] += data1[i]
            u_data[i] += data2[i]
            
    #calculate averages        
    for i in range(0, len(t_data)):
        t_data[i] /= (1.0 * len(matches))
        u_data[i] /= (1.0 * len(matches))
    #print(str(t_data) + ",  " + str(u_data))
    return [t_data, u_data]
test_match_efficacy() #a good, balanced set should have every number of the first list 
#be slightly lower than its corresponding position in the second
[[2830.1428571428573,
  7.285714285714286,
  0.2857142857142857,
  1.2857142857142858,
  0.0],
 [3173.6428571428573, 7.5, 0.2857142857142857, 1.2857142857142858, 0.0]]
fieldnames = ['Trial'] + features + ['Future Edit Count', 'Greater Edit Count']
def save_data(trial=trial, output_file=output_file, fieldnames=fieldnames):
    data = test_match_efficacy()
    data2 = compare_future_edits()
    data = [data[i] + data2[i] for i in range(0, len(data))]
    data = [[trial]+d for d in data]
    with open(output_file, 'a+', encoding = 'utf-8') as csvfile:
        wrter = csv.writer(csvfile)
        wrter.writerow(fieldnames)
        for d in data:
            wrter.writerow(d)
#save_data()
Conclusion:

From the full results (which can be found in (part 2)-results) we can claim that thanks do motivate editors to be more active in the short-term. Please be careful, however, as there could very well be some confounding variable I missed.

below, the samples were not actually going into effect. Should probs redo so that they are, going across all groups of 20%

sample = [0.8,1]
[[2968.5945945945946,
  750.3513513513514,
  7.405405405405405,
  68.0,
  0.43243243243243246],
 [3062.837837837838,
  751.8918918918919,
  7.486486486486487,
  73.24324324324324,
  0.4594594594594595]]
sample = [0.6, 0.8]
[[2746.5714285714284,
  12.428571428571429,
  0.5714285714285714,
  1.7142857142857142,
  0.0],
 [3177.5714285714284,
  12.142857142857142,
  0.5714285714285714,
  1.7142857142857142,
  0.0]]
sample = [0.2, 0.6]
[[2830.1428571428573,
  7.285714285714286,
  0.2857142857142857,
  1.2857142857142858,
  0.0],
 [3143.1428571428573,
  7.714285714285714,
  0.2857142857142857,
  1.2857142857142858,
  0.0]]