Welcome to an exploration of the Teahouse question archives!

This notebook will explore the various questions asked in the teahouse, and figure out what we can do with this data!

Fetch all the Teahouse Question Archives

We'll first fetch the entire contents of the Teahous questions archives before attempting to parse them.

# We shall use pywikibot to do our data fetching
from pywikibot import Site, pagegenerators
# Teahouse exists only on enwiki right now
enwiki = Site('en', 'wikipedia')

Teahouse archives are all pages of the form Wikipedia:Teahouse/Questions/Archive_NN where NN is a number. So let's get them all by looking for pages with prefix Wikipedia:Teahouse/Questions/Archive_

prefix = 'Wikipedia:Teahouse/Questions/Archive_'
# Make a list of all these pages, requesting 50 page results per API request
# We use a PreloadingGenerator to fetch the contents of the pages as well, since we
# will be working with those
pages = list(
    pagegenerators.PreloadingGenerator(
        pagegenerators.PrefixingPageGenerator(prefix=prefix, site=enwiki, step=50)
    )
)
VERBOSE:pywiki:Found 1 wikipedia:en processes running, including this one.
Retrieving 50 pages from wikipedia:en.
INFO:pywiki:Retrieving 50 pages from wikipedia:en.
Retrieving 50 pages from wikipedia:en.
INFO:pywiki:Retrieving 50 pages from wikipedia:en.
Retrieving 50 pages from wikipedia:en.
INFO:pywiki:Retrieving 50 pages from wikipedia:en.
Retrieving 50 pages from wikipedia:en.
INFO:pywiki:Retrieving 50 pages from wikipedia:en.
Retrieving 50 pages from wikipedia:en.
INFO:pywiki:Retrieving 50 pages from wikipedia:en.
Retrieving 50 pages from wikipedia:en.
INFO:pywiki:Retrieving 50 pages from wikipedia:en.
Retrieving 50 pages from wikipedia:en.
INFO:pywiki:Retrieving 50 pages from wikipedia:en.
Retrieving 50 pages from wikipedia:en.
INFO:pywiki:Retrieving 50 pages from wikipedia:en.
Retrieving 47 pages from wikipedia:en.
INFO:pywiki:Retrieving 47 pages from wikipedia:en.
print("We have found {} pages of archives!".format(len(pages)))
We have found 447 pages of archives!

Parsing the Archives

The wonderful WikiChatter library makes it super easy to parse the questions and answers out of the archives. We install it in the notebook with the ! magic (which executs all the things after it in a shell)

!pip install git+https://github.com/yuvipanda/WikiChatter.git
Collecting git+https://github.com/yuvipanda/WikiChatter.git
  Cloning https://github.com/yuvipanda/WikiChatter.git to /tmp/pip-rjbm2krk-build
Installing collected packages: WikiChatter
  Running setup.py install for WikiChatter
Successfully installed WikiChatter-0.1
from wikichatter import TalkPageParser

sections = []
for page in pages:
    if page.isRedirectPage():
        continue
 
    parsed_page = TalkPageParser.parse(page.text)
    for c in parsed_page.children:
        c.from_page = page.title() 
    sections += parsed_page.children
            
print("There are %d sections!" % len(sections))
There are 13845 sections!
class Question:
    def __init__(self, title, user, text, timestamp, replies, tree):
        self.title = title
        self.user = user
        self.text = text
        self.timestamp = timestamp
        self.replies = replies
        self.tree = tree
    
    @property
    def replies_count(self):
        return sum([len(r) for r in self.replies])
    
    @property
    def flat_replies(self):
        return sum([r.flat() for r in self.replies], [])
        
        
class Reply:
    def __init__(self, user, text, timestamp, parent, replies):
        self.user = user
        self.text = text
        self.timestamp = timestamp
        self.text = text
        self.replies = replies
    
    def __len__(self):
        return len(self.replies) + 1
    
    def flat(self):
        return sum([r.flat() for r in self.replies], [self])
from datetime import datetime
questions = []

def parse_time(time):
    time = time.replace(' Jun ', ' June ')
    return datetime.strptime(time.strip(), "%H:%M, %d %B %Y (%Z)"),

    
def make_answer(comment, parent):
    answer = Reply(
        user=comment.value.user,
        timestamp=parse_time(comment.value.timestamp),
        text='\n'.join([b.text for b in comment.value.blocks]),
        parent=parent,
        replies=[]
    )
    answer.replies = [make_answer(c, answer) for c in comment.children]
    return answer

for sec in sections:
    replies = []        
    questions.append(Question(
        title=sec.value.heading.strip().strip('=').strip(),
        user=sec.children[0].value.user,
        timestamp=parse_time(sec.children[0].value.timestamp),
        text='\n'.join([b.text for b in sec.children[0].value.blocks]),
        replies=[make_answer(c, None) for c in sec.children[1:] + sec.children[0].children],
        tree=sec
    ))

Ask some questions!

Now that we have a parsed list of all questions, we can ask questions of it!

% matplotlib inline
import matplotlib.pyplot as plt

Replies per question

Let's figure out the distributions of 'replies' per question.

plt.hist([len(q.flat_replies) for q in questions])
plt.title('Total replies per question')
plt.xlabel('Number of replies')
plt.ylabel('Number of questions')
plt.show()

Most questions get under 15 replies! The breakdown under that is a bit hard to see in this graph because of outliers, so assuming everything outside 15 replies is an outlier...

plt.hist([len(q.flat_replies) for q in questions if len(q.flat_replies) <= 15], bins=15)
plt.title('Total replies per question (<= 15 replies only)')
plt.xlabel('Number of replies')
plt.ylabel('Number of questions')
plt.show()

Number of questions per asker

Do people keep asking questions over and over again?

askers = {}
for q in questions:
    if q.user in askers:
        askers[q.user] += 1
    else:
        askers[q.user] = 1
plt.hist([i for i in askers.values()])
plt.title('Histogram of number of questions asked per user')
plt.show()
print('Total %s askers' % len(askers))
Total 8452 askers

Outliers again! Let's produce a table of the 'top' askers

top_askers = sorted(askers, key=askers.get, reverse=True)[:20]
for asker in top_askers:
    print('{0: <24} {1}'.format(asker, askers[asker]))
Miss Bono                133
Rubbish computer         92
DangerousJXD             79
Anne Delong              75
Robert McClenon          59
Matty.007                59
Titodutta                56
Gronk Oz                 53
Jodosma                  45
Capankajsmilyo           44
Acagastya                41
Tlqk56                   41
Bonkers The Clown        40
JHUbal27                 37
ColinFine                36
Bali88                   36
Nahnah4                  34
Tattoodwaitress          32
Bananasoldier            32
Frogger48                30

So let's exclude the top 20 (anyone with over 30 questions) from the histogram and see what we get!

plt.hist([i for i in askers.values() if i < 30],bins=30)
plt.title('Histogram of number of questions asked per user (n < 30)')
plt.show()