Welcome to an exploration of the Teahouse question archives!

This notebook will explore the various questions asked in the teahouse, and figure out what we can do with this data!

Fetch all the Teahouse Question Archives

We'll first fetch the entire contents of the Teahous questions archives before attempting to parse them.

# We shall use pywikibot to do our data fetching
from pywikibot import Site, pagegenerators
# Teahouse exists only on enwiki right now
enwiki = Site('en', 'wikipedia')

Teahouse archives are all pages of the form Wikipedia:Teahouse/Questions/Archive_NN where NN is a number. So let's get them all by looking for pages with prefix Wikipedia:Teahouse/Questions/Archive_

prefix = 'Wikipedia:Teahouse/Questions/Archive_'
# Make a list of all these pages, requesting 50 page results per API request
# We use a PreloadingGenerator to fetch the contents of the pages as well, since we
# will be working with those
pages = list(
        pagegenerators.PrefixingPageGenerator(prefix=prefix, site=enwiki, step=50)
VERBOSE:pywiki:Found 1 wikipedia:en processes running, including this one.
Retrieving 50 pages from wikipedia:en.
INFO:pywiki:Retrieving 50 pages from wikipedia:en.
Retrieving 50 pages from wikipedia:en.
INFO:pywiki:Retrieving 50 pages from wikipedia:en.
Retrieving 50 pages from wikipedia:en.
INFO:pywiki:Retrieving 50 pages from wikipedia:en.
Retrieving 50 pages from wikipedia:en.
INFO:pywiki:Retrieving 50 pages from wikipedia:en.
Retrieving 50 pages from wikipedia:en.
INFO:pywiki:Retrieving 50 pages from wikipedia:en.
Retrieving 50 pages from wikipedia:en.
INFO:pywiki:Retrieving 50 pages from wikipedia:en.
Retrieving 50 pages from wikipedia:en.
INFO:pywiki:Retrieving 50 pages from wikipedia:en.
Retrieving 50 pages from wikipedia:en.
INFO:pywiki:Retrieving 50 pages from wikipedia:en.
Retrieving 45 pages from wikipedia:en.
INFO:pywiki:Retrieving 45 pages from wikipedia:en.
We have found 445 pages of archives!

Splitting the archives into individual questions

Now we need to parse all the text into a more malleable form, using the wonderful mwparserfromhell library. It parses the wikitext into a bunch of nodes we can get better info from

from mwparserfromhell import parse, nodes
# Parse all the things!
parsed_pages = [parse(page.text) for page in pages]
sections = []
for parsed_page in parsed_pages:
    # Get only level 2 sections, so subsections all get subsumed
    sections += parsed_page.get_sections(levels=[2])
print("There are a total of {} questions in the archives!".format(len(sections)))
There are a total of 12325 questions in the archives!

Parsing the contents of the questions

Each section represents a question. We attempt to parse out the question and each individual conversation. We assume the section title is the question title, and then use vague heuristics (looking for anything looking like a timestamp) to split the conversation into individual replies further.

import re
# Dirty regex to detect dates!
date_sig_re = re.compile(r'(\d\d:\d\d, \d\d? (?:January|February|March|April|June|July|August|September|October|November|December) \d\d\d\d \(UTC\))')
def parse_conversations(section):
    title = None
    conversations = []
    current = []
    for node in section.nodes:
        if title is None:
            # Just starting, so first one gotta be a heading. That's our title
            assert(type(node) == nodes.Heading)
            title = node.__strip__(normalize=True, collapse=True).strip()
            last = node

        if type(node) == nodes.Text:
            if date_sig_re.search(node.value):
                current = []
    return (title, conversations)
# Parse all the things!
questions = [parse_conversations(section) for section in sections]

Ask some questions!

Now that we have a parsed list of all questions, we can ask questions of it!

# Questions with the word 'reference' in the title
[q[0] for q in questions if 'reference' in q[0]]
['my IMDb references are being deleted?',
 'Regarding the references of an article...',
 'Wikilinks in references',
 'In-text reference question',
 'Just need help making a reference',
 'adding references and footnotes',
 'Using eBooks for a reference',
 'Unreferenced article',
 'Editing references',
 'Harvnb reference when year is unknown',
 'Verifying references for a departed person',
 'Needing help with references, seeking experienced editor',
 'How do I combine multiple uses of a reference into one?',
 'Give reference',
 'How do you reference a',
 'Do I have to reference external links? Gchac (talk) 20:11, 7 August 2013 (UTC)',
 "How to add a PDF reference that isn't online/How do I track a Photo request",
 'how do I "verify" references in my page posts?',
 'Official press releases as references',
 'Quoting a reference',
 'How to deal with Reflist, how to edit out a reference..........',
 'How to reference something not online?',
 'How to insert references',
 'I have created a page, but think that the punctuation is wrong for the references.',
 'reference material does not include information given in article',
 'How to cite newspaper references',
 'Found a reference for music soundtrack credited to "Alexander Courage" from 1938',
 'Help with reference',
 'How to edit complex references?',
 'Other language references',
 'Is there a definitive way to cite a reference from a book?',
 'Using a company website as a reference for sales information',
 'Does University Dissertations only as a reference make an article notable?',
 'How do I make proper references (they exist) for this page. It is factual but evidently does not have all the references. The person/artist DOES absolutely meet Wikipedia criteria for a page for a person.',
 'will references from wikipedia work',
 'How to make multiple references to one source',
 'How do I give a reference when I am editing an existing article? (talk) 19:11, 23 October 2013 (UTC)',
 'How to indicate that a bio is poorly referenced self-promotion',
 'How much citation and reference is enough?',
 'How do I add a tag that an article is seriously lacking in references?',
 'looking for references for Knowledge-Based systems',
 'named references',
 "Someone's article contains scientific misstatements with unsupportive references. Can I just substitute correct statements or am I required first to refute his erroneous statements?",
 'Posting date in references list',
 'Article says it needs references - but it seems to have them already',
 'Trying to fulfill article reference issues',
 'not understanding how to code references / reference template',
 'Can "This article does not cite any references or sources" tag be removed after external link added?',
 'Can a paragraph about a living person titled "biography" have no references or citations and still be valid?',
 'Help with references after editing',
 'Regarding reference provided for the article',
 'How can I add reference notes to my article?',
 'Help for a newbie with references on a BLP',
 'Formatting questions; reference list and erroneous "page does not exist" markings',
 'Problem adding references in new article',
 'Updated one reference on the Winona County Minnesota information page, then lost everything',
 'revelance and references',
 'Are Flickr links ok for image reference?',
 'Same reference, different pages',
 'Adding references',
 'Using same reference many times in the article',
 'Book reference usage',
 "How do I align references so that it's not just one long paragraph?",
 'How to give references',
 "Can I write a new page if person's name is already referenced?",
 'Content removal without references',
 'Does *everything* need an online reference?!',
 'Linking to references from body of article',
 'Question regarding references',
 'references in another language',
 'Format for listing references and sources',
 'I\'ve added a "reflist" and references section, but the red message still shows about there not being one.',
 'Duplicate references',
 'Best ways to search for sources and references',
 'How to fix a broken reference link?',
 'How to manage multiple references to the same source, but on different pages?',
 'What exactly is wrong with this as a reference entry? Please help, it keeps getting rejected.',
 'Using non-English references?',
 'The subject of an article which kept on getting deleted is tempted to do it himself because he knows the materials and references that can support the article.',
 'Unreferenced? Or not',
 'how do I insert cross-reference to another wikipedia article?',
 'What type of references are necessary when discussing references of topics in media?',
 'How to add references in page?',
 "Is this article unreferenced? Or referenced in a way I'm unaware of?",
 'Is there a policy for "stashing" references for a future editor of a red-linked article?',
 'Defining a reference',
 'reg- references & links',
 'Newspaper name is red in reference',
 'Contribs in Preferences, Contribs at edit counter',
 'People for planet: let us make it a strong reference.',
 'How to add references / citations that are neither online nor publications',
 'Bal des débutantes article language reference',
 'Circular references',
 'Links in references?',
 'After edditing my sandboxed ext. links and references are gone',
 "I need a reflist template to complete my references but I don't know what that is!",
 'I have info that I want to put in an article, but there are no references.',
 'Viewing page 2 and also references',
 'How do I remove dead references?',
 'No reference for quote',
 'editing references',
 'New member trying to submit/references/citing',
 'Can you direct me or provide, and how do I reference a TED video?',
 'Appropriate way to reference article in an edited book?',
 'a reference was rejected,  see below',
 'non-profit with many pages referenced that need correction or clarification - can we do this ourselves?',
 'Citation preferences: {{sfn|name|year|etc.}} versus <ref>{{harvnb|name|year|etc.}}</ref>',
 'List-defined references',
 'Having trouble with references',
 'Multiple references to various pages of the same source',
 'Adding references',
 'I don\'t know what they want me to do with "The named reference $1 was invoked but never defined (see the help page)."',
 'Accepted with few references?',
 'Talk page: Keep references on their section',
 'I just added a reference - not sure I got the style correct.',
 'article references',
 'why my article is deleted or changed in spite of having references?',
 'reference format',
 'Regarding including references,external links and pictures to the articles we write',
 'Acceptable references for film and music articles',
 'Citing the same reference twice.',
 'Foreign language references',
 "How do I reference details of the South Australian government's Women's Information Service?",
 'More Preferences',
 'I am unable to add reference and sources. Please help in getting the page activate. I am even unable to upload picture. Nayab Sami 14:21, 11 September 2014 (UTC)',
 'My article was declined due to references? Would you please assist?',
 'Citing a movie as a reference',
 'Can you wikilink something in the references?',
 'How to I reference to external source?',
 'trouble with my references',
 'How many references does an article need to have',
 'PubMed reference template generator makes reference at the bottom repeat – DOI generator does not – recommendations? PubMed ID recommended over DOI?',
 'How do I eliminate duplicate references?',
 'permission for using as a reference a youtube video',
 'How can I ensure that the references are eligible and verified ?',
 'How to give references',
 'Unverifiable references',
 'Problems with adding references',
 'opinion is not the same as personal experience! do i need a note from my GP as reference to substantiate my treatment for poison oak?',
 'How do I insert references and external links and how do I know if my article submitted?',
 'Formatting for references vs. footnotes:',
 'how to include references that were written in Danish',
 'Watchlist and Preferences problem',
 'How do i properly reference images taken from published scientific papers?',
 'How do I condense the same reference used multiple times?',
 'Technical: Could not figure out how to edit references',
 "I can't get my references to work how do I do that?",
 "What's the usual date format for references?",
 'I am writing an article and I am not sure if the references are good enough',
 'Adding references',
 'All references to AIPAC are deleted from Washington Institute for Near East Policy',
 'Copyright notice and references and knowledge problems',
 'Why is IMDb not considered a reliable cited reference?',
 'removing a referenced line.',
 'Using other wikipedia links as a reference',
 'How do reference sections differ?',
 'How to reference a letter',
 'How many references do I need for verification?',
 'How to connect references to numbered superscripts in the text',
 'How to draw attention to an erroneous book reference',
 'Can someone please check the notability and references in my article draft?',
 'Article references and notability',
 'reference problem',
 'Formatting issues and unreliable references',
 'How can I put references more professionally?',
 'Can I take Youtube videos as references?',
 'Using references and .wav files',
 'I asked a question about references and .wav files',
 'Telluride Blues & Brews references',
 'articles without references',
 'Inserting references',
 'Adding references to the Mars 3 article',
 'How can I improve the notability of my references?',
 'Adding narrative to a numbered reference',
 'disambiguation pages and how to reference correct one!',
 'Where do you put reference material if there is no created page?',
 'Citing references',
 'Is there a way to cite video games as a reference?',
 'Can my reference pages be added....',
 'citing references and stuff...',
 'Thurlaston Brook Article - i cannot reference this as i wrote it entirely from my own survey',
 'Newspaper as a reference',
 'Question about references',
 'How to create a new page with all reference',
 'How can I reference it properly',
 'I need to edit my references',
 'Translating references',
 'Proper references',
 'Using French citation/reference templates in translated articles',
 'Concerned about edits made with references and article reverted back to before edited copy',
 'citations and references',
 'Books as references',
 'could someone help me in finding and citing appropriate references',
 'is notability issues only for the references or for the content as well?',
 'Using social media as a reference',
 'How do I edit a page to say that a reference is needed?',
 'What type of references are allowed for an article submission?',
 'How to edit reference links?',
 'How to cite a reference twice',
 'Is it acceptable to use a URL from archive.org for a reference?',
 'How do I repeat the same reference but with different page numbers',
 'a bot to check the references list',
 'How to differentiate sectioning of notes and references for an wiki article',
 'Help with references',
 'Ordering reference list',
 'My references are not auto-populating in the reference list',
 "I have two books as my reference but I don't know how to cite them as a reference. Please help. Thank you :-)",
 'E-commerce portals as a reference ??',
 'I would like the reference I added to have a number, but it is in the text of the article itself.',
 'Correcting references',
 'Which references need to be checked?',
 'how to cite when there are not references',
 'How to make reference links "clickable"',
 'Notability & references (also additional) problem',
 'How to reference a YouTube video in an article?',
 'How to add a reference?',
 'How many references for a Wikipedia article?',
 'Notability/ Circular references',
 "what's wrong with my references?",
 'Getting links into references',
 'third party references',
 'reference list mucked up',
 'I think my submission on The Interpersonal Gap was rejected becasue of a lack of references but I included three and they are not showing in the version you rejected',
 'How do I show the article needs references',
 'Shortening redundant information in references?',
 'Formatting footnotes, references, & external links question',
 'Improving references',
 'User page need references????',
 'Upgrade with no references',
 'My first wiki page - adding references',
 'can I share the information and the references before actually updating it on the account.',
 'preferences: highlighted in bold, where to find?',
 'Help with references',
 "can't find a reference",
 'Article denied due to lack of references',
 'Outdated references?',
 "How do I correct my submission's references to show the subject's notability?",
 "What if I can't find unbiased references?",
 'How do I do pinpoint citations without repeating the complete info about the source in each reference?',
 'Draft:Greger_Huttu was declined due to notability and verifiable references.',
 'Article references in foreign languages',
 'can you take a look at my reference page? looks weird. thank you!',
 "Why aren't my references being accepted?",
 'literature reference is not converted',
 'Should reference sources in citations use Ibid. when they repeat sequentially?',
 'Dealing with inaccessible and unconfirmed references',
 'Citing references when they come from a private chat (Privacy question)',
 'Article denied because of references, wondering if someone could take a look?',
 'What qualifies as a good reference when trying to back up a statement?',
 'formatting links and references on Francis Cauffman',
 'Is the reference sufficient?',
 'I need assistance fixing the broken reference that has been detected, in order to get the "Gabriele Corcos" page published please',
 'Problem fixing references on Bot Colony article',
 'Can the reference be repeated in use?',
 'Where to place reference, for a table/list',
 'Questionable reference',
 'difficulty with adding references',
 'Is it OK for editors to delete all references to an unpopular decision by the 11th Circuit Court of Appeals?',
 'ASIN references',
 'How to make changes to an edit and how to cite a reference?',
 'How to put inline references in my article?',
 'Unreferenced quotations + creating an article',
 "Inquisitive!!! Can someone please check my article's page and see if the references are reliable and enough?",
 'Do YouTube links work as references? I',
 'Do bulletin sources work as references without using those <ref></ref> things?',
 'Tell me how to make references',
 'What to do when the references are pay for view scholarly journals?',
 'What about a reference that is magazine article where the magazine charges for the full content?',
 'reference inside of an article?',
 "Can't get a reference to work properly...",
 'Problem with references',
 'Creating an article with few formal references',
 'reference section / online article',
 'how do I add a reference??',
 'Missing reference tags',
 'The use of company/personal websites as references.',
 'Quoting references',
 'Help neded to sort out citations/references',
 'cannot add ref list because it says there are blacklisted references - how do I know which are',
 'How do I provide more reliable references?',
 'how should I list the references I need to present in a way to verify my article .. ie CD critics and press  .?',
 'How can I prove authenticity of my article without online reference.',
 'Want to add a link to the references section',
 'link vs. reference',
 'Can I use school or local publications as references for an article about a high school?',
 'how to add reference or a source on wikipedia?',
 'What references are reliable?',
 'About references',
 'How to update the references?',
 'What do I do to fix my references for ZoSia Karbowiak',
 'Publishing a new term where references will be added in the coming months',
 'help with references for beginner',
 'Desire to post a reference about a supplier of innovative point of sale and payment technology',
 'Help Me! The reference on the LPS: Popular disappears! How can I prevent the reference from disappearing?',
 'How does one insert a superscript number, say "1" at the end of a sentence and then provide the reference data, in this case a PNAS paper form 1979?',
 'How to do references',
 'Can I delete footnote 1 without losing the reference for a later footnote?',
 'Bizarre behaviour in the link within a reference',
 'Mobile versions as reference',
 'Third party references and opinion about a subject, however failing to understand why the article is not accepted',
 'seperating references into references into two categories',
 'problem with the reference section',
 'Please review the reference image as follows?',
 'A reference as something other than a published article',
 'I have submitted National Newspaper articles for a certain article for references, but its not approved. How do i go about it?',
 'Formatting references',
 'How to add a reference.',
 'how to add a reference',
 'How to use a map as a reference',
 'Creating references.',
 'How to fix "cite" references',
 'Disambiguation and confusing references',
 'A reference I wanted to use is on the Spam List.',
 "This submission's references do not adequately evidence the subject's notability.",
 'another problem/issue with reference section',
 'Is the references and citations OK',
 'How to indicate references on a list',
 'Unsure if I should reference',
 'How to use the same reference more than once?',
 'Trying to address concerns about neutral tone and references for article',
 'Writing up references in advance',
 'how to add a reference in the reflist section',
 'How to reference pictures',
 'Article references',
 'reference creator',
 'Does a bibliography page need references?',
 'How do I add more links, references and new sources to War on Terror page?',
 'Adding references for first person accounts or personal experiences',
 'reference problem',
 'how do i enter references?',
 'updating references',
 'Editing references',
 'How to  provide a reference link?',
 'using references',
 'bullet list in a reference?',
 'What happens in the case of new movements which do not have written references?',
 'Use of IMDB as a reference',
 'Blog as reference?',
 'Problem with ref name: uses only first reference']
# Questions with the word 'textbook' in the title
[q[0] for q in questions if 'textbook' in q[0]]
['Spam link?? I entered a link to a textbook?']