Lab 2 - Hyperlink Networks

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the second of five lab notebooks that will explore how to do some introductory data extraction and analysis from Wikipedia data. This lab will extend the methods in the prior lab about analyzing a single article's revision histories and use network science methods to analyze the networks of hyperlinks around a single article. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm that basic Python commands work

a = 3
b = 4
a**b
81

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retreive information for analysis, and visualize results.

# Makes the plots appear within the notebook
%matplotlib inline

# Two fundamental packages for doing data manipulation
import numpy as np                   # http://www.numpy.org/
import pandas as pd                  # http://pandas.pydata.org/

# Two related packages for plotting data
import matplotlib.pyplot as plt      # http://matplotlib.org/
import seaborn as sb                 # https://stanford.edu/~mwaskom/software/seaborn/

# Package for requesting data via the web and parsing resulting JSON
import requests
import json
from bs4 import BeautifulSoup

# Two packages for accessing the MySQL server
import pymysql                       # http://pymysql.readthedocs.io/en/latest/
import os                            # https://docs.python.org/3.4/library/os.html

# Packages for analyzing complex networks
import networkx as nx                # https://networkx.github.io/
import igraph as ig

# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
sb.set_style('whitegrid')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110

Define the name of the article you want to use for the rest of the lab.

page_title = "2013 Egyptian coup d'état"
#practice calling APIwith Alber Einsten page
_S="https://en.wikipedia.org/w/api.php?action=query&format=json&prop=langlinks&formatversion=2&titles=Main_Page&llprop=autonym|langname&lllimit=500&"
    
req = requests.get(_S)

json_string = json.loads(req.text)
    
#run the string and it spits back the page name in every other lang
json_string
{'batchcomplete': True,
 'query': {'normalized': [{'from': 'Main_Page',
    'fromencoded': False,
    'to': 'Main Page'}],
  'pages': [{'langlinks': [{'autonym': 'العربية',
      'lang': 'ar',
      'langname': 'Arabic',
      'title': ''},
     {'autonym': 'български',
      'lang': 'bg',
      'langname': 'Bulgarian',
      'title': ''},
     {'autonym': 'bosanski', 'lang': 'bs', 'langname': 'Bosnian', 'title': ''},
     {'autonym': 'català', 'lang': 'ca', 'langname': 'Catalan', 'title': ''},
     {'autonym': 'čeština', 'lang': 'cs', 'langname': 'Czech', 'title': ''},
     {'autonym': 'dansk', 'lang': 'da', 'langname': 'Danish', 'title': ''},
     {'autonym': 'Deutsch', 'lang': 'de', 'langname': 'German', 'title': ''},
     {'autonym': 'Ελληνικά', 'lang': 'el', 'langname': 'Greek', 'title': ''},
     {'autonym': 'Esperanto',
      'lang': 'eo',
      'langname': 'Esperanto',
      'title': ''},
     {'autonym': 'español', 'lang': 'es', 'langname': 'Spanish', 'title': ''},
     {'autonym': 'eesti', 'lang': 'et', 'langname': 'Estonian', 'title': ''},
     {'autonym': 'euskara', 'lang': 'eu', 'langname': 'Basque', 'title': ''},
     {'autonym': 'فارسی', 'lang': 'fa', 'langname': 'Persian', 'title': ''},
     {'autonym': 'suomi', 'lang': 'fi', 'langname': 'Finnish', 'title': ''},
     {'autonym': 'français', 'lang': 'fr', 'langname': 'French', 'title': ''},
     {'autonym': 'galego', 'lang': 'gl', 'langname': 'Galician', 'title': ''},
     {'autonym': 'עברית', 'lang': 'he', 'langname': 'Hebrew', 'title': ''},
     {'autonym': 'hrvatski',
      'lang': 'hr',
      'langname': 'Croatian',
      'title': ''},
     {'autonym': 'magyar', 'lang': 'hu', 'langname': 'Hungarian', 'title': ''},
     {'autonym': 'Bahasa Indonesia',
      'lang': 'id',
      'langname': 'Indonesian',
      'title': ''},
     {'autonym': 'italiano', 'lang': 'it', 'langname': 'Italian', 'title': ''},
     {'autonym': '日本語', 'lang': 'ja', 'langname': 'Japanese', 'title': ''},
     {'autonym': 'ქართული', 'lang': 'ka', 'langname': 'Georgian', 'title': ''},
     {'autonym': '한국어', 'lang': 'ko', 'langname': 'Korean', 'title': ''},
     {'autonym': 'lietuvių',
      'lang': 'lt',
      'langname': 'Lithuanian',
      'title': ''},
     {'autonym': 'latviešu', 'lang': 'lv', 'langname': 'Latvian', 'title': ''},
     {'autonym': 'Bahasa Melayu',
      'lang': 'ms',
      'langname': 'Malay',
      'title': ''},
     {'autonym': 'Nederlands', 'lang': 'nl', 'langname': 'Dutch', 'title': ''},
     {'autonym': 'norsk nynorsk',
      'lang': 'nn',
      'langname': 'Norwegian Nynorsk',
      'title': ''},
     {'autonym': 'norsk bokmål',
      'lang': 'no',
      'langname': 'Norwegian',
      'title': ''},
     {'autonym': 'polski', 'lang': 'pl', 'langname': 'Polish', 'title': ''},
     {'autonym': 'português',
      'lang': 'pt',
      'langname': 'Portuguese',
      'title': ''},
     {'autonym': 'română', 'lang': 'ro', 'langname': 'Romanian', 'title': ''},
     {'autonym': 'русский', 'lang': 'ru', 'langname': 'Russian', 'title': ''},
     {'autonym': 'srpskohrvatski / српскохрватски',
      'lang': 'sh',
      'langname': 'Serbo-Croatian',
      'title': ''},
     {'autonym': 'Simple English',
      'lang': 'simple',
      'langname': 'Simple English',
      'title': ''},
     {'autonym': 'slovenčina',
      'lang': 'sk',
      'langname': 'Slovak',
      'title': ''},
     {'autonym': 'slovenščina',
      'lang': 'sl',
      'langname': 'Slovenian',
      'title': ''},
     {'autonym': 'српски / srpski',
      'lang': 'sr',
      'langname': 'Serbian',
      'title': ''},
     {'autonym': 'svenska', 'lang': 'sv', 'langname': 'Swedish', 'title': ''},
     {'autonym': 'ไทย', 'lang': 'th', 'langname': 'Thai', 'title': ''},
     {'autonym': 'Türkçe', 'lang': 'tr', 'langname': 'Turkish', 'title': ''},
     {'autonym': 'українська',
      'lang': 'uk',
      'langname': 'Ukrainian',
      'title': ''},
     {'autonym': 'Tiếng Việt',
      'lang': 'vi',
      'langname': 'Vietnamese',
      'title': ''},
     {'autonym': '中文', 'lang': 'zh', 'langname': 'Chinese', 'title': ''}],
    'ns': 0,
    'pageid': 15580374,
    'title': 'Main Page'}]}}
 len(_langlink_list)
45
#make a variable for page ID so we don't have to enter specific ID by making a list
_pageID=list(json_string['query']['pages'].keys())[0]
#once we have the variable pass it through the string
_langlink_list=json_string['query']['pages'][_pageID]['langlinks']
#make a dictionary of links for each language 
_langlink_dict=dict()

for d in _langlink_list:
    _lang=d['lang']
    _title=d['*']
    _langlink_dict[_lang]=_title
#dictionary list of all languages and their abbreviations
_langAbrev_dict=dict()

for d in _langlink_list:
    _lang=d['lang']
    _langname=d['langname']
    _langAbrev_dict[_lang]=_langname
_langAbrev_dict
{'af': 'Afrikaans',
 'als': 'Alemannisch',
 'am': 'Amharic',
 'an': 'Aragonese',
 'ang': 'Old English',
 'ar': 'Arabic',
 'arz': 'Egyptian Arabic',
 'as': 'Assamese',
 'ast': 'Asturian',
 'ay': 'Aymara',
 'az': 'Azerbaijani',
 'azb': 'تۆرکجه',
 'ba': 'Bashkir',
 'bat-smg': 'Samogitian',
 'bcl': 'Bikol Central',
 'be': 'Belarusian',
 'be-x-old': 'беларуская (тарашкевіца)\u200e',
 'bg': 'Bulgarian',
 'bm': 'Bambara',
 'bn': 'Bangla',
 'bpy': 'Bishnupriya',
 'br': 'Breton',
 'bs': 'Bosnian',
 'bxr': 'буряад',
 'ca': 'Catalan',
 'cbk-zam': 'Chavacano de Zamboanga',
 'cdo': 'Min Dong Chinese',
 'ce': 'Chechen',
 'ceb': 'Cebuano',
 'ckb': 'Central Kurdish',
 'co': 'Corsican',
 'cs': 'Czech',
 'cv': 'Chuvash',
 'cy': 'Welsh',
 'da': 'Danish',
 'de': 'German',
 'diq': 'Zazaki',
 'el': 'Greek',
 'eo': 'Esperanto',
 'es': 'Spanish',
 'et': 'Estonian',
 'eu': 'Basque',
 'ext': 'Extremaduran',
 'fa': 'Persian',
 'fi': 'Finnish',
 'fiu-vro': 'Võro',
 'fo': 'Faroese',
 'fr': 'French',
 'frp': 'Arpitan',
 'frr': 'Northern Frisian',
 'fy': 'Western Frisian',
 'ga': 'Irish',
 'gan': 'Gan Chinese',
 'gd': 'Scottish Gaelic',
 'gl': 'Galician',
 'gn': 'Guarani',
 'gom': 'Goan Konkani',
 'gu': 'Gujarati',
 'gv': 'Manx',
 'hak': 'Hakka Chinese',
 'haw': 'Hawaiian',
 'he': 'Hebrew',
 'hi': 'Hindi',
 'hif': 'Fiji Hindi',
 'hr': 'Croatian',
 'ht': 'Haitian Creole',
 'hu': 'Hungarian',
 'hy': 'Armenian',
 'ia': 'Interlingua',
 'id': 'Indonesian',
 'ie': 'Interlingue',
 'ig': 'Igbo',
 'ilo': 'Iloko',
 'io': 'Ido',
 'is': 'Icelandic',
 'it': 'Italian',
 'ja': 'Japanese',
 'jam': 'Jamaican Creole English',
 'jbo': 'Lojban',
 'jv': 'Javanese',
 'ka': 'Georgian',
 'kaa': 'Kara-Kalpak',
 'kab': 'Kabyle',
 'kk': 'Kazakh',
 'km': 'Khmer',
 'kn': 'Kannada',
 'ko': 'Korean',
 'ksh': 'Colognian',
 'ku': 'Kurdish',
 'kw': 'Cornish',
 'ky': 'Kyrgyz',
 'la': 'Latin',
 'lad': 'Ladino',
 'lb': 'Luxembourgish',
 'lez': 'Lezghian',
 'lg': 'Ganda',
 'li': 'Limburgish',
 'lij': 'Ligurian',
 'lmo': 'Lombard',
 'lrc': 'Northern Luri',
 'lt': 'Lithuanian',
 'lv': 'Latvian',
 'mai': 'Maithili',
 'map-bms': 'Basa Banyumasan',
 'mg': 'Malagasy',
 'mk': 'Macedonian',
 'ml': 'Malayalam',
 'mn': 'Mongolian',
 'mr': 'Marathi',
 'ms': 'Malay',
 'mwl': 'Mirandese',
 'my': 'Burmese',
 'mzn': 'Mazanderani',
 'na': 'Nauru',
 'nah': 'Nāhuatl',
 'nds': 'Low German',
 'nds-nl': 'Low Saxon',
 'ne': 'Nepali',
 'new': 'Newari',
 'nl': 'Dutch',
 'nn': 'Norwegian Nynorsk',
 'no': 'Norwegian',
 'nov': 'Novial',
 'nv': 'Navajo',
 'oc': 'Occitan',
 'olo': 'Livvi-Karelian',
 'om': 'Oromo',
 'or': 'Odia',
 'os': 'Ossetic',
 'pa': 'Punjabi',
 'pam': 'Pampanga',
 'pcd': 'Picard',
 'pl': 'Polish',
 'pms': 'Piedmontese',
 'pnb': 'Western Punjabi',
 'ps': 'Pashto',
 'pt': 'Portuguese',
 'qu': 'Quechua',
 'ro': 'Romanian',
 'roa-rup': 'Aromanian',
 'ru': 'Russian',
 'rue': 'Rusyn',
 'sa': 'Sanskrit',
 'sah': 'Sakha',
 'sc': 'Sardinian',
 'scn': 'Sicilian',
 'sco': 'Scots',
 'sd': 'Sindhi',
 'se': 'Northern Sami',
 'sh': 'Serbo-Croatian',
 'si': 'Sinhala',
 'simple': 'Simple English',
 'sk': 'Slovak',
 'sl': 'Slovenian',
 'so': 'Somali',
 'sq': 'Albanian',
 'sr': 'Serbian',
 'su': 'Sundanese',
 'sv': 'Swedish',
 'sw': 'Swahili',
 'szl': 'Silesian',
 'ta': 'Tamil',
 'te': 'Telugu',
 'th': 'Thai',
 'tk': 'Turkmen',
 'tl': 'Tagalog',
 'tpi': 'Tok Pisin',
 'tr': 'Turkish',
 'tt': 'Tatar',
 'tyv': 'Tuvinian',
 'ug': 'Uyghur',
 'uk': 'Ukrainian',
 'ur': 'Urdu',
 'uz': 'Uzbek',
 'vec': 'Venetian',
 'vep': 'Veps',
 'vi': 'Vietnamese',
 'vo': 'Volapük',
 'wa': 'Walloon',
 'war': 'Waray',
 'wo': 'Wolof',
 'wuu': 'Wu Chinese',
 'xmf': 'Mingrelian',
 'yi': 'Yiddish',
 'yo': 'Yoruba',
 'za': 'Zhuang',
 'zea': 'Zeelandic',
 'zh': 'Chinese',
 'zh-min-nan': 'Chinese (Min Nan)',
 'zh-yue': 'Cantonese'}
#combine all the steps above into one function, every page in another language will be listed and written in that particular language
def link_getter(page_title):
    
    _S="https://en.wikipedia.org/w/api.php?action=query&format=json&prop=langlinks&titles={0}&llprop=autonym|langname&lllimit=500".format(page_title)
    
    req = requests.get(_S)

    json_string = json.loads(req.text)
    
    _pageID=list(json_string['query']['pages'].keys())[0]

    _langlink_list=json_string['query']['pages'][_pageID]['langlinks']
    
    _langlink_dict=dict()

    for d in _langlink_list:
        _lang=d['lang']
        _title=d['*']
        _langlink_dict[_lang]=_title
        
    
    return _langlink_dict
#returns list of pages in each lang it is published in 
link_getter(page_title)
{'af': 'Egiptiese staatsgreep van 2013',
 'ar': 'انقلاب 2013 في مصر',
 'arz': 'خريطة المستقبل (مصر)',
 'az': 'Misirdə hərbi çeviriliş (2013)',
 'bg': 'Държавен преврат в Египет (2013 г.)',
 'ca': "Cop d'Estat a Egipte l'any 2013",
 'ckb': 'کودەتای ٢٠١٣ی میسر',
 'de': 'Militärputsch in Ägypten 2013',
 'el': 'Αιγυπτιακό πραξικόπημα 2013',
 'es': 'Golpe de Estado en Egipto de 2013',
 'fa': 'کودتای ۲۰۱۳ مصر',
 'fi': 'Egyptin vallankaappaus 2013',
 'fr': "Coup d'État du 3 juillet 2013 en Égypte",
 'he': 'ההפיכה במצרים (2013)',
 'hi': 'मिस्र में सैन्य तख्तापलट २०१३',
 'id': 'Kudeta Mesir 2013',
 'it': 'Golpe egiziano del 2013',
 'ja': '2013年エジプトクーデター',
 'ko': '2013년 이집트 쿠데타',
 'nl': 'Protesten en staatsgreep in Egypte in 2013',
 'pl': 'Zamach stanu w Egipcie (2013)',
 'pt': 'Golpe de Estado no Egito em 2013',
 'ro': 'Lovitura de stat din Egipt din 2013',
 'ru': 'Военный переворот в Египте (2013)',
 'sr': 'Државни удар у Египту (2013)',
 'tg': 'Кудатои 2013 Миср',
 'tr': '2013 Mısır askerî darbesi',
 'uk': 'Військовий переворот в Єгипті 2013',
 'ur': '2013ء مصری فوجی تاخت',
 'vi': 'Đảo chính Ai Cập 2013',
 'zh': '2013年埃及政变'}

Retrieve the content of the page via API

Write a function that takes an article title and returns the list of links in the body of the article. Note that the reason we don't use the "pagelinks" table in MySQL or the "links" parameter in the API is that this includes links within templates. Articles with templates link to each other forming over-dense clusters in the resulting networks. We only want the links appearing in the body of the text.

We pass a request to the API, which returns a JSON-formatted string containing the HTML of the page. We use BeautifulSoup to parse through the HTML tree and extract the non-template links and return them as a list.

def get_page_outlinks(page_title,lang='en',redirects=1):
    # Replace spaces with underscores
    page_title = page_title.replace(' ','_')
    
    bad_titles = ['Special:','Wikipedia:','Help:','Template:','Category:','International Standard','Portal:','s:','File:']
    
    # Get the response from the API for a query
    # After passing a page title, the API returns the HTML markup of the current article version within a JSON payload
    req = requests.get('https://{2}.wikipedia.org/w/api.php?action=parse&format=json&page={0}&redirects={1}&prop=text&disableeditsection=1&disabletoc=1'.format(page_title,redirects,lang))
    
    # Read the response into JSON to parse and extract the HTML
    json_string = json.loads(req.text)
    
    # Initialize an empty list to store the links
    outlinks_list = [] 
    
    if 'parse' in json_string.keys():
        page_html = json_string['parse']['text']['*']

        # Parse the HTML into Beautiful Soup
        soup = BeautifulSoup(page_html,'lxml')

        # Delete tags associated with templates
        for tag in soup.find_all('tr'):
            tag.replace_with('')

        # For each paragraph tag, extract the titles within the links
        for para in soup.find_all('p'):
            for link in para.find_all('a'):
                if link.has_attr('title'):
                    title = link['title']
                    # Ignore links that aren't interesting
                    if all(bad not in title for bad in bad_titles):
                        outlinks_list.append(title)

        # For each unordered list, extract the titles within the child links
        for unordered_list in soup.find_all('ul'):
            for item in unordered_list.find_all('li'):
                for link in item.find_all('a'):
                    if link.has_attr('title'):
                        title = link['title']
                        # Ignore links that aren't interesting
                        if all(bad not in title for bad in bad_titles):
                            outlinks_list.append(title)

    return outlinks_list
#test of outlinks grabbed for german page specifically named here, naming the page and the language
german_outlinks=get_page_outlinks('Militärputsch in Ägypten 2013',lang='de')
#link_getter('|'.join(german_outlinks[:5]))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-14-07aecd4f7aac> in <module>()
----> 1 link_getter('|'.join(german_outlinks[:5]))

<ipython-input-10-6634150ec041> in link_getter(page_title)
     10     _pageID=list(json_string['query']['pages'].keys())[0]
     11 
---> 12     _langlink_list=json_string['query']['pages'][_pageID]['langlinks']
     13 
     14     _langlink_dict=dict()

KeyError: 'langlinks'
#combine all the steps above into one function, every page in another language will be listed and written in that particular language
_page_title='|'.join(german_outlinks[0:10])

_S="https://de.wikipedia.org/w/api.php?action=query&format=json&prop=langlinks&titles={0}&redirects=1&llprop=autonym|langname&lllimit=500".format(_page_title)

req = requests.get(_S)

json_string = json.loads(req.text)

_pageID_list=list(json_string['query']['pages'].keys())

#_langlink_list=json_string['query']['pages'][_pageID]['langlinks']

#_langlink_dict=dict()

#for d in _langlink_list:
#    _lang=d['lang']
#    _title=d['*']
#    _langlink_dict[_lang]=_title

translation_dict=dict()


for _pageID in _pageID_list:
    try:

        _langlink_list=json_string['query']['pages'][_pageID]['langlinks']
        #print("title is", json_string['query']['pages'][_pageID]['title'])
        _title=json_string['query']['pages'][_pageID]['title']
        
    
        for d in _langlink_list:      
            if d['lang']=='en' and d['*']:
                #print(d, _title)
               
                translation_dict[_title]=d['*']
                if (_title == "Abd al-Fattah as-Sisi"):
                    print(_title, d['lang'], translation_dict)
    except KeyError:
        _title=json_string['query']['pages'][_pageID]['title']
        translation_dict[_title]=''
translation_dict
{'Abd al-Fattah as-Sisi': '',
 'Mohammed Mursi': '',
 'Muslimbrüder': 'Muslim Brotherhood',
 'Oberster Rat der Streitkräfte': '',
 'Putsch': "Coup d'état",
 'Staatskrise in Ägypten 2013/2014': '',
 'Streitkräfte Ägyptens': 'Egyptian Armed Forces',
 'Ultimatum': 'Ultimatum',
 'Ägypten': 'Egypt'}
json_string['query']['pages']['17531']['title']
'Putsch'
german_outlinks
['Putsch',
 'Streitkräfte Ägyptens',
 'Oberster Rat der Streitkräfte',
 'Abd al-Fattah as-Sisi',
 'Ägypten',
 'Mohammed Mursi',
 'Ultimatum',
 'Islamismus',
 'Muslimbrüder',
 'Staatskrise in Ägypten 2013',
 'Vereinigte Staaten von Amerika',
 'Europäische Union',
 'Terrorismus',
 'Sinai-Halbinsel',
 'Gotteskrieger',
 'Menschenrechtsorganisation',
 'Husni Mubarak',
 'Putsch',
 'Koptische Kirche',
 'Patriarch',
 'Tawadros II.',
 'Imam',
 'Al-Azhar-Universität',
 'Ahmed Tayeb',
 'Tamarod',
 'Linksliberalismus',
 'Nationale Heilsfront',
 'Mohammed el-Baradei',
 'Salafisten',
 'Partei des Lichts',
 'Hasim al-Beblawi',
 'Staatskrise in Ägypten 2013',
 'Christentum',
 'Bischof',
 'Tawadros II.',
 'Sanktion',
 'Römisch-katholische Kirche',
 'Deutsche Bischofskonferenz',
 'Muslimbrüder',
 'Tiefer Staat',
 'Husni Mubarak',
 'Revolution in Ägypten 2011',
 'Restauration (Geschichte)',
 'Konterrevolution',
 'Parlament',
 'Verfassunggebende Versammlung',
 'Gremium',
 'Husni Mubarak',
 'Nachrichtendienst',
 'Streitkräfte Ägyptens',
 'Demonstration',
 'Protest',
 'Interessengruppe',
 'Unterschriftenaktion',
 'Tamarod',
 'Naguib Sawiris',
 'Infrastruktur',
 'Partei der Freien Ägypter',
 'Verfassungsgerichtsbarkeit',
 'Tahani al-Gebali',
 'Elektrizitätsversorgung',
 'Kraftstoff',
 'Erdgas',
 'Menschenrechte',
 'Folter',
 'Pressefreiheit',
 'Lebensmittelteuerung (Seite nicht vorhanden)',
 'Arbeitslosigkeit',
 'Tankstelle',
 'Stromausfall',
 'Boykott',
 'Kriminalität',
 'Nachrichtendienst',
 'Tawadros II.',
 'Ahmad Mohammad al-Tayyeb',
 'Mohammed el-Baradei',
 'Islam',
 'Al-Azhar-Universität',
 'Linksliberalismus',
 'Mohammed el-Baradei',
 'Salafismus',
 'Partei des Lichts',
 'Elite',
 'Übergangsregierung',
 'Technokratie',
 'Großunternehmen',
 'Gouvernement',
 'Gouverneur',
 'Abd al-Fattah as-Sisi',
 'Mohammed Hussein Tantawi',
 'Gremium',
 'Arbeitslosigkeit',
 'Inflation',
 'Revolution in Ägypten 2011',
 'Verfassung der Republik Ägypten',
 'Gouvernements in Ägypten',
 'Adel al-Chajat',
 'Gamaa Islamija',
 'Al-Azhar-Moschee',
 'Tamarod',
 'Mohammed Mursi',
 'Tahrir-Platz',
 'Al-Wasat-Partei',
 'Streitkräfte Ägyptens',
 'Barack Obama',
 'Mohamed Kamel Amr',
 'Middle East News Agency (Seite nicht vorhanden)',
 'Abd al-Fattah as-Sisi',
 'Hescham Kandil',
 'Abdel Meguid Mahmud',
 'Koalition (Politik)',
 'Universität Kairo',
 'Al-Dschamāʿa al-islāmiyya',
 'Assem Abdel-Maged (Seite nicht vorhanden)',
 'Anne W. Patterson',
 'Mitteleuropäische Sommerzeit',
 'Heliopolis',
 'Greenwich Mean Time',
 'Übergangsregierung',
 'Abd al-Fattah as-Sisi',
 'Vorgezogene Neuwahl',
 'Mahmoud Badr (Seite nicht vorhanden)',
 'Entführung',
 'Wiki Thawra (Seite nicht vorhanden)',
 'Washington Post',
 'Revolution',
 'Stiftung Wissenschaft und Politik',
 'Volker Perthes',
 'Die Zeit',
 'Kurier (Tageszeitung)',
 'Der Spiegel',
 'Putsch',
 'Politikwissenschaft',
 'NDR Info',
 'Putsch',
 'John Kerry',
 'Martin E. Dempsey',
 'Sedki Sobhi (Seite nicht vorhanden)',
 'Gunter Mulack',
 'Türkei',
 'Recep Tayyip Erdoğan',
 'Tunesien',
 'Ennahda',
 'The Daily Beast',
 'Philip J. Crowley',
 'Außenminister der Vereinigten Staaten',
 'Zentrum für Forschung zur Arabischen Welt (Seite nicht vorhanden)',
 'Deutsche Gesellschaft für Auswärtige Politik',
 'Europäische Union',
 'Krieg gegen den Terror',
 'Israelisch-ägyptischer Friedensvertrag',
 'Menschenrechte',
 'Dirk Emmerich (Seite nicht vorhanden)',
 'N-tv',
 'Tiefer Staat',
 'Judikative',
 'Exekutive',
 'Administrative',
 'Revolution in Ägypten 2011',
 'Wirtschaft',
 'Militär',
 'Staat im Staate',
 'Konterrevolution',
 'Elite',
 'Militärparade',
 'Akademischer Grad',
 'Militärakademie',
 'Vierte Gewalt',
 'The Guardian',
 'The Washington Post',
 'Al-Arabiya',
 'Al Jazeera',
 'Deutsche Welle',
 'BBC Arabic (Seite nicht vorhanden)',
 'Analphabetismus',
 'Kinderschänder',
 'Michael Thumann',
 'Bürgerrecht',
 'Arabic Network for Human Rights Information (Seite nicht vorhanden)',
 'Muhammad Badi’e',
 'Rābiʿa-al-ʿAdawiyya-Moschee',
 'Chairat al-Schater',
 'Saad al-Katatni',
 'Freiheits- und Gerechtigkeitspartei',
 'Rashad Bajumi (Seite nicht vorhanden)',
 'Human Rights Watch',
 'Adli Mansur',
 'Marsa Matruh',
 'Kafr asch-Schaich',
 'Alexandria',
 'Al-Minya',
 'Alexandria',
 'Luxor',
 'Damanhur',
 'Konterrevolution',
 'Flughafen al-Arisch',
 'Gouvernement as-Suwais',
 'Gouvernement Dschanub Sina',
 'Hosni Mubarak',
 'Kopten',
 'Kreuzzug',
 'Scharia',
 'Mohammed el-Baradei',
 'Partei des Lichts',
 'Ägyptische Sozialdemokratische Partei',
 'Siad Bahaa El-Din (Seite nicht vorhanden)',
 'Koptische Kirche',
 'Al-Arisch',
 'Verfassung der Republik Ägypten',
 'Voice of America',
 'Afrikanische Union',
 'Republikanische Partei',
 'John McCain',
 'Partei des Lichts',
 'Hasim al-Beblawi',
 'Weltbank',
 'Recep Tayyip Erdoğan',
 'Abdullah Gül',
 'Blutbad in Kairo und Gizeh vom 14. August 2013',
 'Cairo Institute for Human Rights Studies (Seite nicht vorhanden)',
 'Kosovo',
 'Libyen',
 'Syrien',
 'Libanon',
 'Ukraine',
 'Europäische Union',
 'Militärputsch in Ägypten 1952',
 'Gamal Abdel Nasser',
 'Tawadros II.',
 'Ahmad Mohammad al-Tayyeb',
 'Mohammed el-Baradei',
 'Verfassung der Republik Ägypten',
 'Oberstes Verfassungsgericht Ägyptens',
 'Adli Mansur',
 'Übergangsregierung',
 'Technokratie',
 'British Broadcasting Corporation',
 'Nationale Heilsfront',
 'Freiheits- und Gerechtigkeitspartei',
 'Joachim Schroedel',
 'Deutsche Bischofskonferenz',
 'The European',
 'Afrikanische Union',
 'Afrikanische Union',
 'Afrikanische Union',
 'Addis Abeba',
 'Nkosazana Dlamini-Zuma',
 'Revolution in Ägypten 2011',
 'Hosni Mubarak',
 'Mohammed Edrees (Seite nicht vorhanden)',
 'Dekolonisation Afrikas',
 'Deutschland',
 'Deutschland',
 'Guido Westerwelle',
 'Dänemark',
 'Dänemark',
 'Iran',
 'Iran',
 'Jordanien',
 'Jordanien',
 'Katar',
 'Katar',
 'Kuwait',
 'Kuwait',
 'Russland',
 'Russland',
 'Alexej Puchow (Seite nicht vorhanden)',
 'Saudi-Arabien',
 'Saudi-Arabien',
 'Abdullah ibn Abd al-Aziz',
 'Somalia',
 'Somalia',
 'Al-Shabaab (Somalia)',
 'Twitter',
 'Rosarote Brille',
 'Syrien',
 'Syrien',
 'Baschar al-Assad',
 'Türkei',
 'Türkei',
 'Ahmet Davutoğlu',
 'Recep Tayyip Erdoğan',
 'Israel',
 'Tunesien',
 'Tunesien',
 'Moncef Marzouki',
 'Kongress für die Republik (Tunesien)',
 'Vereinigte Arabische Emirate',
 'Vereinigte Arabische Emirate',
 'Chalifa bin Zayid Al Nahyan',
 'Vereinigtes Königreich',
 'Vereinigtes Königreich',
 'William Hague',
 'Vereinigte Staaten',
 'Vereinigte Staaten',
 'Vereinigte Staaten',
 'Tansania',
 'Barack Obama',
 'Tunesien',
 'Algerien',
 'L’Orient-Le Jour',
 'Libanon',
 'The Daily Star (Libanon)',
 'Baschar al-Assad',
 'Syrien',
 'Iran',
 'Bahrain',
 'Gulf News',
 'Vereinigte Arabische Emirate',
 'Israel',
 'Israel HaYom',
 'Jedi’ot Acharonot',
 'Haaretz',
 'Frankreich',
 'Paris',
 'Le Figaro',
 'Ouest-France',
 'The New York Times',
 'Handelsblatt',
 'Die Welt',
 'Süddeutsche Zeitung',
 'Frankfurter Allgemeine Zeitung',
 'Jen Psaki',
 'George Orwell',
 'Neusprech',
 'Federal Reserve Bank of New York',
 'Rüstungsindustrie',
 'General Dynamics F-16',
 'Mehrzweckkampfflugzeug',
 'Hughes AH-64',
 'Kampfhubschrauber',
 'M1 Abrams',
 'Kampfpanzer',
 'Fregatte',
 'Krieg in Afghanistan',
 'Krieg gegen den Terror',
 'Naher Osten',
 'Ostafrika',
 'Freiheiten der Luft',
 'Luftraum',
 'Sueskanal',
 'Ölvorkommen',
 'Naher Osten',
 'Brookings Institution',
 'Demokratiemessung',
 'Demokratiemessung',
 'Husni Mubarak',
 'Anwar as-Sadat',
 'Blutbad in Kairo und Gizeh 2013',
 'Staatskrise in Ägypten 2013/2014 (Kabinett Beblawi)',
 'Chile',
 'Argentinien',
 'Algerien',
 'Martin Gehlen',
 'James Franklin Jeffrey',
 'American Council on Germany',
 'Council on Foreign Relations',
 'George W. Bush']
def translation_getter(page_title, lang='de', target_lang):
    
    _T="https://en.wikipedia.org/w/api.php?action=query&format=json&prop=langlinks&titles={0}&llprop=autonym|langname&lllimit=500".format(page_title)
    
    req = requests.get(_T)

    json_string = json.loads(req.text)
    
    _pageID=list(json_string['query']['pages'].keys())[0]

    _translation_list=json_string['query']['pages'][_pageID]['langlinks']
    
    _translation_dict=dict()

    for translate in german_outlinks:
        translate_lang=translate['en']
        translate_title=translate['*']
        _translation_dict[translate_lang]=translate_title
        
    
    return _translation_dict
  File "<ipython-input-19-dd9a2f2eff10>", line 1
    def translation_getter(page_title, lang='de', target_lang):
                          ^
SyntaxError: non-default argument follows default argument
translation_getter(page_title)
german_outlinks
translate_links_en_dict=dict()

for lang,title in german_outlinks(page_title,lang).items():
    translate_links_en=german_outlinks(page_title=title,lang='en')
    
    translate_links_en_dict['en']=translate_links_en

Run an example article, shows the first all outlinks for all articles in each language .

#pull all out links for each specific language page 
_langlink_AllList_dict=dict()

for lang,title in link_getter(page_title).items():
    LangLinksAll=get_page_outlinks(page_title=title,lang=lang)
   
    _langlink_AllList_dict[lang]=LangLinksAll
    
    
 
_langlink_AllList_dict
 
page_outlinks = get_page_outlinks(page_title)
page_outlinks
['Abdel Fattah el-Sisi',
 'Mohamed Morsi',
 'Egyptian Constitution of 2012',
 'June 2013 Egyptian protests',
 'Muslim Brotherhood',
 'Supreme Constitutional Court of Egypt',
 'Adly Mansour',
 'Grand Imam of al-Azhar',
 'Ahmed el-Tayeb',
 'Pope of the Coptic Orthodox Church of Alexandria',
 'Pope Tawadros II of Alexandria',
 'Mohamed ElBaradei',
 'Tunisia',
 'African Union',
 'Revolution',
 'August 2013 Rabaa massacre',
 'Post-coup unrest in Egypt (2013–14)',
 'Hosni Mubarak',
 'Egyptian Revolution of 2011',
 'History of Egypt under Hosni Mubarak',
 'Egyptian presidential election, 2012',
 'Muslim Brotherhood in post-Mubarak electoral politics of Egypt',
 'Mohamed ElBaradei',
 'Amr Moussa',
 'Hamdeen Sabahi',
 'The Wall Street Journal',
 'Tamarod',
 'National Salvation Front (Egypt)',
 'April 6 Youth Movement',
 'Strong Egypt Party',
 'The Gallup Organization',
 'Foreign involvement in the Syrian civil war',
 'International Crisis Group',
 'Egyptian constitution',
 'Anti-Coup Alliance',
 'El-Hossari Mosque (page does not exist)',
 'El-Nahda Square (page does not exist)',
 'Cairo University',
 'Ain Shams',
 "Coup d'état",
 'Tamarod',
 'Politics of the United Arab Emirates',
 'Cairo',
 'Alexandria',
 'Dakahlia Governorate',
 'Gharbiya',
 'Aswan',
 'Rabia Al-Adawiya Mosque',
 'Egyptian Presidential Palace',
 'El-Quba Palace (page does not exist)',
 'Damietta',
 'Tahrir Square',
 'Heliopolis Palace',
 'Port Said',
 'Suez',
 'Mokatam (page does not exist)',
 'Egyptian Armed Forces',
 'Ministry of Tourism (Egypt)',
 'Hisham Zazou',
 "Al-Gama'a al-Islamiyya",
 'Luxor massacre',
 'Luxor',
 'Ministry of Communications and Information Technology (Egypt)',
 'Atef Helmi',
 'Hatem Bagato (page does not exist)',
 'Khaled Abdel Aal (page does not exist)',
 'Freedom and Justice Party (Egypt)',
 'Barack Obama',
 'United States',
 'Minister of Foreign Affairs (Egypt)',
 'Mohamed Kamel Amr',
 'Egyptian Army',
 'List of Ministers of Defence of Egypt',
 'Abdel Fattah el-Sisi',
 'Court of Cassation (Egypt) (page does not exist)',
 'Abdel Meguid Mahmoud',
 'Talaat Abdallah (page does not exist)',
 'Al-Ahram',
 'Constitution of Egypt',
 'Sami Hafez Anan',
 'Egyptian Armed Forces',
 'Egyptian Armed Forces',
 'Mohamed El-Baradei',
 'National Salvation Front (Egypt)',
 'Abdel Fattah el-Sisi',
 'Waleed al-Haddad (page does not exist)',
 'Mohammed Zaki (page does not exist)',
 'Yahya Hamed (page does not exist)',
 "Talk:2013 Egyptian coup d'état",
 'Abdel Fattah el-Sisi',
 'Adli Mansour',
 'Technocracy',
 'Republican Guard (Egypt)',
 'Adli Mansour',
 'Shura Council',
 'Allahu akbar',
 'Pope of the Coptic Orthodox Church of Alexandria',
 'Tawadros II',
 'Grand Imam of al-Azhar',
 'Ahmed el-Tayeb',
 'Mohamed ElBaradei',
 'Tamarod',
 'Mahmoud Badr',
 'Al-Nour party',
 'Galal Murra (page does not exist)',
 'National Salvation Front (Egypt)',
 'Egyptian Armed Forces',
 'Republican Guard (Egypt)',
 'Egyptian Armed Forces',
 'Colonel',
 'Ahmed Mohammed Ali',
 'Egyptian Armed Forces',
 'Catherine Ashton',
 'European Union',
 'African Union',
 'Freedom and Justice Party (Egypt)',
 'Saad El-Katatni',
 'Rashad al-Bayoumi',
 'Muslim Brotherhood',
 'Al-Ahram',
 'Mohammed Badie',
 'Khairat El-Shater',
 'Mahdi Akef',
 'Mohamed Beltagy',
 'Safwat Hegazi',
 'Al-Wasat Party',
 'Abou Elela Mady',
 'Essam Sultan (page does not exist)',
 'Al Jazeera English',
 'Misr 25',
 'Al Hafez (page does not exist)',
 'Al Nas (page does not exist)',
 'Al Jazeera',
 'Mubasher Misr (page does not exist)',
 'Associated Press Television News',
 'Cairo News Company (page does not exist)',
 'Committee to Protect Journalists',
 'BBC News',
 'Jeremy Bowen',
 'Al-Ahram',
 'Friday prayers',
 '2013 Republican Guard headquarters clashes',
 'BBC News',
 'Jeremy Bowen',
 'Qena',
 '6th October Bridge',
 'Gaza Strip',
 'Rafah border crossing',
 'Prime Minister of the Gaza Strip',
 'Ismail Haniyeh',
 '2013 Republican Guard headquarters clashes',
 'Mohamed Beltagy',
 'Al-Dustour (Egypt)',
 'Foreign rebel fighters in the Syrian civil war',
 'University of California at Berkeley',
 'Qalyoub (page does not exist)',
 'Rabaa al-Adawiya mosque',
 "Talk:2013 Egyptian coup d'état",
 'Coptic Christian',
 'Christians Against the Coup',
 'Anti-Coup Alliance',
 'Al-Arish',
 'Treason',
 "2005 Mauritanian coup d'état",
 "2012 Malian coup d'état",
 '2009 Malagasy political crisis',
 "1999 Pakistani coup d'état",
 'Egyptian American',
 'Michigan',
 'Amnesty International',
 'Muslim Brotherhood',
 'Freedom and Justice Party (Egypt)',
 'Egyptian Army',
 'United Arab Emirates',
 'Tamarod',
 'African Union',
 'Nkosazana Dlamini-Zuma',
 'European Union',
 'High Representative of the Union for Foreign Affairs and Security Policy',
 'Catherine Ashton',
 'United Nations',
 'Ban Ki-moon',
 'Nabil Fahmy',
 'Navi Pillay',
 'Argentina',
 'Australia',
 'Kevin Rudd',
 'Bahrain',
 'Hamad bin Isa Al-Khalifa',
 'Canada',
 'John Baird (Canadian politician)',
 'China',
 'Colombia',
 'France',
 'Francois Hollande',
 'Tunisian revolution',
 'Aftermath of the Libyan civil war',
 'Syrian civil war',
 'Laurent Fabius',
 'Germany',
 'Guido Westerwelle',
 'Iran',
 'Ali Akbar Salehi',
 'Iraq',
 'Nouri al-Maliki',
 'Israel',
 'Benjamin Netanyahu',
 'Haaretz',
 'Yisrael Katz (politician born 1955)',
 'Israeli Army Radio',
 'Eli Shaked (page does not exist)',
 'Eli Shaked (page does not exist)',
 'Jordan',
 'Kuwait',
 'Kuwait News Agency',
 'Sabah Al-Ahmad Al-Jaber Al-Sabah',
 'Lebanon',
 'Tammam Salam',
 'Libya',
 'Rome',
 'Ali Zidan',
 'Malaysia',
 'Najib Razak',
 'Ministry of Youth and Sports (Malaysia)',
 'Khairy Jamaluddin',
 'Pan-Malaysian Islamic Party',
 'Nik Abdul Aziz Nik Mat',
 'Anwar Ibrahim',
 'Pan-Malaysian Islamic Party',
 'Nik Abdul Aziz Nik Mat',
 'Anwar Ibrahim',
 'Norway',
 'Espen Barth Eide',
 'Netherlands',
 'Pakistan',
 'Nawaz Sharif',
 'State of Palestine',
 'President of the State of Palestine',
 'Mahmoud Abbas',
 'Hanan Ashrawi',
 'Gaza Strip',
 'Hamas',
 'Governance of the Gaza Strip',
 'Yahia Moussa (page does not exist)',
 'Hamas',
 'Ahmad Yousef (page does not exist)',
 'Sic',
 'Gaza Strip',
 'Hamas',
 'Governance of the Gaza Strip',
 'Yahia Moussa (page does not exist)',
 'Hamas',
 'Ahmad Yousef (page does not exist)',
 'Sic',
 'Eli Shaked (page does not exist)',
 'Pan-Malaysian Islamic Party',
 'Nik Abdul Aziz Nik Mat',
 'Anwar Ibrahim',
 'Gaza Strip',
 'Hamas',
 'Governance of the Gaza Strip',
 'Yahia Moussa (page does not exist)',
 'Hamas',
 'Ahmad Yousef (page does not exist)',
 'Sic',
 'Philippines',
 'Benigno Aquino III',
 'Edwin Lacierda',
 'Department of Foreign Affairs (Philippines)',
 'Poland',
 'Qatar',
 'Al Jazeera',
 'Tamim bin Hamad Al Thani',
 'Khaled al-Attiya (page does not exist)',
 'Russia',
 'Saudi Arabia',
 'Abdullah of Saudi Arabia',
 'Somalia',
 'Al-Shabaab (militant group)',
 'Twitter',
 'Al-Shabaab (militant group)',
 'Twitter',
 'Sudan',
 'Ali Karti (page does not exist)',
 'Mohamed Kamel Amr',
 'Egypt-Sudan relations',
 'Hassan al-Turabi',
 'Hassan al-Turabi',
 'Sweden',
 'Carl Bildt',
 'Switzerland',
 'Syria',
 'Bashar al-Assad',
 'Tunisia',
 'Arab Spring',
 'Ennahda Movement',
 'Rachid Ghannouchi',
 'Turkey',
 'Recep Tayyip Erdogan',
 'Ahmet Davutoglu',
 'Hüseyin Çelik',
 'Justice and Development Party (Turkey)',
 'Cabinet Erdoğan II',
 "Republican People's Party (Turkey)",
 'Kemal Kılıçdaroğlu',
 "Republican People's Party (Turkey)",
 'Kemal Kılıçdaroğlu',
 'United Arab Emirates',
 'Abdullah bin Zayed Al Nahyan',
 'United Kingdom',
 'William Hague',
 'United States',
 'William Joseph Burns',
 'John McCain',
 'Senate Foreign Relations Committee',
 'Ed Royce',
 'House Foreign Affairs Committee',
 'Eliot Engel',
 'Dan Shapiro',
 'Tel Aviv',
 'Frank G. Wisner',
 'United States Secretary of State',
 'John Kerry',
 'William Joseph Burns',
 'John McCain',
 'Senate Foreign Relations Committee',
 'Ed Royce',
 'House Foreign Affairs Committee',
 'Eliot Engel',
 'Dan Shapiro',
 'Tel Aviv',
 'Frank G. Wisner',
 'United States Secretary of State',
 'John Kerry',
 'Yemen',
 'Abd Rabbuh Mansur Hadi',
 'Hamid al-Ahmar',
 'Al-Islah (Yemen)',
 'Muslim Brotherhood',
 'Hamid al-Ahmar',
 'Al-Islah (Yemen)',
 'Muslim Brotherhood',
 'Al-Shabaab (militant group)',
 'Twitter',
 'Hassan al-Turabi',
 "Republican People's Party (Turkey)",
 'Kemal Kılıçdaroğlu',
 'William Joseph Burns',
 'John McCain',
 'Senate Foreign Relations Committee',
 'Ed Royce',
 'House Foreign Affairs Committee',
 'Eliot Engel',
 'Dan Shapiro',
 'Tel Aviv',
 'Frank G. Wisner',
 'United States Secretary of State',
 'John Kerry',
 'Hamid al-Ahmar',
 'Al-Islah (Yemen)',
 'Muslim Brotherhood',
 'Al-Qaeda',
 'Ayman al-Zawahiri',
 'Sharia',
 'Post-coup unrest in Egypt (2013–14)',
 'August 2013 Rabaa massacre',
 'Egyptian Revolution of 2011',
 'Egyptian Revolution of 1952',
 'Egyptian Revolution of 1919',
 'Digital object identifier',
 'Digital object identifier']
#check language links for translation in other pages
#def outlink_translater(page_title,lang,page_outlinks):
    
translated_links_dict = dict()
    
    for lang,title,outlinks in _langlink_AllList_dict.items():
        translatedLinks=lan

You could write a recursive function like recursively_get_hyperlink_network that would crawl the hyperlink network out to an arbitrary distance, but this is becomes exhorbitantly expensive at any depth greater than 1.

Here's an example function, but is not executable to prevent you from harming yourself. :)

def recursively_get_hyperlink_network(seed_page,depth): neighbors = {} if depth < 0: return neighbors neighbors[seed_page] = get_page_outlinks(seed_page) for neighbor in neighbors[seed_page]: neighbors[neighbor] = get_hyperlink_network(neighbor,depth-1) return neighbors
 req = requests.get("https://en.wikipedia.org/w/api.php?action=query&titles=2013%20Egyptian%20coup%20d'état&prop=langlinks")
 
    # Replace spaces with underscores
    page_title = page_title.replace(' ','_')
    
   # bad_titles = ['Special:','Wikipedia:','Help:','Template:','Category:','International Standard','Portal:','s:','File:']
    
    # Get the response from the API for a query
    # After passing a page title, the API returns the HTML markup of the current article version within a JSON payload
    req = requests.get("https://en.wikipedia.org/w/api.php?action=parse&format=json&page={0}&prop=langlinks".format(page_title,redirects))
    
    # Read the response into JSON to parse and extract the HTML
    json_string = json.loads(req.text)
    
    # Initialize an empty list to store the links
    #outlinks_list = [] 
    
    #if 'parse' in json_string.keys():
        #page_html = json_string['parse']['text']['*']

        # Parse the HTML into Beautiful Soup
        #soup = BeautifulSoup(page_html,'lxml')

        # Delete tags associated with templates
        #for tag in soup.find_all('tr'):
            #tag.replace_with('')

        # For each paragraph tag, extract the titles within the links
        #for para in soup.find_all('p'):
           # for link in para.find_all('a'):
                #if link.has_attr('title'):
                   # title = link['title']
                    # Ignore links that aren't interesting
                    #if all(bad not in title for bad in bad_titles):
                        #outlinks_list.append(title)

        # For each unordered list, extract the titles within the child links
        #for unordered_list in soup.find_all('ul'):
            #for item in unordered_list.find_all('li'):
                #for link in item.find_all('a'):
                    #if link.has_attr('title'):
                        #title = link['title']
                        # Ignore links that aren't interesting
                        #if all(bad not in title for bad in bad_titles):
                            #outlinks_list.append(title)

    #return outlinks_list
 
page_outlinks = get_page_outlinks(page_title)
page_outlinks[:10]

Instead, define a simple function to get the 1.5-step ego hyperlink network. The "ego" is the seed page you start from, the "alters" are the neighbors that the ego links out to. We also get the alters of the alters (2nd order alters), but only include these 2nd order connections if they link to 1st order alters. In other words, the 1.5-step ego hyperlink network are all the pages linked from the seed page and the connections among this set of articles.

def get_hyperlink_alters(seed_page):
    # Initialize an empty dictionary to act as an adjacency "list"
    neighbors = {}
    
    # Get all the alters for the seed page and store them in the adjacency dictionary
    neighbors[seed_page] = get_page_outlinks(seed_page,1)
    
    # For each of the alters, get their alters and store in the adjacency dictionary
    for neighbor in list(set(neighbors[seed_page])): # Don't recrawl duplicates
        neighbors[neighbor] = get_page_outlinks(neighbor,0)
    
    # Initialize an empty graph that we will add nodes and edges into
    g = nx.DiGraph()
    
    # For each entry in the adjacency dictionary, check if the alter's alters are also the seed page's alters
    # If they are and the edge is already in the graph, increment the edge weight by one
    # If they are but the edge is not already in the graph, add the edge with a weight of one
    for article,neighbor_list in neighbors.items():
        for neighbor in neighbor_list:
            if neighbor in neighbors[seed_page] + [seed_page]:
                if g.has_edge(article,neighbor):
                    g[article][neighbor]['weight'] += 1
                else:
                    g.add_edge(article,neighbor,weight=1)
    
    # Return the weighted graph
    return g

Run this on an example article and save the resulting graph object to disk.

This step could take more than a minute depending on the number of links and size of the neighboring pages.

# Create the hyperlink network
hyperlink_g = get_hyperlink_alters(page_title)

# Save the graph to disk to visualize in Gephi
nx.write_graphml(hyperlink_g,'hyperlink_{0}.graphml'.format(page_title.replace(' ','_')))
hg_nodes = hyperlink_g.number_of_nodes()
hg_edges = hyperlink_g.number_of_edges()

print("There are {0} nodes and {1} edges in the hyperlink network.".format(hg_nodes,hg_edges))
hg_density = nx.density(hyperlink_g)
print('{0:.2%} of the possible edges actually exist.'.format(hg_density))
def reciprocity(g):
    reciprocated_edges = []
    
    for (i,j) in g.edges():
        if hyperlink_g.has_edge(j,i):
            reciprocated_edges.append((i,j))
    
    return len(reciprocated_edges)/float(g.number_of_edges())

hg_reciprocity = reciprocity(hyperlink_g)

print('{0:.2%} of the edges in the hyperlink network are reciprocated.'.format(hg_reciprocity))

Play the Wikipedia Game!

Using only the hyperlinks on the article, try to get from the first article to the second article.

page1,page2 = np.random.choice(list(hyperlink_g.nodes()),2)
print("Try to navigate from \"{0}\" to \"{1}\" using only hyperlinks.\n".format(page1,page2))
print("Start at: https://en.wikipedia.org/wiki/{0}".format(page1.replace(' ','_')))

No cheating!

After you've played the game a few times, see what an optimal shortest path is. You may get an error indicating there is no shortest path, in which case, try a new pair of nodes.

nx.shortest_path(hyperlink_g,page1,page2)

The shortest path length is the path connecting two nodes in the fewest steps. This is related to the "small world" effect where everyone in the world is just a few handshakes from each other. It's rare to find complex networks where the longest shortest path is above 5. Nodes that are this far from each other are likely about very unrelated topics.

If there are no paths greater than 5, lower the path_length_threshold from 5 to 4.

The long_path_lengths dictionary below is populated by computing all the shortest path lengths between nodes in the network and only keeping those paths that are longer than 5 steps from each other. In a directed graph like our hyperlink network, it's important to follow the direction of the arrows: if page A links to page B but page B doesn't link to page A, then we can't make a shortest path from B to A, we have to find another path.

path_length_threshold = 4
long_path_lengths = {}

for k,d in nx.all_pairs_shortest_path_length(hyperlink_g).items():
    long_paths = [v for v,l in d.items() if l > path_length_threshold]
    if len(long_paths) > 0:
        long_path_lengths[k] = long_paths
        
long_path_lengths.keys()

The shortest path between the articles can be identified using the shortest_path function and supplying the graph and the names of two nodes.

# Randomly choose two articles in the list of long shortest paths
page1,page2 = np.random.choice(list(long_path_lengths.keys()),2)
print("The two pages randomly selected are: \"{0}\" and \"{1}\"".format(page1,page2))

# Display the path between these articles
nx.shortest_path(hyperlink_g,page1,page2)

Test out different combinations of articles from the long_path_lengths to find the articles that are farthest apart by entering different article names for page1 and page2.

page1 = 'National Association for Business Economics'
page2 = 'NATO'
nx.shortest_path(hyperlink_g,page1,page2)
hg_in_degree_d = {node:int(centrality*(len(hyperlink_g) - 1)) for node,centrality in nx.in_degree_centrality(hyperlink_g).items()}
hg_out_degree_d = {node:int(centrality*(len(hyperlink_g) - 1)) for node,centrality in nx.out_degree_centrality(hyperlink_g).items()}

Look at the nodes with the highest in-degree: other pages in the network point to this page.

degree_df = pd.DataFrame({'In':hg_in_degree_d,'Out':hg_out_degree_d})
degree_df['In'].sort_values(ascending=False).head(10)

Look at the nodes with the highest-out-degree: these pages point to many other pages.

degree_df['Out'].sort_values(ascending=False).head(10)

Look at the nodes that have no links out.

degree_df.query('Out == 0')['Out']

Look at nodes that have a single link in. These are also known as (in-) pendants. If there are none, it should appear as an empty series.

degree_df.query('In == 1')['In']

Look at the nodes with a single link out. These are also known as (out-)pendants. If there are none, it should appear as an empty series.

degree_df.query('Out == 1')['Out']

Given a page, what are the neighbors that link in to it? Assign a specific article title to the page1 variable by replacing the np.random.choice(degree_df.index)

page1 = np.random.choice(degree_df.index)

in_connections = hyperlink_g.predecessors(page1)
print("The links into node \"{0}\" are:\n{1}".format(page1,in_connections))
out_connections = hyperlink_g.successors(page1)
print("The links out from node \"{0}\" are:\n{1}".format(page1,out_connections))
in_degree_dist_df = degree_df['In'].value_counts().reset_index()
out_degree_dist_df = degree_df['Out'].value_counts().reset_index()

f,ax = plt.subplots(1,1)
in_degree_dist_df.plot.scatter(x='index',y='In',ax=ax,c='blue',label='In')
out_degree_dist_df.plot.scatter(x='index',y='Out',ax=ax,c='red',label='Out')
ax.set_xscale('symlog')
ax.set_yscale('symlog')
ax.set_xlim((0,1e3))
ax.set_ylim((0,1e3))

ax.set_xlabel('Connections')
ax.set_ylabel('Count')

Calculate communities within the network

Define a function to compute node community memberships for multiple community detection algorithms within igraph. The output is a dictionary of dictionaries where the top-level key is the name of the algorithm and returns a second-level dictionary keyed by the the page name with values being the community membership value. Documentation and details about these algorithms can be found under the igraph graph-class documentation.

def comparative_community_detector(igraph):
    memberships = {}
    
    # Directed memberships
    memberships['betweenness'] = igraph.community_edge_betweenness().as_clustering().membership
    memberships['infomap'] = igraph.community_infomap().membership
    memberships['spinglass'] = igraph.community_spinglass().membership
    memberships['walktrap'] = igraph.community_walktrap().as_clustering().membership
    
    # Undirected memberships
    undirected = igraph.as_undirected()
    memberships['fastgreedy'] = undirected.community_fastgreedy().as_clustering().membership
    memberships['leading_eigenvector'] = undirected.community_leading_eigenvector().membership
    memberships['multilevel'] = undirected.community_multilevel().membership
    
    labelled_memberships = {}
    for label,membership in memberships.items():
        labelled_memberships[label] = dict(zip(igraph.vs['id'],membership))
        
    return labelled_memberships

Not included in the comparative_community_detector function are two additional community detection algorithms that are too intensive or are not working properly. They're documented below if you ever care to explore in the future.

# Uses up a ton of memory and crashes kernel immediately ig_hg_optimal_modularity = hyperlink_g.community_optimal_modularity().membership ig_hg_optimal_modularity_labels = dict(zip(ig_hg.vs['id'],ig_hg_optimal_modularity)) pd.Series(ig_hg_optimal_modularity_labels).value_counts().head(10) # Lumps everyone into a single community ig_hg_label_propagation = hyperlink_g.community_label_propagation(initial=range(ig_hg_d.vcount()),fixed=[False]*ig_hg_d.vcount()).membership ig_hg_label_propagation_labels = dict(zip(ig_hg_d.vs['id'],ig_hg_label_propagation)) pd.Series(ig_hg_label_propagation_labels).value_counts().head(10)

Here we need to shift from using the networkx library to using the igraph library. The former is built purely in Python which makes it easier-to-use but somewhat slower while the latter is a "wrapper" that lets us write in Python but does the calculations in much-faster C code behind-the-scenes.

# Load the hyperlink network data from disk into a networkx graph object
nx_hg = nx.read_graphml('hyperlink_{0}.graphml'.format(page_title.replace(' ','_')))

# Load the hyperlink network data from disk into a igraph graph object
ig_hg = ig.read('hyperlink_{0}.graphml'.format(page_title.replace(' ','_')))
ig.summary(ig_hg) # Get statistics about the 

Run the function on the igraph version of the hyperlink network.

This may take a minute or more since these are intensive calculations

# Run the community detection labelling on the igraph graph object
comparative_community_labels = comparative_community_detector(ig_hg)

# Convert the node labels into a dict-of-dicts keyed by page name and inner-dict containing community labels
comparative_community_labels_transposed = pd.DataFrame(comparative_community_labels).to_dict('index')

# Update each node in the networkx graph object to reflect the community membership labels
for _node in nx_hg.nodes():
    try:
        nx_hg.node[_node]['label'] = _node
        for (label,membership) in comparative_community_labels_transposed[_node].items():
            nx_hg.node[_node][label] = int(membership)
    except KeyError: # Concerning that some labels aren't present, but skip them for now
        print("Error in assigning \"{0}\" to a community.".format(_node))
        pass

# Write the labeled graph back to disk to visualize in Gephi
nx.write_graphml(nx_hg,'hyperlink_communities_{0}.graphml'.format(page_title.replace(' ','_')))