EDA – plays a cruciall role in understanding the what, why, and how of the problem statement

Importing of different Modules and explore their usages

Pandas is an extensively data manipulation tool, built on the Numpy package and its key data structure called DataFrame allows to store and manipulate tabular data in rows of obseravations

import pandas as pd
import numpy as np

json is javascript object notation,lightweight data-interchange format I have ran multiple examples to get acquinted with the json format, it's syntax and datatypes like string, number, object (JSON object), array, boolean, nul

import json

re is Python's inbuilt library to work with regular expressions. I learnt different functions of re and sequences

import re

Seaborn and Matplotlib are imported to perform some visualizations as Analyses results can be shown well as the infographics have an impulsed and quick impact than the text paragraphs describing the results in th

import seaborn as sns
import matplotlib as plt

I have imported mediawiki API and have referred through all the sections and actions that can be performed -Quering etc

import mwapi

Interaction with API

session = mwapi.Session(host='https://en.wikipedia.org',
                        user_agent='Miriiyala Pujitha Jaji')

#Translated form English to Hindi ,Our National Language here in India
parameters = {'action':'query',
              'format':'json',
              'list':'cxpublishedtranslations',
              'from':'en',
              'to':'hi',
              'limit':500,
              'offset':200}
res = session.get(parameters)

To know how the data looks like I have run following

res['result']['translations'][:5]
Data = pd.DataFrame(res['result']['translations'])
Data.head(5)
Data.describe
Data.columns
Data.index
Data.shape
Data.info()

To check if there exists any missing or null values

sns.heatmap(Data.isnull(),cbar=False,yticklabels=False,cmap = 'viridis')
Data['sourceURL'].nunique()

There are 11 Columns

print(Data['publishedDate'].describe())
print(Data['sourceLanguage'].describe())
print(Data['sourceRevisionId'].describe())
print(Data['sourceTitle'].describe())
print(Data['sourceURL'].describe())
print(Data['targetLanguage'].describe())
print(Data['targetRevisionId'].describe())
print(Data['targetTitle'].describe())

Noted the Observations....

print(Data['targetURL'].describe())

As discussed in the earlier Contribution #1 in the Jupyter Notebook named kickstart- exploration of the data and other modules that the stats is an python dictionary accessing the value for each corresponding key may be difficult So made each value of stats into a seperate Column in the Data STATS has the highest importance as it speaks everything in the analysis

Data_stats = Data.drop('stats',1).assign(**Data.stats.apply(pd.Series))
Data_stats[:5]
Data_stats.info()
Data_stats.describe()
Data_stats["any"][:10]
print(Data_stats['any'].describe())
Data_stats["mt"][:10]
print(Data_stats['mt'].describe())
Data_stats["mtSectionsCount"][:10]
print(Data_stats['mtSectionsCount'].describe())
Data_stats["human"][:10]
print(Data_stats['human'].describe())
sns.distplot(Data_stats['any'], color='g', bins=28, hist_kws={'alpha': 0.8});
#Quality correlation matrix
k = 12 #number of variables for heatmap
cols = Data_stats.corr().nlargest(k, 'any')['any'].index
cm = Data_stats[cols].corr()
from matplotlib import figure
f = figure.Figure( figsize =(7,7) )
sns.heatmap(cm, annot=True, cmap = 'viridis')
sns.distplot(Data_stats['mt'], color='g', bins=28, hist_kws={'alpha': 0.8});
sns.distplot(Data_stats['human'], color='g', bins=28, hist_kws={'alpha': 0.8});
sns.distplot(Data_stats['mtSectionsCount'], color='g', bins=28, hist_kws={'alpha': 0.8});
anyMore_Data_stats=Data_stats.sort_values(by='any', ascending=False)
anyMore_Data_stats.head()
mtMore_Data_stats=Data_stats.sort_values(by='any', ascending=False)
mtMore_Data_stats.head()
humanMore_Data_stats=Data_stats.sort_values(by='any', ascending=False)
humanMore_Data_stats.head()
mtSectionsCountMore_Data_stats=Data_stats.sort_values(by='any', ascending=False)
mtSectionsCountMore_Data_stats.head()

sort by multiple columns:

Data_stats.sort_values(by=[ 'mt','human',], ascending=[True, False]).head()
Data_stats.sort_values(by=[ 'human','mt'], ascending=[True, False]).head()

Better Comparison among the stats

col_names = ['any','mt', 'mtSectionsCount', 'human']

fig, ax = plt.subplots(len(col_names), figsize=(16,12))

for i, col_val in enumerate(col_names):

    sns.distplot(Data_stats[col_val], hist=True, ax=ax[i])
    ax[i].set_title('Freq dist '+col_val, fontsize=10)
    ax[i].set_xlabel(col_val, fontsize=8)
    ax[i].set_ylabel('Count', fontsize=8)

plt.show()
Data_stats1 = Data_stats[:10]
col_names = ['any','mt', 'mtSectionsCount', 'human']

fig, ax = plt.subplots(len(col_names), figsize=(8,40))

for i, col_val in enumerate(col_names):

    sns.boxplot(Data_stats1[col_val], ax=ax[i])
    ax[i].set_title('Box plot - {}'.format(col_val), fontsize=10)
    ax[i].set_xlabel(col_val, fontsize=8)

plt.show()
Data_stats2 = Data_stats.drop(['any','mt', 'mtSectionsCount', 'human'], axis=1)
sns.pairplot(Data_stats2)
 
 
 
 
 
 
 
 
 
 
import matplotlib.pyplot as plt l = Data_stats.columns.values number_of_columns= 4 number_of_rows = len(l)-1/number_of_columns from matplotlib import figure f = figure.Figure(figsize=(number_of_columns,5*number_of_rows)) for i in range(0,len(l)): plt.subplot(number_of_rows + 1,number_of_columns,i+1) sns.set_style('whitegrid') sns.boxplot(Data[l[i]],color='green',orient='v') plt.tight_layout()
 
 
 
 
 
 
 
 

from matplotlib import figure f = figure.Figure( figsize =(7,7) ) sns.heatmap(Data_stats["any"][:8]. Data_stats["mt"][:8],cmap='Blues',annot=False)

Data_num = Data.select_dtypes(include = ['float64', 'int64']) Data_num1 = Data_num[:10] Data_num1.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8)