Understanding the impact of Content Translations tools at it's best

Importing of different Modules and explore their usages

Pandas is an extensively data manipulation tool, built on the Numpy package and its key data structure called DataFrame allows to store and manipulate tabular data in rows of obseravations

import pandas as pd

NumPy is for scientific computation,a general-purpose array-processing package

import numpy as np

json is javascript object notation,lightweight data-interchange format I have ran multiple examples to get acquinted with the json format, it's syntax and datatypes like string, number, object (JSON object), array, boolean, nul

import json

re is Python's inbuilt library to work with regular expressions. I learnt different functions of re and sequences

import re

Seaborn and Matplotlib are imported to perform some visualizations as Analyses results can be shown well as the infographics have an impulsed and quick impact than the text paragraphs describing the results Data visualisation is an essential part of analysis since it allows even non-programmers to be able to decipher trends and patterns.

import seaborn as sns
import matplotlib as plt
import matplotlib.pyplot as plt 

I have imported mediawiki API and have referred through all the sections and actions that can be performed -Quering etc

import mwapi

with API

session = mwapi.Session(host='https://en.wikipedia.org',
                        user_agent='Miriiyala Pujitha Jaji')
te_session = mwapi.Session(host='https://te.wikipedia.org',
                        user_agent='Miriiyala Pujitha Jaji')

#Translated form English to Hindi ,Our National Language here in India
parameters = {'action':'query',
              'format':'json',
              'list':'cxpublishedtranslations',
              'from':'en',
              'to':'te',
              'limit':500,
              'offset':2000}
res2 = session.get(parameters)

To know how the data looks like I have run following

res2['result']['translations'][:15]
[]
Data = pd.DataFrame(res['result']['translations'])
Data.head(15)

EDA – plays a crucial role in understanding the what, why, and how of the problem statement Exploratory data analysis is an approach to analyzing data sets by summarizing their main characteristics with visualizations to unravel many insights they often lead us in building robust models

Let’s have a look at data dimensionality, features names, and feature types.

Data.describe
Data.columns
Data.index
Data.shape
Data.info()

To check if there exists any missing or null values

sns.heatmap(Data.isnull(),cbar=False,yticklabels=False,cmap = 'viridis')

Therefore they are no null values

Data['sourceURL'].nunique()

There are 11 Columns

print(Data['publishedDate'].describe())
print(Data['sourceLanguage'].describe())
print(Data['sourceRevisionId'].describe())
print(Data['sourceTitle'].describe())
print(Data['sourceURL'].describe())
print(Data['targetLanguage'].describe())
print(Data['targetRevisionId'].describe())
print(Data['targetTitle'].describe())

Noted the Observations....

print(Data['targetURL'].describe())

As discussed in the earlier Contribution #1 in the Jupyter Notebook named kickstart- exploration of the data and other modules that the stats is an python dictionary accessing the value for each corresponding key may be difficult So made each value of stats into a seperate Column in the Data STATS has the highest importance as it speaks everything in the analysis

Data_stats = Data.drop('stats',1).assign(**Data.stats.apply(pd.Series))
Data_stats[:5]
Data_stats.info()
Data_stats.describe()

Let's check for the newer publications-translations As we are more focused on recent trends though observing the history may lead to interesting conclusions! Okayyy Let's do Both! No missing No Confusion

As we have observed from the above results that the type of the 'publishedDate' column-observations is object We should convert it into int datatype so we can perform some comparsion operations in order to extract information about recent activities

Data_stats = Data_stats.sort_values('publishedDate')

using astype method I have changed the publishedDate datatype into int64

Data_stats['publishedDate'] = Data_stats['publishedDate'].astype('int64')

Let us look at the frequency of publications in accordance with the publishedDates..

sns.distplot(Data_stats['publishedDate'])

Most of the publications-transalations are done in the year 2016 - 2017!

Why not after that period? Has everthing been finished during that period ?? To be studied!!

Let us try to get from 2018

Data_stats[Data_stats['publishedDate'] > 20180000000000]

Oops! there is not much data available in the limit(500) we have set So lets go other 6 months backwards

Data_stats[Data_stats['publishedDate'] > 20170600000000]
Data_stats["any"][:10]
print(Data_stats['any'].describe())
Data_stats["mt"][:10]
print(Data_stats['mt'].describe())
Data_stats["mtSectionsCount"][:10]
print(Data_stats['mtSectionsCount'].describe())
Data_stats["human"][:10]
print(Data_stats['human'].describe())

Now that we have seen the recent years and visualised the any - > Machine translation or Human Translation mt - > Machine translation human -> Human Translation mtSectionsCount -> Number of Sections translated Let us now combine both of them to observe which may lead us to new conclusions

Data_stats[Data_stats['publishedDate'] > 20170600000000].sort_values('any', ascending=True)
Data_stats[Data_stats['publishedDate'] > 20170600000000].sort_values('mt', ascending=True)
Data_stats[Data_stats['publishedDate'] > 20170600000000].sort_values('human', ascending=True)
Data_stats[Data_stats['publishedDate'] > 20170600000000].sort_values('mtSectionsCount', ascending=True)
sns.distplot(Data_stats['any'], color='g', bins=28, hist_kws={'alpha': 0.8});
sns.distplot(Data_stats['human'], color='g', bins=28, hist_kws={'alpha': 0.8});
sns.distplot(Data_stats['mtSectionsCount'], color='g', bins=28, hist_kws={'alpha': 0.8});
sns.distplot(Data_stats['mt'], color='g', bins=28, hist_kws={'alpha': 0.8});
plt.rcParams['figure.figsize'] = (3,4)
sns.countplot(x='any',  data=Data_stats[:50])
plt.rcParams['figure.figsize'] = (3,4)
sns.countplot(x='mt',hue = 'mtSectionsCount' , data=Data_stats[:5])

The two above conclusions-plots are not-so-bad May be we have to compare among different parametres in order to get meaningful observations.. Let's Do it

Let us make use scatter plots and try to grab some insights from it

sns.jointplot(x="human", y="mt", data=Data_stats)
sns.jointplot(x="mt", y="mtSectionsCount", data=Data_stats)
sns.jointplot(x="any", y="mt", data=Data_stats)
sns.jointplot(x="any", y="mtSectionsCount", data=Data_stats)
sns.jointplot(x="any", y="human", data=Data_stats)

Let us identify what type/how many of articles have after-translation-efforts like edits/translations by human or machine and how are they changed? When they are changed/translated? How much are they translated?

Firstly understanding the which articles have been translation either by Human or Machine

anyMore_Data_stats=Data_stats.sort_values(by='any', ascending=False) #false indicates descending order
anyMore_Data_stats.head()

Secondly understanding the which articles have been translation either by exclusively Machine

mtMore_Data_stats=Data_stats.sort_values(by='any', ascending=False)
mtMore_Data_stats.head()
humanMore_Data_stats=Data_stats.sort_values(by='any', ascending=False)
humanMore_Data_stats.head()
mtSectionsCountMore_Data_stats=Data_stats.sort_values(by='any', ascending=False)
mtSectionsCountMore_Data_stats.head()

sort by multiple columns:

Robust comparision often lead us to healthy conclusions!!

let's have a detective eye on them so that intelligent comparisions cannot escape from us

Data_stats.sort_values(by=[ 'mt','human',], ascending=[True, False]).head()
Data_stats.sort_values(by=[ 'human','mt'], ascending=[True, False]).head()
Better Comparison among the stats - the central part of Data Analysis
plt=__import__("matplotlib.pyplot")


col_names = ['any','mt', 'mtSectionsCount', 'human']

fig, ax = plt.pyplot.subplots(len(col_names), figsize=(16,12))

for i, col_val in enumerate(col_names):

    sns.distplot(Data_stats[col_val], hist=True, ax=ax[i])
    ax[i].set_title('Freq dist '+col_val, fontsize=10)
    ax[i].set_xlabel(col_val, fontsize=8)
    ax[i].set_ylabel('Count', fontsize=8)

plt.pyplot.show()

Outliners

Data_stats1 = Data_stats
col_names = ['any','mt', 'mtSectionsCount', 'human']

fig, ax = plt.pyplot.subplots(len(col_names), figsize=(8,40))

for i, col_val in enumerate(col_names):

    sns.boxplot(Data_stats1[col_val], ax=ax[i])
    ax[i].set_title('Box plot - {}'.format(col_val), fontsize=10)
    ax[i].set_xlabel(col_val, fontsize=8)

plt.pyplot.show()
 
 
 
 
 
 
"On top of the content that was translated, which the above notebook demonstrates ways to access, more data can be accessed about the translations and what occurred after them. Try comparing statistics about edits, pageviews, etc. between the source and translated versions of articles. More advanced analyses in a project might eventually compare translated articles with similar articles that were not translated or classify edits based upon their 'type' for more fine-grained analyses of what happens to translated articles. "We would hope for a mixed-methods approach that uses both quantitative analyses (e.g., edit counts, topics that are more frequently translated, etc.) and qualitative analyses (e.g., content analysis of translated pages and subsequent edits, talk pages, etc.).
 
!pip install git+https://github.com/mediawiki-utilities/python-mwviews.git

Now Heading towards the actual Quantitative Analysis

First comes the pageviews

import mwviews
from mwviews.api import PageviewsClient
p = PageviewsClient(user_agent="Miriiyala Pujitha Jaji Outreachy Aspirant")
# p.article_views('en.wikipedia', ['Selfie', 'Cat', 'Dog'])
# p.project_views(['ro.wikipedia', 'de.wikipedia', 'commons.wikimedia'])
#p.article_views('en.wikipedia', ['Selfie', 'Cat', 'Dog'])
#p.project_views(['ro.wikipedia', 'de.wikipedia', 'commons.wikimedia'])
#p.top_articles('en.wikipedia', limit=10)
#p.article_views('en.wikipedia', ['Selfie', 'Cat'])
ViewsinEnHi= p.project_views(['en.wikipedia', 'hi.wikipedia','commons.wikimedia'])
ViewsinEnHi =  pd.DataFrame(ViewsinEnHi).transpose()
ViewsinEnHi.head()

Having seen such an significant amount of views for Hindi wikipedia too though the Literacy rates in India is low is relly surprising Isn't it?

Hence We have Compared the views in the Commons, English, Hindi Wikipedias

Hmm.. Interesting topic Ahead!!! Let us see which articles got the top views ! in Hindi Language or English ? Let us see Hindi First!

p.top_articles('hi.wikipedia', limit=10)

How about English

p.top_articles('en.wikipedia', limit=10)

Some Insights!

We could see that in both of the languages, Main_Page has got the highest views which is so obvious!

Special:Search is there in both the languages at top 5!(in English - position 2 andin Hindi position 4 ) -- but not at the same position! Is that because Hindi speaking People Don't know about this or They are more interested in the topics topped at the positions 2 and 3?

Let us compare with more focus

viewsinEnglish = p.article_views('en.wikipedia', 'Special:Search', start = "20180101", end = "20190101")
viewsinEnglish = pd.DataFrame(viewsinEnglish).transpose()
viewsinEnglish.head()
viewsinHindi = p.article_views('hi.wikipedia','Special:Search' , start = "20180101", end = "20190101")
viewsinHindi = pd.DataFrame(viewsinHindi).transpose()
viewsinHindi.head()

Okay Views is Completed!

Meanwhile what would be going at the developer side??

Once the articles are published will they be same forever

No changes ? Impossible right?

Because the world is constantly changing,

The articles are often reiterated and updated

Now let us dig in to Edits Section!!

Data_stats[:5]

Data_stats.head(12)

plt.rcParams['figure.figsize'] = [12, 10] plt.pyplot.hist([df['englishEditCount'][~np.isnan(df['englishEditCount'])], df['hindiEditCount'][~np.isnan(df['hindiEditCount'])]], bins = 50, label = ['English edits count', 'hindi edits count']) plt.pyplot.xlabel('Number of edits') plt.pyplot.ylabel('Number of articles') plt.pyplot.legend(loc='upper right') plt.pyplot.show()

plt.rcParams['figure.figsize'] = [6, 5] plt.pyplot.plot(df.loc[:50, 'englishEditCount'], label='English Edit Count') plt.pyplot.plot(df.loc[:50, 'hindiEditCount'], label='Hindi Edit Count') plt.pyplot.legend(loc='center right', bbox_to_anchor=(0.75, 0.5), ncol=1, fancybox=True, shadow=True); plt.pyplot.title('hindiEditCount vs englishEditCount') plt.pyplot.show()