Topic Modeling of Wikidata Items via fastText and WikiProjects taxonomy

The code in this notebook documents an example of how we might predict topics for a given Wikipedia article based upon its associated Wikidata item. In practice, the training dataset used will be much larger (all articles that are part of a WikiProject on English Wikipedia) and further model adjustments might be made.

Data

  • Features: a Wikipedia article is represented as a bag of Wikidata statements
  • Labels: topics are based on the WikiProjects that have tagged English Wikipedia articles. These WikiProjects are then mapped to ~40 different mid-level categories.
  • Example:
  • New topic predications can then be made for any Wikidata item, even if it does not have an associated English Wikipedia article that has been claimed by various WikiProjects.
  • In practice, WikiProject labels are relatively sparse (though cover a substantial portion of English Wikipedia articles). This means that because in theory there are many missing labels in our dataset, a not insubstantial number of false negatives should actually be true positives -- i.e. the model precision is very conservative for classes where labeling is less consistent.

fastText Model

The fastText classificationn model is a simple linear model that learns embeddings for each vocabulary word (in this case, Wikidata properties and values), averages those embeddings together for a given document, and then learns a mulitnomial logisitic classifier overtop this document embedding. In practice, it is very quick and often matches (or exceeds) the performance of more complex approaches.

Imports, parameters, etc.

!pip install fasttext
# imports
import fasttext
import numpy as np
import pandas as pd
# parameters
input_train_fn = './wikidata_mlcs_fasttext_train_98.txt'
input_test_fn = './wikidata_mlcs_fasttext_test_02.txt'
verbosity = 2
learning_rate = 0.1
epochs = 25
# 50-dimensional vectors are learned for property/values and the Wikidata item embedding from which the label predications are made
ndim = 50
# one-vs-all: multi-label classification
loss = 'ova'
# treat each property / value as independent -- i.e. order does not matter
wordngrams = 1
# property or value must appear at least k times in the training data to be included
minCount = 3

Train model

model = fasttext.train_supervised(input=input_train_fn, minCount=minCount, wordNgrams=1,
                                  lr=learning_rate, epoch=epochs, dim=ndim, loss=loss, verbose=verbosity)

Example datapoints

num_examples = 5
with open(input_train_fn, 'r') as fin:
    with open(input_train_fn.replace('train_', 'train_qids_'), 'r') as fin_qids:
        for i, datapoint in enumerate(fin, start=1):
            claims, topics = model.get_line(datapoint.strip())
            qid = next(fin_qids).strip()
            print("\nExample {0}.".format(i))
            print("QID: https://www.wikidata.org/wiki/{0}".format(qid))
            print("x-data: {0}".format(claims))
            print("y-data: {0}".format(topics))
            if i == num_examples:
                break
Example 1.
QID: https://www.wikidata.org/wiki/Q17143790 Smart_Approaches_to_Marijuana
x-data: ['P31', 'Q43229', 'P571', '</s>']
y-data: ['__label__Geography.Americas', '__label__STEM.Biology', '__label__History_And_Society.History_and_society', '__label__History_And_Society.Politics_and_government']

Example 2.
QID: https://www.wikidata.org/wiki/Q24928029 Babu_Banarasi_Das_Indoor_Stadium
x-data: ['P131', 'Q1498', 'P31', 'Q641226', 'P137', 'Q799599', 'P466', 'Q14507242', 'P17', 'Q668', 'P127', 'Q5589328', 'P641', 'Q7291', 'P1083', 'P856', 'P625', '</s>']
y-data: ['__label__Geography.Asia']

Example 3.
QID: https://www.wikidata.org/wiki/Q55625419 North_Country_Fair
x-data: ['P856', 'P625', 'P17', 'Q16', '</s>']
y-data: ['__label__History_And_Society.History_and_society', '__label__Geography.Americas']

Example 4.
QID: https://www.wikidata.org/wiki/Q333900 Friedrich_Reinhold_Kreutzwald
x-data: ['P3762', 'P2924', 'P950', 'P213', 'P268', 'P2163', 'P463', 'Q265058', 'P20', 'Q13972', 'P269', 'P1280', 'P119', 'Q7278380', 'P1207', 'P1412', 'Q9072', 'P1412', 'Q188', 'P906', 'P409', 'P5587', 'P69', 'Q204181', 'P227', 'P1296', 'P569', 'P646', 'P1343', 'Q678259', 'P1343', 'Q59995154', 'P1343', 'Q1960551', 'P434', 'P214', 'P1315', 'P3154', 'P1871', 'P570', 'P1899', 'P735', 'Q14038597', 'P735', 'Q18091397', 'P19', 'Q1013773', 'P6886', 'Q188', 'P373', 'P3222', 'P1368', 'P734', 'Q43780741', 'P1695', 'P244', 'P27', 'Q34266', 'P1442', 'P1417', 'P1938', 'P21', 'Q6581097', 'P31', 'Q5', 'P18', 'P1006', 'P1015', 'P106', 'Q333634', 'P106', 'Q49757', 'P106', 'Q4853732', 'P106', 'Q551835', 'P106', 'Q36180', 'P106', 'Q64733534', 'P949', 'P910', 'Q9896837', 'P1017', 'P691', 'P648', 'P1233', '</s>']
y-data: ['__label__Culture.Language_and_literature', '__label__Geography.Europe']

Example 5.
QID: https://www.wikidata.org/wiki/Q1145471 Kuchipudi
x-data: ['P31', 'Q11639', 'P373', 'P18', 'P1417', 'P646', 'P2924', 'P2581', 'P3827', 'P495', 'Q668', 'P279', 'Q1990304', 'P3417', 'P910', 'Q8576786', '</s>']
y-data: ['__label__Culture.Performing_arts', '__label__History_And_Society.History_and_society', '__label__Geography.Asia']

Collect statistics

# build statistics dataframe for printing
def ft_to_toplevel(fasttext_lbl):
    return fasttext_lbl.replace('__label__','').split('.')[0]

lbl_statistics = {}
toplevel_statistics = {}
threshold = 0.5
all_lbls = model.get_labels()
for lbl in all_lbls:
    lbl_statistics[lbl] = {'n':0, 'FP':0, 'TP':0, 'FN':0, 'TN':0}
    toplevel_statistics[ft_to_toplevel(lbl)] = {'n':0, 'FP':0, 'TP':0, 'FN':0, 'TN':0}
with open(input_test_fn, 'r') as fin:
    for line_no, datapoint in enumerate(fin):
        claims, topics = model.get_line(datapoint.strip())
        prediction = model.predict(datapoint.strip(), k=-1)
        predicted_labels = [l for idx, l in enumerate(prediction[0]) if prediction[1][idx] > threshold]
        for lbl in all_lbls:
            if lbl in topics and lbl in predicted_labels:
                lbl_statistics[lbl]['n'] += 1
                lbl_statistics[lbl]['TP'] += 1
            elif lbl in topics:
                lbl_statistics[lbl]['n'] += 1
                lbl_statistics[lbl]['FN'] += 1
            elif lbl in predicted_labels:
                lbl_statistics[lbl]['FP'] += 1
            else:
                lbl_statistics[lbl]['TN'] += 1
        toplevel_topics = [ft_to_toplevel(l) for l in topics]
        toplevel_predictions = [ft_to_toplevel(l) for l in predicted_labels]
        for lbl in toplevel_statistics:
            if lbl in toplevel_topics and lbl in toplevel_predictions:
                toplevel_statistics[lbl]['n'] += 1
                toplevel_statistics[lbl]['TP'] += 1
            elif lbl in toplevel_topics:
                toplevel_statistics[lbl]['n'] += 1
                toplevel_statistics[lbl]['FN'] += 1
            elif lbl in toplevel_predictions:
                toplevel_statistics[lbl]['FP'] += 1
            else:
                toplevel_statistics[lbl]['TN'] += 1
            

            
for lbl in all_lbls:
    s = lbl_statistics[lbl]
    try:
        s['precision'] = s['TP'] / (s['TP'] + s['FP'])
    except ZeroDivisionError:
        s['precision'] = 0
    try:
        s['recall'] = s['TP'] / (s['TP'] + s['FN'])
    except ZeroDivisionError:
        s['recall'] = 0
    try:
        s['f1'] = 2 * (s['precision'] * s['recall']) / (s['precision'] + s['recall'])
    except ZeroDivisionError:
        s['f1'] = 0
        
for lbl in toplevel_statistics:
    s = toplevel_statistics[lbl]
    try:
        s['precision'] = s['TP'] / (s['TP'] + s['FP'])
    except ZeroDivisionError:
        s['precision'] = 0
    try:
        s['recall'] = s['TP'] / (s['TP'] + s['FN'])
    except ZeroDivisionError:
        s['recall'] = 0
    try:
        s['f1'] = 2 * (s['precision'] * s['recall']) / (s['precision'] + s['recall'])
    except ZeroDivisionError:
        s['f1'] = 0

Full statistics

mlc_statistics = pd.DataFrame(lbl_statistics).T
mlc_statistics['mid-level-category'] = [s.replace('__label__', '').replace('_', ' ') for s in mlc_statistics.index]
mlc_statistics.set_index('mid-level-category', inplace=True)
mlc_statistics.insert(1, '', '-->')
for col in ['n','TP','FP','TN','FN']:
    mlc_statistics[col] = mlc_statistics[col].astype('int32')
mlc_statistics[['n','','TP','FP','TN','FN','precision','recall','f1']]
n TP FP TN FN precision recall f1
mid-level-category
Culture.Language and literature 3071 --> 2826 160 6155 245 0.946417 0.920221 0.933135
Geography.Europe 2262 --> 1601 305 6819 661 0.839979 0.707781 0.768234
Geography.Americas 1672 --> 1105 244 7470 567 0.819125 0.660885 0.731546
Culture.Sports 1553 --> 1337 54 7779 216 0.961179 0.860914 0.908288
Geography.Asia 1417 --> 867 111 7858 550 0.886503 0.611856 0.724008
History And Society.History and society 995 --> 301 117 8274 694 0.720096 0.302513 0.426044
STEM.Biology 844 --> 548 45 8497 296 0.924115 0.649289 0.762700
Culture.Philosophy and religion 639 --> 256 52 8695 383 0.831169 0.400626 0.540655
History And Society.Politics and government 562 --> 227 72 8752 335 0.759197 0.403915 0.527294
STEM.Technology 473 --> 211 55 8858 262 0.793233 0.446089 0.571042
Culture.Broadcasting 394 --> 270 42 8950 124 0.865385 0.685279 0.764873
Culture.Entertainment 380 --> 223 47 8959 157 0.825926 0.586842 0.686154
Culture.Plastic arts 357 --> 214 46 8983 143 0.823077 0.599440 0.693679
History And Society.Transportation 316 --> 184 18 9052 132 0.910891 0.582278 0.710425
Culture.Performing arts 351 --> 123 68 8967 228 0.643979 0.350427 0.453875
Geography.Oceania 321 --> 187 26 9039 134 0.877934 0.582555 0.700375
History And Society.Military and warfare 319 --> 164 41 9026 155 0.800000 0.514107 0.625954
History And Society.Business and economics 269 --> 113 62 9055 156 0.645714 0.420074 0.509009
Culture.Music 259 --> 115 112 9015 144 0.506608 0.444015 0.473251
Geography.Africa 244 --> 123 14 9128 121 0.897810 0.504098 0.645669
STEM.Medicine 241 --> 95 31 9114 146 0.753968 0.394191 0.517711
History And Society.Education 229 --> 121 22 9135 108 0.846154 0.528384 0.650538
Culture.Visual arts 172 --> 85 23 9191 87 0.787037 0.494186 0.607143
STEM.Science 136 --> 25 16 9234 111 0.609756 0.183824 0.282486
Culture.Food and drink 129 --> 65 15 9242 64 0.812500 0.503876 0.622010
Culture.Games and toys 137 --> 59 31 9218 78 0.655556 0.430657 0.519824
STEM.Geosciences 127 --> 65 5 9254 62 0.928571 0.511811 0.659898
STEM.Chemistry 112 --> 47 14 9260 65 0.770492 0.419643 0.543353
Geography.Bodies of water 99 --> 66 7 9280 33 0.904110 0.666667 0.767442
Geography.Landforms 81 --> 51 5 9300 30 0.910714 0.629630 0.744526
STEM.Space 69 --> 43 8 9309 26 0.843137 0.623188 0.716667
STEM.Time 73 --> 25 4 9309 48 0.862069 0.342466 0.490196
STEM.Meteorology 55 --> 24 4 9327 31 0.857143 0.436364 0.578313
STEM.Physics 65 --> 14 3 9318 51 0.823529 0.215385 0.341463
Culture.Crafts and hobbies 57 --> 15 2 9327 42 0.882353 0.263158 0.405405
Culture.Media 33 --> 3 3 9350 30 0.500000 0.090909 0.153846
STEM.Mathematics 35 --> 15 1 9350 20 0.937500 0.428571 0.588235
STEM.Information science 27 --> 10 2 9357 17 0.833333 0.370370 0.512821
Culture.Arts 26 --> 14 1 9359 12 0.933333 0.538462 0.682927
Culture.Internet culture 21 --> 5 1 9364 16 0.833333 0.238095 0.370370
Geography.Maps 15 --> 1 1 9370 14 0.500000 0.066667 0.117647

Top-level categories only

# e.g., if label is STEM.Mathematics and STEM.Information Science is predicted, then that's considered correct still
tlc_statistics = pd.DataFrame(toplevel_statistics).T
tlc_statistics['top-level-category'] = [s.replace('__label__', '').replace('_', ' ') for s in tlc_statistics.index]
tlc_statistics.set_index('top-level-category', inplace=True)
tlc_statistics.insert(1, '', '-->')
tlc_statistics
FN FP TN TP f1 n precision recall
top-level-category
Culture 1076.0 --> 170.0 3640.0 4500.0 0.878392 5576.0 0.963597 0.807030
Geography 1714.0 --> 513.0 3158.0 4001.0 0.782286 5715.0 0.886354 0.700087
History And Society 1246.0 --> 240.0 6799.0 1101.0 0.597072 2347.0 0.821029 0.469110
STEM 901.0 --> 99.0 7264.0 1122.0 0.691739 2023.0 0.918919 0.554622
# order from here: https://github.com/wikimedia/drafttopic/blob/master/model_info/enwiki.drafttopic.md
# allows for easier comparison with word-embeddings model
class_order = ['Geography.Oceania',
               'STEM.Mathematics',
               'STEM.Science',
               'STEM.Meteorology',
               'Culture.Sports',
               'Culture.Performing arts',
               'Culture.Entertainment',
               'Assistance.Article improvement and grading',
               'Culture.Language and literature',
               'Culture.Visual arts',
               'STEM.Biology',
               'History And Society.Business and economics',
               'Assistance.Files',
               'History And Society.History and society',
               'STEM.Medicine',
               'Culture.Crafts and hobbies',
               'STEM.Geosciences',
               'Culture.Food and drink',
               'History And Society.Transportation',
               'Geography.Cities',
               'Geography.Landforms',
               'Assistance.Maintenance',
               'STEM.Information science',
               'STEM.Time',
               'Geography.Europe',
               'STEM.Engineering',
               'Culture.Media',
               'STEM.Technology',
               'STEM.Space',
               'History And Society.Education',
               'Geography.Countries',
               'History And Society.Military and warfare',
               'Culture.Plastic arts',
               'STEM.Physics',
               'History And Society.Politics and government',
               'STEM.Chemistry',
               'Culture.Broadcasting',
               'Geography.Maps',
               'Culture.Arts',
               'Culture.Internet culture',
               'Geography.Bodies of water',
               'Assistance.Contents systems',
               'Culture.Philosophy and religion']
class_order = class_order + [i for i in mlc_statistics.index if i not in class_order]
mlc_statistics = mlc_statistics.reindex(class_order)
print("Statistics:")
print("counts (n={0})".format(line_no + 1))
display(mlc_statistics[['n','','TP','FP','FN','TN']])
Statistics:
counts (n=9386)
n TP FP FN TN
mid-level-category
Geography.Oceania 321.0 --> 187.0 26.0 134.0 9039.0
STEM.Mathematics 35.0 --> 15.0 1.0 20.0 9350.0
STEM.Science 136.0 --> 25.0 16.0 111.0 9234.0
STEM.Meteorology 55.0 --> 24.0 4.0 31.0 9327.0
Culture.Sports 1553.0 --> 1337.0 54.0 216.0 7779.0
Culture.Performing arts 351.0 --> 123.0 68.0 228.0 8967.0
Culture.Entertainment 380.0 --> 223.0 47.0 157.0 8959.0
Assistance.Article improvement and grading NaN NaN NaN NaN NaN NaN
Culture.Language and literature 3071.0 --> 2826.0 160.0 245.0 6155.0
Culture.Visual arts 172.0 --> 85.0 23.0 87.0 9191.0
STEM.Biology 844.0 --> 548.0 45.0 296.0 8497.0
History And Society.Business and economics 269.0 --> 113.0 62.0 156.0 9055.0
Assistance.Files NaN NaN NaN NaN NaN NaN
History And Society.History and society 995.0 --> 301.0 117.0 694.0 8274.0
STEM.Medicine 241.0 --> 95.0 31.0 146.0 9114.0
Culture.Crafts and hobbies 57.0 --> 15.0 2.0 42.0 9327.0
STEM.Geosciences 127.0 --> 65.0 5.0 62.0 9254.0
Culture.Food and drink 129.0 --> 65.0 15.0 64.0 9242.0
History And Society.Transportation 316.0 --> 184.0 18.0 132.0 9052.0
Geography.Cities NaN NaN NaN NaN NaN NaN
Geography.Landforms 81.0 --> 51.0 5.0 30.0 9300.0
Assistance.Maintenance NaN NaN NaN NaN NaN NaN
STEM.Information science 27.0 --> 10.0 2.0 17.0 9357.0
STEM.Time 73.0 --> 25.0 4.0 48.0 9309.0
Geography.Europe 2262.0 --> 1601.0 305.0 661.0 6819.0
STEM.Engineering NaN NaN NaN NaN NaN NaN
Culture.Media 33.0 --> 3.0 3.0 30.0 9350.0
STEM.Technology 473.0 --> 211.0 55.0 262.0 8858.0
STEM.Space 69.0 --> 43.0 8.0 26.0 9309.0
History And Society.Education 229.0 --> 121.0 22.0 108.0 9135.0
Geography.Countries NaN NaN NaN NaN NaN NaN
History And Society.Military and warfare 319.0 --> 164.0 41.0 155.0 9026.0
Culture.Plastic arts 357.0 --> 214.0 46.0 143.0 8983.0
STEM.Physics 65.0 --> 14.0 3.0 51.0 9318.0
History And Society.Politics and government 562.0 --> 227.0 72.0 335.0 8752.0
STEM.Chemistry 112.0 --> 47.0 14.0 65.0 9260.0
Culture.Broadcasting 394.0 --> 270.0 42.0 124.0 8950.0
Geography.Maps 15.0 --> 1.0 1.0 14.0 9370.0
Culture.Arts 26.0 --> 14.0 1.0 12.0 9359.0
Culture.Internet culture 21.0 --> 5.0 1.0 16.0 9364.0
Geography.Bodies of water 99.0 --> 66.0 7.0 33.0 9280.0
Assistance.Contents systems NaN NaN NaN NaN NaN NaN
Culture.Philosophy and religion 639.0 --> 256.0 52.0 383.0 8695.0
Geography.Americas 1672.0 --> 1105.0 244.0 567.0 7470.0
Geography.Asia 1417.0 --> 867.0 111.0 550.0 7858.0
Culture.Music 259.0 --> 115.0 112.0 144.0 9015.0
Geography.Africa 244.0 --> 123.0 14.0 121.0 9128.0
Culture.Games and toys 137.0 --> 59.0 31.0 78.0 9218.0
class_order = ['History_And_Society.Education',
               'STEM.Geosciences',
               'Culture.Language and literature',
               'Assistance.Maintenance',
               'STEM.Technology',
               'Geography.Cities',
               'Culture.Sports',
               'STEM.Chemistry',
               'STEM.Physics',
               'Culture.Broadcasting',
               'Assistance.Contents systems',
               'Geography.Oceania',
               'Assistance.Files',
               'Geography.Maps',
               'Assistance.Article improvement and grading',
               'Geography.Landforms',
               'Culture.Visual arts',
               'STEM.Medicine',
               'Culture.Plastic arts',
               'Culture.Arts',
               'Culture.Food and drink',
               'STEM.Information science',
               'STEM.Engineering',
               'Culture.Philosophy and religion',
               'STEM.Science',
               'Culture.Crafts and hobbies',
               'History_And_Society.Business and economics',
               'Geography.Countries',
               'STEM.Time',
               'STEM.Biology',
               'History_And_Society.Transportation',
               'STEM.Meteorology',
               'History_And_Society.Politics and government',
               'Culture.Internet culture',
               'History_And_Society.Military and warfare',
               'Culture.Media',
               'STEM.Mathematics',
               'STEM.Space',
               'Culture.Performing arts',
               'Geography.Bodies of water',
               'Geography.Europe',
               'History_And_Society.History and society',
               'Culture.Entertainment']
#class_order = [c for c in class_order if c in statistics.index]
mlc_statistics = mlc_statistics.reindex(class_order)
report = mlc_statistics[~mlc_statistics['precision'].isnull()]

Recall

micro = np.average(report['recall'], weights=report['n'])
macro = np.nanmean(report['recall'])
print("recall (micro={0:.3f}, macro={1:.3f})".format(micro, macro))
display(report['recall'])
recall (micro=0.693, macro=0.474)
mid-level-category
STEM.Geosciences                   0.511811
Culture.Language and literature    0.920221
STEM.Technology                    0.446089
Culture.Sports                     0.860914
STEM.Chemistry                     0.419643
STEM.Physics                       0.215385
Culture.Broadcasting               0.685279
Geography.Oceania                  0.582555
Geography.Maps                     0.066667
Geography.Landforms                0.629630
Culture.Visual arts                0.494186
STEM.Medicine                      0.394191
Culture.Plastic arts               0.599440
Culture.Arts                       0.538462
Culture.Food and drink             0.503876
STEM.Information science           0.370370
Culture.Philosophy and religion    0.400626
STEM.Science                       0.183824
Culture.Crafts and hobbies         0.263158
STEM.Time                          0.342466
STEM.Biology                       0.649289
STEM.Meteorology                   0.436364
Culture.Internet culture           0.238095
Culture.Media                      0.090909
STEM.Mathematics                   0.428571
STEM.Space                         0.623188
Culture.Performing arts            0.350427
Geography.Bodies of water          0.666667
Geography.Europe                   0.707781
Culture.Entertainment              0.586842
Name: recall, dtype: float64

Precision

micro = np.average(report['precision'], weights=report['n'])
macro = np.nanmean(report['precision'])
print("precision (micro={0:.3f}, macro={1:.3f})".format(micro, macro))
display(report['precision'])
precision (micro=0.876, macro=0.821)
mid-level-category
STEM.Geosciences                   0.928571
Culture.Language and literature    0.946417
STEM.Technology                    0.793233
Culture.Sports                     0.961179
STEM.Chemistry                     0.770492
STEM.Physics                       0.823529
Culture.Broadcasting               0.865385
Geography.Oceania                  0.877934
Geography.Maps                     0.500000
Geography.Landforms                0.910714
Culture.Visual arts                0.787037
STEM.Medicine                      0.753968
Culture.Plastic arts               0.823077
Culture.Arts                       0.933333
Culture.Food and drink             0.812500
STEM.Information science           0.833333
Culture.Philosophy and religion    0.831169
STEM.Science                       0.609756
Culture.Crafts and hobbies         0.882353
STEM.Time                          0.862069
STEM.Biology                       0.924115
STEM.Meteorology                   0.857143
Culture.Internet culture           0.833333
Culture.Media                      0.500000
STEM.Mathematics                   0.937500
STEM.Space                         0.843137
Culture.Performing arts            0.643979
Geography.Bodies of water          0.904110
Geography.Europe                   0.839979
Culture.Entertainment              0.825926
Name: precision, dtype: float64

F1

micro = np.average(report['f1'], weights=report['n'])
macro = np.average(report['f1'])
print("f1 (micro={0:.3f}, macro={1:.3f})".format(micro, macro))
display(report['f1'])
f1 (micro=0.763, macro=0.583)
mid-level-category
STEM.Geosciences                   0.659898
Culture.Language and literature    0.933135
STEM.Technology                    0.571042
Culture.Sports                     0.908288
STEM.Chemistry                     0.543353
STEM.Physics                       0.341463
Culture.Broadcasting               0.764873
Geography.Oceania                  0.700375
Geography.Maps                     0.117647
Geography.Landforms                0.744526
Culture.Visual arts                0.607143
STEM.Medicine                      0.517711
Culture.Plastic arts               0.693679
Culture.Arts                       0.682927
Culture.Food and drink             0.622010
STEM.Information science           0.512821
Culture.Philosophy and religion    0.540655
STEM.Science                       0.282486
Culture.Crafts and hobbies         0.405405
STEM.Time                          0.490196
STEM.Biology                       0.762700
STEM.Meteorology                   0.578313
Culture.Internet culture           0.370370
Culture.Media                      0.153846
STEM.Mathematics                   0.588235
STEM.Space                         0.716667
Culture.Performing arts            0.453875
Geography.Bodies of water          0.767442
Geography.Europe                   0.768234
Culture.Entertainment              0.686154
Name: f1, dtype: float64