Checking The Met open data consistency against Wikidata

This set of scripts will check the consistency of the Met data set and Wikidata.

For objects, we should check the title, accession number, creator, creation date and instance of info.

Import Met CSV database into a dataframe

This is more than 400,000 rows, so it may take 10-20 seconds or more

%%time
import csv
from urllib.request import urlopen
import codecs
# from tqdm import tqdm
# from tqdm._tqdm_notebook import tqdm_notebook
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from dateutil.parser import parse

# The Met's weekly CSV dump URL in Github is quite big at 250 Mbytes
# Comment out for now, since this is slow to load. 
# url = 'https://media.githubusercontent.com/media/metmuseum/openaccess/master/MetObjects.csv'

# Use local copy of CSV file for speed and read the CSV file into a pandas dataframe
url = 'metmuseum/MetObjects-20190425.csv'
met_df = pd.read_csv(url,low_memory=False)
CPU times: user 8.36 s, sys: 904 ms, total: 9.26 s
Wall time: 9.05 s

Examine the structure of rows

Take a look at some of the rows. NaN means "not a number" or a blank from the CSV file.

met_df.sample(3)
Object Number Is Highlight Is Public Domain Object ID Department Object Name Title Culture Period Dynasty ... Locale Locus Excavation River Classification Rights and Reproduction Link Resource Metadata Date Repository Tags
365896 22.1.1329 False False 559520 Egyptian Art Spacer\r\n Spacer NaN Middle Kingdom–Early New Kingdom Dynasty 12–18 ... Cemetery NaN MMA excavations, 1920–22 NaN NaN NaN http://www.metmuseum.org/art/collection/search... 4/22/2019 8:00:03 AM Metropolitan Museum of Art, New York, NY NaN
250216 23.112.493 False True 378568 Drawings and Prints Drawing Design for Omnibus or Wagonette, no. 3369 NaN NaN NaN ... NaN NaN NaN NaN Drawings NaN http://www.metmuseum.org/art/collection/search... 4/22/2019 8:00:03 AM Metropolitan Museum of Art, New York, NY Carriages
135256 41.205.3c False True 226810 European Sculpture and Decorative Arts Panel Firescreen panel French NaN NaN ... NaN NaN NaN NaN Textiles-Embroidered NaN http://www.metmuseum.org/art/collection/search... 4/22/2019 8:00:03 AM Metropolitan Museum of Art, New York, NY Insignia|Embroidery

3 rows × 44 columns

Statistics on most used and unique terms

The most frequently used terms can be found in the row labeled "top." In summary:

  1. Drawings and prints departement has the most items
  2. Object number "62.635" is used four times, which we can investigate below.
met_df.describe(include=[np.object])
Object Number Department Object Name Title Culture Period Dynasty Reign Portfolio Artist Role ... Locale Locus Excavation River Classification Rights and Reproduction Link Resource Metadata Date Repository Tags
count 494311 494311 489851 463050 209050 89550 23284 11205 22217 285440 ... 15562 7329 15967 2098 437900 24940 494311 494311 494311 277566
unique 491608 19 28302 251441 7267 1811 381 380 3274 6326 ... 859 1352 364 230 1227 1427 494311 1 1 55172
top 62.635 Drawings and Prints Print Fragment American Edo period (1615–1868) Dynasty 18 reign of Amenhotep III Les Images De Tous Les Saincts et Saintes de L... Artist ... Cemetery debris MMA excavations Upper Sepik River Prints © Walker Evans Archive, The Metropolitan Museu... http://www.metmuseum.org/art/collection/search... 4/22/2019 8:00:03 AM Metropolitan Museum of Art, New York, NY Men|Portraits|Baseball|Athletes
freq 4 178182 101945 7011 28404 8879 7178 2730 601 115028 ... 1925 416 2387 361 76967 9204 1 494311 494311 9706

4 rows × 39 columns

met_df[met_df['Object Number'].str.match('^62.635$')]
Object Number Is Highlight Is Public Domain Object ID Department Object Name Title Culture Period Dynasty ... Locale Locus Excavation River Classification Rights and Reproduction Link Resource Metadata Date Repository Tags
267275 62.635 False True 396034 Drawings and Prints Print The Delivery of the Keys to Peter NaN NaN NaN ... NaN NaN NaN NaN Prints NaN http://www.metmuseum.org/art/collection/search... 4/22/2019 8:00:03 AM Metropolitan Museum of Art, New York, NY Sheep|Christ|Christian Imagery|Saint Peter|Keys
475029 62.635 False False 772842 Drawings and Prints Print Air NaN NaN NaN ... NaN NaN NaN NaN Prints NaN http://www.metmuseum.org/art/collection/search... 4/22/2019 8:00:03 AM Metropolitan Museum of Art, New York, NY NaN
475031 62.635 False False 772844 Drawings and Prints Print Water NaN NaN NaN ... NaN NaN NaN NaN Prints NaN http://www.metmuseum.org/art/collection/search... 4/22/2019 8:00:03 AM Metropolitan Museum of Art, New York, NY NaN
476670 62.635 False False 774810 Drawings and Prints Print Dilapidated Tower and Gothic Gate NaN NaN NaN ... NaN NaN NaN NaN Prints NaN http://www.metmuseum.org/art/collection/search... 4/22/2019 8:00:03 AM Metropolitan Museum of Art, New York, NY NaN

4 rows × 44 columns

met_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 494311 entries, 0 to 494310
Data columns (total 44 columns):
Object Number              494311 non-null object
Is Highlight               494311 non-null bool
Is Public Domain           494311 non-null bool
Object ID                  494311 non-null int64
Department                 494311 non-null object
Object Name                489851 non-null object
Title                      463050 non-null object
Culture                    209050 non-null object
Period                     89550 non-null object
Dynasty                    23284 non-null object
Reign                      11205 non-null object
Portfolio                  22217 non-null object
Artist Role                285440 non-null object
Artist Prefix              98323 non-null object
Artist Display Name        287530 non-null object
Artist Display Bio         237989 non-null object
Artist Suffix              12208 non-null object
Artist Alpha Sort          287495 non-null object
Artist Nationality         193440 non-null object
Artist Begin Date          240617 non-null object
Artist End Date            237808 non-null object
Object Date                479252 non-null object
Object Begin Date          494311 non-null int64
Object End Date            494311 non-null int64
Medium                     486728 non-null object
Dimensions                 417876 non-null object
Credit Line                493520 non-null object
Geography Type             60369 non-null object
City                       32248 non-null object
State                      2805 non-null object
County                     8576 non-null object
Country                    76812 non-null object
Region                     31975 non-null object
Subregion                  22277 non-null object
Locale                     15562 non-null object
Locus                      7329 non-null object
Excavation                 15967 non-null object
River                      2098 non-null object
Classification             437900 non-null object
Rights and Reproduction    24940 non-null object
Link Resource              494311 non-null object
Metadata Date              494311 non-null object
Repository                 494311 non-null object
Tags                       277566 non-null object
dtypes: bool(2), int64(3), object(39)
memory usage: 159.3+ MB
met_df[['Object ID','Object Name','Title','Department','Object Date','Object Begin Date','Object End Date']].sample(20)
Object ID Object Name Title Department Object Date Object Begin Date Object End Date
248200 376395 Print Sultana wearing a pearl necklace and a turban,... Drawings and Prints 1649–50 1649 1650
445916 726443 Baseball card, print Card Number 134, Steve Mura, San Diego Padres Drawings and Prints 1981 1981 1981
362289 555415 Amulet, Taweret Taweret Amulet Egyptian Art ca. 1295–1070 B.C. -1295 -1070
205752 327242 Cylinder seal Cylinder seal Ancient Near Eastern Art ca. 19th–16th century B.C. -1900 -1500
238121 366043 Print Landscape No. 16 Drawings and Prints 1958 1958 1958
396555 649532 Photograph [Construction of the Interborough Rapid Transi... Photographs 1900–1907 1900 1907
369436 563291 Jar Label Jar label fragments Egyptian Art ca. 1390–1352 B.C. -1390 -1390
274884 403779 Baseball card, photograph James H. "Jim" Devlin, Pitcher, St. Louis Brow... Drawings and Prints 1887 1887 1887
179316 282803 Photograph [Street Scene, Toledo, Ohio] Photographs ca. 1915 1914 1916
317186 451665 Bowl Bowl Islamic Art 15th–16th century 1400 1599
387311 636744 Print Wampum, from the Indian Gum series (R73), issu... Drawings and Prints 1933 1933 1933
31092 36410 Folding fan mounted as an album leaf 近代 田世光 海棠小鳥 扇面|Flower and Bird Asian Art 20th century 1914 1999
6944 7413 Shaving stand Shaving stand American Decorative Arts 1700–30 1710 1730
306740 436467 Painting, miniature Portrait of a Child European Paintings NaN 1790 1800
382329 629974 Ornament Figure Arts of Africa, Oceania, and the Americas A.D. 900–1400 900 1400
323101 460967 Frame Astragal frame Robert Lehman Collection early 17th century(?) 1600 1625
147461 246437 Bracelet Bracelet Greek and Roman Art 1st–3rd century A.D. 1 300
473061 769131 Fragment Ceramic Fragment Islamic Art 8th–12th century 700 1199
341387 489044 Figure Figure Modern and Contemporary Art 1934 1934 1934
308958 443348 Fragment Fragment Islamic Art 5th century 400 499

Examine random artist names

met_df['Artist Display Name'].sample(10)
157181                                        Hieron|Makron
223579    Kenneth Grahame|Robert James Enraght Moony|Bel...
460246         American Tobacco Company|W. Duke, Sons & Co.
224537    Jean Ingelow|University Press: Welch, Bigelow ...
192395                                                  NaN
480356                                   Bowman Gum Company
423104                                                  NaN
86666                                    René-Jules Lalique
179915    Davis Pratt|Cyril Connolly|Walker Evans|Olivia...
434373                                                  NaN
Name: Artist Display Name, dtype: object

Public domain works

met_df['Is Public Domain'].value_counts().plot.pie(autopct='%.0f')
<matplotlib.axes._subplots.AxesSubplot at 0x7f64b27c0d30>

Most frequent artists

Photographer Walker Evans leads the list here, followed by publishers of trade or baseball cards (Kinney Brothers Tobacco, Allen & Ginter, W. Duke, Sons, et al), ephemera and other materials.

met_df['Artist Display Name'].value_counts()[:50]
Walker Evans                                                  9353
Kinney Brothers Tobacco Company                               4282
Unknown                                                       3586
Allen & Ginter                                                3184
W. Duke, Sons & Co.                                           3148
Topps Chewing Gum Company                                     3068
Goodwin & Company                                             2746
Brewster & Co.                                                2696
Bowman Gum Company                                            2633
American Tobacco Company                                      2122
Exhibit Supply Company                                        1664
The Diamond Match Company                                     1305
Goudey Gum Company                                            1208
Anonymous                                                     1094
Reginald Marsh                                                1007
Anonymous, British, 19th century                               989
W. Duke, Sons & Co.|Knapp & Company                            984
Giovanni Battista Piranesi                                     930
Auguste Edouart                                                870
Jacques Callot|Israël Henriet                                  862
William S. Kimball & Company                                   829
Gum, Inc.                                                      798
Turkish Trophies Cigarettes                                    751
National Chicle Gum Company, Cambridge, Massachusetts          740
Albrecht Dürer                                                 729
John Singer Sargent                                            716
Wenceslaus Hollar                                              715
Jacques Androuet Du Cerceau                                    686
Utagawa Hiroshige                                              673
Anonymous, French, 19th century                                621
J. Claude Frères & Co.|Unknown Designer                        602
Jean Le Pautre|Pierre Mariette le fils                         589
W. Duke, Sons & Co.|American Tobacco Company                   587
American Tobacco Company|W. Duke, Sons & Co.                   564
American Caramel Company, Lancaster and York, Pennsylvania     559
Giorgio di Sant'Angelo                                         555
Stefano della Bella                                            549
James McNeill Whistler                                         535
Unidentified Artist                                            523
Jules-Edmond-Charles Lachaise|Eugène-Pierre Gourdet            519
Jacques Callot                                                 516
Anonymous, French, 18th century                                501
Meissen Manufactory                                            494
André-Adolphe-Eugène Disdéri                                   490
Stephen Shore                                                  469
Charles James                                                  465
Thomas Rowlandson                                              464
Alphonse Bertillon                                             440
Katsushika Hokusai                                             422
American Caramel Company, Philadelphia                         417
Name: Artist Display Name, dtype: int64

Anonymous creators

Many entries in the Met database are "Unknown", "Anonymous" or "Anonymous..." with the addition of some qualifying details. Sometimes it is "Anonymous|..." with a pipe symbol, and other details are included. Here are the most frequent uses of "Anonymous." We should somehow capture this info in Wikidata, but it's possible that we might not want to and just infer the era instead from the artwork's Wikidata inception time?

Question - Should we keep this distinction on "Unknown" and "Anonymous"?

Question - Should we capture the nationality and era of the "Anonymous" creator in Wikidata?

We may want to sidestep this at this point.

met_df[met_df['Artist Display Name'].str.match('Anonymous',na=False)]['Artist Display Name'].value_counts()[:50]
Anonymous                                                                                                      1094
Anonymous, British, 19th century                                                                                989
Anonymous, French, 19th century                                                                                 621
Anonymous, French, 18th century                                                                                 501
Anonymous, Italian, 16th century                                                                                254
Anonymous, American, 19th century                                                                               233
Anonymous, Italian, Roman-Bolognese, 17th century                                                               230
Anonymous, French, 16th century                                                                                 207
Anonymous, British, 18th century                                                                                192
Anonymous, Alsatian, 19th century                                                                               180
Anonymous, Italian, 17th century                                                                                172
Anonymous, American, 20th century                                                                               169
Anonymous, French, 20th century                                                                                 166
Anonymous, Italian, Piedmontese, 18th century                                                                   158
Anonymous, Italian, 19th century                                                                                123
Anonymous, Italian, first half of the 18th century                                                              100
Anonymous, Italian, 18th century                                                                                 97
Anonymous, Czech, early 20th century                                                                             90
Anonymous, German, 19th century                                                                                  86
Anonymous, German, 15th century                                                                                  83
Anonymous, Dutch, 17th century                                                                                   83
Anonymous, French, 17th century                                                                                  62
Anonymous, Irish, 19th century                                                                                   55
Anonymous, Italian or Spanish, late 15th to early 16th century|Master F|Giovanni Battista di Jacopo Cennini      50
Anonymous, American, 20th century|American Lithographic Co.                                                      46
Anonymous, Spanish, School of Seville, 17th century                                                              45
Anonymous, American, 19th to 20th centuries                                                                      37
Anonymous, 19th century                                                                                          35
Anonymous, German, 18th century                                                                                  32
Anonymous, German, 16th century                                                                                  32
Anonymous|Remondini Family                                                                                       32
Anonymous, Italian                                                                                               31
Anonymous, Netherlandish, 16th century                                                                           28
Anonymous, French, 18th century|Antoine Caire-Morand|Anonymous, French, 19th century                             28
Anonymous, French, 16th century|Jacques Androuet Du Cerceau                                                      27
Anonymous, Italian, 17th or 18th century                                                                         27
Anonymous, 17th century|Guido Reni                                                                               26
Anonymous, Chinese, 19th century                                                                                 26
Anonymous, Italian, 16th century|Parmigianino (Girolamo Francesco Maria Mazzola)                                 25
Anonymous|Jacques Callot                                                                                         25
Anonymous, American, 18th century                                                                                25
Anonymous, Spanish, School of Seville, 17th century|Pedro  Duque Cornejo                                         25
Anonymous, British, early 19th century                                                                           24
Anonymous|Antonio Lafreri                                                                                        24
Anonymous, Japanese, 19th Century                                                                                23
Anonymous, British, late 18th–early 19th century                                                                 23
Anonymous, French                                                                                                23
Anonymous, Italian, 16th century|Enea Vico                                                                       22
Anonymous, German, 17th century                                                                                  21
Anonymous, British                                                                                               20
Name: Artist Display Name, dtype: int64
met_df[met_df['Artist Display Name'].str.match('Painter',na=False)]['Artist Display Name'].value_counts()[:50]
Painter of Berlin 2268                     20
Painter of London D 12                     20
Painter of the Paris Gigantomachy          14
Painter of Brussels R 330                  10
Painter of London E 777                     9
Painter of Bologna 417                      8
Painter of London E 106                     6
Painter Z                                   5
Painter of Munich 2676                      4
Painter of the New York Hypnos              4
Painter of Louvre G 265                     4
Painter of London E 100                     4
Painter of Munich 2335                      3
Painter of Louvre G 456                     3
Painter of London E 342                     3
Painter of the Dublin Situlae               3
Painter of the Louvre Komoi                 3
Painter of the Vienna Stamnos 318           2
Painter of Louvre CA 1694                   2
Painter of London B 76                      2
Painter of Palermo 4                        2
Painter of the Louvre Centauromachy         2
Painter of London E 80                      2
Painter of the Yale Cup                     2
Painter of Vatican 365                      2
Painter of New York 07                      2
Painter of the Boulogne Horse               2
Painter of Würzburg 487                     2
Painter of Bologna 228                      2
Painter of the Nicosia Olpe                 2
Painter of the Yale Lekythos                2
Painter of Boston C.A.                      1
Painter of the Cambridge Hydria             1
Painter of London E 497                     1
Painter of the Bari Orestes                 1
Painter of Munich 2363                      1
Painter of Munich 1842                      1
Painter of Palermo 1162                     1
Painter of London E 100|Orleans Painter     1
Painter of Heidelberg 209                   1
Painter of Tarquinia RC 6847                1
Painter of the Villa Giulia 1664            1
Painter of Berlin 2464                      1
Painter of London E 2                       1
Painter of the New York Centauromachy       1
Painter of the Long Overfalls               1
Painter of London E 317                     1
Painter of New York 30.115.26               1
Painter of Vienna 113                       1
Painter of Oxford 213                       1
Name: Artist Display Name, dtype: int64
met_df[met_df['Artist Display Name'].str.match('Master',na=False)]['Artist Display Name'].value_counts()[:50]
Master G.A.                                                                                      59
Master F. P.|Parmigianino (Girolamo Francesco Maria Mazzola)                                     45
Master F. P.                                                                                     40
Master ES                                                                                        37
Master of the Die|Antonio Salamanca|Michiel Coxie (I)                                            17
Master F                                                                                         16
Master of the Aeneid                                                                             16
Master G.A.|Lambert Suavius                                                                      16
Master of the Die|Raphael (Raffaello Sanzio or Santi)                                            14
Master with the Name of Jesus                                                                    13
Master FG|Francesco Primaticcio                                                                  12
Master of the Die                                                                                11
Master A D                                                                                       10
Master G.A.|Anonymous, Italian, 16th century                                                     10
Master PS|Jacques Prevost de Gray                                                                10
Master IAM of Zwolle                                                                              8
Master I.B.                                                                                       7
Master of the Riccardiana Lactantius                                                              7
Master of the Die|Baldassare Tommaso Peruzzi                                                      6
Master S.E.|Jacques Androuet Du Cerceau                                                           6
Master of the Triptych of Louis XII                                                               6
Master A. F.                                                                                      6
Master I.S.                                                                                       6
Master B                                                                                          6
Master F. P.|Parmigianino (Girolamo Francesco Maria Mazzola)|Anonymous, Italian, 16th century     6
Master FVB                                                                                        5
Master FG                                                                                         5
Master IO. F.F. (Giovanni Fondulino dei Fonduli)                                                  5
Master of the Devasano Pado Kalpasutra                                                            4
Master at the Court of Mankot                                                                     4
Master of the Die|Perino del Vaga (Pietro Buonaccorsi)|Antonio Salamanca                          4
Master P.R.K                                                                                      4
Master of the Virgin Mary's Reliquary Casket                                                      4
Master IQV|Giulio Romano                                                                          4
Master PW of Cologne                                                                              4
Master of the Die|Michiel Coxie (I)                                                               4
Master of the Die|Perino del Vaga (Pietro Buonaccorsi)                                            4
Master of the Playing Cards                                                                       4
Master HL                                                                                         3
Master PS|Polidoro da Caravaggio                                                                  3
Master of the Die|Raphael (Raffaello Sanzio or Santi)|Antonio Lafreri                             3
Master of the Orcagnesque Misericordia                                                            3
Master of the Magdalen                                                                            3
Master of the Die|Philippe Thomassin                                                              3
Masters of the Dispersed Bhagavata Purana                                                         3
Master of Rimini                                                                                  3
Master of the Martyrdom of St. Sebastian                                                          3
Master of the Orléans Triptych                                                                    3
Master of the Saint Ursula Legend                                                                 3
Master I. C. (probably Jean Court or Jean de Court)|Heinrich Aldegrever                           2
Name: Artist Display Name, dtype: int64
met_df[met_df['Artist Display Name'].str.match('School',na=False)]['Artist Display Name'].value_counts()[:50]
School of Rembrandt van Rijn                                                                                                                                                                                                                                                                                                                          7
School of Nottingham                                                                                                                                                                                                                                                                                                                                  7
School of Cambiaso|Luca Cambiaso                                                                                                                                                                                                                                                                                                                      2
School of Antwerp                                                                                                                                                                                                                                                                                                                                     2
School of Cambiaso                                                                                                                                                                                                                                                                                                                                    2
School of Burgundy                                                                                                                                                                                                                                                                                                                                    2
School of Rembrandt van Rijn|Rembrandt (Rembrandt van Rijn)                                                                                                                                                                                                                                                                                           1
School of Toulon                                                                                                                                                                                                                                                                                                                                      1
School of Tielke                                                                                                                                                                                                                                                                                                                                      1
School of Zuccarelli                                                                                                                                                                                                                                                                                                                                  1
School of Art Needlework|Princess Helena Augusta Victoria of Great Britain and Ireland|Lady Marianne Margaret Compton Cust, Viscountess Alford|Leticia "Lily" Higgin|Sampson Low, Marston, Searle & Rivington|Vincent Brooks, Day & Son|William Morris|Selwyn Image|Fairfax Wade|Walter Crane|Mary Herbert|George Aitchison|Sir Edward Burne-Jones    1
School of Troyes                                                                                                                                                                                                                                                                                                                                      1
Name: Artist Display Name, dtype: int64
ax = met_df[met_df['Artist Display Name'].str.match('Anonymous',na=False)]['Artist Display Name'].value_counts()[:25].plot(kind='barh')

ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

# Switch off ticks
ax.tick_params(axis="both", which="both", bottom=False, top=False, labelbottom=True, left=False, right=False, labelleft=True)

# Draw vertical axis lines
vals = ax.get_xticks()
for tick in vals:
    ax.axvline(x=tick, linestyle='dashed', alpha=0.4, color='#eeeeee', zorder=1)

# Set x-axis label
ax.set_xlabel("Count", labelpad=20, weight='bold', size=12)

# Set y-axis label
ax.set_ylabel("Artist", labelpad=20, weight='bold', size=12)

# Format y-axis label
# ax.xaxis.set_major_formatter(StrMethodFormatter('{x:,g}'))
    
ax.invert_yaxis()
met_df['Artist Display Name'].isin(['Anonymous']).value_counts(normalize=True)
False    0.997787
True     0.002213
Name: Artist Display Name, dtype: float64

Let's take a look at some of the examples of using a pipe (|) in the Anonymous entries

Question - Is there a consistent interpretation of the pipe symbol, in terms of it being an "and" or an "or" ... or does it depend on the department and their use of it differently?

met_df[met_df['Artist Display Name'].str.match('^Anonymous.*\|',na=False)]['Artist Display Name']
30538     Anonymous, Italian|Enea Vico|Polidoro da Carav...
140504    Anonymous Flemish weavers|Pieter Coecke van Aelst
207275    Anonymous, Netherlandish, 19th century|G. van ...
207621    Anonymous, German, 19th century|Anonymous, Ger...
207622    Anonymous, German, 19th century|Anonymous, Ger...
207629    Anonymous, German, 16th century|Anonymous, Swi...
207759                       Anonymous|Marcantonio Raimondi
207795    Anonymous, Swiss, 16th Century|Niklaus Manuel ...
207890    Anonymous, American, 19th century|Century Maga...
208017    Anonymous, Dutch, 17th century|Allart van Ever...
208018    Anonymous, Netherlandish, 17th century|Anonymo...
208020    Anonymous, Netherlandish, 17th century|Willem ...
208023    Anonymous, Dutch, 17th century|Anonymous, Dutc...
208024    Anonymous, Netherlandish, 17th century|Anonymo...
208025    Anonymous, Netherlandish, 17th century|Anonymo...
208027    Anonymous, Dutch, 17th century|Anonymous, Dutc...
208028    Anonymous, Flemish, 18th century|Jan Brueghel ...
208029    Anonymous, Flemish, 18th century|Jan Brueghel ...
208083    Anonymous, Netherlandish, 16th century|Anonymo...
208253    Anonymous, Netherlandish, 16th century|Jan de ...
208256    Anonymous, Netherlandish, 16th century|Anonymo...
208262    Anonymous, Flemish, 17th century|Anthony van Dyck
208268    Anonymous, Flemish, 18th century|Anonymous, Ne...
208467    Anonymous, German, 19th century|Moritz von Sch...
208495    Anonymous, German, 16th century|Hans Holbein t...
208522    Anonymous, German, 19th century|Anonymous, Ger...
208528    Anonymous, German, 19th century|Anonymous, Ger...
208544    Anonymous, German, 19th century|Anonymous, Ger...
208563    Anonymous, German, 16th century|Friedrich Sustris
208609    Anonymous, Netherlandish, 16th century|Anonymo...
                                ...                        
493117                           Anonymous|Remondini Family
493122                           Anonymous|Remondini Family
493124                           Anonymous|Remondini Family
493125                           Anonymous|Remondini Family
493135                           Anonymous|Remondini Family
493138                           Anonymous|Remondini Family
493140                           Anonymous|Remondini Family
493142                           Anonymous|Remondini Family
493143                           Anonymous|Remondini Family
493149                           Anonymous|Remondini Family
493175                           Anonymous|Remondini Family
493203                           Anonymous|Remondini Family
493214                           Anonymous|Remondini Family
493215                           Anonymous|Remondini Family
493222                           Anonymous|Remondini Family
493229                           Anonymous|Remondini Family
493230                           Anonymous|Remondini Family
493231                           Anonymous|Remondini Family
493233                           Anonymous|Remondini Family
493254                           Anonymous|Remondini Family
493257                           Anonymous|Remondini Family
493260                           Anonymous|Remondini Family
493261                           Anonymous|Remondini Family
493262                           Anonymous|Remondini Family
493291    Anonymous, British, early 19th century|Rudolph...
493293     Anonymous, British, 19th century|George Thompson
493294     Anonymous, British, 19th century|George Thompson
493517    Anonymous, British, 19th century|Henry William...
493525    Anonymous, British, 19th century|Henry William...
494020                           Anonymous|Remondini Family
Name: Artist Display Name, Length: 1277, dtype: object
met_df[met_df['Artist Display Name'].str.match('Anonymous',na=False)]['Artist Display Name'].sample(10)
218987             Anonymous, Italian, 17th or 18th century
254418    Anonymous, French, 16th century|Anonymous, 16t...
234504                      Anonymous, French, 18th century
263700                      Anonymous, French, 19th century
282318    Anonymous, Italian or Spanish, late 15th to ea...
232814                      Anonymous, French, 16th century
262929    Anonymous, Italian, 16th century|Parmigianino ...
258120                     Anonymous, British, 19th century
242829    Anonymous, Italian, 16th century|Enea Vico|Ant...
213259    Anonymous, Italian, Roman-Bolognese, 17th century
Name: Artist Display Name, dtype: object

Analyzing textile classifications and object names

met_df[met_df['Object Name'].str.match('[Tt]extile',na=False)][['Object ID','Title','Object Name','Classification']]
Object ID Title Object Name Classification
16022 17453 Textile Textile Textiles
17295 21107 Textile with Floral Medallions in a Decorated ... Textile with Floral Medallions in a Decorated ... Textiles
32299 37631 NaN Textile panel Textiles-Embroidered
34056 39495 中亞或華北(十三世紀) 蓮池水禽紋緙絲|Textile with Aquatic Birds... Textile Textiles-Tapestries
34147 39598 NaN Textile Textiles-Woven
34261 39726 NaN Textile Textiles-Woven
34262 39727 NaN Textile Textiles-Woven
34264 39729 NaN Textile Textiles-Woven
34275 39740 NaN Textile Textiles-Woven
34278 39743 唐末宋初 彩繪花鳥嬰戯紋羅|Textile fragment with boys an... Textile Textiles-Painted
34578 40108 NaN Textile Textiles-Woven
36433 42117 元 童子戯蓮紋羅|Vertical pendant with boys holding... Textile pendant Textiles-Woven
38756 44607 NaN Textile Sample Book Textiles-Sample Books
38757 44608 NaN Textile Sample Book Textiles-Sample Books
38758 44609 NaN Textile Sample Book Textiles-Sample Books
38854 44705 NaN Textile fragment Textiles-Woven
38858 44709 NaN Textile for a handscroll Paintings
38859 44710 NaN Textile for a handscroll Paintings
39732 45694 NaN Textile fragment Textiles-Woven
44669 51486 清 緙絲隸書聯|Calligraphy Couplet Textile Textiles-Tapestries
46455 53664 NaN Textile Textiles-Woven
46456 53665 NaN Textile Textiles-Woven
46457 53666 NaN Textile Textiles-Woven
46458 53667 NaN Textile Textiles-Woven
46459 53668 NaN Textile Textiles-Embroidered
46460 53669 NaN Textile Textiles-Woven
46461 53670 NaN Textile Textiles-Woven
46462 53671 NaN Textile Textiles-Woven
46463 53672 NaN Textile Textiles-Woven
46464 53673 NaN Textile Textiles-Woven
... ... ... ... ...
426930 698659 Textile Fragment Textile fragment Textiles-Brocades
426931 698660 Textile Fragment Textile fragment Textiles-Brocades
426932 698661 Textile Fragment Textile fragment Textiles
426933 698662 Three Textile Fragments Textile fragments Textiles
426934 698663 Textile Fragment Textile fragment Textiles
426935 698664 Three textile fragments Textile fragments Textiles-Brocades
426936 698665 Textile Fragment Textile fragment Textiles-Brocades
426937 698666 Textile Fragment Textile fragment Textiles-Brocades
426939 698668 Textile Fragment Textile fragment Textiles-Brocades
426940 698669 Textile Fragment Textile fragment Textiles-Brocades
426941 698670 Textile Fragment Textile fragment Textiles-Brocades
426942 698671 Textile Fragment Textile fragment Textiles
426943 698672 Textile Fragment Textile fragment Textiles
426944 698673 Textile Fragment Textile fragment Textiles
426946 698675 Four Textile Fragments Textile fragments Textiles-Brocades
426947 698676 Textile Fragment Textile fragment Textiles
426948 698677 Fragment of a Brocaded Silk Textile fragment Textiles-Brocades
436312 711931 Two attached fragments Textile fragments Textiles
440445 717754 Lampas Fragment and case Textile fragment Textiles
443700 722092 Tabard with Pelicans Textile Feathers-Costumes
443712 722116 Tunic Textile Textiles-Woven
443720 722127 Mantle Textile Textiles-Costumes
443721 722129 Checkerboard Tunic Textile Textiles-Woven
443722 722130 Tabard with Lizard-Like Creatures Textile Textiles-Featherwork
443733 722143 Miniature Checkerboard Tunic Textile Textiles-Woven
478011 776412 City Hall, New York and Street Criers Textile ; print Textiles-Printed
487622 812961 Wool Roundel with Mythological Animal Textile Textiles-Woven
487624 812963 Textile with Diaper Pattern Textile Textiles-Woven
487625 812964 Wool textile with Geometric Pattern Textile Textiles-Woven
487626 812965 Silk Squares on Linen Textile Textiles-Woven

2764 rows × 4 columns

ax = met_df[met_df['Object Name'].str.match('[Tt]extile',na=False)]['Object Name'].value_counts()[:15].plot(kind='barh')

ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

# Switch off ticks
ax.tick_params(axis="both", which="both", bottom=False, top=False, labelbottom=True, left=False, right=False, labelleft=True)

# Draw vertical axis lines
vals = ax.get_xticks()
for tick in vals:
    ax.axvline(x=tick, linestyle='dashed', alpha=0.4, color='#eeeeee', zorder=1)

# Set x-axis label
ax.set_xlabel("Count", labelpad=20, weight='bold', size=12)

# Set y-axis label
ax.set_ylabel("Artist", labelpad=20, weight='bold', size=12)

# Format y-axis label
# ax.xaxis.set_major_formatter(StrMethodFormatter('{x:,g}'))
    
ax.invert_yaxis()
ax = met_df[met_df['Classification'].str.match('[Tt]extile',na=False)]['Classification'].value_counts()[:15].plot(kind='barh')

ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

# Switch off ticks
ax.tick_params(axis="both", which="both", bottom=False, top=False, labelbottom=True, left=False, right=False, labelleft=True)

# Draw vertical axis lines
vals = ax.get_xticks()
for tick in vals:
    ax.axvline(x=tick, linestyle='dashed', alpha=0.4, color='#eeeeee', zorder=1)

# Set x-axis label
ax.set_xlabel("Count", labelpad=20, weight='bold', size=12)

# Set y-axis label
ax.set_ylabel("Artist", labelpad=20, weight='bold', size=12)

# Format y-axis label
# ax.xaxis.set_major_formatter(StrMethodFormatter('{x:,g}'))
    
ax.invert_yaxis()
# Object_Name -> Textile sample
# Histogram of how others are used
met_df[met_df['Object Name'].str.match('Textile sample',na=False)]['Classification'].value_counts()
Textiles-Woven                  888
Textiles-Printed                474
Textiles-Painted and Printed     28
Textiles                         22
Textiles-Dyed                     2
Textiles-Velvets                  2
Textiles-Laces                    1
Textiles-Painted                  1
Name: Classification, dtype: int64
# Object_Name -> Textile
# Histogram of how others are used
met_df[met_df['Object Name'].str.match('^Textile$',na=False)]['Classification'].value_counts()
Textiles-Woven                                     313
Textiles                                           115
Textiles-Printed                                   105
Textiles-Woven-Brocade                              38
Textiles-Costumes                                   26
Textiles-Embroidered                                24
Textiles-Tapestries                                 16
Textiles-Painted and Printed                        13
Textiles-Miscellaneous                              10
Textiles-Velvets                                     7
Textiles-Painted                                     3
Main dress-Womenswear                                3
Textiles-Dyed                                        3
Textiles-Laces                                       3
Textiles-Painted and/or Printed                      2
Textiles-Embroidered-Painted and Printed             1
Miscellaneous                                        1
Feathers-Costumes                                    1
Barkcloth                                            1
Miscellaneous|Textiles                               1
Textiles-Beadwork                                    1
Sculpture|Textiles|Textiles-Painted and Printed      1
Textiles-Ecclesiastical                              1
Textiles-Embroidered|Textiles-Laces                  1
Textiles-Methods and Materials                       1
Textiles-Featherwork                                 1
Name: Classification, dtype: int64
# Object_Name -> Textile fragment
# Histogram of how others are used
met_df[met_df['Object Name'].str.match('Textile fragment',na=False)]['Classification'].value_counts()
Textiles-Woven                     192
Textiles-Brocades                   26
Textiles                            24
Textiles-Printed                     8
Textiles-Velvets                     7
Textiles-Embroidered                 3
Textiles-Painted and/or Printed      3
Textiles-Painted and Printed         2
Textiles-Non-Woven                   1
Name: Classification, dtype: int64
# Object_Name -> Classification
# Histogram of how others are used
met_df[met_df['Classification'].str.match('[Tt]extile',na=False)]['Object Name'].value_counts()
Piece                                  8630
Fragment                               4479
Textile sample                         1409
Sample                                 1005
Sampler                                 903
Border                                  788
Panel                                   754
Textile                                 684
Ribbon                                  603
Galloon                                 545
Strip                                   457
Carpet                                  413
Edging                                  363
Tapestry                                342
Fringe                                  301
Band                                    268
Insertion                               268
Textile fragment                        254
Cover                                   240
Handkerchief                            217
Sleeve Band                             196
Tassel                                  187
Hanging                                 182
Length                                  165
Sample book                             161
Collar                                  161
Bag                                     160
Quilt                                   150
Rank Badge                              150
Flounce                                 147
                                       ... 
Barkcloth                                 1
Vertical Panel                            1
Coffin shield                             1
Wall covering fragment                    1
Temple Hanging with a Hindu Devotee       1
Wall panels                               1
Roundel and band                          1
Surcoat (Jinbaori)                        1
Ball                                      1
Case, Hanging                             1
Glove                                     1
Crib quilt                                1
Waistcoat panels                          1
Sleeve trimmings                          1
Needle Case                               1
Fragment of a sleeve band                 1
Cravat end or rabat                       1
Chasuble front with orphreys              1
Printed velvet                            1
Wedding Festival hanging                  1
Belts                                     1
Hanging with Crown and Escutcheon         1
Mittens                                   1
Cloth and border fragments                1
Decoration                                1
Altar hanging                             1
Chasuble front with orphrey band          1
Box and textile samples                   1
Galloons                                  1
Garment fragment (?)                      1
Name: Object Name, Length: 1261, dtype: int64
met_df[met_df['Classification'].str.match('[Tt]extile',na=False)]['Classification'].value_counts()
Textiles-Woven                                                                10994
Textiles-Laces                                                                 4962
Textiles-Embroidered                                                           4098
Textiles                                                                       3275
Textiles-Printed                                                               2559
Textiles-Trimmings                                                             1950
Textiles-Velvets                                                               1408
Textiles-Costumes                                                               672
Textiles-Tapestries                                                             646
Textiles-Rugs                                                                   621
Textiles-Methods and Materials                                                  257
Textiles-Painted and Printed                                                    196
Textiles-Sample Books                                                           188
Textiles-Painted and/or Printed                                                 139
Textiles-Implements                                                             116
Textiles-Featherwork                                                            109
Textiles-Beadwork                                                                95
Textiles-Ecclesiastical                                                          79
Textiles-Woven-Brocade                                                           78
Textiles-Woven|Textiles-Ecclesiastical                                           71
Textiles-Painted                                                                 67
Textiles-Dyed                                                                    60
Textiles-Dyed and Embroidered                                                    57
Textiles-Embroidered|Textiles-Ecclesiastical                                     56
Textiles-Brocades                                                                43
Textiles-Costumes-Accessories                                                    39
Textiles-Miscellaneous                                                           31
Textiles-Costumes-Secular                                                        29
Textiles-Costumes-Ecclesiastical                                                 29
Textiles-Painted and Dyed                                                        25
                                                                              ...  
Textiles-Woven|Textiles-Costumes                                                  1
Textiles-Embroidered|Textiles-Ecclesiastical|Textiles-Velvets                     1
Textiles-Paintings                                                                1
Textiles-Velvets|Textiles-Costumes-Ecclesiastical                                 1
Textiles-Embroidered|Textiles-Ecclesiastical|Textiles-Woven                       1
Textiles-Printed|Textiles-Ecclesiastical                                          1
Textiles-Embroidered|Textiles-Woven                                               1
Textiles-Methods and Materials|Textiles-Laces                                     1
Textiles-Beadwork|Textiles-Costumes                                               1
Textiles-Embroidered|Textiles-Dyed and Embroidered                                1
Textiles-Costumes|Textiles-Dyed and Embroidered                                   1
Textiles-Painted and Embroidered                                                  1
Textiles|Paintings                                                                1
Textiles-Woven|Textiles-Embroidered|Textiles-Ecclesiastical                       1
Textiles-Embroidered-Painted and Printed                                          1
Textiles-Velvets|Textiles-Woven|Textiles-Ecclesiastical                           1
Textiles-Embroidered|Textiles-Costumes-Ecclesiastical                             1
Textiles-Woven|Textiles-Laces                                                     1
Textiles-Painted-Reproductions                                                    1
Textiles-Costumes|Textiles-Beadwork                                               1
Textiles-Woven|Textiles-Ecclesiastical|Textiles-Embroidered|Textiles-Laces        1
Textiles-Costumes|Textiles-Embroidered                                            1
Textiles-Woven|Textiles-Printed|Textiles-Ecclesiastical                           1
Textiles-Laces|Textiles-Woven|Textiles-Ecclesiastical                             1
Textiles-Ecclesiastical|Textiles-Woven|Textiles-Embroidered                       1
Textiles-Costumes-Accessories|Beads-Costumes                                      1
Textiles-Painted|Textiles-Woven                                                   1
Textiles-Reproductions                                                            1
Textiles-Woven|Textiles-Laces|Textiles-Ecclesiastical                             1
Textiles-Woven|Textiles-Ecclesiastical|Textiles-Laces                             1
Name: Classification, Length: 93, dtype: int64

Examine some Credit Lines to extract a date

We can extract a year from the "Credit Line" column to set collection (P195) qualifier start time (P580). The object page on The Met web site has a "Provenance" section, but this does not seem accessible from the API or dump. It also seems to be different from Credit Line. Example:

https://www.metmuseum.org/art/collection/search/681504

Question - Is this a reasonable inference?

Question - Why is the above example inconsistent - Met API/database has no date, but object page says 1000 BC - 1 AD, which seems to be the incorrect range of dates

met_df['Credit Line'].sample(15)
356372                                    Rogers Fund, 1927
159234                               Museum Accession, 1974
480525    The Jefferson R. Burdick Collection, Gift of J...
22743                        Gift of William H. Riggs, 1913
159012    The Elisha Whittelsey Collection, The Elisha W...
217192                      Harris Brisbane Dick Fund, 1917
55814                            Gift of Susumu Ikuta, 1985
128730                      Bequest of Maria P. James, 1911
275474    The Jefferson R. Burdick Collection, Gift of J...
19630                      Bequest of George C. Stone, 1935
387428    The Jefferson R. Burdick Collection, Gift of J...
243562                                    Rogers Fund, 1917
443913    Museo Cao, Magdalena de Cao, Peru, Ministerio ...
231470                   Gift of Mrs. Henry T. Curtis, 1953
216522        Gift of Mr. and Mrs. Charles Wrightsman, 1982
Name: Credit Line, dtype: object

A method for extracing the year:

import re

for index, row in met_df.sample(15).iterrows(): 
    try:
        found = re.search('(.+?), ([0-9]+)', row['Credit Line']).group(2)
    except AttributeError:
        # AAA, ZZZ not found in the original string
        found = ''
    
    print (row['Credit Line'], '|', found)
    # print (found)
Conde Nast Fund and Everfast Fund, 1973 | 1973
Bequest of Maria P. James, 1911 | 1911
Fletcher Fund, 1924 | 1924
The Jefferson R. Burdick Collection, Gift of Jefferson R. Burdick | 
Gift of Dietrich von Bothmer, Distinguished Research Curator, Greek and Roman Art, 2011 | 2011
Harris Brisbane Dick Fund, 1946 | 1946
Bequest of George Blumenthal, 1941 | 1941
Private collection, New York | 
Rogers Fund, Transferred from the Library, 1941 | 1941
Bequest of Herbert Mitchell, 2008 | 2008
The Jefferson R. Burdick Collection, Gift of Jefferson R. Burdick | 
Gift of Mrs. Samuel Stiefel, 1940 | 1940
The Crosby Brown Collection of Musical Instruments, 1889 | 1889
Harris Brisbane Dick Fund, 1917 | 1917
Benaki Museum, Athens (10738) | 

Send Wikidata Query to pick up Met objects

Met objects are currently (April 2019) modeled slightly differently, so one goal of the project is to normalize this and make it consistent. There are currently two different methods to pick up Met objects:

  1. Anything with Met ID (P3634)
  2. Anything with inventory number (P217) qualified with collection (P195) set to Met (Q160236)

For a SPARQL query, these two are combined with UNION, and optional fields returned.

wikidata_api_url = 'https://query.wikidata.org/bigdata/namespace/wdq/sparql'

query = '''
SELECT DISTINCT ?item ?inventorynumber ?metid ?creator ?creatorLabel ?inception 
WHERE 
{
  {
    ?item p:P217 [ ps:P217 ?inventorynumber ; pq:P195 wd:Q160236 ]  .
  } UNION {
    ?item wdt:P3634 ?metid .
  }
  OPTIONAL { ?item p:P217 [ ps:P217 ?inventorynumber ; pq:P195 wd:Q160236 ]  . }
  OPTIONAL { ?item wdt:P3634 ?metid . }
  OPTIONAL { ?item wdt:P170 ?creator . }
  OPTIONAL { ?item wdt:P571 ?inception . }

  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} 
'''

data = requests.post(wikidata_api_url, data={'query': query, 'format': 'json'}).json()

Examine some random records to understand the JSON structure

Now that we've made the SPARQL query, which may take 5-10 seconds, let's look at some of the raw JSON that is returned to understand how to parse it.

# Examine the return data by selecting a few random ones
import random
import pprint

pp = pprint.PrettyPrinter()
pp.pprint(random.choices(data['results']['bindings'], k=3))
[{'creator': {'type': 'uri',
              'value': 'http://www.wikidata.org/entity/Q2636828'},
  'creatorLabel': {'type': 'literal',
                   'value': 'Alan Shields',
                   'xml:lang': 'en'},
  'inception': {'datatype': 'http://www.w3.org/2001/XMLSchema#dateTime',
                'type': 'literal',
                'value': '1976-01-01T00:00:00Z'},
  'inventorynumber': {'type': 'literal', 'value': '1984.291'},
  'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q19920757'},
  'metid': {'type': 'literal', 'value': '483117'}},
 {'creator': {'type': 'uri',
              'value': 'http://www.wikidata.org/entity/Q3120087'},
  'creatorLabel': {'type': 'literal',
                   'value': 'John La Farge',
                   'xml:lang': 'en'},
  'inception': {'datatype': 'http://www.w3.org/2001/XMLSchema#dateTime',
                'type': 'literal',
                'value': '1886-01-01T00:00:00Z'},
  'inventorynumber': {'type': 'literal', 'value': '67.55.177a–c'},
  'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q20178373'},
  'metid': {'type': 'literal', 'value': '11389'}},
 {'inception': {'datatype': 'http://www.w3.org/2001/XMLSchema#dateTime',
                'type': 'literal',
                'value': '1730-01-01T00:00:00Z'},
  'inventorynumber': {'type': 'literal', 'value': '2005.365'},
  'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q29385696'},
  'metid': {'type': 'literal', 'value': '504256'}}]

Convert the WDQ JSON result to a dataframe

Make life easier by converting the JSON to a Pandas Dataframe, which is basically a 2D spreadsheet-like data structure. We're going to also do some integrity checks as we import. Most of the data are strings and numbers, but the "inception" is a formal date string in the format +1984-01-01T00:00:00Z and it's possible Wikidata has dates that validate but are illogical, like year 0. It will error out on these, and show up in pink below.

Problem - It is also possible inception is set to "Unknown value" in Wikidata which is tricky to handle in Python.

In SPARQL parlance, it would be tested like this:

?item wdt:P571 ?date .

FILTER isBLANK(?date) .

We're have to figure out how to best represent this while doing our data work, since a Python dateTime module is quite strict. Some research indicates that there is quite a need for this type of function of handling outliers, but there is no simple or pat solution.

(https://stackoverflow.com/questions/6697770/allowing-invalid-dates-in-python-datetime)

import logging as logger
import numpy as np

resultarray = []
for item in data['results']['bindings']:
    # Need to fill in 0 and not None for int64, or this will be cast as float64
    metid = item['metid']['value'] if 'metid' in item else 0
    inventorynumber = item['inventorynumber']['value'] if 'inventorynumber' in item else None
    cl = item['creatorLabel']['value'] if 'creatorLabel' in item else None
    
    # Dates are weird - they are usually like +1984-01-01T00:00:00Z
    #   They may also be "unknown" with or without a start and end time
    try:
        inception = parse(item['inception']['value']) if 'inception' in item else None
    except (TypeError, ValueError):
        logger.warning('failed to parse QID %s - time %r', item['item']['value'], item['inception']['value'])
    
    resultarray.append({
        'qid': int(item['item']['value'].replace('http://www.wikidata.org/entity/Q','')),
        'metid': int(metid),
        'creatorLabel': cl,
        'inception': inception,
        'inventorynumber': inventorynumber,
        }
    )

# resultarray

# Create a Pandas dataframe    
wd_metitems_df = pd.DataFrame(resultarray)

# wd_metitems_df.astype(dtype={'metid':np.int64})

wd_metitems_df.info()
WARNING:root:failed to parse QID http://www.wikidata.org/entity/Q61730536 - time 't1466139731'
WARNING:root:failed to parse QID http://www.wikidata.org/entity/Q59767196 - time 't1466138755'
WARNING:root:failed to parse QID http://www.wikidata.org/entity/Q62761150 - time 't1415652024'
WARNING:root:failed to parse QID http://www.wikidata.org/entity/Q62759189 - time 't1415652016'
WARNING:root:failed to parse QID http://www.wikidata.org/entity/Q63434597 - time 't1445847308'
WARNING:root:failed to parse QID http://www.wikidata.org/entity/Q59770772 - time 't1466138991'
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10800 entries, 0 to 10799
Data columns (total 5 columns):
creatorLabel       9565 non-null object
inception          10416 non-null object
inventorynumber    10695 non-null object
metid              10800 non-null int64
qid                10800 non-null int64
dtypes: int64(2), object(3)
memory usage: 422.0+ KB

Examine some random records to check they are being imported correctly

wd_metitems_df.sample(15)
creatorLabel inception inventorynumber metid qid
6705 Robert Havell 1840-01-01 00:00:00+00:00 54.90.32 11042 20174458
4004 Emmanuel Tzanes 1663-01-01 00:00:00+00:00 33.79.18 437857 19913426
8214 anonymous 1350-01-01 00:00:00+00:00 25.120.289a-e 473891 20201854
2933 anonymous 1700-01-01 00:00:00+00:00 1976.100.13 436347 19913242
9540 None 0433-01-01 00:00:00+00:00 11.141 248499 29383763
1923 anonymous 1750-01-01 00:00:00+00:00 25.106.36 435939 19913745
1984 anonymous 1800-01-01 00:00:00+00:00 62.122.32 437665 19913880
10203 None 0265-01-01 00:00:00+00:00 55.11.5 254819 29383917
6913 Pavel Svinyin 1811-01-01 00:00:00+00:00 42.95.42 12729 20178059
6479 David Teniers the Younger 1655-01-01 00:00:00+00:00 1975.1.127 459069 20187112
5995 William Scott 1957-01-01 00:00:00+00:00 2007.49.87 492759 20190127
10725 None None None 38159 43712728
3366 Henry Colton Shumway 1843-01-01 00:00:00+00:00 2007.437 19718 19924795
8606 None 1800-01-01 00:00:00+00:00 1988.301 327369 29384521
5120 William P. Chappel 1870-01-01 00:00:00+00:00 54.90.514 10440 20175745

Wikidata items for The Met with 'anonymous' as creator

wd_metitems_df.creatorLabel.isin(['anonymous']).value_counts(normalize=True).plot.pie(autopct='%.0f')
<matplotlib.axes._subplots.AxesSubplot at 0x7f64a0e70128>
wd_metitems_df.creatorLabel.isna().value_counts(normalize=True).plot.pie(autopct='%.0f')
<matplotlib.axes._subplots.AxesSubplot at 0x7f64a840e400>
met_df['Artist Display Name'].notna().value_counts(normalize=True).plot.pie(autopct='%.0f')
<matplotlib.axes._subplots.AxesSubplot at 0x7f64a83cf240>
met_artist_notna_df = met_df[met_df['Artist Display Name']].notna().copy()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-0d2376b2e4ca> in <module>
----> 1 met_artist_notna_df = met_df[met_df['Artist Display Name']].notna().copy()
      2 

/srv/paws/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2915 
   2916         # Do we have a (boolean) 1d indexer?
-> 2917         if com.is_bool_indexer(key):
   2918             return self._getitem_bool_array(key)
   2919 

/srv/paws/lib/python3.6/site-packages/pandas/core/common.py in is_bool_indexer(key)
    122             if not lib.is_bool_array(key):
    123                 if isna(key).any():
--> 124                     raise ValueError(na_msg)
    125                 return False
    126             return True

ValueError: cannot index with vector containing NA / NaN values
met_df[met_df['Artist Display Name'].str.match('Anonymous')]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-59-2bfae3594be6> in <module>
----> 1 met_df[met_df['Artist Display Name'].str.match('Anonymous')]

/srv/paws/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2915 
   2916         # Do we have a (boolean) 1d indexer?
-> 2917         if com.is_bool_indexer(key):
   2918             return self._getitem_bool_array(key)
   2919 

/srv/paws/lib/python3.6/site-packages/pandas/core/common.py in is_bool_indexer(key)
    122             if not lib.is_bool_array(key):
    123                 if isna(key).any():
--> 124                     raise ValueError(na_msg)
    125                 return False
    126             return True

ValueError: cannot index with vector containing NA / NaN values

Wikidata items believed to be Met objects but missing Met Object ID statement

These items don't have Met Object ID (P3634) set but are in the list because the inventorynumber and collection->The Met Museum was set. We test to see if Met ID = 0.

wd_missing_metid_df = wd_metitems_df[wd_metitems_df['metid'] == 0].copy()

wd_missing_metid_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 14 entries, 204 to 10677
Data columns (total 5 columns):
creatorLabel       14 non-null object
inception          14 non-null object
inventorynumber    14 non-null object
metid              14 non-null int64
qid                14 non-null int64
dtypes: int64(2), object(3)
memory usage: 672.0+ bytes

Outlier problems

The following entries had some issues with matching what was in the Met database - either these are outdated or there are some issues.

wd_missing_metid_df
creatorLabel inception inventorynumber metid qid
204 William-Adolphe Bouguereau 1873-01-01 00:00:00+00:00 L.2012.29 0 3555076
1601 anonymous 1800-01-01 00:00:00+00:00 24.80.488 0 19912903
9965 anonymous 1633-01-01 00:00:00+00:00 15.30.1 0 22097536
9967 anonymous 1633-01-01 00:00:00+00:00 15.30.2 0 22097547
9969 anonymous 1630-01-01 00:00:00+00:00 1971.186 0 22054880
9972 Rembrandt 1630-01-01 00:00:00+00:00 1971.186 0 22054880
10019 Johan Danckerts 1640-01-01 00:00:00+00:00 27.146 0 28608971
10279 David Vinckboons 1620-01-01 00:00:00+00:00 1976.100.20 0 30086888
10290 Adam Willaerts 1643-01-01 00:00:00+00:00 06.1303 0 30086759
10305 Albrecht Dürer 1526-01-01 00:00:00+00:00 9.73.120 0 40703137
10332 t1457747052 1655-01-01 00:00:00+00:00 74.55 0 62570242
10335 David Teniers the Younger 1655-01-01 00:00:00+00:00 74.55 0 62570242
10676 Philippe Pavy 1886-01-01 00:00:00+00:00 N.A.2016.9.3 0 57506362
10677 Jan van Goyen 1636-01-01 00:00:00+00:00 06.290 0 57170461

Use Pandas equivalent of a database join

Do an "inner" join that makes a new dataframe based on the Met database (met_df) but adds a new columns from the Wikidata query (wd_missing_metid_df) that supplies qid and Object Number/inventory number

merged_todo_metid_df = pd.merge(met_df, wd_missing_metid_df, how='inner', left_on='Object Number', right_on='inventorynumber')

Test some rows for sanity:

merged_todo_metid_df[['Object Number','Object ID','qid']].info()
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 3 columns):
Object Number    0 non-null object
Object ID        0 non-null int64
qid              0 non-null int64
dtypes: int64(2), object(1)
memory usage: 0.0+ bytes

Good - we fixed 82 items missing the Met ID on May 24, 2019, so now we are in sync.

Generate Quickstatements to fix the problem

In case you need to create Quickstatements, here they are:

# To generate Quickstatements, iterate over the rows
# Desired output: Q43742238|P3634|26606
QS_ADDMISSING_METID='Q%s|P3634|"%d"'

for index, row in merged_todo_metid_df.iterrows(): 
    print (QS_ADDMISSING_METID % (row['qid'], row['Object ID']))

Wikidata items believed to be Met objects but missing inventory number statement

These items don't have inventory number set but are in the list because Met Object ID (P3634) was set. We test to see if Wikidata results for Met items has inventorynumber set to None.

FIX for this would be to generate and ingest Quickstatements to fill in inventory number.

Something like:

Q61876946|P217|"2003.161"|P195|Q160236

wd_missing_inventory_df = wd_metitems_df[wd_metitems_df['inventorynumber'].isnull()].copy()

wd_missing_inventory_df.sample(10)
creatorLabel inception inventorynumber metid qid
10786 Katsushika Hokusai 1825-01-01 00:00:00+00:00 None 54925 60920142
10729 Makron None None 248697 42186356
10795 Giovanni Paolo Panini 1754-01-01 00:00:00+00:00 None 437245 3094651
10735 None 0589-01-01 00:00:00+00:00 None 548211 28670008
10694 Winslow Homer 1899-01-01 00:00:00+00:00 None 20011478 62022405
10797 None 1550-01-01 00:00:00+00:00 None 812566 61781455
10789 Katsushika Hokusai None None 78638 60682065
10695 Pablo Picasso 1914-01-01 00:00:00+00:00 None 490563 61884489
10765 Kitagawa Utamaro None None 36631 60590430
10758 Gaspare Diziani None None 338609 28732893

Use Pandas equivalent of a database join

Do an "inner" join that makes a new dataframe based on the Met database (met_df) but adds a new columns from the Wikidata query (wd_missing_inventory_df) that supplies qid and matched metid

merged_todo_inventorynumber_df = pd.merge(met_df, wd_missing_inventory_df, how='inner', left_on='Object ID', right_on='metid')

merged_todo_inventorynumber_df[['Object Number','Object ID','qid']].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 104 entries, 0 to 103
Data columns (total 3 columns):
Object Number    104 non-null object
Object ID        104 non-null int64
qid              104 non-null int64
dtypes: int64(2), object(1)
memory usage: 3.2+ KB
merged_todo_inventorynumber_df[['Object Number','Object ID','qid']].sample(5)
Object Number Object ID qid
91 L.2014.77 503040 63082710
44 2013.730 78801 60661765
63 80.3.384 338609 28732893
87 1998.456.3 490686 61781317
58 2005.100.20 283092 29881716

Generate Quickstatements to fix the problem

# To generate Quickstatements, iterate over the rows
# Desired output: Q61876946|P217|32767|P195|Q160236

# It is possible that the inventory number does exist already, but is missing the qualifier
# For example, it may have:
#   Q61750964|P217|"66.109"
# but not:
#   Q61750964|P217|"66.109"|P195|Q160236

# We could just subtract

QS_ADDMISSING_INVENTORYNUMBER='Q%s|P217|"%s"|P195|Q160236'
QS_ADDMISSING_COLLECTION='Q%s|P195|Q160236|P217|"%s"'

for index, row in merged_todo_inventorynumber_df.iterrows(): 
    print (QS_ADDMISSING_INVENTORYNUMBER % (row['qid'], row['Object Number']))
    print (QS_ADDMISSING_COLLECTION % (row['qid'], row['Object Number']))
Q61750964|P217|"66.109"|P195|Q160236
Q61750964|P195|Q160236|P217|"66.109"
Q30993455|P217|"2016.797.21"|P195|Q160236
Q30993455|P195|Q160236|P217|"2016.797.21"
Q30994036|P217|"2016.797.17"|P195|Q160236
Q30994036|P195|Q160236|P217|"2016.797.17"
Q61750993|P217|"43.120.1208a, b"|P195|Q160236
Q61750993|P195|Q160236|P217|"43.120.1208a, b"
Q30063003|P217|"36.25.508"|P195|Q160236
Q30063003|P195|Q160236|P217|"36.25.508"
Q60590430|P217|"JP152"|P195|Q160236
Q60590430|P195|Q160236|P217|"JP152"
Q60579810|P217|"JP555"|P195|Q160236
Q60579810|P195|Q160236|P217|"JP555"
Q60628715|P217|"JP1122"|P195|Q160236
Q60628715|P195|Q160236|P217|"JP1122"
Q60601286|P217|"JP2739"|P195|Q160236
Q60601286|P195|Q160236|P217|"JP2739"
Q60567840|P217|"JP3118"|P195|Q160236
Q60567840|P195|Q160236|P217|"JP3118"
Q60568037|P217|"JP3129"|P195|Q160236
Q60568037|P195|Q160236|P217|"JP3129"
Q43712728|P217|"1982.220.7"|P195|Q160236
Q43712728|P195|Q160236|P217|"1982.220.7"
Q62794859|P217|"36.100.70"|P195|Q160236
Q62794859|P195|Q160236|P217|"36.100.70"
Q61751040|P217|"30.32.15"|P195|Q160236
Q61751040|P195|Q160236|P217|"30.32.15"
Q43742393|P217|"19.103.2"|P195|Q160236
Q43742393|P195|Q160236|P217|"19.103.2"
Q60628584|P217|"1996.463"|P195|Q160236
Q60628584|P195|Q160236|P217|"1996.463"
Q28126677|P217|"1991.134"|P195|Q160236
Q28126677|P195|Q160236|P217|"1991.134"
Q60590216|P217|"JP990"|P195|Q160236
Q60590216|P195|Q160236|P217|"JP990"
Q47544982|P217|"1985.227.1"|P195|Q160236
Q47544982|P195|Q160236|P217|"1985.227.1"
Q43376787|P217|"1999.263a, b"|P195|Q160236
Q43376787|P195|Q160236|P217|"1999.263a, b"
Q61746498|P217|"1989.363.123"|P195|Q160236
Q61746498|P195|Q160236|P217|"1989.363.123"
Q60920142|P217|"JP1016"|P195|Q160236
Q60920142|P195|Q160236|P217|"JP1016"
Q60572478|P217|"JP1121"|P195|Q160236
Q60572478|P195|Q160236|P217|"JP1121"
Q60595333|P217|"JP2394"|P195|Q160236
Q60595333|P195|Q160236|P217|"JP2394"
Q60706622|P217|"JIB5"|P195|Q160236
Q60706622|P195|Q160236|P217|"JIB5"
Q61409206|P217|"JIB69"|P195|Q160236
Q61409206|P195|Q160236|P217|"JIB69"
Q61413284|P217|"JIB80"|P195|Q160236
Q61413284|P195|Q160236|P217|"JIB80"
Q61748802|P217|"10.7.4"|P195|Q160236
Q61748802|P195|Q160236|P217|"10.7.4"
Q61876908|P217|"2002.325"|P195|Q160236
Q61876908|P195|Q160236|P217|"2002.325"
Q43648711|P217|"2015.300.301a–c"|P195|Q160236
Q43648711|P195|Q160236|P217|"2015.300.301a–c"
Q63243745|P217|"SL.17.2011.22.4"|P195|Q160236
Q63243745|P195|Q160236|P217|"SL.17.2011.22.4"
Q61781621|P217|"2011.127"|P195|Q160236
Q61781621|P195|Q160236|P217|"2011.127"
Q60682065|P217|"2013.736a–f"|P195|Q160236
Q60682065|P195|Q160236|P217|"2013.736a–f"
Q60704425|P217|"2013.738a, b"|P195|Q160236
Q60704425|P195|Q160236|P217|"2013.738a, b"
Q60570674|P217|"2013.852a, b"|P195|Q160236
Q60570674|P195|Q160236|P217|"2013.852a, b"
Q60571703|P217|"2013.853"|P195|Q160236
Q60571703|P195|Q160236|P217|"2013.853"
Q60571061|P217|"2013.858"|P195|Q160236
Q60571061|P195|Q160236|P217|"2013.858"
Q60571356|P217|"2013.859"|P195|Q160236
Q60571356|P195|Q160236|P217|"2013.859"
Q60704878|P217|"2013.875"|P195|Q160236
Q60704878|P195|Q160236|P217|"2013.875"
Q60704659|P217|"2013.878"|P195|Q160236
Q60704659|P195|Q160236|P217|"2013.878"
Q60701873|P217|"2013.881"|P195|Q160236
Q60701873|P195|Q160236|P217|"2013.881"
Q60571569|P217|"2013.897"|P195|Q160236
Q60571569|P195|Q160236|P217|"2013.897"
Q60705589|P217|"2013.726"|P195|Q160236
Q60705589|P195|Q160236|P217|"2013.726"
Q60661034|P217|"2013.729a–c"|P195|Q160236
Q60661034|P195|Q160236|P217|"2013.729a–c"
Q60661765|P217|"2013.730"|P195|Q160236
Q60661765|P195|Q160236|P217|"2013.730"
Q41601734|P217|"29.158.485"|P195|Q160236
Q41601734|P195|Q160236|P217|"29.158.485"
Q30063394|P217|"2009.300.3275a–c"|P195|Q160236
Q30063394|P195|Q160236|P217|"2009.300.3275a–c"
Q582428|P217|"24.232"|P195|Q160236
Q582428|P195|Q160236|P217|"24.232"
Q30263483|P217|"42.30.17a, b"|P195|Q160236
Q30263483|P195|Q160236|P217|"42.30.17a, b"
Q54965918|P217|"58.75.3"|P195|Q160236
Q54965918|P195|Q160236|P217|"58.75.3"
Q30262521|P217|"42.30.18a, b"|P195|Q160236
Q30262521|P195|Q160236|P217|"42.30.18a, b"
Q60646192|P217|"L.2011.66.56a, b"|P195|Q160236
Q60646192|P195|Q160236|P217|"L.2011.66.56a, b"
Q42186356|P217|"12.231.1"|P195|Q160236
Q42186356|P195|Q160236|P217|"12.231.1"
Q42149613|P217|"17.230.5"|P195|Q160236
Q42149613|P195|Q160236|P217|"17.230.5"
Q36446060|P217|"24.97.21a, b"|P195|Q160236
Q36446060|P195|Q160236|P217|"24.97.21a, b"
Q36446060|P217|"24.97.21a, b"|P195|Q160236
Q36446060|P195|Q160236|P217|"24.97.21a, b"
Q42244789|P217|"46.11.7"|P195|Q160236
Q42244789|P195|Q160236|P217|"46.11.7"
Q3209564|P217|"1976.646"|P195|Q160236
Q3209564|P195|Q160236|P217|"1976.646"
Q29881716|P217|"2005.100.20"|P195|Q160236
Q29881716|P195|Q160236|P217|"2005.100.20"
Q61876946|P217|"2003.161"|P195|Q160236
Q61876946|P195|Q160236|P217|"2003.161"
Q61781606|P217|"1987.394.544"|P195|Q160236
Q61781606|P195|Q160236|P217|"1987.394.544"
Q26221327|P217|"62.151"|P195|Q160236
Q26221327|P195|Q160236|P217|"62.151"
Q61876710|P217|"1999.171.1, .2"|P195|Q160236
Q61876710|P195|Q160236|P217|"1999.171.1, .2"
Q28732893|P217|"80.3.384"|P195|Q160236
Q28732893|P195|Q160236|P217|"80.3.384"
Q56604898|P217|"26.72.24"|P195|Q160236
Q56604898|P195|Q160236|P217|"26.72.24"
Q56604898|P217|"26.72.24"|P195|Q160236
Q56604898|P195|Q160236|P217|"26.72.24"
Q23688157|P217|"06.1322.2"|P195|Q160236
Q23688157|P195|Q160236|P217|"06.1322.2"
Q18338529|P217|"19.73.66"|P195|Q160236
Q18338529|P195|Q160236|P217|"19.73.66"
Q61876882|P217|"2003.20.4"|P195|Q160236
Q61876882|P195|Q160236|P217|"2003.20.4"
Q18339638|P217|"19.73.152"|P195|Q160236
Q18339638|P195|Q160236|P217|"19.73.152"
Q61876879|P217|"17.3.4"|P195|Q160236
Q61876879|P195|Q160236|P217|"17.3.4"
Q61781468|P217|"2009.436.1a–w"|P195|Q160236
Q61781468|P195|Q160236|P217|"2009.436.1a–w"
Q18338468|P217|"68.793.66"|P195|Q160236
Q18338468|P195|Q160236|P217|"68.793.66"
Q18338961|P217|"19.73.82"|P195|Q160236
Q18338961|P195|Q160236|P217|"19.73.82"
Q61781441|P217|"53.600.261(5)"|P195|Q160236
Q61781441|P195|Q160236|P217|"53.600.261(5)"
Q3395798|P217|"49.7.21"|P195|Q160236
Q3395798|P195|Q160236|P217|"49.7.21"
Q17275914|P217|"14.40.626–27"|P195|Q160236
Q17275914|P195|Q160236|P217|"14.40.626–27"
Q44214076|P217|"49.7.22"|P195|Q160236
Q44214076|P195|Q160236|P217|"49.7.22"
Q3094651|P217|"52.63.2"|P195|Q160236
Q3094651|P195|Q160236|P217|"52.63.2"
Q25936897|P217|"66.25"|P195|Q160236
Q25936897|P195|Q160236|P217|"66.25"
Q25936897|P217|"66.25"|P195|Q160236
Q25936897|P195|Q160236|P217|"66.25"
Q31073129|P217|"1979.317.1"|P195|Q160236
Q31073129|P195|Q160236|P217|"1979.317.1"
Q31068759|P217|"1979.317.2ab"|P195|Q160236
Q31068759|P195|Q160236|P217|"1979.317.2ab"
Q31070828|P217|"1979.317.3ab"|P195|Q160236
Q31070828|P195|Q160236|P217|"1979.317.3ab"
Q29049154|P217|"1990.38.1"|P195|Q160236
Q29049154|P195|Q160236|P217|"1990.38.1"
Q933665|P217|"1990.38.3"|P195|Q160236
Q933665|P195|Q160236|P217|"1990.38.3"
Q61884489|P217|"CTO.135"|P195|Q160236
Q61884489|P195|Q160236|P217|"CTO.135"
Q5506360|P217|"2016.237.33"|P195|Q160236
Q5506360|P195|Q160236|P217|"2016.237.33"
Q61781317|P217|"1998.456.3"|P195|Q160236
Q61781317|P195|Q160236|P217|"1998.456.3"
Q60596830|P217|"CTO.128"|P195|Q160236
Q60596830|P195|Q160236|P217|"CTO.128"
Q60517079|P217|"SL.17.2014.1.3"|P195|Q160236
Q60517079|P195|Q160236|P217|"SL.17.2014.1.3"
Q61479355|P217|"SL.17.2014.1.9"|P195|Q160236
Q61479355|P195|Q160236|P217|"SL.17.2014.1.9"
Q63082710|P217|"L.2014.77"|P195|Q160236
Q63082710|P195|Q160236|P217|"L.2014.77"
Q7107718|P217|"31.4.2"|P195|Q160236
Q7107718|P195|Q160236|P217|"31.4.2"
Q28670008|P217|"07.229.1a, b"|P195|Q160236
Q28670008|P195|Q160236|P217|"07.229.1a, b"
Q61781516|P217|"2011.599.1a"|P195|Q160236
Q61781516|P195|Q160236|P217|"2011.599.1a"
Q61627764|P217|"63.350.207.135.17"|P195|Q160236
Q61627764|P195|Q160236|P217|"63.350.207.135.17"
Q44178853|P217|"2014.520"|P195|Q160236
Q44178853|P195|Q160236|P217|"2014.520"
Q55619837|P217|"SL.1.2016.39.1"|P195|Q160236
Q55619837|P195|Q160236|P217|"SL.1.2016.39.1"
Q61781550|P217|"1972.717.1"|P195|Q160236
Q61781550|P195|Q160236|P217|"1972.717.1"
Q61781403|P217|"1972.717.4"|P195|Q160236
Q61781403|P195|Q160236|P217|"1972.717.4"
Q56651549|P217|"SL.2.2018.1.2"|P195|Q160236
Q56651549|P195|Q160236|P217|"SL.2.2018.1.2"
Q61781450|P217|"2016.113"|P195|Q160236
Q61781450|P195|Q160236|P217|"2016.113"
Q56697085|P217|"SL.2.2018.21.20"|P195|Q160236
Q56697085|P195|Q160236|P217|"SL.2.2018.21.20"
Q61781455|P217|"2018.756"|P195|Q160236
Q61781455|P195|Q160236|P217|"2018.756"