Exploring the Wikidata Query Log (Special Interest in Links)

Source of this data set: https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en
Why this analysis? For **CHAPTER 3** =)
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import numpy as np
%matplotlib inline
import seaborn as sns
import urllib
from urllib.parse import unquote
#Can I look at co-ocurrence of XX in the queries?
data = pd.read_csv(filepath_or_buffer="2017-06-12_2017-07-09_organic.tsv",delimiter='\t')
#data = pd.read_csv(filepath_or_buffer="Wikidata-queries-sample.tsv",delimiter='\t')
data.head()
#anonymizedQuery #timestamp #sourceCategory #user_agent
0 SELECT+%3Fvar1++%3Fvar1Label+%0AWHERE+%7B%0A++... 2017-06-12 00:03:59 organic browser
1 SELECT+%3Fvar1++%3Fvar1Label++%3Fvar2++%3Fvar2... 2017-06-12 00:05:14 organic browser
2 SELECT+%3Fvar1++%3Fvar2+%0AWHERE+%7B%0A++%3Cht... 2017-06-12 00:05:35 organic browser
3 SELECT+%3Fvar1++%3Fvar1Label++%3Fvar2++%3Fvar3... 2017-06-12 00:05:35 organic browser
4 SELECT+%3Fvar1++%3Fvar1Label++%3Fvar2++%3Fvar2... 2017-06-12 00:05:57 organic browser
testQuery = data['#anonymizedQuery'][0]
#urllib.urlencode(f)
readableQuery = urllib.parse.unquote(testQuery)
readableQuery 
'SELECT+?var1++?var1Label+\nWHERE+{\n++?var1++<http://www.wikidata.org/prop/direct/P31>++<http://www.wikidata.org/entity/Q4423781>+;\n+<http://www.w3.org/2000/01/rdf-schema#label>++?var1Label+.\n+FILTER+(++(+(++LANG+(++?var1Label++)++=++"ru"+)+)+\n)+.\n}\n'
query = readableQuery.replace("\n"," ")
query2 = query.replace("+"," ")
query2
'SELECT ?var1  ?var1Label  WHERE {   ?var1  <http://www.wikidata.org/prop/direct/P31>  <http://www.wikidata.org/entity/Q4423781> ;  <http://www.w3.org/2000/01/rdf-schema#label>  ?var1Label .  FILTER (  ( (  LANG (  ?var1Label  )  =  "ru" ) )  ) . } '
# Works
if ('WHERE' in readableQuery):
    print('found!')
else:
    print('meh!')
found!

def isFederated(row): if 'data['#anonymizedQuery']

 
 
#from urllib.parse import unquote
import re
str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
    print('found')
    print(match.group()) ## 'found word:cat'
else:
    print('did not find')
found
word:cat
def feProperties(query):
    decquery = unquote(query)
    count = re.findall('http://www.wikidata.org/prop/direct/',decquery)
    #print(decquery)
    #print(len(count))
    #print('---')   
    
    
    
    matches = re.findall('<http://www.wikidata.org/prop/direct/(.*)>\+\+', decquery)
    
    propertyURI=''
    '''if match:
        
        d=match.group()
        propertyURI = d[:-2]'''
    
    #for match in matches:
        #print(match)
        
   
    #print('-')
   
    
    return len(count)
    
countProps = data['#anonymizedQuery'].apply(feProperties)
countProps
0         1
1         2
2         3
3         3
4         2
5         3
6         3
7         1
8         3
9         0
10        0
11        1
12        0
13        3
14        0
15        2
16        1
17        3
18        4
19        4
20        4
21        1
22        1
23        1
24        1
25        1
26        2
27        1
28        1
29        2
         ..
192300    3
192301    3
192302    3
192303    3
192304    3
192305    3
192306    3
192307    3
192308    3
192309    3
192310    3
192311    3
192312    3
192313    3
192314    3
192315    3
192316    3
192317    3
192318    3
192319    3
192320    3
192321    3
192322    3
192323    1
192324    1
192325    1
192326    1
192327    1
192328    6
192329    1
Name: #anonymizedQuery, Length: 192330, dtype: int64
import seaborn as sns
sns.distplot(countProps);
len(data)
192330
sns.countplot(countProps)
/Users/sarasua/anaconda3/lib/python3.6/site-packages/seaborn/categorical.py:1460: FutureWarning: remove_na is deprecated and is a private function. Do not use.
  stat_data = remove_na(group_data)
<matplotlib.axes._subplots.AxesSubplot at 0x1150d14a8>
 

Testing query parsing structured (wikier's library)

!pip install rdflib
Collecting rdflib
  Using cached https://files.pythonhosted.org/packages/3c/fe/630bacb652680f6d481b9febbb3e2c3869194a1a5fc3401a4a41195a2f8f/rdflib-4.2.2-py3-none-any.whl
Collecting isodate (from rdflib)
  Using cached https://files.pythonhosted.org/packages/9b/9f/b36f7774ff5ea8e428fdcfc4bb332c39ee5b9362ddd3d40d9516a55221b2/isodate-0.6.0-py2.py3-none-any.whl
Requirement already satisfied: pyparsing in /srv/paws/lib/python3.6/site-packages (from rdflib)
Requirement already satisfied: six in /srv/paws/lib/python3.6/site-packages (from isodate->rdflib)
Installing collected packages: isodate, rdflib
Successfully installed isodate-0.6.0 rdflib-4.2.2
from rdflib.namespace import FOAF
#INFO:rdflib:RDFLib Version: 4.2.2
from rdflib.plugins.sparql import prepareQuery
#q = prepareQuery('SELECT ?s WHERE { ?person foaf:knows ?s .}',initNs = { "foaf": FOAF })
q = prepareQuery('SELECT ?var1  ?var1Label  WHERE {   ?var1  <http://www.wikidata.org/prop/direct/P31>  <http://www.wikidata.org/entity/Q4423781> ;  <http://www.w3.org/2000/01/rdf-schema#label>  ?var1Label .  FILTER (  ( (  LANG (  ?var1Label  )  =  "ru" ) )  ) . } ',initNs = { "foaf": FOAF })
dir(q)
#['_class_', '_delattr_', '_dict_', '_doc_', '_format_', '_getattribute_', '_hash_', '_init_', '_module_', '_new_', '_reduce_', '_reduce_ex_', '_repr_', '_setattr_', '_sizeof_', '_str_', '_subclasshook_', '_weakref_', '_original_args', 'algebra', 'prologue']
['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_original_args',
 'algebra',
 'prologue']
q.algebra
#SelectQuery_{'_vars': set([rdflib.term.Variable(u's'), rdflib.term.Variable(u'person')]), 'p': Project_{'_vars': set([rdflib.term.Variable(u's'), rdflib.term.Variable(u'person')]), 'p': BGP_{'_vars': set([rdflib.term.Variable(u's'), rdflib.term.Variable(u'person')]), 'triples': [(rdflib.term.Variable(u'person'), rdflib.term.URIRef(u'http://xmlns.com/foaf/0.1/knows'), rdflib.term.Variable(u's'))]}, 'PV': [rdflib.term.Variable(u's')]}, 'datasetClause': None, 'PV': [rdflib.term.Variable(u's')]}
CompValue([('p',
            CompValue([('p',
                        CompValue([('expr',
                                    Expr([('expr',
                                           Expr([('arg',
                                                  rdflib.term.Variable('var1Label')),
                                                 ('_vars',
                                                  {rdflib.term.Variable('var1Label')})])),
                                          ('op', '='),
                                          ('other', rdflib.term.Literal('ru')),
                                          ('_vars', set())])),
                                   ('p',
                                    CompValue([('triples',
                                                [(rdflib.term.Variable('var1'),
                                                  rdflib.term.URIRef('http://www.wikidata.org/prop/direct/P31'),
                                                  rdflib.term.URIRef('http://www.wikidata.org/entity/Q4423781')),
                                                 (rdflib.term.Variable('var1'),
                                                  rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#label'),
                                                  rdflib.term.Variable('var1Label'))]),
                                               ('_vars',
                                                {rdflib.term.Variable('var1'),
                                                 rdflib.term.Variable('var1Label')})])),
                                   ('_vars',
                                    {rdflib.term.Variable('var1'),
                                     rdflib.term.Variable('var1Label')})])),
                       ('PV',
                        [rdflib.term.Variable('var1'),
                         rdflib.term.Variable('var1Label')]),
                       ('_vars',
                        {rdflib.term.Variable('var1'),
                         rdflib.term.Variable('var1Label')})])),
           ('datasetClause', None),
           ('PV',
            [rdflib.term.Variable('var1'), rdflib.term.Variable('var1Label')]),
           ('_vars',
            {rdflib.term.Variable('var1'),
             rdflib.term.Variable('var1Label')})])
r = q.algebra
print(type(r))
<class 'rdflib.plugins.sparql.parserutils.CompValue'>
v = r.get('p').get('PV')[0]
v.n3()
'?s'
 
 
 
 
 
 
 
def parseQuery(row):
    readableQ = urllib.parse.unquote(row['#anonymizedQuery'])
    query = readableQ.replace("\n"," ").replace("+"," ")
    
    #print(query)
    q = prepareQuery(query)
    print(q.algebra.get('p').get('PV')) #[0]
data.apply(parseQuery,axis=1)
[rdflib.term.Variable('var1'), rdflib.term.Variable('var1Label')]
[rdflib.term.Variable('var1'), rdflib.term.Variable('var1Label'), rdflib.term.Variable('var2'), rdflib.term.Variable('var2Label'), rdflib.term.Variable('var3'), rdflib.term.Variable('var4'), rdflib.term.Variable('var5'), rdflib.term.Variable('var6'), rdflib.term.Variable('var7'), rdflib.term.Variable('var7Label'), rdflib.term.Variable('var8'), rdflib.term.Variable('var8Label'), rdflib.term.Variable('var9'), rdflib.term.Variable('var10'), rdflib.term.Variable('var10Label'), rdflib.term.Variable('var11'), rdflib.term.Variable('var11Label'), rdflib.term.Variable('var12'), rdflib.term.Variable('var12Label'), rdflib.term.Variable('var13'), rdflib.term.Variable('var14')]
[rdflib.term.Variable('var1'), rdflib.term.Variable('var2')]
PV
[rdflib.term.Variable('var1'), rdflib.term.Variable('var1Label'), rdflib.term.Variable('var2'), rdflib.term.Variable('var2Label'), rdflib.term.Variable('var3'), rdflib.term.Variable('var4'), rdflib.term.Variable('var5'), rdflib.term.Variable('var6'), rdflib.term.Variable('var7'), rdflib.term.Variable('var7Label'), rdflib.term.Variable('var8'), rdflib.term.Variable('var8Label'), rdflib.term.Variable('var9'), rdflib.term.Variable('var10'), rdflib.term.Variable('var10Label'), rdflib.term.Variable('var11'), rdflib.term.Variable('var11Label'), rdflib.term.Variable('var12'), rdflib.term.Variable('var12Label'), rdflib.term.Variable('var13'), rdflib.term.Variable('var14')]
[rdflib.term.Variable('var1'), rdflib.term.Variable('var1Label'), rdflib.term.Variable('var2'), rdflib.term.Variable('var2Label'), rdflib.term.Variable('var3'), rdflib.term.Variable('var4'), rdflib.term.Variable('var5'), rdflib.term.Variable('var6'), rdflib.term.Variable('var7'), rdflib.term.Variable('var7Label'), rdflib.term.Variable('var8'), rdflib.term.Variable('var8Label'), rdflib.term.Variable('var9'), rdflib.term.Variable('var10'), rdflib.term.Variable('var10Label'), rdflib.term.Variable('var11'), rdflib.term.Variable('var11Label'), rdflib.term.Variable('var12'), rdflib.term.Variable('var12Label'), rdflib.term.Variable('var13'), rdflib.term.Variable('var14')]
PV
PV
PV
[rdflib.term.Variable('var1')]
[rdflib.term.Variable('var1')]
[rdflib.term.Variable('var1')]
[rdflib.term.Variable('var1')]
PV
[rdflib.term.Variable('var1')]
PV
[rdflib.term.Variable('var1')]
[rdflib.term.Variable('var1'), rdflib.term.Variable('var1Label')]
[rdflib.term.Variable('var1'), rdflib.term.Variable('var1Label'), rdflib.term.Variable('var2'), rdflib.term.Variable('var2Label'), rdflib.term.Variable('var3'), rdflib.term.Variable('var4'), rdflib.term.Variable('var5'), rdflib.term.Variable('var6'), rdflib.term.Variable('var7'), rdflib.term.Variable('var7Label'), rdflib.term.Variable('var8'), rdflib.term.Variable('var8Label'), rdflib.term.Variable('var9'), rdflib.term.Variable('var10'), rdflib.term.Variable('var10Label'), rdflib.term.Variable('var11'), rdflib.term.Variable('var11Label'), rdflib.term.Variable('var12'), rdflib.term.Variable('var12Label'), rdflib.term.Variable('var13'), rdflib.term.Variable('var14')]
[rdflib.term.Variable('var1'), rdflib.term.Variable('var2')]
PV
[rdflib.term.Variable('var1'), rdflib.term.Variable('var1Label'), rdflib.term.Variable('var2'), rdflib.term.Variable('var2Label'), rdflib.term.Variable('var3'), rdflib.term.Variable('var4'), rdflib.term.Variable('var5'), rdflib.term.Variable('var6'), rdflib.term.Variable('var7'), rdflib.term.Variable('var7Label'), rdflib.term.Variable('var8'), rdflib.term.Variable('var8Label'), rdflib.term.Variable('var9'), rdflib.term.Variable('var10'), rdflib.term.Variable('var10Label'), rdflib.term.Variable('var11'), rdflib.term.Variable('var11Label'), rdflib.term.Variable('var12'), rdflib.term.Variable('var12Label'), rdflib.term.Variable('var13'), rdflib.term.Variable('var14')]
[rdflib.term.Variable('var1'), rdflib.term.Variable('var1Label'), rdflib.term.Variable('var2'), rdflib.term.Variable('var2Label'), rdflib.term.Variable('var3'), rdflib.term.Variable('var4'), rdflib.term.Variable('var5'), rdflib.term.Variable('var6'), rdflib.term.Variable('var7'), rdflib.term.Variable('var7Label'), rdflib.term.Variable('var8'), rdflib.term.Variable('var8Label'), rdflib.term.Variable('var9'), rdflib.term.Variable('var10'), rdflib.term.Variable('var10Label'), rdflib.term.Variable('var11'), rdflib.term.Variable('var11Label'), rdflib.term.Variable('var12'), rdflib.term.Variable('var12Label'), rdflib.term.Variable('var13'), rdflib.term.Variable('var14')]
PV
PV
PV
[rdflib.term.Variable('var1')]
[rdflib.term.Variable('var1')]
[rdflib.term.Variable('var1')]
[rdflib.term.Variable('var1')]
PV
[rdflib.term.Variable('var1')]
PV
[rdflib.term.Variable('var1')]
---------------------------------------------------------------------------
ParseException                            Traceback (most recent call last)
<ipython-input-58-9ebfd081a441> in <module>()
----> 1 data.apply(parseQuery,axis=1)

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
   6485                          args=args,
   6486                          kwds=kwds)
-> 6487         return op.get_result()
   6488 
   6489     def applymap(self, func):

~/anaconda3/lib/python3.6/site-packages/pandas/core/apply.py in get_result(self)
    149             return self.apply_raw()
    150 
--> 151         return self.apply_standard()
    152 
    153     def apply_empty_result(self):

~/anaconda3/lib/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
    255 
    256         # compute the result using the series generator
--> 257         self.apply_series_generator()
    258 
    259         # wrap results

~/anaconda3/lib/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
    284             try:
    285                 for i, v in enumerate(series_gen):
--> 286                     results[i] = self.f(v)
    287                     keys.append(v.name)
    288             except Exception as e:

<ipython-input-57-417f96967916> in parseQuery(row)
      4 
      5     #print(query)
----> 6     q = prepareQuery(query)
      7     print(q.algebra.get('p').get('PV')) #[0]

~/anaconda3/lib/python3.6/site-packages/rdflib/plugins/sparql/processor.py in prepareQuery(queryString, initNs, base)
     23     Parse and translate a SPARQL Query
     24     """
---> 25     ret = translateQuery(parseQuery(queryString), base, initNs)
     26     ret._original_args = (queryString, initNs, base)
     27     return ret

~/anaconda3/lib/python3.6/site-packages/rdflib/plugins/sparql/parser.py in parseQuery(q)
   1056 
   1057     q = expandUnicodeEscapes(q)
-> 1058     return Query.parseString(q, parseAll=True)
   1059 
   1060 

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in parseString(self, instring, parseAll)
   1630             else:
   1631                 # catch and re-raise exception from here, clears out pyparsing internal stack trace
-> 1632                 raise exc
   1633         else:
   1634             return tokens

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in parseString(self, instring, parseAll)
   1620             instring = instring.expandtabs()
   1621         try:
-> 1622             loc, tokens = self._parse( instring, 0 )
   1623             if parseAll:
   1624                 loc = self.preParse( instring, loc )

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseCache(self, instring, loc, doActions, callPreParse)
   1527                 ParserElement.packrat_cache_stats[MISS] += 1
   1528                 try:
-> 1529                     value = self._parseNoCache(instring, loc, doActions, callPreParse)
   1530                 except ParseBaseException as pe:
   1531                     # cache a copy of the exception, without the traceback

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1377             if self.mayIndexError or loc >= len(instring):
   1378                 try:
-> 1379                     loc,tokens = self.parseImpl( instring, preloc, doActions )
   1380                 except IndexError:
   1381                     raise ParseException( instring, len(instring), self.errmsg, self )

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   3393                     raise ParseSyntaxException(instring, len(instring), self.errmsg, self)
   3394             else:
-> 3395                 loc, exprtokens = e._parse( instring, loc, doActions )
   3396             if exprtokens or exprtokens.haskeys():
   3397                 resultlist += exprtokens

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseCache(self, instring, loc, doActions, callPreParse)
   1527                 ParserElement.packrat_cache_stats[MISS] += 1
   1528                 try:
-> 1529                     value = self._parseNoCache(instring, loc, doActions, callPreParse)
   1530                 except ParseBaseException as pe:
   1531                     # cache a copy of the exception, without the traceback

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1377             if self.mayIndexError or loc >= len(instring):
   1378                 try:
-> 1379                     loc,tokens = self.parseImpl( instring, preloc, doActions )
   1380                 except IndexError:
   1381                     raise ParseException( instring, len(instring), self.errmsg, self )

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   3543             if maxException is not None:
   3544                 maxException.msg = self.errmsg
-> 3545                 raise maxException
   3546             else:
   3547                 raise ParseException(instring, loc, "no defined alternatives to match", self)

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   3528         for e in self.exprs:
   3529             try:
-> 3530                 ret = e._parse( instring, loc, doActions )
   3531                 return ret
   3532             except ParseException as err:

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseCache(self, instring, loc, doActions, callPreParse)
   1527                 ParserElement.packrat_cache_stats[MISS] += 1
   1528                 try:
-> 1529                     value = self._parseNoCache(instring, loc, doActions, callPreParse)
   1530                 except ParseBaseException as pe:
   1531                     # cache a copy of the exception, without the traceback

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1377             if self.mayIndexError or loc >= len(instring):
   1378                 try:
-> 1379                     loc,tokens = self.parseImpl( instring, preloc, doActions )
   1380                 except IndexError:
   1381                     raise ParseException( instring, len(instring), self.errmsg, self )

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   3715     def parseImpl( self, instring, loc, doActions=True ):
   3716         if self.expr is not None:
-> 3717             return self.expr._parse( instring, loc, doActions, callPreParse=False )
   3718         else:
   3719             raise ParseException("",loc,self.errmsg,self)

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseCache(self, instring, loc, doActions, callPreParse)
   1527                 ParserElement.packrat_cache_stats[MISS] += 1
   1528                 try:
-> 1529                     value = self._parseNoCache(instring, loc, doActions, callPreParse)
   1530                 except ParseBaseException as pe:
   1531                     # cache a copy of the exception, without the traceback

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1377             if self.mayIndexError or loc >= len(instring):
   1378                 try:
-> 1379                     loc,tokens = self.parseImpl( instring, preloc, doActions )
   1380                 except IndexError:
   1381                     raise ParseException( instring, len(instring), self.errmsg, self )

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   3393                     raise ParseSyntaxException(instring, len(instring), self.errmsg, self)
   3394             else:
-> 3395                 loc, exprtokens = e._parse( instring, loc, doActions )
   3396             if exprtokens or exprtokens.haskeys():
   3397                 resultlist += exprtokens

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseCache(self, instring, loc, doActions, callPreParse)
   1527                 ParserElement.packrat_cache_stats[MISS] += 1
   1528                 try:
-> 1529                     value = self._parseNoCache(instring, loc, doActions, callPreParse)
   1530                 except ParseBaseException as pe:
   1531                     # cache a copy of the exception, without the traceback

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1377             if self.mayIndexError or loc >= len(instring):
   1378                 try:
-> 1379                     loc,tokens = self.parseImpl( instring, preloc, doActions )
   1380                 except IndexError:
   1381                     raise ParseException( instring, len(instring), self.errmsg, self )

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   3715     def parseImpl( self, instring, loc, doActions=True ):
   3716         if self.expr is not None:
-> 3717             return self.expr._parse( instring, loc, doActions, callPreParse=False )
   3718         else:
   3719             raise ParseException("",loc,self.errmsg,self)

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseCache(self, instring, loc, doActions, callPreParse)
   1527                 ParserElement.packrat_cache_stats[MISS] += 1
   1528                 try:
-> 1529                     value = self._parseNoCache(instring, loc, doActions, callPreParse)
   1530                 except ParseBaseException as pe:
   1531                     # cache a copy of the exception, without the traceback

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1377             if self.mayIndexError or loc >= len(instring):
   1378                 try:
-> 1379                     loc,tokens = self.parseImpl( instring, preloc, doActions )
   1380                 except IndexError:
   1381                     raise ParseException( instring, len(instring), self.errmsg, self )

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   3715     def parseImpl( self, instring, loc, doActions=True ):
   3716         if self.expr is not None:
-> 3717             return self.expr._parse( instring, loc, doActions, callPreParse=False )
   3718         else:
   3719             raise ParseException("",loc,self.errmsg,self)

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseCache(self, instring, loc, doActions, callPreParse)
   1527                 ParserElement.packrat_cache_stats[MISS] += 1
   1528                 try:
-> 1529                     value = self._parseNoCache(instring, loc, doActions, callPreParse)
   1530                 except ParseBaseException as pe:
   1531                     # cache a copy of the exception, without the traceback

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1377             if self.mayIndexError or loc >= len(instring):
   1378                 try:
-> 1379                     loc,tokens = self.parseImpl( instring, preloc, doActions )
   1380                 except IndexError:
   1381                     raise ParseException( instring, len(instring), self.errmsg, self )

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   3393                     raise ParseSyntaxException(instring, len(instring), self.errmsg, self)
   3394             else:
-> 3395                 loc, exprtokens = e._parse( instring, loc, doActions )
   3396             if exprtokens or exprtokens.haskeys():
   3397                 resultlist += exprtokens

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseCache(self, instring, loc, doActions, callPreParse)
   1527                 ParserElement.packrat_cache_stats[MISS] += 1
   1528                 try:
-> 1529                     value = self._parseNoCache(instring, loc, doActions, callPreParse)
   1530                 except ParseBaseException as pe:
   1531                     # cache a copy of the exception, without the traceback

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1381                     raise ParseException( instring, len(instring), self.errmsg, self )
   1382             else:
-> 1383                 loc,tokens = self.parseImpl( instring, preloc, doActions )
   1384 
   1385         tokens = self.postParse( instring, loc, tokens )

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   3715     def parseImpl( self, instring, loc, doActions=True ):
   3716         if self.expr is not None:
-> 3717             return self.expr._parse( instring, loc, doActions, callPreParse=False )
   3718         else:
   3719             raise ParseException("",loc,self.errmsg,self)

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseCache(self, instring, loc, doActions, callPreParse)
   1527                 ParserElement.packrat_cache_stats[MISS] += 1
   1528                 try:
-> 1529                     value = self._parseNoCache(instring, loc, doActions, callPreParse)
   1530                 except ParseBaseException as pe:
   1531                     # cache a copy of the exception, without the traceback

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in _parseNoCache(self, instring, loc, doActions, callPreParse)
   1381                     raise ParseException( instring, len(instring), self.errmsg, self )
   1382             else:
-> 1383                 loc,tokens = self.parseImpl( instring, preloc, doActions )
   1384 
   1385         tokens = self.postParse( instring, loc, tokens )

~/anaconda3/lib/python3.6/site-packages/pyparsing.py in parseImpl(self, instring, loc, doActions)
   2411             (self.matchLen==1 or instring.startswith(self.match,loc)) ):
   2412             return loc+self.matchLen, self.match
-> 2413         raise ParseException(instring, loc, self.errmsg, self)
   2414 _L = Literal
   2415 ParserElement._literalStringClass = Literal

ParseException: Expected {SelectQuery | ConstructQuery | DescribeQuery | AskQuery} (at char 36), (line:1, col:37)
 
 
 
 

Descriptive statistics of the query log

TODOS
  • How many queries per user (classified per type of user)?
  • What percentage of queries use internal / external links?
    • Evolution of that over time?
  • What is the length (VLDB had that) / expressivity / ... of queries in QWL (queries with links) vs QWOL (queries without links)

Tests on analyising queries

query = readableQuery.replace("\n"," ")
query2 = query.replace("+"," ")
query2
'SELECT DISTINCT ?var1  ?var2  ?var3  WHERE {   BIND (  <http://www.wikidata.org/entity/Q641161>  AS  ?var1 ).   ?var1  <http://www.wikidata.org/prop/P150>  ?var4 .   ?var4  <http://www.wikidata.org/prop/statement/P150>  ?var3 .  OPTIONAL {   ?var1 ( <http://www.wikidata.org/prop/direct/P582> | <http://www.wikidata.org/prop/direct/P576> ) ?var5 .  }  OPTIONAL {   ?var3 ( <http://www.wikidata.org/prop/direct/P582> | <http://www.wikidata.org/prop/direct/P576> ) ?var5 .  }  OPTIONAL {   ?var4  <http://www.wikidata.org/prop/qualifier/P582>  ?var5 .  }  FILTER (  ( !( BOUND (  ?var5  ) ) )  ) .  OPTIONAL {   ?var3  <http://www.w3.org/2000/01/rdf-schema#label>  ?var2 .  FILTER (  ( (  LANG (  ?var2  )  =  "en" ) )  ) .  } } '
#working with readableQuery might still be better
 
 
 
 
 
 
 

How are explicit / implicit links used? Or to include the retrieval of entities, how are data sets queried in jonjuction?

 
 
 

What do links bring? in play

  • How can links be used (in general)?
    • (Browsing) Follow-up-your-nose style, looking up HTTP URIs
    • (Querying) To get data from source and target in various ways:
      • Mandatory / OPTIONAL
      • As central wanted data (in SELECT) or as an intermediary step (only in the WHERE)
      • Query Federation of two data sets (source and target) or more (!)
  • What properties can we analyse from links that are used in SPARQL queries?
  • How to measure the value that links bring to the query (results)?

    • Expected value: what the query executor defines in the query
    • Real value: what the results match to the expected (and maybe if they bring something additionally?)
    • See that what I had in SeaStar is like the potential value and this is the practical value (that people exploit via queries). Actually comparing these two things would be very good! Even if I don't have the diversity-base dmeasures, rather just diff and count. But not making it not only a descriptive work. Also with some hypothesis testing.
    • It would also be quite cool to crawl GitHub code that uses SPARQL to see how people are in general using links, because after all, I don't need more than queries. What a query log gives me in addition is the certainty that the queries were indeed executed and also a sequence of queries.
    • Is this different depending on the role of the data set? For example, Wikidata, DBpedia they are centralizing links because everyone wants to be linked to them, as entity disambiguator / visibility etc.
  • (Don't know where to write this in the section classification but) looking at the sequences of queries and the influence of federation in any dimension related to the sequenced queries would be really interesting.

IMPORTANT NOTE: If no one in the literature (Fed. Quedry?) has looked into it, I can develop some measures to assess the value that links bring into the query. If I can turn this into indicators for the data publisher when linking (e.g. for a SeaStar extension) that would be very useful, to motivate people to link further.

TODOS
  • Based on what I would like to observe, I need some methods to annotate the queries, and come up with a structured db that I can later exploit
  • What questions do I have and what hypotheses?
  • low-level things
    • in the results (which I would need to retrieve and then indicate when the query was executed or do it with a JSON dump even better), I can identify URIs with different URI base to see the things from the federate query?
    • to look into how are things asked (values / URIs) the place to look is not results, but the query itself. If I can I should restrict myself to only queries.
Unfocused thoughts
  • Anything around utility theory?
    • Any relation about the status of the entities and what the queries ask about them? That would be a lot of work actually
  • Can I go into things like curiosity theory or sth? Associatinism here?
  • Exploratory queries vs xx vs (what were the information needs? It was in user intent from Broder 2002 cited: Informational, transactional, navigational). If I could do that with a focus on the linked data perspective ...
  • do fifferent agents of different domains (i.e. querying the fed source of other domain) o sth different?