Wikidata query duplicates

2.3k views Asked by At

Sorry if my english is bad, but I don't really have any place where I can ask this question in my native language. I've been trying to create SPARQL query for Wikidata that should create a list of all horror fiction that was created in 1925-1950 years, names of authors and, if available, pictures:

SELECT DISTINCT ?item ?itemLabel ?author ?name ?creation ?picture
WHERE
{
    ?item wdt:P136 wd:Q193606 . # book
    ?item wdt:P50 ?author .   # author
    ?item wdt:P577 ?creation .
    ?item wdt:P577 ?end .
  ?author rdfs:label ?name .    
  OPTIONAL{ ?item wdt:P18 ?picture }
  FILTER (?creation >= "1925-01-01T00:00:00Z"^^xsd:dateTime) .
  FILTER (?end <= "1950-12-31T23:59:59Z"^^xsd:dateTime) .

SERVICE wikibase:label
{ 
bd:serviceParam wikibase:language "en" .
} 
}

However, for some reason this query placing duplicates in the list. DISTINCT doesn't do much. After some time I figured out that the reason is "?item rdfs:label ?name .". If this line is removed, no duplicates are listed. But I need this line to show author name in the list! Any ideas on how to fix this?

3

There are 3 answers

4
maxlath On

You don't need to use ?item rdfs:label ?name . as you already get items labels as ?itemLabel thank to SERVICE wikibase:label.

Then, you will get duplicate results for every items that have a SELECTed property with possibly multiple values: here, you are SELECTing authors (P50), which will create duplicates for every item with several authors.

5
Kiran.B On

The query is actually giving you distinct items. The problem is that some items have multiple rdfs:labels. You can see as an example the item:

SELECT *
WHERE
{
   wd:Q2882840 rdfs:label ?label

SERVICE wikibase:label
{ 
bd:serviceParam wikibase:language "en" .
} 
}

And since there are multiple rdfs:label predicates for some items, they are showing up in separate rows.

1
Shlomi Uziel On

You can aggregate your results according to the book title (the item's label) using the

group by

keyword. Thus, every result will be a group which will show up once, and other fields which have different values, will be aggregated using the separator (in this case, a comma).

The fixed query:

SELECT DISTINCT ?item ?itemLabel 
(group_concat(distinct ?author;separator=",") as ?author)
(group_concat(distinct ?name;separator=",") as ?name)
 (group_concat(distinct ?creation;separator=",") as ?creation)
 (group_concat(distinct ?picture;separator=",") as ?picture)
WHERE
{
    ?item wdt:P136 wd:Q193606 . # book
    ?item wdt:P50 ?author .   # author
    ?item wdt:P577 ?creation .
    ?item wdt:P577 ?end .
  ?author rdfs:label ?name .    
  OPTIONAL{ ?item wdt:P18 ?picture }
  FILTER (?creation >= "1925-01-01T00:00:00Z"^^xsd:dateTime) .
  FILTER (?end <= "1950-12-31T23:59:59Z"^^xsd:dateTime) .

SERVICE wikibase:label
{ 
bd:serviceParam wikibase:language "en" .
} 
}
group by ?item ?itemLabel