Wiki-MID: a very large Multi-domain Interests Dataset of Twitter users with mappings to Wikipedia


New! (May 2020)

message-based interests for both Italian and English are now provided with timestamps:

CLICK HERE TO DOWNLOAD

Wiki-MID Dataset

Wiki-MID is a LOD compliant multi-domain interests dataset to train and test Recommender Systems. Our English dataset includes an average of 90 multi-domain preferences per user on music, books, movies, celebrities, sport, politics and much more, for about half million Twitter users traced during six months in 2017. Preferences are either extracted from messages of users who use Spotify, Goodreads and other similar content sharing platforms, or induced from their "topical" friends, i.e., followees representing an interest rather than a social relation between peers. In addition, preferred items are matched with Wikipedia articles describing them. This unique feature of our dataset provides a mean to categorize preferred items, exploiting available semantic resources linked to Wikipedia such as the Wikipedia Category Graph, DBpedia, BabelNet and others.

Data model:

Figure 1: The data model adopted for the design of our resource.

Our resource is designed on top of the Semantically-Interlinked Online Communities (SIOC) core ontology. The SIOC ontology favors the inclusion of data mined from social networks communities into the Linked Open Data (LOD) cloud. As shown in Figure 1 we represent Twitter users as instances of the SIOC UserAccount class. Topical users and message based user interests are then associated, through the usage of the Simple Knowledge Organization System Namespace Document (SKOS) predicate relatedMatch, to a corresponding Wikipedia page as a result of our automated mapping methodology.

Examples:

To better understand the released resource we provide in this section an instance example of the adopted data model. Let "https://twitter.com/intent/user?user_id=100000647" be a generic Twitter user and "https://twitter.com/intent/user?user_id=21447363" be the Twitter account corresponding to a friend of "https://twitter.com/intent/user?user_id=100000647" (i.e. "Katy Perry").
With the following set of triples we first define the two Twitter users as instances of the SIOC UserAccount class and second with the last triple we define the relations representing the fact that first user do follows the second user:

<https://twitter.com/intent/user?user_id=100000647> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdfs.org/sioc/ns#UserAccount>  .
<https://twitter.com/intent/user?user_id=21447363> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdfs.org/sioc/ns#UserAccount>  .  
<https://twitter.com/intent/user?user_id=100000647> <http://rdfs.org/sioc/ns#follows> <https://twitter.com/intent/user?user_id=21447363>  .

With the following triple we define that "https://twitter.com/intent/user?user_id=21447363" is a related match to to the wikipedia page "Katy_Perry":

<https://twitter.com/intent/user?user_id=21447363> <http://www.w3.org/2004/02/skos/core#relatedMatch> <https://en.wikipedia.org/wiki/Katy_Perry> .

With the following set of triples we first define that "https://twitter.com/intent/user?user_id=100000647" is interested by the movie "http://www.imdb.com/title/tt5537374":

<https://twitter.com/intent/user?user_id=100000647> <http://rdfs.org/sioc/ns#likes> <http://www.imdb.com/title/tt5537374> .

And finally, with the following triple we provide details about the extracted interest:

<http://www.imdb.com/title/tt5537374> <http://www.w3.org/2004/02/skos/core#relatedMatch> <https://it.wikipedia.org/wiki/Prison_Break> .

License:

If you use the Wiki-MID Dataset in your research, please cite this publication:

    Di Tommaso, Giorgia and Faralli, Stefano and Stilo, Giovanni and Velardi, Paola
    Wiki-MID: a very large Multi-domain Interests Dataset of Twitter users with mappings to Wikipedia
    In Proceedings of the The 17th International Semantic Web Conference, ISWC2018,
     2018, Monterey (California)

       Paper        Bibtex
The resources are licensed under:
      Creative Commons Attribution-Non Commercial-Share Alike 4.0 License.

Download:

How to use it with Jena:

 1 package it.uniroma1.wikimid;
 2 
 3 import org.apache.jena.query.*;
 4 import org.apache.jena.rdf.model.Model;
 5 import org.apache.jena.rdf.model.ModelFactory;
 6 import org.apache.jena.shared.PrefixMapping;
 7 
 8 import java.io.*;
 9 
10 public class App {
11 
12     static final String datasetPath = "in/Wiki-MID_IT.nt";
13 
14     public static void main(String[] args) throws IOException {
15 
16         Model model = ModelFactory.createDefaultModel();
17         model.setNsPrefixes(PrefixMapping.Standard);
18         model.read(new FileReader(datasetPath), "UTF-8", "N-TRIPLES");
19         String queryString = "SELECT ?x WHERE { ?x  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdfs.org/sioc/ns#UserAccount> }";
20 
21         Query qry = QueryFactory.create(queryString);
22         try (QueryExecution qe = QueryExecutionFactory.create(qry, model)) {
23             ResultSet rs = qe.execSelect();
24             
25             while (rs.hasNext()) {
26                 QuerySolution sol = rs.nextSolution();
27                 System.out.println(sol);
28             }
29         }
30     }
31 
32 }
33 

Software Repository:

We provide a code repository to share part of the pipiline components used for the construction of the Wiki-MID resource: https://github.com/stefanofaralli/wikimid