Knowledge Innovation Research Center

Please refer to the papers for more details on the datasets available on this page. Email munyi AT kaist DOT ac DOT kr if more information are needed regarding these datasets.

KSE643 Dataset

The dataset for KSE643 assignment 4 is available here.This dataset is a snapshot of the IMDB movie data. 2,000 hit movies are selected for this assignment. Your task is to set up a simple version of content-based movie recommender system. The dataset consists of 3 files: Movie, plotsummary, and keyword. Each field is separated by ‘|.’ Informally, here are their meanings:

Movie table:
Idmovie: unique identifier of movies
Ranking: a box-office record
Name: the name of the movie
Country: the country where the movie was made from
ReleaseYear: the relased year of the movie
Synopsis: synopsis of the movie
Certificate: film rate of the movie
Runtime: runtime of the movie

Plotsummary table:
IdplotSummary: unique identifier of plots
Content: texts of the plot
Movie_idmovie: ids of the movies

Keyword table:
Idkeyword: unique identifier of keywords
Keyword: individual keyword for movie
Movie_idmovie: ids of the movies

Clinical Labratory Test (CST17) Dataset

The CST17 dataset is available here. This dataset contains query list extracted from CST based clinical notes and its relevance judgement. Relevance are labled from 1 (Not relevant) to 3 (Definitely relevant). To be “Deﬁnitely Relevant,” a passage should provide information relevant to the particular patient described in the topic. The information would provide diagnosis, test, and treatment of the patient described in the topic. On the other hand, a passage is judged “Possibly Relevant,” if an assessor believed it was not immediately informative on its own, but that it may be relevant in the context of a broader literature review. Finally, a passage is judged “Not Relevant” if they did not provide any information relevant to the particular aspect of patient described in the case report. Passages are extracted from three medical textbooks. CSV files for passages are named by its isbn numbers.

Please refer to the following paper for more detailed description of the dataset and cite the paper if you use the dataset:

• The regarding paper is currently under review. It will be available shortly.

Domain Ontology for Clinical Laboratory Test

The ontology for clinical laboratory test is available here. The ontology consists of 4 classes, 4 properties, 668 instances , and 1,961 synonyms in the lexicons. The class named ”Test” refers to a clinical laboratory test component for reporting a patient’s health status, indicating that it connects every classes within the domain ontology. For instance, an amylase test is conducted on serum to examine whether or not the patient has a fructosuria. The other classes such as ”Specimen”, ”Category”, and ”Disease” were used to deﬁne properties of a corresponding part. ”Specimen” was used to deﬁne which specimen is used to perform the given test, such as serum or urine. ”Category” was used to deﬁne which category the given test belongs to.Finally, ”Disease” was for disease information such as the disease name and body part the disease affects.

Please refer to the following paper for more detailed description of the dataset and cite the paper if you use the dataset:

• The regarding paper is currently under review. It will be available shortly.

Datasets

KSE643 Dataset

Clinical Labratory Test (CST17) Dataset

Domain Ontology for Clinical Laboratory Test