Sunday, February 24, 2008

Semantic Web, the Wikipedia, TextLinguistics and Information Extraction

Paraphrasing an article on AI3 published on Feb 18, 2008, there is finally a recognized collective wave of information extraction tools and sources that mine Wikipedia in order to help enrich the Semantic Web.

Here are a few instances that show-case the successful marriage of text-linguistics and information extraction:
-First paragraphs: Wikipedia articles are being mined for term definitions. It is standard text structure (even in Wikipedia articles) that in the first paragraph outlines the main terms discussed in the article. For this reason, initial paragraphs are good places for looking up terms in the document.
-Redirects: Mining for synomymous terms, spelling variations, abbreviations and other "equivalents" of a term is just par for the course.
-Document Title: This is where we locate named entities and domain-specific terms or semantic variants
-Subject Line and Section Headings: For category identification (topic classification)
-Full text: Whereas in the first paragraph new terms are being defined, it is in the rest of the document that one will find a full description of the definition/meaning, along with related terms, translations and other collocations (linguistic context).
-Embedded article links: Links to and by external pages provide more related terms, potential synonyms, clues for disambiguation and categorization.
-Embedded Lists (and other Hierarchies): Look here for hyponyms, meronyms and other semantic relationships among related terms.

Notice that all of the above are overt structural elements in Wikipedia articles. This type of structure is not unique in Wikipedia articles although Wikipedia standards impose a conscious effort for homogeneity. However, detecting such structural clues in text is no news in the field of text-linguistics (for seminal work in the field check here).
What's new here is the application of text-linguistics analysis techniques to the Web (and Wikipedia in particular) for purposes of Web mining, Information Extraction and the Semantic Web initiative.

The output of such analyses and metrics helps populate ontologies and taxonomies, as well as link records. Areas of focus for these types of application are:
-subcategorization
-WSD, NER and NED
-semantic similarity and relatedness analysis and metrics

Tools of the trade

Try NoteTab Light a freeware with some commercial features available for a 31-day trial. It allows embedded scripting (html, PERL, Gawk). It looks pretty loaded in comparison with NotePad.