Ramifications of a linguist's life: Information Extraction

Showing posts with label Information Extraction. Show all posts

Wednesday, December 17, 2008

Automated Information Extraction in Media Production

2nd International Workshop on Automated Information Extraction in Media Production (AIEMPro09)

Special Session at WIAMIS 2009

London, 6-8 May 2009

After the successful exordium at DEXA 2008, the second edition of AIEMPro will have the form of a special session at WIAMIS 2009 (The International Workshop on Image Analysis for Multimedia Interactive Services)

Tentative deadlines:

Paper submission: 11 January 2009
Notification of reviews: 1 February 2009 Final camera ready (this is a STRICT DEADLINE): 13th February 2009

Areas of Interest (not limited to):

· Efficient and real-time audiovisual indexing in acquisition
· Automated repurposing of archived material on new media channels
· Automated news production
· Efficient indexing and retrieval of multimedia streams
· Automatic speech recognition and personality identification
· Collaborative systems for media production
· Information Retrieval systems from Multimedia Archives
· Automated material copyright infraction detection and material fingerprinting
· Content summarisation (e.g., sports highlights)
· Audiovisual genre and editorial format detection and characterisation
· Cross-media indexing and integration
· Content segmentation tools (e.g., shot and scene segmentation)
· Evaluation methods for multimedia analysis tools

Prospective authors must submit their work following the WIAMIS formatting instructions (http://wiamis2009.qmul.net/submissions.php) and send the paper in PDF format DIRECTLY to the organisers by e-mail.

Organisers:
Alberto Messina (RAI CRIT) a.messina@rai.it
Jean-Pierre Evain (European Broadcasting Union) evain@ebu.ch
Robbie De Sutter (VRT medialab) robbie.desutter@vrt.be

Monday, February 25, 2008

Tired of superficial text processing

I've been looking at CALAIS the open-source web services for performing information extraction on Reuters data. The info extraction tech used under the hood is based on ClearForest's proprietary rule-based info extraction language called DIAL. Like SRA's and Inxight's (now Business Objects and SAP) similar tools of the trade, this type of languages are tailor-made for general purpose info extraction. The biggest asset of working with such tools is that they allow the developer to go as deep as she can and extend the original NLP basis in order to meet specific customer data and requirements. However, tools like CALAIS shift the focus majorly from the underlying NLP/IE technology to the web services and I/O front and related bells and whistles. They even offer "bounties" for innovative web service applications built for CALAIS. All this while the single most important and most attractive element of this tool is its NLP extensibility power! This remains concealed and under wraps with a roadmap promise to be released by the end of the year. Until then the tool runs with the out-of-the-box IE capabilities, which are -arguably- pretty limited and only impressive to those with limited prior NLP/IE experience. Does someone have their priorities screwed-up?

Sunday, February 24, 2008

Semantic Web, the Wikipedia, TextLinguistics and Information Extraction

Paraphrasing an article on AI3 published on Feb 18, 2008, there is finally a recognized collective wave of information extraction tools and sources that mine Wikipedia in order to help enrich the Semantic Web.

Here are a few instances that show-case the successful marriage of text-linguistics and information extraction:
-First paragraphs: Wikipedia articles are being mined for term definitions. It is standard text structure (even in Wikipedia articles) that in the first paragraph outlines the main terms discussed in the article. For this reason, initial paragraphs are good places for looking up terms in the document.
-Redirects: Mining for synomymous terms, spelling variations, abbreviations and other "equivalents" of a term is just par for the course.
-Document Title: This is where we locate named entities and domain-specific terms or semantic variants
-Subject Line and Section Headings: For category identification (topic classification)
-Full text: Whereas in the first paragraph new terms are being defined, it is in the rest of the document that one will find a full description of the definition/meaning, along with related terms, translations and other collocations (linguistic context).
-Embedded article links: Links to and by external pages provide more related terms, potential synonyms, clues for disambiguation and categorization.
-Embedded Lists (and other Hierarchies): Look here for hyponyms, meronyms and other semantic relationships among related terms.

Notice that all of the above are overt structural elements in Wikipedia articles. This type of structure is not unique in Wikipedia articles although Wikipedia standards impose a conscious effort for homogeneity. However, detecting such structural clues in text is no news in the field of text-linguistics (for seminal work in the field check here).
What's new here is the application of text-linguistics analysis techniques to the Web (and Wikipedia in particular) for purposes of Web mining, Information Extraction and the Semantic Web initiative.

The output of such analyses and metrics helps populate ontologies and taxonomies, as well as link records. Areas of focus for these types of application are:
-subcategorization
-WSD, NER and NED
-semantic similarity and relatedness analysis and metrics

Sunday, December 16, 2007

ACL 2008

ACL-08: HLT will be held in Columbus, Ohio. Deadline for paper submissions is Jan 10, 2008.
Notice that Information Extraction makes for a separate subfield/umbrella of topics acceptable for submission.

Ramifications of a linguist's life