Ramifications of a linguist's life

Friday, March 14, 2008

The unbearable lightness of being

See the movie if you can. Better yet read the book (by Kundera).
A bit post-modernistic by some standards, I found it agrees with me these days.

Tuesday, March 4, 2008

semantic computing and "vaguely-formulated human intentions"

The field of Semantic Computing (SC) brings together those disciplines concerned with connecting the (often vaguely-formulated) intentions of humans with computational content. This connection can go both ways: retrieving, using and manipulating existing content according to user's goals ("do what the user means"); and creating, rearranging, and managing content that matches the author's intentions ("do what the author means").

There is no such thing as "vaguely-formulated intentions of humans". Humans have intentions. Only the human who has expressed the specific intentions knows how vague or not these intentions were "formulated". How? Usually vague intentions bring about vague results or bring about undesired (unintended) results. It's all in the eye (and mind) of the beholder.

What's pertinent for CL and SemComputing here is the fact that intentions are hard to detect. We will never be 100% sure that we matched the intention of the author, neither should we even try for that. Regardless of what the author "meant" (which can be very imprecise and gauged only by the author's own judgement), a document bears evidence of such intentions. If we focus on the intentions evidently expressed in the document we can happily dispense with the "intentions of humans".

Friday, February 29, 2008

Keeping subjectivity out of CL...

Reading the announcement about the COLING workshop on "human judgements in Computational Linguistics":

Human judgements play a key role in the development and the assessment of linguistic resources and methods in Computational Linguistics. [...]
We invite papers about experiments that collect human judgements for Computational Linguistic purposes, with a particular focus on linguistic tasks that are controversial from a theoretical point of view (e.g., some coding tasks having to do with semantics or pragmatics). Such experimental tasks are usually difficult to design and interpret, and they typically result in mediocre inter-rater reliability.

So let me think. "Coding tasks having to do with semantics and pragmatics" "typically result in mediocre inter-rater reliability". Seriously? So are we back to the 50s and Chomsky's concept of "ungrammaticality"? The dominance of syntax and the marginalization of semantics and pragmatics as "subjective"?
Now that we finally made it through enough of Chomsky 50+ years later, now that CL is finally breaking free of attempts to formalize semantics, now that we have finally figured it out how to relate language and information theory, we now willingly take a turn back and look at "human judgements"? Why? Language is definitely not created in a vacuum. Virtually every level of natural language (and hence also of CL) is potentially subjective in that it inevitably reflects the 'theory' of the linguist who looks at it. There is no way around this. Claiming that "some coding" is subjective implies that some other "coding" is not. Well, the point is that if it is not, then it has nothing to do with *natural* language.

Mental exercise of the day

Focus on the negative and you will be immediately inundated by an avalanche of negative experience.

Focus on the positive, and more positive will turn up out of the blue...

On idiotic management...

Managing people is hard enough. Managing smart people is definitely harder.
Managing smart people who constantly blabber about new technology must scare the hell out of most managers.

What takes the cake is:
Managing people and technology and making decisions without listening to your experts.
Managing people and technology and be too scared to make any decisions.

Thursday, February 28, 2008

words....

Interesting neologisms of the day (only read with a sense of humor):

celebritology, noun:
1. the study of the lives of celebrities
2. the endless gossip about Britney's life
3. the main subject of attention of the People magazine

chatological, adj. (as in "chatological humor", reminiscent of "eschatological"):
1. the system or theory concerning online chats, online chat rooms, and any other online life species
2. the branch of logic dealing with the same...

Interesting syntactic phenomenon of the day:

Clapton Invited to Play North Korea*

Lucky North Korea will be played by Clapton...

* to confirm the meaning of this schema look at the article

Monday, February 25, 2008

If it's text processing it is also batch processing...

Here's a consequence of isolating web services from real text processing:

What about batch processing?

Tired of superficial text processing

I've been looking at CALAIS the open-source web services for performing information extraction on Reuters data. The info extraction tech used under the hood is based on ClearForest's proprietary rule-based info extraction language called DIAL. Like SRA's and Inxight's (now Business Objects and SAP) similar tools of the trade, this type of languages are tailor-made for general purpose info extraction. The biggest asset of working with such tools is that they allow the developer to go as deep as she can and extend the original NLP basis in order to meet specific customer data and requirements. However, tools like CALAIS shift the focus majorly from the underlying NLP/IE technology to the web services and I/O front and related bells and whistles. They even offer "bounties" for innovative web service applications built for CALAIS. All this while the single most important and most attractive element of this tool is its NLP extensibility power! This remains concealed and under wraps with a roadmap promise to be released by the end of the year. Until then the tool runs with the out-of-the-box IE capabilities, which are -arguably- pretty limited and only impressive to those with limited prior NLP/IE experience. Does someone have their priorities screwed-up?

Sunday, February 24, 2008

Semantic Web, the Wikipedia, TextLinguistics and Information Extraction

Paraphrasing an article on AI3 published on Feb 18, 2008, there is finally a recognized collective wave of information extraction tools and sources that mine Wikipedia in order to help enrich the Semantic Web.

Here are a few instances that show-case the successful marriage of text-linguistics and information extraction:
-First paragraphs: Wikipedia articles are being mined for term definitions. It is standard text structure (even in Wikipedia articles) that in the first paragraph outlines the main terms discussed in the article. For this reason, initial paragraphs are good places for looking up terms in the document.
-Redirects: Mining for synomymous terms, spelling variations, abbreviations and other "equivalents" of a term is just par for the course.
-Document Title: This is where we locate named entities and domain-specific terms or semantic variants
-Subject Line and Section Headings: For category identification (topic classification)
-Full text: Whereas in the first paragraph new terms are being defined, it is in the rest of the document that one will find a full description of the definition/meaning, along with related terms, translations and other collocations (linguistic context).
-Embedded article links: Links to and by external pages provide more related terms, potential synonyms, clues for disambiguation and categorization.
-Embedded Lists (and other Hierarchies): Look here for hyponyms, meronyms and other semantic relationships among related terms.

Notice that all of the above are overt structural elements in Wikipedia articles. This type of structure is not unique in Wikipedia articles although Wikipedia standards impose a conscious effort for homogeneity. However, detecting such structural clues in text is no news in the field of text-linguistics (for seminal work in the field check here).
What's new here is the application of text-linguistics analysis techniques to the Web (and Wikipedia in particular) for purposes of Web mining, Information Extraction and the Semantic Web initiative.

The output of such analyses and metrics helps populate ontologies and taxonomies, as well as link records. Areas of focus for these types of application are:
-subcategorization
-WSD, NER and NED
-semantic similarity and relatedness analysis and metrics

Tools of the trade

Try NoteTab Light a freeware with some commercial features available for a 31-day trial. It allows embedded scripting (html, PERL, Gawk). It looks pretty loaded in comparison with NotePad.