Friday, February 29, 2008

Keeping subjectivity out of CL...

Reading the announcement about the COLING workshop on "human judgements in Computational Linguistics":


Human judgements play a key role in the development and the assessment of linguistic resources and methods in Computational Linguistics. [...]
We invite papers about experiments that collect human judgements for Computational Linguistic purposes, with a particular focus on linguistic tasks that are controversial from a theoretical point of view (e.g., some coding tasks having to do with semantics or pragmatics). Such experimental tasks are usually difficult to design and interpret, and they typically result in mediocre inter-rater reliability.


So let me think. "Coding tasks having to do with semantics and pragmatics" "typically result in mediocre inter-rater reliability". Seriously? So are we back to the 50s and Chomsky's concept of "ungrammaticality"? The dominance of syntax and the marginalization of semantics and pragmatics as "subjective"?
Now that we finally made it through enough of Chomsky 50+ years later, now that CL is finally breaking free of attempts to formalize semantics, now that we have finally figured it out how to relate language and information theory, we now willingly take a turn back and look at "human judgements"? Why? Language is definitely not created in a vacuum. Virtually every level of natural language (and hence also of CL) is potentially subjective in that it inevitably reflects the 'theory' of the linguist who looks at it. There is no way around this. Claiming that "some coding" is subjective implies that some other "coding" is not. Well, the point is that if it is not, then it has nothing to do with *natural* language.

Mental exercise of the day

Focus on the negative and you will be immediately inundated by an avalanche of negative experience.

Focus on the positive, and more positive will turn up out of the blue...

On idiotic management...

Managing people is hard enough. Managing smart people is definitely harder.
Managing smart people who constantly blabber about new technology must scare the hell out of most managers.

What takes the cake is:
Managing people and technology and making decisions without listening to your experts.
Managing people and technology and be too scared to make any decisions.

Thursday, February 28, 2008

words....

Interesting neologisms of the day (only read with a sense of humor):

celebritology, noun:
1. the study of the lives of celebrities
2. the endless gossip about Britney's life
3. the main subject of attention of the People magazine


chatological, adj. (as in "chatological humor", reminiscent of "eschatological"):
1. the system or theory concerning online chats, online chat rooms, and any other online life species
2. the branch of logic dealing with the same...


Interesting syntactic phenomenon of the day:

Clapton Invited to Play North Korea*

Lucky North Korea will be played by Clapton...

* to confirm the meaning of this schema look at the article

Monday, February 25, 2008

If it's text processing it is also batch processing...

Here's a consequence of isolating web services from real text processing:

What about batch processing?

Tired of superficial text processing

I've been looking at CALAIS the open-source web services for performing information extraction on Reuters data. The info extraction tech used under the hood is based on ClearForest's proprietary rule-based info extraction language called DIAL. Like SRA's and Inxight's (now Business Objects and SAP) similar tools of the trade, this type of languages are tailor-made for general purpose info extraction. The biggest asset of working with such tools is that they allow the developer to go as deep as she can and extend the original NLP basis in order to meet specific customer data and requirements. However, tools like CALAIS shift the focus majorly from the underlying NLP/IE technology to the web services and I/O front and related bells and whistles. They even offer "bounties" for innovative web service applications built for CALAIS. All this while the single most important and most attractive element of this tool is its NLP extensibility power! This remains concealed and under wraps with a roadmap promise to be released by the end of the year. Until then the tool runs with the out-of-the-box IE capabilities, which are -arguably- pretty limited and only impressive to those with limited prior NLP/IE experience. Does someone have their priorities screwed-up?

Sunday, February 24, 2008

Semantic Web, the Wikipedia, TextLinguistics and Information Extraction

Paraphrasing an article on AI3 published on Feb 18, 2008, there is finally a recognized collective wave of information extraction tools and sources that mine Wikipedia in order to help enrich the Semantic Web.

Here are a few instances that show-case the successful marriage of text-linguistics and information extraction:
-First paragraphs: Wikipedia articles are being mined for term definitions. It is standard text structure (even in Wikipedia articles) that in the first paragraph outlines the main terms discussed in the article. For this reason, initial paragraphs are good places for looking up terms in the document.
-Redirects: Mining for synomymous terms, spelling variations, abbreviations and other "equivalents" of a term is just par for the course.
-Document Title: This is where we locate named entities and domain-specific terms or semantic variants
-Subject Line and Section Headings: For category identification (topic classification)
-Full text: Whereas in the first paragraph new terms are being defined, it is in the rest of the document that one will find a full description of the definition/meaning, along with related terms, translations and other collocations (linguistic context).
-Embedded article links: Links to and by external pages provide more related terms, potential synonyms, clues for disambiguation and categorization.
-Embedded Lists (and other Hierarchies): Look here for hyponyms, meronyms and other semantic relationships among related terms.

Notice that all of the above are overt structural elements in Wikipedia articles. This type of structure is not unique in Wikipedia articles although Wikipedia standards impose a conscious effort for homogeneity. However, detecting such structural clues in text is no news in the field of text-linguistics (for seminal work in the field check here).
What's new here is the application of text-linguistics analysis techniques to the Web (and Wikipedia in particular) for purposes of Web mining, Information Extraction and the Semantic Web initiative.

The output of such analyses and metrics helps populate ontologies and taxonomies, as well as link records. Areas of focus for these types of application are:
-subcategorization
-WSD, NER and NED
-semantic similarity and relatedness analysis and metrics

Tools of the trade

Try NoteTab Light a freeware with some commercial features available for a 31-day trial. It allows embedded scripting (html, PERL, Gawk). It looks pretty loaded in comparison with NotePad.

Saturday, February 23, 2008

The hound that won the race...

Dedicated to my "Phriends" in the throes of the dissertation journey.
You can do it!

What is it with students?

OK -- you get in the trouble of contacting me with a question related to my doctoral dissertation that you have sitting in front of you. How flattering! Someone actually was interested enough to buy (and hopefully read?) my brainchild.
Now don't spoil the good news with 1) incredibly bad manners and 2) incredulously thick questions.
So to attend to 1, do follow proper email etiquette. By this I mean use a basic "Hi (FirstName)" salutation when you address someone you don't know over email. And while we are at it, please resist the temptation of (inadvertently) offending me by ascribing my brainchild to someone else... Thank you.
To attend to 2, just do not expect people to give you ready answers to q's they belabored for a while! It is called a doctoral dissertation for a reason! I didn't spend one day, one month or even one year on it, dude. Since you have bought it, do me a favor and actually read it or at least browse through its pages. What else can I say. Then, once you are in a place to form intelligent and respectful questions, come back to me. It is call "research" for a reason!

watching the news...

The next big revolution will be in the direction of exercising judgment when it comes to information. Do I really need a daily depression dose by CNN and newspapers in order to "get it" that the economy is bad and people are losing their jobs and -most crucially- that the government isn't doing much about it? I say stop watching all this negativity and start doing something about it in your daily life: from voting, from questioning political practices, from living "in the present" and tuning in to what's happening around you (CNN won't tell you how many of your neighbors lost their jobs in the last X months). Open your eyes, be present and use judgment and common sense when it comes to "mass media". Above all THINK. Yes, use the substance in your skull that promotes intelligent life. Mass media perpetuate negativity. This is how it is. It's not to blame. It's up to you to "buy it" or not. I say use your own mind and separate yourself from the "mass". Then you have better chances of staying positive and do something in your life that promotes a change to the better.

two months later, I resurface...