Ramifications of a linguist's life: 2008

Wednesday, December 31, 2008

Down with linguistic purity!

If you surround yourself with people who overuse such language, kindly do yourself and all of us a favor and simply remove yourself from the particular linguistic environment. Why keep nagging? Language is inevitable and slang or various linguistic fads are part of life (and language). If it weren't for those, various mainstream NLP techniques would have a hard time programming in probabilities for single-word transitions (okay, that's a little NLP joke). Besides, most of the language listed in the link above is teenage-speak.

New Year's wish

[cartoon by Nick Galiafianakis for the Washington Post]

Tuesday, December 30, 2008

Read someone the riot act

Here's the gist of the Riot Act [enforced by the British government in 1715]:

"Our sovereign Lord the King chargeth and commandeth all persons, being assembled, immediately to disperse themselves, and peaceably to depart to their habitations, or to their lawful business, upon the pains contained in the act made in the first year of King George, for preventing tumults and riotous assemblies. God save the King."

In other words (aka the British way): 'you noisy louts, don't you know there are people here trying to sleep?'
OR (the un-cut and slightly un-kind and definitely non-British version):
tell someone(s) to "Shut the F*&#$% Up!"...

[Source: The Phrase Finder]

Sunday, December 28, 2008

5 things

Getting to know me:

5 Things I was Doing 10 years Ago:

1. Moved across the Atlantic (round #1) to BigCity, VeryImportantState, USA, to start a PhD.
2. Broke up with Dutch bf.
3. Swimming in deep waters as I was trying something completely different academically.
4. Obtained a Fulbright scholarship.
5. Was online dating.

5 Things On My To-Do List Today:

1. Have a hot bath.
2. Keep reading that FunReading NonFiction book.
3. Work for 1hr on the proposal for the collaborative paper.
4. Do online apartment hunting (looking to let).
5. Do stretching exercises.

5 Snacks I Love:

1. Soups (even the instant ones)
2. Ramen noodles (bad habit leftover from my student yrs)
3. Anything crunchy (cucumbers, carrots, dried fruits and seeds)
4. Tea and cookies
5. Nuts

5 Things I Would Do If I Were A Millionaire:

1. Move to place near sea/beach, with plenty of sunlight and warm temperatures all year long.
2. Adopt a pet.
3. Fund a disabilities group.
4. Build my dream house.
5. Pay off my parents' debt.

5 Places I've Lived:

1. Several cities in various U.S. states.
2. City in Cameroon (Africa).
3. London, UK.
4. Athens, Greece.
5. Amsterdam, The Netherdlands.

5 Jobs I've Had:

1. Sales girl for Procter & Gamble.
2. Freelance programmer.
3. Reporter and Interpreter for a one-off private interview with a big-name Greek literature author in Athens (paid assignment).
4. Teacher.
5. Consulting Analyst and Computational Linguist for a global company.

5 blogs I'd like to tag:

Young Female Scientist
See Jane compute
Joel on software
TechCrunch
Language hat

RegEx's

Here's a very good point about using regex's for those who basically can't imagine using them! I love using regexes and am constantly amazed by how misunderstood they tend to be. Yes, regexes won't clean your kitchen, but they're darn good for some quick (though somewhat dirty) data mining tasks. Their power lies in the programmer's creativity.

Thursday, December 25, 2008

Merry Xmas!

Monday, December 22, 2008

SpinVox (again)

SpinVox's voicemail/speech-to-text service becomes increasingly important for my mobile communications. Today I noticed that although they had screwed up the names (a person name and a company name) on the message, they actually had just spelled them phonetically. The person who was calling was a stranger so I had no way to know the correct name. However, I took a few educated guesses about the possibly correct spellings based on the assumption that the resulting spelling was a phonetic representation of the input string. And I was right! Within seconds of googling, I found both the person's name and this person's affiliated company name.
A suggestion for SpinVox: Maybe run Soundex on your interface and pick the most frequently correct spelling for the names the engine recognizes. I think that would considerably improve performance.

Wednesday, December 17, 2008

Automated Information Extraction in Media Production

2nd International Workshop on Automated Information Extraction in Media Production (AIEMPro09)

Special Session at WIAMIS 2009

London, 6-8 May 2009

After the successful exordium at DEXA 2008, the second edition of AIEMPro will have the form of a special session at WIAMIS 2009 (The International Workshop on Image Analysis for Multimedia Interactive Services)

Tentative deadlines:

Paper submission: 11 January 2009
Notification of reviews: 1 February 2009 Final camera ready (this is a STRICT DEADLINE): 13th February 2009

Areas of Interest (not limited to):

· Efficient and real-time audiovisual indexing in acquisition
· Automated repurposing of archived material on new media channels
· Automated news production
· Efficient indexing and retrieval of multimedia streams
· Automatic speech recognition and personality identification
· Collaborative systems for media production
· Information Retrieval systems from Multimedia Archives
· Automated material copyright infraction detection and material fingerprinting
· Content summarisation (e.g., sports highlights)
· Audiovisual genre and editorial format detection and characterisation
· Cross-media indexing and integration
· Content segmentation tools (e.g., shot and scene segmentation)
· Evaluation methods for multimedia analysis tools

Prospective authors must submit their work following the WIAMIS formatting instructions (http://wiamis2009.qmul.net/submissions.php) and send the paper in PDF format DIRECTLY to the organisers by e-mail.

Organisers:
Alberto Messina (RAI CRIT) a.messina@rai.it
Jean-Pierre Evain (European Broadcasting Union) evain@ebu.ch
Robbie De Sutter (VRT medialab) robbie.desutter@vrt.be

SpinVox

I signed up to try SpinVox's voicemail-to-text service on my mobile phone. They quickly set that up and I was impressed that it's a complimentary demo. Performance seemed to be lacking in proper name entity recognition whereas not so much in catching exotic accents. I had a Danish friend leave a voicemail for me with the details of our following day's meeting. SpinVox caught everything my Danish friend said but for her (Danish) name, my (Greek) name, and the name of the place (unfortunately for SpinVox we were meeting at a local Starbucks, so no excuses for not catching that! /wahaha.......... /hmm).
To SpinVox's credit, the call was placed in the middle of the street, a lot of noise in the background and the caller had an accent.
However, in actual conditions (if I really depended on that converted to text voicemail for my meeting) their performance was poor and the text I got useless as unfortunately the proper names SpinVox missed were critical information. For instance, I wouldn't know who was calling since they screwed up their name, and I wouldn't know where she wanted to meet because SpinVox didn't catch the place name. So in that respect, although an admirable effort, it leaves a lot to be desired.
Maybe users could build their own local dictionaries of names based on their -say- address books. They could upload to the SpinVox's server a dictionary of names pronounced with the particular user's accent to help augment SpinVox's central server's dictionary of names and accents. Still, a lot of real-time speech comes with a high unpredictability factor as various callers are expected to call the particular user. Only an adaptive speech recognition system could actually learn from ad hoc input in order to improve itself. Imagine for instance if every time I had a new caller, SpinVox could learn to memorize and subsequently recognize their accent and linguistic model; so, if your boss has an American accent and usually talks about project XYZ and meeting you at Room 234B in Building ABC, SpinVox could learn to expect this type of "talk" (and accent) when he next calls you. That would of course improve speech recognition accuracy and it would involve the successful marriage of a memory (lexicon/vocabulary + accents) with an adaptive learning algorithm.

Saturday, December 13, 2008

Stargazer

Party time in the sky! It started on Nov 24th with a Venus, Jupiter and crescent Moon Conjunction and continues tonight Dec 13th with a Venus, Jupiter and Full Moon conjunction.
Astrologically speaking, this triple conjunction of the planet of Love and Money (Venus), of abundance and generosity of spirit (Jupiter) and the Moon (emotions, feelings, psychological make-up) is in general an auspicious aspect. However, it doesn't tell the entire story. The Dec 12th Full Moon in Gemini forms a rather threatening square with Mars in Sagittarius and together form a second square with the opposition of Saturn in Virgo and Uranus in Pisces. Two oppositions squaring each other form a "Grand Cross". Anger (Mars), accidents (Uranus), health (Gemini) and money (Saturn) generate issues boiling under pressure. Take care, everybody!

Friday, December 12, 2008

Riots in Greece

What's up with that?
People outside Greece wonder whether the horrendous ("accidental") police shooting of a teenager can really cause such fuss. If anyone knows Greeks and their (recent) history, one knows that Greeks just don't tolerate fools. At the same time, Greeks don't tolerate anarchy and, yes, a part of this has been taken advantage by the anarchists groups in Exarheia (Athens). But if you know Athens like I do, what's new!
The police shooting in Greece brings to mind a similar police shooting in London a few years ago. Again, an accident by the police. Again, no apologies given initially as the Brazilian guy was thought to be one of the terrorists the police was going after. But because he was not a UK citizen, the matter could not close as easily (or "accidentally") as it opened. There were protests and international intervention and inquiries. One must wonder how the UK government would have handled it if the police had accidentally killed a UK citizen...

Sunday, December 7, 2008

CFP CoNLL-2009

The Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009) will be held in Boulder, CO, USA on June 4 and 5, 2009.

Saturday, December 6, 2008

And because I love good ol' Greek music...

Here's a treat...

And a personal attempt at translating the lyrics in English:

If you are a small cranky rock
in a desert
if you are a lonely cyclamen
on the top of a mountain
if you are a forgotten star
in the sky
how do you want me to know?

If you are night-time rain
falling in the ocean
if you are boat smoke
in the sea
if you are an old drawing
on a church wall
how do you want me to know?

If you are a thorn
in my heart
Me loving you so
How do you want me to know?

(Greek lyrics by Kostas Karidis, music by Spanos and singing by Karalis)

Tuesday, November 18, 2008

Temple Bar

Temple Bar is the reason people from all over the world visit Dublin.
Besides the innumerable pubs, bars and clubs, Temple Bar is simply gorgeous. Check my pictures and you will see what I mean :).

Here's a typical corner in the Old Town (aka Temple Bar). Watch the gorgeous wall colors:

In this picture you can actually see the canal at the far end of the cobblestone street:

I just love it how almost all Temple Bar streets end at the face of beautiful old buildings:

Friday, November 7, 2008

Saint Patrick's Cathedral

Although I posted pictures of the buildings in the vicinity of Saint Patrick's Cathedral, I realized I didn't post a photo of the Cathedral itself. I promise to post a good one soon. /blush

Thursday, November 6, 2008

and one more...

One of the Christchurch Square buildings:

Christ Church Cathedral

I find this architecture food for the eye :):

The Christchurch Cathedral (dating back to the eleventh century):

Wednesday, November 5, 2008

Dublin Ireland

/hihi Last month I mentioned that I moved to Dublin. /shock
I wanted to show you some pictures so here goes.
This picture shows both the Christ Church (or "Christchurch" as I see written here) Cathedral and the surrounding buildings (and traffic) of the square:

No clue what this guy is doing in my photo. He was just passing by when I pressed the button /hmm.

The gorgeous old building below is close to Saint Patrick's, a couple of blocks down the road from the Christchurch Cathedral shown above.

Here's another gorgeous building at the corner of Patrick Street and Bull Alley Street in the vicinity of St. Patrick's Cathedral:

Check out the lamp post in the above! /omg I think it's very pretty!

More to follow soon! :)

John Grisham

The man knows how to write! His legal-world thrillers -categorized under Crime in the popular Irish Hodges Figgis bookstore in Dawson St- are not the only genre he can work on apparently! I recently had the pleasure to finish 'Playing for Pizza'

and I was very pleased. It made me laugh out loud (I mean really /wahaha) and a la typical Grisham way, I didn't want to leave it aside until it was done and over with! My current Grisham adventure includes 'Skipping Christmas' and so far, I'm sold. :D

Political landslide

Fancy that! /blur
Democratic Obama wrote history today.
I care not for the color of his skin but for the fresh air he brings in with him. Something went terribly wrong in the last 8yrs. Time to build things anew. Onward!

Saturday, October 25, 2008

From Silicon Valley to Dublin Ireland...

Crossing over the Atlantic for one more time.
It's really beautiful here. Looks to me even more gorgeous than London where I lived for 4yrs. There is a slight betterment in people too: a bit more open to strangers, a bit more joyful and humorous and lighthearted.
Here's to new beginnings!

Thursday, September 4, 2008

New Google Broswer: Chrome

Released on beta on Sept 2, 2008.
Absolutely fantastic by a quick look at things.
It's not only the detachable tabs, now you can make your gmail and google calendar shortcuts on your desktop! I bet we all are going to use them more often now.
Curious about what could crash it if anything; it seems pretty robust! Well done, Google!

Saturday, July 26, 2008

Word Vine (game)

Word Vine

Place word on the vine and link every word to finish the level!

Play this free game now!!

If you wonder how ontologies are built, try this game! Very educational.

Wednesday, July 23, 2008

What's your favorite language?

Huh? Seriously?
You mean, programming languages are not just tools?
I say, show me the project and I'll tell you which language I prefer for it.
You won't call in a French interpreter to translate an English-Corean dialog, will you. In a similar fashion, specific programming tasks almost require specific type of programming tools. The right language for the job may not be your favorite but then this only means that this programming job is not your favorite thing to do either. Also a French interpreter could possibly re-train and learn to do Korean interpretations. He will still be using a language to do his job and the basic methodology will remain the same; only the language "rules" will have changed. Similarly, regardless of what your current programming expertise is, what counts is that you are familiar with and eventually applying the same intrinsic methologies and principles of programming. So, what's really a "favorite language"?

Tuesday, July 22, 2008

Text Analytics Java development (job post)

Text Analytics Developer - Java - Computational Linguistics
in US-IL-Chicago

Company: CyberCoders Engineering
Location: Chicago, IL

What you need to apply:

# A strong background in Computational Linguistics or Natural Language Processing.
# Proven ability to deliver text mining or information retrieval solutions for real world applications
# Demonstrates and applies thorough understanding of software development methodology and protocols.
# Excellent programming skills in Java
# Expert-level understanding of machine learning and statistical techniques as applied to text analytics, e.g., information extraction, summarization, classification, clustering, tone/sentiment analysis, relevance ranking
# Experience in the creation and exploitation of domain and task ontologies in text analytics are a plus
# A background check will be required; holding an active security clearance is a plus.

What you will be doing:

# Developing and implementing commercial software applications.
# Identifying and modifying existing algorithms, code implementation, testing, and maintenance.
# Development will be done in the Java programming language and will integrate with the D2K analytics development environment unless specified otherwise.
# Work within the Analytics Group in planning and executing the creation and delivery of text analytic features.

What's in it for you:

# An attractive compensation plan including cash, stock and bonus is available to the right candidate.

Required Skills
Text Analytics Developer, Java, Computational Linguistics, Natural Language Processing, Text Analytics, Information Extraction, Summarization, Classification, Clustering, Tone Analysis, Sentiment Analysis, Relevance Ranking, D2K

Relevant background includes:
Text Analytics Developer, Java, Computational Linguistics, Natural Language Processing, Text Analytics, Information Extraction, Summarization, Classification, Clustering, Tone Analysis, Sentiment Analysis, Relevance Ranking, D2K

The following job types are relevant:
Information Technology, Engineering, Professional Services

How to apply:
Through the company's website.
Recruiter's Name: Reggie Landicho
Job ID: RL-TextAnalyticsDev-IL6

Friday, July 18, 2008

Diigo

A new way to keep track of information you find on the web and share it with friends. Also a new way to connect with people all over the world who share your bookmarks and therefore interests. Visit Diigo and start marking up and sharing the web with your like-minded buddies. Learn more about the Nevada-based start-up at: About Diigo.

Wednesday, July 16, 2008

start-up effort in sentiment detection and analysis

"Semantic analysis" or "semantic measurement" or "sentiment detection" are very popular with startup's. Take for instance ScoutLabs and SkyGrid. The former capitalizes on the early warning detection of security-related events in news and the latter is watching negative and positive sentiment in news about business in order to inform stock market trends. In both occasions, the sentiment analysis technology relies heavily on automatically analyzing natural language input in unstructured form and filling records of a database with the extracted/tagged information. One has to wonder about the limitations of database systems for successfully undertaking such task.

Thursday, June 12, 2008

4th Annual Text Analytics Summit in Boston (June 16-17 2008)

An extremely interesting 2-day menu of Text Analytics activities in Boston coming up this Sunday. It includes (the highlights below are my selection from the online menu):

1. Pre-conference workshops (Text Analytics for dummies and MarketPlace Overview),

2. Keynote by Microsoft CompLing Research Labs on sentiment detection,

3. An Industry Panel including a Sr. Product Manager from Business_Objects/SAP (previously Inxight), the CTOs of Attensity and Clarabridge and the VP of SPSS,

4. Text mining and evaluation in blogs,

5. A presentation in Speech Analytics by the SVP of CallMiner, Inc. and

6. A presentation in Visual Analytics in Pharma data by the rep of Merck KGaA.

Tuesday, June 10, 2008

Euro 2008

The Greeks didn't get a head start today as they lost the game against Sweden. We're all awaiting Saturday now when they fight against the Russians!

Glottopedia

Check out Computational Linguistics in Glottopedia*. Of course you can edit the lemmas just like in Wikipedia. A nice little corner for CompLinguists.

*Apologies if the link to the Glottopedia site doesn't work; I've found that the site is down periodically.

Monday, June 9, 2008

iPhone

Well, okay, it's now a bit cheaper.
However, there are still a couple of glaring omissions in its features specifically in terms of accessibility:
Whereas most blackberry devices are hearing-aid compatible (HAC), iPhone isn't. This simply means that about 20 millions of Americans that are born with (or have acquired) varied degrees of hearing loss throughout their lives are automatically excluded from iPhone's clientele since the phone cannot be used with a hearing aid's microphone (M) or telecoil (T) switches.
iPhone also does not support multiple languages i.e. languages other than English that the phone can be set to display its menus and text in. This also means that iPhone -perhaps inadvertently- excludes e.g. Spanish-speaking audiences from its clientele, unless of course they know to write and read in English.
It's high time that Apple opened its tech doors to the world. As long as it remains "exclusive" and in that sense esoteric to the masses, it may never attain the success and popularity of a Blackberry.

Saturday, June 7, 2008

South and North California dichotomy

It's one thing to hear it and totally another to actually live it.
Californians seem to have a good reason for being displeased with the stereotype.
Upon arrival in Northern Cali, I was welcomed by numerous friends and acquaintances who didn't know my exact location with a "Welcome to LA!". Ha! Not all Cali is LA and thank God for that! Unlike my American friends, my only reason of discontent with LA is the fact that it's a big city. I love almost every suburb around LA but I am not a big fan of LA itself. So when I correct my well-wishing friends, I have my reasons and they have theirs for looking at me with concern. I'm a few light years away from the local stereotypes but it's funny how assumptions work, especially when they are so obvious.

Saturday, May 31, 2008

Resurfacing for good later in June! Enjoy summer, everyone!

May has been a real long month.
Today the 31st I'm finally wrapping up business and moving on with my life.
Apologies for leaving this blog an orphan for so long.
I'll be back soon with daily bites.

Saturday, May 3, 2008

resurfarcing...

Coming up.......

2008 North American Computational Linguistics Olympiad

What is the Computational Linguistics Olympiad?

The North American Computational Linguistics Olympiad (NACLO) is modeled after similar Linguistics Olympiads held in Eastern Europe since 1965. In these events, hundreds of high school age students have participated, challenged by interesting linguistic problems from dozens of the world's languages. In solving the problems, students learn about the richness, diversity and systematicity of language, while exercising natural logic and reasoning skills. No prior knowledge of particular languages or of linguistics is necessary, but the competitions have proven very successful in attracting top students to study and choose careers in fields of linguistics, computational linguistics and language technologies.

Professional linguists and other specialists in natural language processing technologies cooperate to create stimulating and engaging problems that represent cutting edge theoretical and practical issues in their fields. This is truly an opportunity for young people to experience a taste of what natural language processing in the 21st century is all about.

For details and past CL Olympiads visit: NACLO 2008

Friday, March 14, 2008

The unbearable lightness of being

See the movie if you can. Better yet read the book (by Kundera).
A bit post-modernistic by some standards, I found it agrees with me these days.

Tuesday, March 4, 2008

semantic computing and "vaguely-formulated human intentions"

The field of Semantic Computing (SC) brings together those disciplines concerned with connecting the (often vaguely-formulated) intentions of humans with computational content. This connection can go both ways: retrieving, using and manipulating existing content according to user's goals ("do what the user means"); and creating, rearranging, and managing content that matches the author's intentions ("do what the author means").

There is no such thing as "vaguely-formulated intentions of humans". Humans have intentions. Only the human who has expressed the specific intentions knows how vague or not these intentions were "formulated". How? Usually vague intentions bring about vague results or bring about undesired (unintended) results. It's all in the eye (and mind) of the beholder.

What's pertinent for CL and SemComputing here is the fact that intentions are hard to detect. We will never be 100% sure that we matched the intention of the author, neither should we even try for that. Regardless of what the author "meant" (which can be very imprecise and gauged only by the author's own judgement), a document bears evidence of such intentions. If we focus on the intentions evidently expressed in the document we can happily dispense with the "intentions of humans".

Friday, February 29, 2008

Keeping subjectivity out of CL...

Reading the announcement about the COLING workshop on "human judgements in Computational Linguistics":

Human judgements play a key role in the development and the assessment of linguistic resources and methods in Computational Linguistics. [...]
We invite papers about experiments that collect human judgements for Computational Linguistic purposes, with a particular focus on linguistic tasks that are controversial from a theoretical point of view (e.g., some coding tasks having to do with semantics or pragmatics). Such experimental tasks are usually difficult to design and interpret, and they typically result in mediocre inter-rater reliability.

So let me think. "Coding tasks having to do with semantics and pragmatics" "typically result in mediocre inter-rater reliability". Seriously? So are we back to the 50s and Chomsky's concept of "ungrammaticality"? The dominance of syntax and the marginalization of semantics and pragmatics as "subjective"?
Now that we finally made it through enough of Chomsky 50+ years later, now that CL is finally breaking free of attempts to formalize semantics, now that we have finally figured it out how to relate language and information theory, we now willingly take a turn back and look at "human judgements"? Why? Language is definitely not created in a vacuum. Virtually every level of natural language (and hence also of CL) is potentially subjective in that it inevitably reflects the 'theory' of the linguist who looks at it. There is no way around this. Claiming that "some coding" is subjective implies that some other "coding" is not. Well, the point is that if it is not, then it has nothing to do with *natural* language.

Mental exercise of the day

Focus on the negative and you will be immediately inundated by an avalanche of negative experience.

Focus on the positive, and more positive will turn up out of the blue...

On idiotic management...

Managing people is hard enough. Managing smart people is definitely harder.
Managing smart people who constantly blabber about new technology must scare the hell out of most managers.

What takes the cake is:
Managing people and technology and making decisions without listening to your experts.
Managing people and technology and be too scared to make any decisions.

Thursday, February 28, 2008

words....

Interesting neologisms of the day (only read with a sense of humor):

celebritology, noun:
1. the study of the lives of celebrities
2. the endless gossip about Britney's life
3. the main subject of attention of the People magazine

chatological, adj. (as in "chatological humor", reminiscent of "eschatological"):
1. the system or theory concerning online chats, online chat rooms, and any other online life species
2. the branch of logic dealing with the same...

Interesting syntactic phenomenon of the day:

Clapton Invited to Play North Korea*

Lucky North Korea will be played by Clapton...

* to confirm the meaning of this schema look at the article

Monday, February 25, 2008

If it's text processing it is also batch processing...

Here's a consequence of isolating web services from real text processing:

What about batch processing?

Tired of superficial text processing

I've been looking at CALAIS the open-source web services for performing information extraction on Reuters data. The info extraction tech used under the hood is based on ClearForest's proprietary rule-based info extraction language called DIAL. Like SRA's and Inxight's (now Business Objects and SAP) similar tools of the trade, this type of languages are tailor-made for general purpose info extraction. The biggest asset of working with such tools is that they allow the developer to go as deep as she can and extend the original NLP basis in order to meet specific customer data and requirements. However, tools like CALAIS shift the focus majorly from the underlying NLP/IE technology to the web services and I/O front and related bells and whistles. They even offer "bounties" for innovative web service applications built for CALAIS. All this while the single most important and most attractive element of this tool is its NLP extensibility power! This remains concealed and under wraps with a roadmap promise to be released by the end of the year. Until then the tool runs with the out-of-the-box IE capabilities, which are -arguably- pretty limited and only impressive to those with limited prior NLP/IE experience. Does someone have their priorities screwed-up?

Sunday, February 24, 2008

Semantic Web, the Wikipedia, TextLinguistics and Information Extraction

Paraphrasing an article on AI3 published on Feb 18, 2008, there is finally a recognized collective wave of information extraction tools and sources that mine Wikipedia in order to help enrich the Semantic Web.

Here are a few instances that show-case the successful marriage of text-linguistics and information extraction:
-First paragraphs: Wikipedia articles are being mined for term definitions. It is standard text structure (even in Wikipedia articles) that in the first paragraph outlines the main terms discussed in the article. For this reason, initial paragraphs are good places for looking up terms in the document.
-Redirects: Mining for synomymous terms, spelling variations, abbreviations and other "equivalents" of a term is just par for the course.
-Document Title: This is where we locate named entities and domain-specific terms or semantic variants
-Subject Line and Section Headings: For category identification (topic classification)
-Full text: Whereas in the first paragraph new terms are being defined, it is in the rest of the document that one will find a full description of the definition/meaning, along with related terms, translations and other collocations (linguistic context).
-Embedded article links: Links to and by external pages provide more related terms, potential synonyms, clues for disambiguation and categorization.
-Embedded Lists (and other Hierarchies): Look here for hyponyms, meronyms and other semantic relationships among related terms.

Notice that all of the above are overt structural elements in Wikipedia articles. This type of structure is not unique in Wikipedia articles although Wikipedia standards impose a conscious effort for homogeneity. However, detecting such structural clues in text is no news in the field of text-linguistics (for seminal work in the field check here).
What's new here is the application of text-linguistics analysis techniques to the Web (and Wikipedia in particular) for purposes of Web mining, Information Extraction and the Semantic Web initiative.

The output of such analyses and metrics helps populate ontologies and taxonomies, as well as link records. Areas of focus for these types of application are:
-subcategorization
-WSD, NER and NED
-semantic similarity and relatedness analysis and metrics

Tools of the trade

Try NoteTab Light a freeware with some commercial features available for a 31-day trial. It allows embedded scripting (html, PERL, Gawk). It looks pretty loaded in comparison with NotePad.

Saturday, February 23, 2008

The hound that won the race...

Dedicated to my "Phriends" in the throes of the dissertation journey.
You can do it!

What is it with students?

OK -- you get in the trouble of contacting me with a question related to my doctoral dissertation that you have sitting in front of you. How flattering! Someone actually was interested enough to buy (and hopefully read?) my brainchild.
Now don't spoil the good news with 1) incredibly bad manners and 2) incredulously thick questions.
So to attend to 1, do follow proper email etiquette. By this I mean use a basic "Hi (FirstName)" salutation when you address someone you don't know over email. And while we are at it, please resist the temptation of (inadvertently) offending me by ascribing my brainchild to someone else... Thank you.
To attend to 2, just do not expect people to give you ready answers to q's they belabored for a while! It is called a doctoral dissertation for a reason! I didn't spend one day, one month or even one year on it, dude. Since you have bought it, do me a favor and actually read it or at least browse through its pages. What else can I say. Then, once you are in a place to form intelligent and respectful questions, come back to me. It is call "research" for a reason!

watching the news...

The next big revolution will be in the direction of exercising judgment when it comes to information. Do I really need a daily depression dose by CNN and newspapers in order to "get it" that the economy is bad and people are losing their jobs and -most crucially- that the government isn't doing much about it? I say stop watching all this negativity and start doing something about it in your daily life: from voting, from questioning political practices, from living "in the present" and tuning in to what's happening around you (CNN won't tell you how many of your neighbors lost their jobs in the last X months). Open your eyes, be present and use judgment and common sense when it comes to "mass media". Above all THINK. Yes, use the substance in your skull that promotes intelligent life. Mass media perpetuate negativity. This is how it is. It's not to blame. It's up to you to "buy it" or not. I say use your own mind and separate yourself from the "mass". Then you have better chances of staying positive and do something in your life that promotes a change to the better.