Version 1 of art descriptions output by my super ghetto natural language processing algorithm, which parses text from 277 modern art exhibits in 4 New York City art galleries –– David Zwirner, Gagosian, Gladstone, and Hauser and Wirth. As expected, it’s laughably bad. But it’s a start. Now that I can see what it’s generating, I can work on specific improvements. The genesis of this idiotic project can be found in an earlier blog post.
How the sausage is made…
Step 1: Develop intake scripts to programmatically build up a Corpora of text
4 New York City art galleries are crawled by my scripts, from which content is gathered from…
483 individual pieces of art described in…
9,467 sentences with…
281,800 words composed of…
1,550,999 characters
Step 2: Sift through the corpora to locate sentences with 1 or more stem words
Stem words are simply words that share the same word stem: eg: run, runner, and running all have the same stem: run
I am not doing any fancy lookups with correlated words with weighted affinities
I generate sentences from a string of related stem words
Step 3: Run transformations on my generated sentences to produce novel output
Replace all Names (proper nouns) with generated names of the same gender
Replace all nouns, adjectives, verbs, and adverbs with their shortest synonym/counterpart
Print out 4 sentences
Stems = grapple, quest, yearn…
Stems = youth, glory, immortality
Stems = digital, internet, technology…
A few directions I’m interested in pursuing
Getting simple stuff out of the way:
Article noun agreement – She ate a apple
Subject verb agreement – She go to the movies.
Tense agreement: He goes to the movies with his friend to saw the film
Establish context better before fetching content from the corpora – don’t just rely on word stemming. Definitely look into lemmatizing. Also, with wordnet, you can use hyponyms and hypernyms to get word substitutes that more closely align with the sense of a word in the context of the sentence from which it was plucked (pretty cool!)
Dynamically deriving my own set of context-free grammars from the corpora – I’ve been reading up on CFG’s, and how these are used in simple clientside JavaScript libraries. The cool thing about them is that when they’re expanded, they can be recursive. So a sentence can consist of a noun phrase + a verb phrase. But any noun phrase might consist of another noun phrase + verb phrase. For example… this is the house that Jack built =≥ this is lake that lies next to the house that Jack built.
Hooking up a frontend to a backend to make the corpora interactive 😃