The Collective Biographies of Women Project (CBW) investigates cultural representations of women through a large corpus of British and American biographical texts from the nineteenth and twentieth centuries. The texts belong to the collective biography genre, with volumes containing several chapter-length biographies of different women organized around a common theme. The CBW project is supported by Dr. Alison Booth, Director of Scholars Lab at the University of Virginia
CBW seeks to annotate these biographies at the paragraph level using a controlled vocabulary to label each paragraph according to a set of literary-critical dimensions. These dimensions are defined in a controlled vocabulary known as BESS, Biographical Elements and Structure Schema. The BESS vocabulary consists of tags such as - Stage Of life, Persona, Event, Topos, Discourse. The BESS tag Event has labels such as marriage, birth, death. Ultimately, the goal of the project is to develop a complete annotated corpus drawn from 1,270 known books, comprising around 13,000 chapters of about 8,000 women.
Each paragraph in the biography will be classified with its corresponding BESS annotation. Textual features like Bag of Words, TF-IDFs and linguistic features like semantic and syntactic parameters among others will be used as the model features. Our initial approach to classification will be using a variety of Machine Learning models like Logistic Regression, Tree-Based Models and SVMs. The results of the above model will be considered as our baseline result for our deep learning results.
With the baseline scores from Machine Learning models, the biographiles are annotated using a Recurrent Neural Network approach. Multi-layered bidirectional LSTM based model will be used to understand the context and theme of a paragraph and identify the corresponding BESS annotation. Every word will be initiated with their predefined GloVe embeddings which will further trained to get their meaning aligned with the context of biographies. Different architectures with be tried and tested before selecting the best one suited to this use case. Especially because biographies have not been worked upon a lot in Natural Language Processing, it should be a challenging task arriving at an optimal architecture.
The next objective is to find common events associated with a women in each biography. This is done by drawing parallels with Market Basket Analysis. Just like in Market Basket Analysis where there is some % of probability of the presence of an item in a cart given another item is present, here a cart is represented by a paragraph and the items are words in a paragraph. So, all words (non-trivial word) with higher probability associated with each woman is identified. This is useful in getting a quick gist/summary of the life events of a woman. Thus, with the above objectives, we identify what each paragraph in a biography is talking about and at the same time what are the important life events associated with every woman.
You need this ticket from Eventbrite to sign up:
Applied Machine Learning Conference.