The second part of the project for this year will be on named-entity recognition (NER), a task that detects segments of a sentence in which a person name, a company name, a location name, etc. are mentioned. This task will be introduced at the end of lecture 07; see also this link.
For the development of the project you need to experiment with zero-shot techniques for NER, using any chatBot of your choice. I am asking you to start from the paper 'Empirical Study of Zero-Shot NER with ChatGPT' by Xie et al. 2023, which you can find at this link and which will be discussed in class, and reproduce/adapt some of the methods reported in there.
More precisely, you are required to go through the following steps.
Choose a NER dataset from the web. This can be a general domain dataset, using the entities person, company, location, time, quantity, etc., or else a specialized domain dataset, using entities in the domain of medicine, law, economy, etc.
Choose an instruction-tuned LLM/chatBot that can be automatically queried by means of an API. ChatGPT, Gemini, Llama3 can all be used in this way.
Implement the baseline method from (Xie et al., 2023) and implement also two other methods among those presented in that paper, or inspired by those methods. Refine your methods and zero-shot prompts on the basis of a development portion of your dataset.
Evaluate your methods on the basis of a test portion of your dataset separated from the development portion. Important! do not change your methods after the test set evaluation. You can use any of the evaluation quantities that are traditionally exploited in NER; see for instance the paper 'A Brief History of Named Entity Recognition' by Munnangi, 2024, at this link. Finally, provide a careful, critical discussion of your results.
Your project should be presented as a notebook, to be uploaded in the Google shared folder: see the 'Project registration' post in this forum. Use a file name that includes as prefix your proper names. The notebook should be organized as follows.
Student names, school registration number (matricola), and master program should be reported at the forefront of the notebook.
Domain and dataset should be introduced and described in a special section.
Each of the zero-shot methods that you use should be carefully described in separate sections. If your method requires additional annotation, as in case of the syntactic augmentation method described in (Xie et al., 2023), off-the-shelf libraries that you use should be introduced and briefly documented.
If you use additional methods that have been inspired by similar projects publicly available in the web, you should add a proper acknowledgement section in your notebook.
If you have any question about the second part of the project, please post a message in this thread.