Project part II: named-entity recognition

Project part II: named-entity recognition

by Giorgio Satta -
Number of replies: 5

The second part of the project for this year will be on named-entity recognition (NER), a task that detects segments of a sentence in which a person name, a company name, a location name, etc. are mentioned. This task will be introduced at the end of lecture 07; see also this link.

For the development of the project you need to experiment with zero-shot techniques for NER, using any chatBot of your choice. I am asking you to start from the paper 'Empirical Study of Zero-Shot NER with ChatGPT' by Xie et al. 2023, which you can find at this link and which will be discussed in class, and reproduce/adapt some of the methods reported in there.

More precisely, you are required to go through the following steps.

  1. Choose a NER dataset from the web. This can be a general domain dataset, using the entities person, company, location, time, quantity, etc., or else a specialized domain dataset, using entities in the domain of medicine, law, economy, etc.

  2. Choose an instruction-tuned LLM/chatBot that can be automatically queried by means of an API. ChatGPT, Gemini, Llama3 can all be used in this way.

  3. Implement the baseline method from (Xie et al., 2023) and implement also two other methods among those presented in that paper, or inspired by those methods. Refine your methods and zero-shot prompts on the basis of a development portion of your dataset.

  4. Evaluate your methods on the basis of a test portion of your dataset separated from the development portion. Important! do not change your methods after the test set evaluation. You can use any of the evaluation quantities that are traditionally exploited in NER; see for instance the paper 'A Brief History of Named Entity Recognition' by Munnangi, 2024, at this link. Finally, provide a careful, critical discussion of your results.

Your project should be presented as a notebook, to be uploaded in the Google shared folder: see the 'Project registration' post in this forum. Use a file name that includes as prefix your proper names. The notebook should be organized as follows.

  • Student names, school registration number (matricola), and master program should be reported at the forefront of the notebook.

  • Domain and dataset should be introduced and described in a special section.

  • Each of the zero-shot methods that you use should be carefully described in separate sections. If your method requires additional annotation, as in case of the syntactic augmentation method described in (Xie et al., 2023), off-the-shelf libraries that you use should be introduced and briefly documented.

  • If you use additional methods that have been inspired by similar projects publicly available in the web, you should add a proper acknowledgement section in your notebook.

If you have any question about the second part of the project, please post a message in this thread.

In reply to Giorgio Satta

Re: Project part II: named-entity recognition

by Giorgio Satta -

Just an additional note.

The idea of solving NER through prompting or by reducing to a question/answer problem + prompting is certainly not new in the literature. Besides the paper (Xie et al., 2023) referred to in the project assignment, you are welcome to further explore the literature for alternative ways of implementing NER through prompting.

If you explore alternative approaches other than those mentioned in (Xie et al., 2023), always cite the paper you start from.

In reply to Giorgio Satta

Re: Project part II: named-entity recognition

by MANUEL ANTONUTTI -
Dear professor,

during the testing phase of the techniques (with 100 sentence examples) we hit the daily request limit for our LLM API. We were wondering if we could try the following solutions:

1) Avoid testing less effective techniques and focus only on the more promising ones.
2) Modify the paper's techniques to reduce the number of requests needed (in particular tool augmentation).
3) Divide the test set into two parts and feed them to two different LLM models, then combine the results.
In reply to MANUEL ANTONUTTI

Re: Project part II: named-entity recognition

by Giorgio Satta -

All of the three suggestions you report are ok.

In addition, you could limit testing your prompts on a development set with fewer than 100 sentences, 50 sentences would still be ok for this project.

In reply to Giorgio Satta

Re: Project part II: named-entity recognition

by SERGEY KHVAN -
Dear Professor,

I have a question about the contents of the notebook.
Do we need to submit the whole code that we used to recreate the methods and data processing? Or do we need to submit only the analysis of the results?
In reply to SERGEY KHVAN

Re: Project part II: named-entity recognition

by Giorgio Satta -

I expect you to do both: describe your ideas, the prompt you used, and the results, and report any code that will allow me to reproduce your results.