(资料图片仅供参考)
COMP3009J – Information RetrievalProgramming AssignmentThis assignment is worth 30% of the final grade for the module.Due Date: Sunday 28th May 2023 at 23:55 (i.e. before beginning of Week 15)Before you begin, download and extract the files ``small_corpus.zip’’ and ``large_corpus.zip’’from Brightspace. These contain several files that you will need to complete this assignment.The README.md file in each describes the files contained in the archive and their format1.Both corpora are in the same format, except for the relevance judgments. For the smallcorpus, all documents not included in the relevance judgments have been judged nonrelevant. For the large corpus, documents not included in the relevance judgments have notbeen judged.You should first try the following tasks using the small corpus. If you can do this successfully,then you should move to the large corpus. The large corpus will require more efficientprograms.Part 1: BM25 ModelFor this assignment you are required to implement the BM25 of Information Retrieval. Youmust create a program (using Python) that can do the following.1. Extract the documents contained in the document collections provided. You must dividethe documents into terms in an appropriate way. The strategy must be documented in yoursource code comments.2. Perform stopword removal. A list of stopwords to use is contained in the stopwords.txtfile that is provided in the ``files’’ directory.3. Perform stemming. For this task, you may use the porter.py code in the ``files’’directory.4. The first time your program runs, it should create an appropriate index so that IR using theBM25 method may be performed. Here, an index is any data structure that is suitable forperforming retrieval later.This will require you to calculate the appropriate weights and do as much pre-calculation asyou can. This should be stored in an external file in some human-readable format. Do not usedatabase systems (e.g. MySQL, SQL Server, SQLite, etc.) for this.1 This is a Markdown file. Although you can open and read it as plain text, a Markdowneditor like Remarkable (https://remarkableapp.github.io/ - Windows or Linux) or MacDown(https://macdown.uranusjr.com/ - macOS) is recommended.5. The other times your program runs, it should load the index from this file, rather thanprocessing the document collection again.6. Run queries according to the BM25 model. This can be done in two ways:- In “interactive” mode, a user can manually type in queries and see the first 15 resultsin their command line, sorted beginning with the highest similarity score. The output shouldhave three columns: the rank, the document’s ID, and the similarity score. A sample run ofthe program is contained later in this document. The user should continue to be prompted toenter further queries until they type “QUIT”.- In “automatic” mode, the standard queries should be read from the``queries.txt’’ file (in the ``files’’ directory). This file has a query on each line, beginningwith its query ID. The results should be printed into a file named “results.txt”, whichshould include four columns: query ID, rank, document ID and similarity score.It is ESSENTIAL that this can be run as a standalone program, without requiring an IDE suchas IDLE, PyCharm, etc.You can assume that your program will be run in the same directory as the README.md file(i.e. the current directory will have the ``documents’’ and ``files’’ directories in it). Do notuse absolute paths in your code.Non-standard libraries (other than the Porter stemmer provided) may not be used.Users should be able to select the appropriate mode by using command line arguments, e.g.:- python search_small_corpus.py -m interactiveOr for automatic mode:- python search_small_corpus.py -m automaticPart 2: EvaluationFor this part, your program should evaluate your results.txt file (that was created duringautomatic mode above) to evaluate the effectiveness of the BM25 approach.The user should be able to run the program using the following command:- python evaluate_small_corpus.pyBased on the results.txt file, your program should calculate and print the followingevaluation metrics (based on the relevance judgments contained in the ``qrels.txt’’ filein the ``files’’ directory):- Precision- Recall- P@10- R-precision- MAP- bprefWhat you should submitSubmission of this assignment is through Brightspace. You should submit a single .zip archivecontaining:- Part 1: Python programs to run the queries.o search_small_corpus.pyo search_large_corpus.py (if you have attempted the large corpus)- Part 2: Python files to perform the evaluation.o evaluate_small_corpus.pyo evaluate_large_corpus.py (if you have attempted the large corpus)- A README.txt or README.md file that describes what your program can do (inparticular it should mention whether the program will work on both corpora or onlythe small one).Sample Run (Interactive)$ python search_small_corpus.py -m interactiveLoading BM25 index from file, please wait.Enter query: library information conferenceResults for query [library information conference]1 928 0.9919972 1109 0.9842803 1184 0.9795304 309 0.9690755 533 0.9189406 710 0.9125947 388 0.8940918 1311 0.8477489 960 0.84504410 717 0.83375311 77 0.82926112 1129 0.82164313 783 0.81763914 1312 0.80403415 423 0.795264Enter query: QUITNote: In all of these examples, the results, and similarity scores were generated at random forillustration purposes, so they are not correct scores.Sample Run (Evaluation)$ python evaluate_small_corpus.pyEvaluation results:Precision: 0.138Recall: 0.412R-precision: 0.345P@10: 0.621MAP: 0.253bpref: 0.345Grading RubricBelow are the main criteria that will be applied for the major grades (A, B, C, etc.). Other aspectswill be taken into account to decide minor grades (i.e. the difference between B+, B, B-, etc.).- Readability and organisation of code (including use of appropriate functions, variablenames, helpful comments, etc.).- Quality of solution (including code efficiency, presence of minor bugs, avoiding absolutepaths, etc.).Questions should be sent to david.lillis@ucd.ie or posted in the Brightspace forum.Passing Grades``D'' GradeGood implementation of the primary aspects of Information Retrieval, using the small corpus. Thisincludes extracting the documents from the document collection, preprocessing (stemming andstopword removal), indexing and retrieval. The solution may contain some implementation errors.It is clear that the student has correctly understood the Information Retrieval process.``C'' GradeGood implementation of the primary aspects of Information Retrieval, using the small corpus. Theprogram can also save and load the index to/from an external file, as appropriate.``B'' GradeCorrect implementation of all sections of the assignment using the small corpus (some minorimplementation errors will be tolerated). It is clear that the student has understood both theinformation retrieval process and the evaluation process. Note: This means that evaluation is onlytaken into account if your search program can successfully retrieve documents for evaluation.``A'' GradeExcellent implementation of all sections of the assignment, including choice of appropriate efficientdata structures and efficient programming. The efficiency of the programs will be measured usingthe large corpus. In particular, a response to a query must be returned in a reasonable amount oftime, although efficiency is important in indexing also. Note: This means that working efficiently onthe large corpus is only taken into account if your code can successfully work with the small corpus.Failing Grades``ABS'' GradeNo submission received.``NM'' GradeNo relevant work attempted.``G'' GradeWholly unacceptable, little or no evidence of meaningful work attempted.``F'' GradeSome evidence of work attempted, but little (if any) functionality operates in the correct manner.``E'' GradeClear evidence that work has been attempted on implementing retrieval using BM25, but thereare serious errors in implementation, or in understanding of the process.Other notes1. This is an individual assignment. All code submitted must be your own work. Submitting the workof somebody else or generated by AI tools such as ChatGPT is plagiarism, which is a seriousacademic offence. Be familiar with the UCD Plagiarism Policy and the UCD School of ComputerScience Plagiarism Policy.2. If you have questions about what is or is not plagiarism, ask!Document Version Historyv1.0: 2023-05-08, Initial Version.v1.1: 2023-05-15, Updated requirements for the output format of automatic mode.
WX:codehelp
关键词:
