COMP3009J 信息检索编程

来源：哔哩哔哩时间：2023-05-20 13:13:50

(资料图片仅供参考)

COMP3009J – Information RetrievalProgramming AssignmentThis assignment is worth 30% of the final grade for the module.Due Date: Sunday 28th May 2023 at 23:55 (i.e. before beginning of Week 15)Before you begin, download and extract the files ``small_corpus.zip’’ and ``large_corpus.zip’’from Brightspace. These contain several files that you will need to complete this assignment.The README.md file in each describes the files contained in the archive and their format1.Both corpora are in the same format, except for the relevance judgments. For the smallcorpus, all documents not included in the relevance judgments have been judged nonrelevant. For the large corpus, documents not included in the relevance judgments have notbeen judged.You should first try the following tasks using the small corpus. If you can do this successfully,then you should move to the large corpus. The large corpus will require more efficientprograms.Part 1: BM25 ModelFor this assignment you are required to implement the BM25 of Information Retrieval. Youmust create a program (using Python) that can do the following.1. Extract the documents contained in the document collections provided. You must dividethe documents into terms in an appropriate way. The strategy must be documented in yoursource code comments.2. Perform stopword removal. A list of stopwords to use is contained in the stopwords.txtfile that is provided in the ``files’’ directory.3. Perform stemming. For this task, you may use the porter.py code in the ``files’’directory.4. The first time your program runs, it should create an appropriate index so that IR using theBM25 method may be performed. Here, an index is any data structure that is suitable forperforming retrieval later.This will require you to calculate the appropriate weights and do as much pre-calculation asyou can. This should be stored in an external file in some human-readable format. Do not usedatabase systems (e.g. MySQL, SQL Server, SQLite, etc.) for this.1 This is a Markdown file. Although you can open and read it as plain text, a Markdowneditor like Remarkable (https://remarkableapp.github.io/ - Windows or Linux) or MacDown(https://macdown.uranusjr.com/ - macOS) is recommended.5. The other times your program runs, it should load the index from this file, rather thanprocessing the document collection again.6. Run queries according to the BM25 model. This can be done in two ways:- In “interactive” mode, a user can manually type in queries and see the first 15 resultsin their command line, sorted beginning with the highest similarity score. The output shouldhave three columns: the rank, the document’s ID, and the similarity score. A sample run ofthe program is contained later in this document. The user should continue to be prompted toenter further queries until they type “QUIT”.- In “automatic” mode, the standard queries should be read from the``queries.txt’’ file (in the ``files’’ directory). This file has a query on each line, beginningwith its query ID. The results should be printed into a file named “results.txt”, whichshould include four columns: query ID, rank, document ID and similarity score.It is ESSENTIAL that this can be run as a standalone program, without requiring an IDE suchas IDLE, PyCharm, etc.You can assume that your program will be run in the same directory as the README.md file(i.e. the current directory will have the ``documents’’ and ``files’’ directories in it). Do notuse absolute paths in your code.Non-standard libraries (other than the Porter stemmer provided) may not be used.Users should be able to select the appropriate mode by using command line arguments, e.g.:- python search_small_corpus.py -m interactiveOr for automatic mode:- python search_small_corpus.py -m automaticPart 2: EvaluationFor this part, your program should evaluate your results.txt file (that was created duringautomatic mode above) to evaluate the effectiveness of the BM25 approach.The user should be able to run the program using the following command:- python evaluate_small_corpus.pyBased on the results.txt file, your program should calculate and print the followingevaluation metrics (based on the relevance judgments contained in the ``qrels.txt’’ filein the ``files’’ directory):- Precision- Recall- P@10- R-precision- MAP- bprefWhat you should submitSubmission of this assignment is through Brightspace. You should submit a single .zip archivecontaining:- Part 1: Python programs to run the queries.o search_small_corpus.pyo search_large_corpus.py (if you have attempted the large corpus)- Part 2: Python files to perform the evaluation.o evaluate_small_corpus.pyo evaluate_large_corpus.py (if you have attempted the large corpus)- A README.txt or README.md file that describes what your program can do (inparticular it should mention whether the program will work on both corpora or onlythe small one).Sample Run (Interactive)$ python search_small_corpus.py -m interactiveLoading BM25 index from file, please wait.Enter query: library information conferenceResults for query [library information conference]1 928 0.9919972 1109 0.9842803 1184 0.9795304 309 0.9690755 533 0.9189406 710 0.9125947 388 0.8940918 1311 0.8477489 960 0.84504410 717 0.83375311 77 0.82926112 1129 0.82164313 783 0.81763914 1312 0.80403415 423 0.795264Enter query: QUITNote: In all of these examples, the results, and similarity scores were generated at random forillustration purposes, so they are not correct scores.Sample Run (Evaluation)$ python evaluate_small_corpus.pyEvaluation results:Precision: 0.138Recall: 0.412R-precision: 0.345P@10: 0.621MAP: 0.253bpref: 0.345Grading RubricBelow are the main criteria that will be applied for the major grades (A, B, C, etc.). Other aspectswill be taken into account to decide minor grades (i.e. the difference between B+, B, B-, etc.).- Readability and organisation of code (including use of appropriate functions, variablenames, helpful comments, etc.).- Quality of solution (including code efficiency, presence of minor bugs, avoiding absolutepaths, etc.).Questions should be sent to david.lillis@ucd.ie or posted in the Brightspace forum.Passing Grades``D'' GradeGood implementation of the primary aspects of Information Retrieval, using the small corpus. Thisincludes extracting the documents from the document collection, preprocessing (stemming andstopword removal), indexing and retrieval. The solution may contain some implementation errors.It is clear that the student has correctly understood the Information Retrieval process.``C'' GradeGood implementation of the primary aspects of Information Retrieval, using the small corpus. Theprogram can also save and load the index to/from an external file, as appropriate.``B'' GradeCorrect implementation of all sections of the assignment using the small corpus (some minorimplementation errors will be tolerated). It is clear that the student has understood both theinformation retrieval process and the evaluation process. Note: This means that evaluation is onlytaken into account if your search program can successfully retrieve documents for evaluation.``A'' GradeExcellent implementation of all sections of the assignment, including choice of appropriate efficientdata structures and efficient programming. The efficiency of the programs will be measured usingthe large corpus. In particular, a response to a query must be returned in a reasonable amount oftime, although efficiency is important in indexing also. Note: This means that working efficiently onthe large corpus is only taken into account if your code can successfully work with the small corpus.Failing Grades``ABS'' GradeNo submission received.``NM'' GradeNo relevant work attempted.``G'' GradeWholly unacceptable, little or no evidence of meaningful work attempted.``F'' GradeSome evidence of work attempted, but little (if any) functionality operates in the correct manner.``E'' GradeClear evidence that work has been attempted on implementing retrieval using BM25, but thereare serious errors in implementation, or in understanding of the process.Other notes1. This is an individual assignment. All code submitted must be your own work. Submitting the workof somebody else or generated by AI tools such as ChatGPT is plagiarism, which is a seriousacademic offence. Be familiar with the UCD Plagiarism Policy and the UCD School of ComputerScience Plagiarism Policy.2. If you have questions about what is or is not plagiarism, ask!Document Version Historyv1.0: 2023-05-08, Initial Version.v1.1: 2023-05-15, Updated requirements for the output format of automatic mode.

WX：codehelp

关键词：

相关新闻

资讯 查看更多→

行业 查看更多→

动态 查看更多→

云南保山百花岭：“中国观鸟金三角”迎来黄金飙鸟期

高黎贡山百花岭观鸟产业的发展要追溯到1989年，至今已有33年历史，观鸟方式也由“野拍”转为“塘拍”，模式更加专业、管理更加标准

动态焦点:雷科防务：公司未来如有相关计划将及时履行信息披露义务

雷科防务(002413)12月19日在投资者关系平台上答复了投资者关心的问题。

文灿股份12月30日快速上涨

以下是文灿股份在北京时间12月30日09:42分盘口异动快照：12月30日，文灿股份盘中快速上涨，5分钟内涨幅超过2%，截至9点42分，报61 54元，成交6

中国交建拟整合设计资产上市水泥巨头祁连山“转身”在即环球新动态

时隔半年，中国交建拟分拆设计资产与祁连山重组上市，迈出关键一步。　　12月28日晚，中国交建披露重组上市预案，资产估值、交易方案等得以明

中装建设: 2022年第一次债券持有人会议决议公告焦点热闻

中装建设:2022年第一次债券持有人会议决议公告

世界热讯:威海广泰(002111)：董事股份减持计划减持数量过半的进展

COMP3009J 信息检索编程

最新：贵州文旅：见证“美好中国·幸福旅程”主题

每日热门：白玉兰戏剧奖 | 樊婷婷：跨越山与海，全情投入越剧事业

天天快看点丨用于在市场和Artprice数据库中对NFT进行排名

【天天聚看点】安徽省营商环境改革创新示范区建设按下“启动键”

资讯：乌克兰时差与中国对照表_乌克兰时差

天天快消息！江西景德镇夜经济收入约占全市文旅消费收入三成——文创赋能夜珠山

北京公布前4月财政收支情况，增值税收入为何大幅增长超四成？观察

塞读音是什么意思_塞读音环球视点

师长相当于地方什么级别_师长

杭州为“专精特新”科技型企业提供担保贷款突破1亿元

桐庐推出全省首笔“非遗工坊贷” 单户最高1000万元

生态环境部：全国碳市场第一个履约周期运行平稳累计成交额58.02亿元

国台办：严厉谴责个别人在大是大非问题上搞政治操弄

笑起来真好看｜元古堆村来了位特殊客人

综述：海外多国疫情阴霾下走近圣诞节

一声“老兵” 一生军魂

近两年我国平均人均退税额581.61元近一半纳税人可退税

第72集团军某合成旅开展红蓝对抗演练

新技术让3D打印生物组织更方便存储

云南保山百花岭：“中国观鸟金三角”迎来黄金飙鸟期

高黎贡山百花岭观鸟产业的发展要追溯到1989年，至今已有33年历史，观鸟方式也由“野拍”转为“塘拍”，模式更加专业、管理更加标准

动态焦点:雷科防务：公司未来如有相关计划将及时履行信息披露义务

雷科防务(002413)12月19日在投资者关系平台上答复了投资者关心的问题。

文灿股份12月30日快速上涨

以下是文灿股份在北京时间12月30日09:42分盘口异动快照：12月30日，文灿股份盘中快速上涨，5分钟内涨幅超过2%，截至9点42分，报61 54元，成交6

中国交建拟整合设计资产上市水泥巨头祁连山“转身”在即环球新动态

时隔半年，中国交建拟分拆设计资产与祁连山重组上市，迈出关键一步。　　12月28日晚，中国交建披露重组上市预案，资产估值、交易方案等得以明

中装建设: 2022年第一次债券持有人会议决议公告焦点热闻

中装建设:2022年第一次债券持有人会议决议公告

世界热讯:威海广泰(002111)：董事股份减持计划减持数量过半的进展

2022年12月30日公告发布

COMP3009J 信息检索编程

最新：贵州文旅：见证“美好中国·幸福旅程”主题

每日热门：白玉兰戏剧奖 | 樊婷婷：跨越山与海，全情投入越剧事业

天天快看点丨用于在市场和Artprice数据库中对NFT进行排名

【天天聚看点】安徽省营商环境改革创新示范区建设按下“启动键”

资讯：乌克兰时差与中国对照表_乌克兰时差

天天快消息！江西景德镇夜经济收入约占全市文旅消费收入三成——文创赋能夜珠山

北京公布前4月财政收支情况，增值税收入为何大幅增长超四成？ 观察

塞读音是什么意思_塞读音 环球视点

师长相当于地方什么级别_师长

杭州为“专精特新”科技型企业提供担保贷款突破1亿元

桐庐推出全省首笔“非遗工坊贷” 单户最高1000万元

生态环境部：全国碳市场第一个履约周期运行平稳 累计成交额58.02亿元

国台办：严厉谴责个别人在大是大非问题上搞政治操弄

笑起来真好看｜元古堆村来了位特殊客人

综述：海外多国疫情阴霾下走近圣诞节

一声“老兵” 一生军魂

近两年我国平均人均退税额581.61元 近一半纳税人可退税

第72集团军某合成旅开展红蓝对抗演练

新技术让3D打印生物组织更方便存储

云南保山百花岭：“中国观鸟金三角”迎来黄金飙鸟期

高黎贡山百花岭观鸟产业的发展要追溯到1989年，至今已有33年历史，观鸟方式也由“野拍”转为“塘拍”，模式更加专业、管理更加标准

动态焦点:雷科防务：公司未来如有相关计划将及时履行信息披露义务

雷科防务(002413)12月19日在投资者关系平台上答复了投资者关心的问题。

文灿股份12月30日快速上涨

以下是文灿股份在北京时间12月30日09:42分盘口异动快照：12月30日，文灿股份盘中快速上涨，5分钟内涨幅超过2%，截至9点42分，报61 54元，成交6

中国交建拟整合设计资产上市 水泥巨头祁连山“转身”在即 环球新动态

时隔半年，中国交建拟分拆设计资产与祁连山重组上市，迈出关键一步。 12月28日晚，中国交建披露重组上市预案，资产估值、交易方案等得以明

中装建设: 2022年第一次债券持有人会议决议公告 焦点热闻

中装建设:2022年第一次债券持有人会议决议公告

世界热讯:威海广泰(002111)：董事股份减持计划减持数量过半的进展

2022年12月30日公告发布

北京公布前4月财政收支情况，增值税收入为何大幅增长超四成？观察

塞读音是什么意思_塞读音环球视点

生态环境部：全国碳市场第一个履约周期运行平稳累计成交额58.02亿元

近两年我国平均人均退税额581.61元近一半纳税人可退税

中国交建拟整合设计资产上市水泥巨头祁连山“转身”在即环球新动态

时隔半年，中国交建拟分拆设计资产与祁连山重组上市，迈出关键一步。　　12月28日晚，中国交建披露重组上市预案，资产估值、交易方案等得以明

中装建设: 2022年第一次债券持有人会议决议公告焦点热闻