技术控

    今日:44| 主题:49157
收藏本版 (1)
最新软件应用技术尽在掌握

[其他] Inverted Index Project

[复制链接]
那种倒影成月 发表于 7 天前
57 2

立即注册CoLaBug.com会员,免费获得投稿人的专业资料,享用更多功能,玩转个人品牌!

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
I haven't spoken much about the class I've been teaching this semester. It's an intro CS course - a programming heavy intro. I decided to use Python with a transition at the end to C++. The transition is to mirror Hunter's normal first CS course that ends with a C++ intro to prepare the students for next semester's CS course which is a more intense OOP class using C++ - the language we use in our core courses.
  Throughout the semester I've tried to use a variety of interesting application areas so as to try to give the students some idea of the possibilities that studying CS will open up for them.
  After covering Python dictionaries and lists I thought we'd play by building an inverted Index.
  The basic idea is to map a set of words back to source files. For example, given the following four one line files:
                    files                        contents                            file.01                        if you prick us do we not bleed                    file.02                        if you tickle us do we not laugh                    file.03                        if you poison us do we not die and                    file.04                        if you wrong us shall we not revenge              You could build a data structure mapping each word back to the file(s) that contain it (partially shown here),
                    Word                Files containing It's                            if                file.01 file.02 file.03 file.04                    you                file.01 file.02 file.03 file.04                    prick                file.01                    us                file.01 file.02 file.03 file.04                    do                file.01 file.02 file.03              You can, of course, store more information - how many times a word appears in a file, where it appears, etc.
  This is a fairly easy structure to build. A dictionary where the keys are the words in the file and the values are lists of the documents containing the words.
  1.   inverted_index = {
  2.       'if' : ['file.01','file.02','file.03','file.04'],
  3.       'you' : ['file.01','file.02','file.03','file.04'],
  4.       'prick' : ['file.01'],
  5.       'us' : ['file.01','file.02','file.03','file.04'],
  6.       'do' : ['file.01','file.02','file.03'],
  7.       ...
  8. }
复制代码
In addition to letting us work with dictionaries and lists, we can also review file access and even the python CSV module if we want.
  We can immediately write simple queries – "what document(s) contain the word 'prick,' but things get more interesting if you write functions to perform          and    and          or    queries - "what document(s) contain the words 'prick'          or    'do'" for instance.  
  Why are we building this (besides as a data structure and programming exercise)? I've seen a number of references to using an inverted index when building a web search engine. In fact, I think that's something you do early on in the Udacity Mooc. I just wanted to play with information retrieval.
  I remembered that there was a collection of information, including last statements from    executed offenders in Texas. Someone conveniently converted it into a    Google Spreadsheet. The format's a little different from our simple four file example but then there's more data. It's straightforward enough to download the spreadsheet as a CSV file and then read it with a Python program that builds it into an inverted index.  
  Now we have some interesting data to play with.
  How many offenders used words like "sorry" or "apologize?" How about references to religion? We can do all sorts of          and    and          or    queries.  
  We just played with this a bit but I could see all sorts of explorations. What about taking some great work of literature and turning it into an inverted index by chapter. You could query characters or certain words and see where and when they appear in the book. A new and different way of exploring literature.
  So, there you have it - an interesting little project we played with this past semester. We did it in an intro Python course but I could see it as an interesting project in AP CS A using hashmaps and lists.
友荐云推荐




上一篇:Linux内存占用
下一篇:Django Channels: Using Custom Channels
酷辣虫提示酷辣虫禁止发表任何与中华人民共和国法律有抵触的内容!所有内容由用户发布,并不代表酷辣虫的观点,酷辣虫无法对用户发布内容真实性提供任何的保证,请自行验证并承担风险与后果。如您有版权、违规等问题,请通过"联系我们"或"违规举报"告知我们处理。

liluo1991 发表于 6 天前
我为那种倒影成月转身!
回复 支持 反对

使用道具 举报

依娜 发表于 3 天前
我最恨别人用鼠标指着我的头.
回复 支持 反对

使用道具 举报

*滑动验证:
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

我要投稿

推荐阅读

扫码访问 @iTTTTT瑞翔 的微博
回页顶回复上一篇下一篇回列表手机版
手机版/CoLaBug.com ( 粤ICP备05003221号 | 文网文[2010]257号 )|网站地图 酷辣虫

© 2001-2016 Comsenz Inc. Design: Dean. DiscuzFans.

返回顶部 返回列表