Build an automated electronic information system for monitoring, organizing, and visualizing reports of outbreaks of public health concerns in targeted geographical regions according to location, time, and condition/event.
It processes data from a number of sources, such as: social media, news agencies, and personal blogs. It pushes for filtering using large datasets of keywords and will filter the extracted text that may be of interest. Such a system would be trained to categorize bio-surveillance of interests (or not) and prioritize them based on different levels.
The system would have a user interface display that analyzes data for human interpretation. The result is feedback into the analysis engine for training or to improve its algorithm.
What I Built
An automated system that crawls through various data sources, such as Twitter and Google search engine's news, and extracts a blob of text, which it then pushes for filtering using large datasets of keywords.
This system filters the extracted text that may be of interest and with a provision of training the system to categorize: bio-surveillance of interests (or not) and prioritize them based on different levels.
Public Libraries Used
Easily provides MVC frameworks, user authentication, and back-end admin.
PHP Wrapper for Twitter API v1.1 calls, located at:
YAML UI framework:
Similar in usage to Bootstrap, but much lighter and more cross-platform. In Projects, YAML's corefiles have been compressed, along with some custom CSS, into one file at:
And a few other of its files are at:
jQuery and jQuery UI:
First, there were several meetings to further understand the full requirement specifications and make exceptions where necessary. Also, we looked at the project budget, deployment site, and external influence factors.
Then, after clarifying all grounds, the below six milestones were used.
Milestone 1 : Design and Specification Document
Milestone 2 : Development of data crawler
Milestone 3 : Machine Learning Interface development
Milestone 4 : User interface design and development
Milestone 5 : Integration / Refinements
Milestone 6 : Training
Several weighing options were considered and we finally implemented a weighing algorithm that estimates the average number of people who engage with (and talk about) posted data, based on the authenticity and readership size of the source.
Topics and Keywords:
Topics are items to be tracked online. Keywords are carefully chosen words or partial words that filter each topic to ensure that only references that are relevant to the application are fetched.
For example: topics like "Yellow Fever" and keywords like "outbreak" combine to ensure that all "Yellow Fever" references to traffic wardens or anything else outside of our context of interest are not considered for fetching.
One major challenge was that mobile users often do not enable their location on Twitter, and for some users, their actual location cannot be determined because the Internet Service Providers (ISPs) provide users location based on the ISP's situated headquarters location. Based on the above, regular expression was used in determining the location where the feed is coming from.
I learned that this can be implemented on-site using Public libraries such as Joomla!, TwitterAPIExchange, YAML UI framework, and jQuery.
Ordinarily, I would have considered the following:
1.BigQuery — delivers major improvements in speed, cost, and real-time querying compared to other big-data databases.
2.Google's Natural Language Processing API — to achieve a sufficient quality of meaning and categorization from our data filters, within the short period of this project, to build basic pattern-recognition, combined with context-aware entity recognition and sentiment analysis.
3.User interface — to be built in HTML5/CSS3/jQuery to be light, fast, and accessible across desktop, tablet, and most mobile screens with its back-end built on Python/Flask technologies.
Tips and advice
From my experience on this project, what informs the choice of technology is in the clear terms of reference and requirement specification. This will guide one through whether it should be a cloud-based application or on-site hosted application. The best technology, tools, and libraries will be based on these criteria.
The project delivered its phase target and will continue to be developed while training and fine tuning the system's algorithm to recognize bio-surveillance messages of interest.