Data mining through crowdsourcing, on unstructured data sets

Generally, through mining, we understand the process of analyzing data from various perspectives to extract useful information that can be used to optimize performance indicators such as increasing sales volume or lowering costs.

It is practically aimed at identifying some correlations between different dimensions rendered by the existing information.

Most times, the way data is collected or originated later allows the use of special mining data programs such as Angoss, RapidMiner, Knime, Weka, and others.

By using machine learning algorithms, such programs may even “understand” certain patterns and apply them in the selection of information.

When it is technically possible to use such a solution, things are usually simple and the problem is costly (the license for the program and the specialist who can use it).

Very often a company is in a position to analyze a set of unstructured data or an algorithm that can not process efficiently but which is easy for a human operator.

For example, from a single image of a dress, a man can easily tell what color he has if he is long or short if he has a naked back if he has buttons or a zipper if he has a print, etc.

Also as an example, a person can analyze a video and see whether or not something happens in it, much easier than an algorithm.

Through crowdsourcing, data analysts and companies have virtual access to a lot of people who can handle the processing of existing data.

Processing videos from surveillance cameras through crowdsourcing

The beneficiary of the project owns a network of physical stores in Cork and in the country, specializing in the sale of flowers.

For security reasons, surveillance cameras have been installed in each store for moving.

One of these rooms records the area of the cash register.

The issue: The surveillance camera in question produces about 900 video files per day.

The length of such a file varies between 15 seconds and 4 minutes, and some actions/moves are trapped in multiple consecutive files.

Three stores were included in the project, which means approximately 2,700 video files per day to process.

During a week (with stores open for 6 days), 16,892 video files were collected.

Verification of these records was required to confront the number of transactions occurring in the cash register with the number of transactions actually made in the store.

The objective of the action was to identify possible differences and damage to the company as a result of non-taxation of those sales.

Videos must be checked manually to identify transactions, and a video analysis algorithm is not available.

The classic solution to solving this problem involves assigning a person to view clips and identify transactions.

At an 8-hour work schedule, 5 working days a week, that person should view daily 3,378 video files to keep up with their generation rate.

In fact, we are talking about viewing 422 videos per hour (7 clips per minute), which is very difficult, even with specialized equipment (simultaneous display of multiple video streams).

My solution: Data analysis through a crowdsourcing platform.

Working mode: The 16,892 video files were retrieved from the client, redimensioned and uploaded to a server, thus obtaining URLs such as those below, accessible in an online player, like YouTube.

The work began as soon as it was introduced into the platform, each video being checked by 3 different users to get a score as high as you can.

Moreover, for the job, only users with a reputation score higher than 80% were selected in the system, ie people who responded correctly in the past to at least 80% of the projects they worked in.

The results obtained by processing in crowd.

All 16,892 video files were processed by the crowd in 4 hours and 32 minutes (the time between the first and the last response recorded in the system).

The job was assigned to 2,435 people, of whom 2,316 gave at least one answer (processed at least one video). On average, each person processed 7 videos.

The analysis of the data obtained (50,000 + responses, because each video was analyzed by at least 3 people) was done by our statistics specialist.

After the information has been processed, the list of video clips in which transactions are made (a customer offers the seller’s money) has been generated along with the exact days and hours (minute and second).

The cost of the project was $ 300 (the actual clips in the crowd) plus 200 euro in the form of a one-time Setup fee, given the initial technical challenges: automatic DVR client downloading, dedicated server hosting, etc.

Crowdsourcing is the ideal solution when you have a project involving a large volume of tasks that can not be automated. Watch the video below to better understand how the system works.