Open Source Software

Open Source Big Data Tools and Software

Avro
It is a data serialization system on JSON defined schemas with APIs present on C, C++, C# and Java. It is OS Independent.

BIRT
Co-founded by Actuate, adds reporting functionalities to Java applications. Is OS Independent.

Blazegraph
It is a highly scalable and high-performance database which is available as open-source and with commercial license. It is OS Independent.

Cassandra
Developed by Facebook, the NoSQL database is nowadays handled by Apache Foundation. It’s used by Netflix, Urban Airship, Twitter, Reddit, Constant Contact, Digg and Cisco. It is OS Independent.

Chukwa
Built on platforms MapReduce and HDFS, it gathers data from larger distributed systems with displaying and analyzing the gathered data. Works on Linux and OS X.

CouchDB
It stores web data in JSON documents accessed through the query using JavaScript. Also offers distributed scaling and fault-tolerant storage. Works on Windows, Android, Linux, OS X.

DataMelt
Can do data mining, statistical analysis, mathematical computation and data visualization. It supports Java and related programming languages including Jython, Groovy, JRuby and Beanshell. It is OS Independent.

ECL
ECL is a full set of tools, comprising of an IDE and debugger in HPCC, with documentation available on HPCC website. It operates on Linux.

FlockDB
Store Twitter social graphs (i.e., who is following or blocking whom) with horizontal scaling and swift reads and writes. Is OS Independent.

Flume
An Apache project, it gathers, aggregates and transfers the required log data from apps to HDFS. It’s robust, fault-tolerant Java-based project. Operates on Windows, Linux and OS X.

Gluster
It provides unified file and objects storage for larger data-sets. Can be scaled to 72 brontobytes, extending Hadoop capabilities on Linux.

GridGain
Offers in-memory processing for quick analysis of the real-time data. Works on windows, Linux, OS X Operating Systems.

Hadoop
Frequently the terms “Hadoop” and “big data” are utilized synonymously. The Apache Foundation sponsors multiple projects that range the competences of Hadoop. Multiple vendors provide supported versions of Hadoop and connected technologies. Works on Windows, Linux and OS X.

Hadoop Distributed File System
It is a primary storage structure for Hadoop. It rapidly replicates data onto numerous nodes in a cluster in order to deliver reliable, speedy performance. Works on Windows, Linux and OS X.

Hbase
HBase is an Apache project, with a non-relational data store for the Hadoop. Functionalities comprise of linear and modular scalability, automatic failover support and more. Is OS independent.

Hibari
It is important big data storage with consistency, availability and quick performance supporting many telecom companies. Is OS Independent.

Hive
It is Hadoop’s data warehouse, offers data summarization and analysis of big data. It uses a SQL-like language, HiveQL and is OS Independent.

HPCC Systems
It is a high performance computing cluster offering better performance to Hadoop. It works on Linux with free community versions and paid ones.

Hypertable
Provides effectiveness and quick performance resulting in cost savings. It has both open source and but paid support. Available on Linux, OS X.

Infinispan
Java-based, highly scalable data grid platform used for multi-core architecture and offers distributed cache competences. Is OS Independent.

InfoBright Community Edition
It is a scalable data warehouse with storage up to 50TB and compression up to 40:1 for best-driven performance. Works on Windows, Linux.

Jaspersoft
It is the most used, flexible, cost-effective and deployed BI software across the globe. Has both commercial and open-source versions, includes Big Data reporting solutions and is OS Independent.

Jedox
Includes Palo Web, OLAP Server, Palo for Excel and Palo ETL Server with open source and commercial software-based tools. Is OS Independent.

KEEL
KEEL assists use evaluates algorithms for data mining issues like classification, regression, pattern mining and clustering. It comprises of a big collection of prevailing algorithms that it uses to associate new algorithms. It is OS Independent.

KNIME
Provides user-friendly data processing, integration and analysis. Gartner named KNIME as a “Cool Vendor” in 2010 for analytics, BI and performance. Operates on Windows, Linux and OS X.

Lucene
It offers very quick indexing and searching capabilities for huge datasets. It indexes over 95GB/hour while utilizing modern hardware. It is OS Independent.

Mahout
Offers algorithms for classification, clustering, and collaborative filtering on Hadoop. The project’s objective is to shape scalable machine learning libraries. Is OS Independent.

MapReduce
It is a programming model and framework for creating applications that speedily analyse big data, parallel on the clusters to compute nodes. Utilized by Hadoop and other processing applications with the independent OS.

Neo4j
The global graph database improves performance to 1000x or more vs. the relational databases. It even has advanced versions and works on Windows, Linux.

Oozie
It is an Apache project which is built to coordinate with the scheduling of Hadoop jobs. It triggers jobs at a programmed time or as per data availability. Works on Linux and OS X.

Orange
Provides multiple visualizations and a toolbox of 100+ widgets. Works on Windows, Linux and OS X.

OrientDB
Stores 150,000 documents per second with loading graphs in just a few milliseconds. Supports ACID transactions and the fast indexes.

Pentaho
Provided big data analytics tools to 10,000 companies along with data mining, dashboard and reporting. Operates on Windows, Linux and OS X.

Pig
It is an Apache data analysis tool that uses a textual language known as Pig Latin, producing sequences of programs for Map-Reduce. It assists writing, understanding and maintaining programs with data analysis tasks performed parallelly. It is OS Independent.

R
Build by Bell Laboratories, R is a programming language with an environment for graphics and statistical computing similar to S. The environment comprises of tools that make it simpler to operate data, create graphs, charts and do calculations with Windows, Linux and OS X.

RapidMiner
It is a leading open-source system for text and data mining. Works on open source versions and paid support and is OS Independent.

Rattle
Makes it simpler for non-programmers to utilize R language by offering a graphical interface for mining of data. Can build models, score datasets and draw graphs. Works on Windows, Linux and OS X.

Redis
Offers in-memory key-value store saved on disk for availing persistence. Supports many programming languages and operates on Linux.

Riak
A powerful open-source and distributed database. Users comprise of Comcast, Voxer, Yammer, Joyent, Boeing, Kiip.me, SEOMoz, Formspring, DotCloud and Danish Government. Works on Linux and OS X.

Solr
It is an advanced enterprise search tool based on Lucene. It empowers search capabilities for larger websites, which includes Netflix, CNET, AOL and Zappos. It is OS Independent.

SpagoBI
Is complete open source business intelligence solution with commercial services, support and training and is OS Independent.

SPMF
It is java based data mining framework, with focus on sequential pattern mining, and has tools for linking rule mining, item set mining and sequential rule mining. It has 46 diverse algorithms and is OS Independent.

Sqoop
It transfers data between RDBMSes, Hadoop and data warehouses. It is a topmost level Apache project now and is OS Independent.

Storm
Owned by Twitter, it provides distributed real-time computation competencies and is called as “Hadoop of real-time.” It’s scalable, fault-tolerant, robust, works with all programming languages, with Linux OS.

Terracotta
It’s “Big Memory” platform that allows enterprise applications to manage and store big data in the server memory, with speedy performance. The company provides open-source and commercial versions of its platform. It is OS Independent.

Terrastore
It offers scalability, elasticity and consistency. Supports range queries, custom data partitioning, push-down predicates, server-side update functions, event processing and reduce querying. Is OS independent.

Weka
Offers data mining algorithms that can be applied to data or use in other Java applications. It’s a fragment of a big machine learning project, sponsored by Pentaho. Operating System: Windows, Linux, OS X.

Zookeeper
It is a centralized service for keeping up configuration details, naming, offering distributed synchronization with group services. APIs are obtainable for Java and C, Python, REST and Perl. Works on Linux, Windows (only development) and OS X (only development).