I have been working on this (mostly) annotated collection of tools and articles that I believe would be of help to both the data dabbler and professional. If you are a data scientist, data analyst or data dummy, chances are there is something in here for you.
I included a list of tools, such as programming languages and web-based utilities, data mining resources, some prominent organizations in the field, repositories where you can play with data, events you may want to attend and important articles you should take a look at.
The second segment (BONUS!) of the list includes a number of art and design resources the infographic designers might like including color palette generators and image searches. There are also some invisible web resources (if you’re looking for something data-related on Google and not finding it) and metadata resources so you can appropriately curate your data.
This is in no way a complete list so please contact me here with any suggestions!
- Google Refine – A power tool for working with messy data (formerly Freebase Gridworks)
- The Overview Project – Overview is an open-source tool to help journalists find stories in large amounts of data, by cleaning, visualizing and interactively exploring large document and data sets. Whether from government transparency initiatives, leaks or Freedom of Information requests, journalists are drowning in more documents than they can ever hope to read.
- Refine, reuse and request data | ScraperWiki – ScraperWiki is an online tool to make acquiring useful data simpler and more collaborative. Anyone can write a screen scraper using the online editor. In the free version, the code and data are shared with the world. Because it’s a wiki, other programmers can contribute to and improve the code.
- Data Curation Profiles – This website is an environment where academic librarians of all kinds, special librarians at research facilities, archivists involved in the preservation of digital data, and those who support digital repositories can find help, support and camaraderie in exploring avenues to learn more about working with research data and the use of the Data Curation Profiles Tool.
- Google Chart Tools – Google Chart Tools provide a perfect way to visualize data on your website. From simple line charts to complex hierarchical tree maps, the chart galley provides a large number of well-designed chart types. Populating your data is easy using the provided client- and server-side tools.
- 22 free tools for data visualization and analysis
- The R Journal – The R Journal is the refereed journal of the R project for statistical computing. It features short to medium length articles covering topics that might be of interest to users or developers of R.
- CS 229: Machine Learning – A widely referenced course by Professor Andrew Ng, CS 229: Machine Learning provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.
- Google Research Publication: BigTable – Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.
- Scientific Data Management – An introduction.
- Natural Language Toolkit – Open source Python modules, linguistic data and documentation for research and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux.
- Beautiful Soup – Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping.
- Mondrian: Pentaho Analysis – Pentaho Open source analysis OLAP server written in Java. Enabling interactive analysis of very large datasets stored in SQL databases without writing SQL.
- The Comprehensive R Archive Network
- R is `GNU S‘, a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc. Please consult the R project homepage for further information.
CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R. Please use the CRAN mirror nearest to you to minimize network load.
- DataStax – Software, support, and training for Apache Cassandra.
- Machine Learning Demos
- Visual.ly – Infographics & Visualizations. Create, Share, Explore
- Google Fusion Tables
- Google Fusion Tables is a modern data management and publishing web application that makes it easy
to host, manage, collaborate on, visualize, and publish data tables online.
- Tableau Software
- Fast Analytics and Rapid-fire Business Intelligence from Tableau Software.
- WaveMaker is a rapid application development environment for building, maintaining and modernizing business-critical Web 2.0 applications.
- Visualization: Annotated Time Line – Google Chart Tools – Google Code
An interactive time series line chart with optional annotations. The chart is rendered within the browser using Flash.
- Visualization: Motion Chart – Google Chart Tools – Google Code
A dynamic chart to explore several indicators over time. The chart is rendered within the browser using Flash.
Create gorgeous infographics about your iPhone photos, with Photostats.
- Ionz Ionz will help you craft an infographic about yourself.
- chart builder
Powerful tools for creating a variety of charts for online display.
Online diagramming and design.
- Pixlr Editor A powerful online photo editor.
- Google Public Data Explorer
?The Google Public Data Explorer makes large datasets easy to explore, visualize and communicate. As the charts and maps animate over time, the changes in the world become easier to understand. You don’t have to be a data expert to navigate between different views, make your own comparisons, and share your findings.
Fathom Information Design helps clients understand and express complex data through information graphics, interactive tools, and software for installations, the web, and mobile devices. Led by Ben Fry. Enough said!
- healthymagination | GE Data Visualization
Visualizations that advance the conversation about issues that shape our lives, and so we encourage visitors to download, post and share these visualizations.
ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.
- MATLAB – The Language Of Technical Computing
MATLAB® is a high-level language and interactive environment that enables you to perform computationally intensive tasks faster than with traditional programming languages such as C, C++, and Fortran.
- OpenGL – The Industry Standard for High Performance Graphics
OpenGL.org is a vendor-independent and organization-independent web site that acts as one-stop hub for developers and consumers for all OpenGL news and development resources. It has a very large and continually expanding developer and end-user community that is very active and vested in the continued growth of OpenGL.
- Google Correlate
Google Correlate finds search patterns which correspond with real-world trends.
- Revolution Analytics – Commercial Software & Support for the R Statistics Language
Revolution Analytics delivers advanced analytics software at half the cost of existing solutions. By building on open source R—the world’s most powerful statistics software—with innovations in big data analysis, integration and user experience, Revolution Analytics meets the demands and requirements of modern data-driven businesses.
- 22 Useful Online Chart & Graph Generators
- The Best Tools for Visualization Visualization is a technique to graphically represent sets of data. When data is large or abstract, visualization can help make the data easier to read or understand. There are visualization tools for search, music, networks, online communities, and almost anything else you can think of. Whether you want a desktop application or a web-based tool, there are many specific tools are available on the web that let you visualize all kinds of data.
- Visual Understanding Environment
The Visual Understanding Environment (VUE) is an Open Source project based at Tufts University. The VUE project is focused on creating flexible tools for managing and integrating digital resources in support of teaching, learning and research. VUE provides a flexible visual environment for structuring, presenting, and sharing digital information.
- Bime – Cloud Business Intelligence | Analytics & Dashboards
Bime is a revolutionary approach to data analysis and dashboarding. It allows you to analyze your data through interactive data visualizations and create stunning dashboards from the Web.
- Data Science Toolkit
A collection of data tools and open APIs curated by our own Pete Warden. You can use it to extract text from a document, learn the political leanings of a particular neighborhood, find all the names of people mentioned in a text and more.
BuzzData lets you share your data in a smarter, easier way. Instead of juggling versions and overwriting files, use BuzzData and enjoy a social network designed for data.
- SAP – SAP Crystal Solutions: Simple, Affordable, and Open BI Tools for Everyday Use
- Project Voldemort
- ggplot. had.co.nz
- Weka -nWeka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.
- PSPP- PSPP is a program for statistical analysis of sampled data. It is a Free replacement for the proprietary program SPSS, and appears very similar to it with a few exceptions. The most important of these exceptions are, that there are no “time bombs”; your copy of PSPP will not “expire” or deliberately stop working in the future. Neither are there any artificial limits on the number of cases or variables which you can use. There are no additional packages to purchase in order to get “advanced” functions; all functionality that PSPP currently supports is in the core package.PSPP can perform descriptive statistics, T-tests, linear regression and non-parametric tests. Its backend is designed to perform its analyses as fast as possible, regardless of the size of the input data. You can use PSPP with its graphical interface or the more traditional syntax commands.
- Rapid I- Rapid-I provides software, solutions, and services in the fields of predictive analytics, data mining, and text mining. The company concentrates on automatic intelligent analyses on a large-scale base, i.e. for large amounts of structured data like database systems and unstructured data like texts. The open-source data mining specialist Rapid-I enables other companies to use leading-edge technologies for data mining and business intelligence. The discovery and leverage of unused business intelligence from existing data enables better informed decisions and allows for process optimization.The main product of Rapid-I, the data analysis solution RapidMiner is the world-leading open-source system for knowledge discovery and data mining. It is available as a stand-alone application for data analysis and as a data mining engine which can be integrated into own products. By now, thousands of applications of RapidMiner in more than 30 countries give their users a competitive edge. Among the users are well-known companies as Ford, Honda, Nokia, Miele, Philips, IBM, HP, Cisco, Merrill Lynch, BNP Paribas, Bank of America, mobilkom austria, Akzo Nobel, Aureus Pharma, PharmaDM, Cyprotex, Celera, Revere, LexisNexis, Mitre and many medium-sized businesses benefitting from the open-source business model of Rapid-I.
- R Project – R is a language and environment for statistical computing and graphics. It is a GNU projectwhich is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.R is available as Free Software under the terms of the Free Software Foundation‘s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
- SDM group at LBNL
- Open Archives Initiative
- Code for America | A New Kind of Public Service
- The # DataViz Daily
- Institute for Advanced Analytics | North Carolina State University | Professor Michael Rappa · MSA Curriculum
- BuzzData | Blog, 25 great links for data-lovin’ journalists
- MetaOptimize – Home – Machine learning, natural language processing, predictive analytics, business intelligence, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization
- Measuring Measures – Measuring Measures
- Repositories | DataCite
- Data | The World Bank
- Infochimps Data Marketplace + Commons: Download Sell or Share Databases, statistics, datasets for free | Infochimps
- Factual Home – Factual
- Flowing Media: Your Data Has Something To Say
- Public Data Explorer
- 25+ more ways to bring data into R
- Welcome | Visweek 2011
- O’Reilly Strata: O’Reilly Conferences
- IBM Information On Demand 2011 and Business Analytics Forum
- Data Scientist Summit 2011
- IBM Virtual Performance 2011
- Wolfram Data Summit 2011—Conference on Data Repositories and Ideas
- Big Data Analytics: Mobile, Social and Web
- Data Science: a literature review | (R news & tutorials)
- What is “Data Science” Anyway?
- Hal Varian on how the Web challenges managers – McKinsey Quarterly – Strategy – Innovation
- The Three Sexy Skills of Data Geeks « Dataspora
- Rise of the Data Scientist
- dataists » A Taxonomy of Data Science
- The Data Science Venn Diagram « Zero Intelligence Agents
- Revolutions: Growth in data-related jobs
- Building data startups: Fast, big, and focused – O’Reilly Radar
- Periodic Table of Typefaces
- Color Scheme Designer 3
- Color Palette Generator Generate A Color Palette For Any Image
- Colorbrewer: Color Advice for Maps
- American Memory from the Library of Congress The home page for the American Memory Historical Collections from the Library of Congress. American Memory provides free access to historical images, maps, sound recordings, and motion pictures that document the American experience. American Memory offers primary source materials that chronicle historical events, people, places, and ideas that continue to shape America.
- Galaxy of Images | Smithsonian Institution Libraries
- Flickr Search
- 50 Websites For Free Vector Images Download
- Design weblog for designers, bloggers and tech users. Covering useful tools, tutorials, tips and inspirational photos.
- Images Google Images. The most comprehensive image search on the web.
- Trade Literature – a set on Flickr
- Compfight / A Flickr Search Tool
- morgueFile free photos for creatives by creatives
- stock.xchng – the leading free stock photography site
- The Ultimate Collection Of Free Vector Packs – Smashing Magazine
- How to Create Animated GIFs Using Photoshop CS3 – wikiHow
- IAN Symbol Libraries (Free Vector Symbols and Icons) – Integration and Application Network
- best icons
- 10 Search Engines to Explore the Invisible Web
Like the header says…
- Scirus – for scientific information
The most comprehensive scientific research tool on the web. With over 410 million scientific items indexed at last count, it allows researchers to search for not only journal content but also scientists’ homepages, courseware, pre-print server material, patents and institutional repository and website information.
- TechXtra: Engineering, Mathematics, and Computing
TechXtra is a free service which can help you find articles, books, the best websites, the latest industry news, job announcements, technical reports, technical data, full text eprints, the latest research, thesis & dissertations, teaching and learning resources and more, in engineering, mathematics and computing.
- Welcome to INFOMINE: Scholarly Internet Resource Collections
INFOMINE is a virtual library of Internet resources relevant to faculty, students, and research staff at the university level. It contains useful Internet resources such as databases, electronic journals, electronic books, bulletin boards, mailing lists, online library card catalogs, articles, directories of researchers, and many other types of information.
- The WWW Virtual Library
The WWW Virtual Library (VL) is the oldest catalogue of the Web, started by Tim Berners-Lee, the creator of HTML and of the Web itself, in 1991 at CERN in Geneva. Unlike commercial catalogues, it is run by a loose confederation of volunteers, who compile pages of key links for particular areas in which they are expert; even though it isn’t the biggest index of the Web, the VL pages are widely recognised as being amongst the highest-quality guides to particular sections of the Web.
- Intute Intute is a free online service that helps you to find web resources for your studies and research. With millions of resources available on the Internet, it can be difficult to find useful material. We have reviewed and evaluated thousands of resources to help you choose key websites in your subject. The Virtual Training Suite can also help you develop your Internet research skills through tutorials written by lecturers and librarians from universities across the UK.
- CompletePlanet – Discover over 70,000+ databases and specially search engines
There are hundreds of thousands of databases that contain Deep Web content. CompletePlanet is the front door to these Deep Web databases on the Web and to the thousands of regular search engines — it is the first step in trying to find highly topical information. By tracing through CompletePlanet’s subject structure or searching Deep Web sites, you can go to various topic areas, such as energy or agriculture or food or medicine, and find rich content sites not accessible using conventional search engines. BrightPlanet initially developed the CompletePlanet compilation to identify and tap into many hundreds and thousands of search sources simultaneously to automatically deliver high-quality content to its corporate and enterprise customers. It then decided to make CompletePlanet available as a public service to the Internet search public.
- Infoplease: Encyclopedia, Almanac, Atlas, Biographies, Dictionary, Thesaurus.
Information Please has been providing authoritative answers to all kinds of factual questions since 1938—first as a popular radio quiz show, then starting in 1947 as an annual almanac, and since 1998 on the Internet at www.infoplease.com. Many things have changed since 1938, but not our dedication to providing reliable information, in a way that engages and entertains.
- DeepPeep: discover the hidden web DeepPeep is a search engine specialized in Web forms. The current beta version currently tracks 45,000 forms across 7 domains. DeepPeep helps you discover the entry points to content in Deep Web (aka Hidden Web) sites, including online databases and Web services.
Advanced search allows you to perform more specific queries. Besides specifying keywords, you can also search for specific form element labels, i.e., the description of the form attributes.
- IncyWincy: The Invisible Web Search Engine IncyWincy is a showcase of Net Research Server (NRS) 5.0, a software product that provides a complete search portal solution, developed by LoopIP LLC.
LoopIP licenses the NRS engine and provides consulting expertise in building search solutions.
- Description Schema: MODS (Library of Congress) and Outline of elements and attributes in MODS version 3.4: MetadataObject
This document contains a listing of elements and their related attributes in MODS Version 3.4 with values or value sources where applicable. It is an “outline” of the schema. Items highlighted in red indicate changes made to MODS in Version 3.4.All top-level elements and all attributes are optional, but you must have at least one element. Subelements are optional, although in some cases you may not have empty containers. Attributes are not in a mandated sequence and not repeatable (per XML rules). “Ordered” below means the subelements must occur in the order given. Elements are repeatable unless otherwise noted.”Authority” attributes are either followed by codes for authority lists (e.g., iso639-2b) or “see” references that link to documents that contain codes for identifying authority lists.For additional information about any MODS elements (version 3.4 elements will be added soon), please see the MODS User Guidelines.
- wiki.dbpedia.org : About DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. We hope this will make it easier for the amazing amount of information in Wikipedia to be used in new and interesting ways, and that it might inspire new mechanisms for navigating, linking and improving the encyclopaedia itself.
- Semantic Web – W3C In addition to the classic “Web of documents” W3C is helping to build a technology stack to support a “Web of data,” the sort of data you find in databases. The ultimate goal of the Web of data is to enable computers to do more useful work and to develop systems that can support trusted interactions over the network. The term “Semantic Web” refers to W3C’s vision of the Web of linked data. Semantic Web technologies enable people to create data stores on the Web, build vocabularies, and write rules for handling data. Linked data are empowered by technologies such as RDF, SPARQL, OWL, and SKOS.
- RDA: Resource Description & Access | www.rdatoolkit.org Designed for the digital world and an expanding universe of metadata users, RDA: Resource Description and Access is the new, unified cataloging standard. The online RDA Toolkit subscription is the most effective way to interact with the new standard. More on RDA.
- Cataloging Cultural Objects Cataloging Cultural Objects: A Guide to Describing Cultural Works and Their Images (CCO) is a manual for describing, documenting, and cataloging cultural works and their visual surrogates. The primary focus of CCO is art and architecture, including but not limited to paintings, sculpture, prints, manuscripts, photographs, built works, installations, and other visual media. CCO also covers many other types of cultural works, including archaeological sites, artifacts, and functional objects from the realm of material culture.
- Library of Congress Authorities (Search for Name, Subject, Title and Name/Title) Using Library of Congress Authorities, you can browse and view authority headings for Subject, Name, Title and Name/Title combinations; and download authority records in MARC format for use in a local library system. This service is offered free of charge.
- Search Tools and Databases (Getty Research Institute) Use these search tools to access library materials, specialized databases, and other digital resources.
- Art & Architecture Thesaurus (Getty Research Institute) Learn about the purpose, scope and structure of the AAT. The AAT is an evolving vocabulary, growing and changing thanks to contributions from Getty projects and other institutions. Find out more about the AAT’s contributors.
- Getty Thesaurus of Geographic Names (Getty Research Institute) Learn about the purpose, scope and structure of the TGN. The TGN is an evolving vocabulary, growing and changing thanks to contributions from Getty projects and other institutions. Find out more about the TGN’s contributors.
- DCMI Metadata Terms
- The Digital Object Identifier System
- The Federal Geographic Data Committee — Federal Geographic Data Committee
- NSA Extends Label-based Security to Big Data Stores (pcworld.com)
- Cassandra 1.0, the cloud, and the future of big data (rackspace.com)