Introduction to Data Analysis
Before starting any machine learning project, it is highly recommended to sit down and conduct an analysis of your data. This Introduction to Data Analysis will give you some idea of how to approach data related problems in your projects.
What is Data Analysis?
Once the data has been collected, coded and organized, we can begin the analysis. Data analysis is commonly defined as “systematic processing of information (data)”. The purpose of the data analysis is to categorize the data collected so that we can describe and analyze what we have found.
Here are three relevant questions one should ask when analyzing data:
- What characterizes the data?
- Are there patterns and / or relationships?
- Can the relationship between variables be due to intermediate causal relationships?
The goal of answering the above questions is to simplify and manipulate the data to reveal the underlying patterns and relationships. This will give a broader understanding and help you identify the best way to solve your problem(s) – which might be with machine learning.
Limiting Factors in Data Analysis
The approach to your analysis varies from project to project, but for the most part, a general domain knowledge is needed to determine the best starting point. For example, let’s say you are trying to make sense of a bunch of high voltage data. You may not need a degree in electrical engineering, but a basic understand of Ohm’s law is probably gonna be helpful. Subsequently, factors such as time, money, extended knowledge and resources will often impact the analysis.
After considering the above factors, the data collected will often dictate what kinds of analyses we can do. The collected data, the data collection method, the data characteristics and the data availability will be the major limitation factors on what kind of analysis can be done.
Summary and Future Reading
Know that you have a basic idea of what data analysis is, it is time to collect some data and get started! Collecting data can often be tedious and frankly, not worth your time. Click this link for a guide on how to generate your own data for machine learning with Python.