Data Analyst:
Interpret data, analyze results using statistical techniques and providing ongoing reports. Develop and implement databases, data collection systems, data analytics and other strategies that optimize statistical efficiency and quality. Acquire data from primary or secondary data sources and maintaining databases/data systems. Data analyst responsibilities include conducting a full lifecycle analysis to include requirements, activities and design. Data analysts develop analysis and reporting capabilities. They also monitor performance and quality control plans to identify improvements.
Desired Skills:
- Technical expertise regarding data models, data mining and segmentation techniques.
- Proficient with data visualization techniques and tools such as Excel, SSRS, Power BI, Tableau or Marketo.
- Experience with database software such as Access, MS SQL.
- Strong quality assurance and business focus with keen attention to detail and good documentation habits.
- Conceptual, non-technical understanding/familiarity with big data pipelines and flows.
- Familiarity with statistical methods and experimentation (A/B testing) a plus.
- Data profiling tools and reporting.
- Knowledge of reporting packages (Business Objects), programming language (XML, Javascript, or ETL frameworks), databases (SQL, SQLite, etc.)
- Strong skills with the ability to analyze, organize, collect and disseminate big data with accuracy.
- Technical knowledge in database design, data models, data mining and segmentation techniques.
- Strong knowledge of statistical packages for analyzing large datasets (SAS, Excel, SPSS (Statistical Package for the Social Sciences).
The responsibility of a Data analyst includes:
- Providing support to all data analysis and coordinating with customers and staff.
- Resolving business associated issues for clients and performing an audit on data.
- Analyzing results and interpreting data using statistical techniques and providing ongoing reports.
- Prioritizing business needs and working closely with management and information needs.
- Identifying new processes and areas for improvement opportunities.
- Analyzing, identifying and interpreting trends or patterns in complex data sets.
- Acquiring data from primary or secondary data sources and maintaining databases/data systems.
- Filtering and “cleaning” data, and reviewing computer reports.
- Determining performance indicators to locate and correct code problems.
- Securing the database by developing access system by determining user level of access.
Some best tools that can be useful for data-analysis:
- Tableau
- RapidMiner
- OpenRefine
- KNIME (Konstanz Information Miner)
- Google Search Operators
- Solver
- NodeXL
- Wolfram Alpha’s
- Google Fusion Tables
INTERVIEW QUESTIONS:
(General overview of the answers. Change them as per requirement.)
Q. What is the criteria for a good data model?
Criteria for a good data model includes:
- It should be easily consumed.
- Large data changes in a good model should be scalable.
- It should provide a predictable performance.
- A good model should adapt to changes in requirements.
Q. How would you create a taxonomy to identify key customer trends in unstructured data?
First of all, I’d consult with the business owner from the outset to understand their objectives in categorizing this data. Then, I would use an iterative process, pulling new samples and modifying the model accordingly, evaluating for accuracy and inclusivity along the way. I’d follow the basic process of mapping the data, creating an algorithm and legend, mining the data, visualizing the data and so forth. However, I would tackle the project in segments in order to solicit feedback from the business stakeholder, and to continue enriching the model to ensure that I’m producing actionable results.
Q. How would you handle the QA process when you’re creating a predictive model to forecast customer churn?
I’d partition the data into three sets: training, testing and validation. To eliminate bias in the first two sets, I’d show the results from the validation set to the business owner. I need input to gauge whether the model accurately predicts customer churn and provides actionable results.
Q. How often should you retrain or refresh a model?
I’ll work with the business owner to establish an appropriate time period upfront. However, I would retrain a model immediately should the company expand into a new market, consummate an acquisition, or encounter emerging competition. Models must be retrained quickly to adjust for changing customer behavior patterns or shifting market conditions.
Q. What are the various steps in an analytics project?
- Problem definition
- Data exploration
- Data preparation
- Modeling
- Validation of data
- Implementation and tracking
Q. What is data cleansing?
Data cleaning deals with identifying and removing errors and inconsistencies from data in order to enhance the quality of data.
Q. What is logistic regression?
Logistic regression is a statistical method for examining a dataset in which there are one or more independent variables that define an outcome.
Q. What is the difference between data mining and data profiling?
Data profiling: It targets on the instance analysis of individual attributes. It gives information on various attributes like value range, discrete value and their frequency, the occurrence of null values, data type, length, etc.
Data mining: It focuses on cluster analysis, detection of unusual records, dependencies, sequence discovery, relation holding between several attributes, etc.
Q. List out some common problems faced by a data analyst.
- Common misspelling
- Duplicate entries
- Missing values
- Illegal values
- Varying value representations
- Identifying overlapping data
Q. Mention the name of the framework developed by Apache for processing large data set for an application in a distributed computing environment.
Hadoop and MapReduce is the programming framework developed by Apache for processing large data set for an application in a distributed computing environment.
Q. What is KNN imputation method?
In KNN imputation, the missing attribute values are imputed by using the attributes value that are most similar to the attribute whose values are missing. By using a distance function, the similarity of the two attributes is determined.
Q. What data validation methods are used by data analyst?
- Data screening
- Data verification
Q. What should be done with suspected or missing data?
- Prepare a validation report that gives information of all suspected data. It should give information like validation criteria that it failed and the date and time of occurrence.
- Experienced personnel should examine the suspicious data to determine their acceptability.
- Invalid data should be assigned and replaced with a validation code.
- To work on missing data use the best analysis strategy like deletion method, single imputation methods, model based methods, etc.
Q. How to deal with multi-source problems?
- Restructuring of schemas to accomplish a schema integration.
- Identification of similar records and merging them into single record containing all relevant attributes without redundancy.
Q. What is an Outlier?
The outlier is a commonly used terms by analysts referred for a value that appears far away and diverges from an overall pattern in a sample. There are two types of Outliers:
- Univariate
- Multivariate
Q. What is Hierarchical Clustering Algorithm?
Hierarchical clustering algorithm combines and divides existing groups, creating a hierarchical structure that showcase the order in which groups are divided or merged.
Q. Explain what is K-mean Algorithm.
K mean is a famous partitioning method. Objects are classified as belonging to one of K groups, K chosen a priori.
- The clusters are spherical: the data points in a cluster are centered around that cluster.
- The variance/spread of the clusters is similar: Each data point belongs to the closest cluster.
Q. What is collaborative filtering?
Collaborative filtering is a simple algorithm to create a recommendation system based on user behavioral data. The most important components of collaborative filtering are users- items- interest. A good example of collaborative filtering is when you see a statement like “recommended for you” on online shopping sites that’s pops out based on your browsing history.
Q. What is KPI, the design of experiments and 80/20 rule?
KPI: It stands for Key Performance Indicator; it is a metric that consists of any combination of spreadsheets, reports or charts about business process
Design of experiments: It is the initial process used to split your data, sample and set up of a data for statistical analysis
80/20 rule: It means that 80 percent of your income comes from 20 percent of your clients.
Q. What is Map Reduce?
Map-reduce is a framework to process large data sets, splitting them into subsets, processing each subset on a different server and then blending results obtained on each.
Q. What is Clustering? What are the properties for clustering algorithms?
Clustering is a classification method that is applied to data. Clustering algorithm divides a data set into natural groups or clusters.
Properties of clustering algorithm are:
- Hierarchical or flat
- Iterative
- Hard and soft
- Disjunctive
Q. What are some of the statistical methods that are useful for data-analyst?
- Bayesian method
- Markov process
- Spatial and cluster processes
- Rank statistics, percentile, outliers detection
- Imputation techniques, etc.
- Simplex algorithm
- Mathematical optimization
Q. What is time series analysis?
Time series analysis can be done in two domains, frequency domain and the time domain. In Time series analysis the output of a particular process can be forecasted by analyzing the previous data with the help of various methods like exponential smoothening, log-linear regression method.
Q. What is correlogram analysis?
A correlogram analysis is the common form of spatial analysis in geography. It consists of a series of estimated autocorrelation coefficients calculated for a different spatial relationship. It can be used to construct a correlogram for distance-based data when the raw data is expressed as distance rather than values at individual points.
Q. What is imputation? List different types of imputation techniques.
During imputation, we replace missing data with substituted values. The types of imputation techniques involved are:
- Single Imputation
- Hot-deck imputation: A missing value is imputed from a randomly selected similar record by the help of a punch card.
- Cold deck imputation: It works same as hot-deck imputation, but it is more advanced and selects donors from other datasets.
- Mean imputation: It involves replacing missing the value with the mean of that variable for all other cases.
- Regression imputation: It involves replacing the missing value with the predicted values of a variable based on other variables.
- Stochastic regression: It is the same as regression imputation, but it adds the average regression variance to regression imputation.
- Multiple Imputation.
- Unlike single imputation, multiple imputation estimates the values multiple times
Q. Which imputation method is more favorable?
Although single imputation is widely used, it does not reflect the uncertainty created by missing data at random. So, multiple imputations are more favorable than single imputation in case of missing data at random.