Data Analytics Viva Questions
Basics of NumPy
NumPy is a Python library for numerical computations.
It provides support for arrays, matrices, and various mathematical operations.
Key functions:
np.array() - Creates an array.
np.mean(), np.median(), np.std() - Statistical calculations.
np.linspace() and np.arange() - Create sequences.
np.dot() - Matrix multiplication.
---
Basics of Pandas
Pandas is a library for data manipulation and analysis.
Two main structures:
Series: One-dimensional data.
DataFrame: Two-dimensional, like a table.
Key functions:
pd.read_csv() - Reads a CSV file.
df.head() - Displays the first rows.
df.describe() - Summary statistics.
df.isnull() - Detects missing values.
---
Feature Scaling
Adjusts the scale of features to make them comparable.
Techniques:
Standardization: (x - mean) / std_dev
Normalization: (x - min) / (max - min)
---
Principal Component Analysis (PCA) and LDA
PCA: Reduces dimensionality by finding components that explain variance.
LDA: Linear Discriminant Analysis focuses on maximizing separation between classes.
---
Linear Regression
Simple Linear Regression: Predicts a dependent variable (y) using one independent variable (x).
Equation:
Multiple Linear Regression: Uses multiple independent variables.
Equation:
---
Handling Missing Values
Imputation:
Mean: Replace missing values with the mean of the column.
Median: Replace with the column's median.
Mode: Replace with the most frequent value.
---
Model Selection
Evaluate models based on:
R²: Measures how much variance is explained.
Mean Squared Error (MSE): Lower is better.
Inertia (in clustering): Measures compactness; lower is better.
---
Difference Between Regression Types
Simple Linear: One independent variable.
Multiple Linear: Multiple independent variables.
Ridge: Adds penalty () to prevent overfitting.
Lasso: Adds penalty () and performs feature selection.
Elastic Net: Combines Ridge and Lasso penalties.
---
Cosine Function Mean
Refers to trigonometric functions. In clustering, cosine similarity is often used to find angular similarity between vectors.
---
DBSCAN Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
Groups points close together and marks outliers.
Parameters:
eps: Maximum distance between two points to consider as neighbors.
min_samples: Minimum points to form a dense region.
---
Elbow Method (for K-Means Clustering)
Used to find the optimal number of clusters (k).
Plot inertia vs. k. The "elbow" is the point where the decrease slows.
---
Key Imports and Functions
1. Lasso Regression:
from sklearn.linear_model import Lasso
2. DBSCAN:
from sklearn.cluster import DBSCAN
3. SVM:
from sklearn.svm import SVC
---
Test Your Understanding
NumPy & Pandas
1. How would you create a NumPy array of numbers from 1 to 10?
2. Which function in Pandas can you use to detect missing values?
Feature Scaling
3. Why is feature scaling important? Name two techniques for scaling.
PCA & LDA
4. What is the primary goal of PCA? How does it differ from LDA?
Linear Regression
5. What is the equation for simple linear regression?
6. How is multiple regression different from simple regression?
Imputation
7. If a dataset has missing values, which techniques can you use to fill them?
Model Selection
8. If two models have the same R² value but different MSE, which should you choose?
Clustering
9. What does the eps parameter in DBSCAN control?
10. What is the purpose of the Elbow Method?
Advanced Regression
11. What is the key difference between Ridge and Lasso regression?
Comments
Post a Comment