Data Analytics Viva Questions

Basics of NumPy

NumPy is a Python library for numerical computations.

It provides support for arrays, matrices, and various mathematical operations.

Key functions:

np.array() - Creates an array.

np.mean(), np.median(), np.std() - Statistical calculations.

np.linspace() and np.arange() - Create sequences.

np.dot() - Matrix multiplication.




---

Basics of Pandas

Pandas is a library for data manipulation and analysis.

Two main structures:

Series: One-dimensional data.

DataFrame: Two-dimensional, like a table.


Key functions:

pd.read_csv() - Reads a CSV file.

df.head() - Displays the first rows.

df.describe() - Summary statistics.

df.isnull() - Detects missing values.




---

Feature Scaling

Adjusts the scale of features to make them comparable.

Techniques:

Standardization: (x - mean) / std_dev

Normalization: (x - min) / (max - min)




---

Principal Component Analysis (PCA) and LDA

PCA: Reduces dimensionality by finding components that explain variance.

LDA: Linear Discriminant Analysis focuses on maximizing separation between classes.



---

Linear Regression

Simple Linear Regression: Predicts a dependent variable (y) using one independent variable (x).

Equation: 


Multiple Linear Regression: Uses multiple independent variables.

Equation: 




---

Handling Missing Values

Imputation:

Mean: Replace missing values with the mean of the column.

Median: Replace with the column's median.

Mode: Replace with the most frequent value.




---

Model Selection

Evaluate models based on:

R²: Measures how much variance is explained.

Mean Squared Error (MSE): Lower is better.

Inertia (in clustering): Measures compactness; lower is better.




---

Difference Between Regression Types

Simple Linear: One independent variable.

Multiple Linear: Multiple independent variables.

Ridge: Adds penalty () to prevent overfitting.

Lasso: Adds penalty () and performs feature selection.

Elastic Net: Combines Ridge and Lasso penalties.



---

Cosine Function Mean

Refers to trigonometric functions. In clustering, cosine similarity is often used to find angular similarity between vectors.



---

DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

Groups points close together and marks outliers.

Parameters:

eps: Maximum distance between two points to consider as neighbors.

min_samples: Minimum points to form a dense region.





---

Elbow Method (for K-Means Clustering)

Used to find the optimal number of clusters (k).

Plot inertia vs. k. The "elbow" is the point where the decrease slows.



---

Key Imports and Functions

1. Lasso Regression:

from sklearn.linear_model import Lasso


2. DBSCAN:

from sklearn.cluster import DBSCAN


3. SVM:

from sklearn.svm import SVC




---

Test Your Understanding

NumPy & Pandas

1. How would you create a NumPy array of numbers from 1 to 10?
2. Which function in Pandas can you use to detect missing values?


Feature Scaling
3. Why is feature scaling important? Name two techniques for scaling.



PCA & LDA
4. What is the primary goal of PCA? How does it differ from LDA?


Linear Regression
5. What is the equation for simple linear regression?
6. How is multiple regression different from simple regression?


Imputation
7. If a dataset has missing values, which techniques can you use to fill them?


Model Selection
8. If two models have the same R² value but different MSE, which should you choose?


Clustering
9. What does the eps parameter in DBSCAN control?
10. What is the purpose of the Elbow Method?


Advanced Regression
11. What is the key difference between Ridge and Lasso regression?

Comments

Popular posts from this blog

Is Packaged Juice Healthy? The Answer Will Surprise You!

How to remember the charge of cation and anion

Dark Mode