Top 45+ Data Science Interview Questions
Data science is a field of multiple expertise. It is a combination of various disciplines, such as statistics, mathematics, artificial intelligence, machine learning, and more. Taking into consideration the massive amount of data at present, data scientists have become extremely important for any company’s functioning. In this blog, we will cover data science interview questions for freshers, intermediates, and advanced professionals that will help you ace your interviews.
You can also opt for an online data science course to increase your proficiency in the field.
Data Science Interview Questions for Freshers
Here are some of the important data science interview questions for freshers.
1. What is data science?
Data science is a combination of domains, such as statistics, machine learning, mathematics, artificial intelligence, programming, and analytics. It focuses on uncovering actionable insights from an organization’s data, which are then used for strategic planning and decision-making.
2. What is a data science life cycle?
A data science life cycle is a set of steps that involve the utilization of data science techniques, tools, and processes to gather insights from a set of data. The following are the steps involved in a data science life cycle:
- Problem Identification
- Data Collection
- Data Preparation
- Data Analysis
- Model Building
- Model Evaluation
- Model Deployment
- Report Preparation
3. According to you, what is the most important skill one should have to become a data scientist?
The skills required to become a successful data scientist include:
- Expertise in mathematics, probability, and statistical analysis.
- Proficiency in cloud computing.
- Familiarity with deep learning models and technologies.
- Advanced data visualization skills.
- Strong programming skills.
- Competence in data processing.
4. Name some data science libraries.
Following are some data science libraries:
- NumPy
- SciPy
- Tensor Flow
- Scrapy
- Librosa
- MatPlotLib
- Pandas
5. How will you select the right variables?
To choose the appropriate variables, we use the feature selection method. This gives out two methods for the process of feature selection, which are:
- Filter Method: This method involves selecting features based on their statistical properties. It is computationally efficient and can be used to quickly identify the most important features in a dataset. Examples: Linear discrimination analysis, ANOVA, and Chi-Square.
- Wrapper Methods: This method involves selecting features based on how well they improve the performance of a specific machine-learning model. It is trained and tested on different subsets of features that give the best performance. Examples: Forward Selection, Backward Selection, and Recursive Feature Elimination.
6. If your data set is missing a lot of values, what will you do?
If the data is large, we can remove the rows that have the missing data and predict the values with the data that is already given. It is the easiest method and quickest way to deal with the problem. However, when the data is small and limited, Pandas data frame can be used and the missing values can be substituted by the average of the data that is still present.
7. What is data cleaning in data science?
Data cleaning is the process of correcting or removing incorrect, duplicate, corrupted, incomplete, and incorrectly formatted data from a dataset. This is done to ensure that no data is duplicated or mislabeled during processing or analysis. Matplotlib and Pandas are some of the commonly used data cleaners.
8. How is data science different from data analytics?
Both data science and data analytics focus on getting insights from a specific dataset. While data science utilizes data to build models that predict future outcomes, data analytics focuses more on analyzing historical data to make informed decisions.
9. What is dimensionality reduction?
Dimensionality reduction is a method that turns a dataset with huge dimensions into a dataset with lesser dimensions. It helps to compress the data. Though data sets with large dimensions and data sets with lesser dimensions give the same amount of information, the benefit is that the storage and the computation time will be less.
10. In what ways would you maintain a deployed model?
The following ways can be used to maintain a deployed model:
- Monitor
- Evaluate
- Compare
- Build
11. How would you define recommender systems?
Recommender systems refer to a class of algorithms and techniques used to make personalized recommendations to users based on their past behavior. These systems are widely used in e-commerce, social media, and other industries to help improve customer engagement and satisfaction.
12. Explain logistic regression.
Logistic regression is used to measure the relationship between a dependent variable and one or more independent variables by estimating probability using its underlying logistic function. In this method, the relationship measured is used to predict the value of one of the variables based on the other. The prediction usually has a binary outcome, like yes or no.
13. How will you select k for k-means?
The optimal number of clusters found from data is denoted as ‘k’ in k-means. You can use the elbow method to run k-means clustering on a data set and select ‘k’. In this method, for each k value, you have to calculate average distances to the centroid across all data points.
14. How will you treat outlier values?
The outlier values can be dropped if only it is a garbage value. You can also try the following:
- Choose an alternate model.
- Attempt to normalize the data.
- Use algorithms that are not that much affected by outliers.
15. What difference lies between traditional application programming and data science?
Traditional application programming has to create rules to convert the input to output. On the other hand, data science already has the rules produced automatically from the data. This is the most important difference between the two.
16. What is deep learning?
Deep learning is built up like human thoughts as it contains algorithms that are similar to the human mind. Deep learning models can recognize patterns in text, pictures, sounds, and other data to give accurate insights and predictions. These methods are used to automate tasks that usually require human intelligence.
17. What is machine learning?
Machine learning is a subset of artificial intelligence that focuses on the creation of algorithms used to train computers or machines to learn without explicitly being programmed. It uses data and algorithms to imitate human learning, gradually improving its accuracy.
18. When should an algorithm be updated?
An algorithm should be updated when:
- The underlying data source is changing.
- There is a case of non-stationary.
- If you want the model to evolve as data streams through infrastructure.
- Due to underperformance or lack of efficiency in an algorithm.
19. What information about the customer should be there in the SQL query?
The following information should be there in the SQL query:
- Order Table
- OrderID
- CustomerId
- OrderNumber
- TotalAmount
- Customer Table
- ID
- First name
- Last name
- City
- Country
20. What is long format data and wide format data?
Long Format Data: In this data, the values keep on repeating in the first column and every row is a one-time point per subject.
Wide Format Data: It refers to a type of data organization where each row represents a single observation, and each column represents a single variable or feature. It is used in datasets where the number of variables is relatively smaller as compared to the number of observations.
21. What is cross-validation?
Cross-validation is a model validation technique to evaluate how the outcomes of a statistical analysis will generalize to an independent data set. It is used to estimate how accurately a model will work.
22. What is root cause analysis?
Root cause analysis is the method for identifying the root cause of problems to identify appropriate solutions. Businesses use this method to find out the underlying reasons for problems. Root cause analysis is considered more effective as it aims to provide a long-term solution by identifying root causes rather than just discussing the symptoms.
23. What is resampling? When is it conducted?
Resampling is a method used to draw new samples from the existing or original data sample. It is done to gather more information about the sample and improve accuracy. Resampling is also helpful in quantifying the uncertainty of population parameters. This ensures that the model is able to handle variations in data. Further, resampling is used in cases where the models are verified using random subsets or when substituting labels on data points while performing tests.
24. What is imbalance data?
Data is considered as imbalance when it is distributed unequally across different categories. Datasets with imbalance data are responsible for model performance encountering error and results being inaccurate.
25. What is expected and mean value?
Expected value is a long-run average value of random variables. It indicates the probability-weighted average of all possible values. Mean value is the average value of a given sample, i.e., raw data that is already collected.
26. What are confounding variables?
Confounding variables, also known as confounders, are a type of variables that influence both dependent and independent variables, causing a mathematical relationship between them. Confounders affect variables that are associated but are not causally related to each other.
27. What are eigenvectors and eigenvalues?
Eigenvectors are unit vectors or column vectors with their length or magnitude equal to 1. They are also known as right vectors. Eigenvalues are coefficients that are applied on right vectors which give these vectors different values for length.
28. What is a p-value?
A p-value is the measure of probability of having results equal to or greater than the results achieved under actual observations. It signifies that the observed probability occurred randomly.
29. What is the law of large numbers?
According to the law of large numbers, the mean of a sample grows closer to the average of the whole population as the sample size increases. This implies that the sample is more representative of the population as the sample becomes larger.
30. What is data visualization?
Data visualization is the graphical representation of information or data using charts, plots, infographics, and animations. It helps in communicating complex data relationships and data-driven insights visually.
31. What are two foundational functions used in data science?
The following are two foundational functions used in data science:
- Cost Function: This function quantifies the disparity between predicted values and actual values. It is useful for optimization and is also known as the objective function.
- Loss Function: This function evaluates the discrepancy or error between predicted values and actual labels. It is useful in supervised learning scenarios.
32. Between R and Python, which is the better programming language for analyzing text?
Python will perform better than R when analyzing text for the following reasons:
- Python has a Pandas module that facilitates high-performance data analysis.
- Pandas support simple-to-use data structures.
- Python performs data analytics faster than R.
33. Give reasons why TensorFlow is a preferred library.
Here are some of the reasons why TensorFlow is a preferred library:
- Provides C++ and Python APIs for efficiency.
- Faster compilation speed than Keras and PyTorch.
- Supports both GPU and CPU computing devices.
34. What is a dropout in data science?
In data science, dropout refers to the process of randomly removing visible and hidden network units. They eliminate up to 20% of the nodes to avoid overfitting of data and provide the necessary space for the network’s iterative convergence process.
Data Science Interview Questions for Intermediates-
Now, let’s look at some of the data science interview questions for intermediates.
35. Which machine learning algorithm do you usually prefer and why?
To answer this question, you can talk about the following machine learning algorithms:
- Linear Regression
- Logistic Regression
- Decision Tree
- Support Vector Machine
- K-Means
You can also explain these ML algorithms, followed by their features, and some benefits because of which you prefer working with the particular algorithm.
36. How would you make a decision tree?
The following steps can be taken to make a decision tree:
- Turn data into input.
- Use split to separate the data into classes.
- Apply the split method to the data.
- Apply the last two steps on the data again to further divide the data.
- When you meet a stopping criteria, you can stop.
- If you have split too far, clean up the tree.
37. Explain pruning in a decision tree algorithm.
Decision trees can be exceptionally huge and may promote overfitting and decreased generalization capacity. Pruning is used to reduce this complexity of a decision tree by reducing the rules. It also helps with the accuracy of the decision tree.
38. What is entropy in a decision tree algorithm?
Entropy is the measure of disorder or randomness in a cluster of observations. It is used to check the homogeneity of a given data. If entropy is zero, the given data is homogenous. If entropy is one, the given data sample is equally divided. Therefore, it determines how a decision tree splits data.
39. Explain the building of a random forest model.
When the data is split into groups, each set makes a decision tree. The role of a random forest model is to get the trees from different groups of data and combine them all. The following are the steps to build a random forest model.
- Select ‘k’ features from the total number of ‘m’ features. Make sure that k<m.
- Calculate node ‘d’ from among the ‘k’ features while using the best-split point.
- Split the nodes into daughter nodes.
- Finalize the leaf nodes.
- Build the random forest model by repeating all the steps to create the trees.
40. What will you do if your model is overfitting?
A model is overfitted if it is set for just a little amount of data and loses sight of the big picture that the data is supposed to display to infer useful insights for the company. There are ways to avoid overfitting your model, which include:
- Take a small number of variables into account and keep the model as simple as possible.
- Use the technique of cross-validation.
- Use LASSO regularization techniques to avoid issues like overfitting.
41. Explain TPR and FPR.
TPR: True-positive rate is a way to measure the verified positives taken out of the proportion of the correct predictions of the positive class.
FPR: The false-positive rate measures something that was previously false but now is true. It provides us with incorrect predictions of the positive class.
42. What is ROC Curve?
The receiver operating characteristic (ROC) curve is the graph between the True Positive Rate on the y-axis and the False Positive Rate on the x-axis. It shows performance of a classification model at all classification levels.
43. What is a variance?
The variance shows the individual figures in a set of data. It measures how far each number in the set is from the mean (average), and from every other number in the set. Therefore, variance is used by data scientists to get an idea about the distribution of a data set.
44. What is the normal distribution?
Normal distribution displays data near the mean and frequency of the data that is present at the time. It appears in the shape of a bell curve on the graph and includes various parameters like Mean, Median, and Mode.
45. What do you mean by feature vectors?
When it comes to machine learning, feature vectors stand for numeric characteristics in a mathematical way so that it becomes easy for the data scientist to analyze the features.
46. What is the objective of A/B testing?
A/B testing is used to see if there are any changes in a web page so that the outcome of a strategy can be maximized to its fullest capacity. Often, two versions of a web page or app are tested through A/B testing to determine which version performs better. This testing is conducted by presenting each version to users at random and analyzing the results based on interactions accordingly.
47. What are the advantages and disadvantages of linear models?
The following are the advantages of linear models:
- It is easier to implement, interpret, and efficient to train.
- It is ideal for linearly separable data.
- It allows using cross-validation and regularization to handle overfitting.
The following are the disadvantages of linear models:
- The assumption of the linearity between variables restricts the linear regression.
- It cannot count the binary outcomes.
- It is prone to overfitting problems.
48. What is regularization in machine learning?
Regularization is a machine learning approach that prevents overfitting, which occurs when a model is overly complicated and fits the training data too closely, resulting in poor generalization to new data. Regularization does this by introducing a penalty term into the loss function, which pushes the model to have smaller weights.
49. What are some sampling techniques?
Sampling is the practice of analyzing a subset of all data to uncover meaningful information in a larger dataset. There are two types of sampling, probability and non-probability sampling.
Probability Sampling: It is a method that uses random selection methods, providing everyone in a population an equal chance of selection. It allows you to form a representative sample. Examples: Clustered sampling, Stratified sampling, and Simple random sampling.
Non-Probability Sampling: It is a method that uses a subjective method rather than a random method to select clusters from a population. In non-probability, not all members of a population have a chance of being selected. It saves time and is a cost-effective method of selection. Examples: Snowball sampling, Quota sampling, and Convenience sampling.
50. Explain different types of bias in sampling.
Following are the different types of bias in sampling:
Selection Bias: It is used in case of a problematic situation where an error is introduced because of a non-random population sample.
Survivorship Bias: It occurs when a data set only considers surviving or existing observations and fails to consider previous observations that now cease to exist.
Undercoverage Bias: It occurs when a significant part of the population is excluded from or not represented well in your sample. This makes the sample no longer representative of the target population.
51. What is gradient and gradient descent?
Gradient: It is the measure of a property to know how much that output has changed with respect to a little modification in the input. Basically, it is used to represent the change in the weights for the change in the error.
Gradient Descent: It is a minimization algorithm that minimizes the activation function. Though it can minimize any function, it is usually used with the activation function.
52. What is an exploding gradient and a vanishing gradient?
Exploding Gradient: It is a problem where large gradients accumulate and result in the increased weight of neural network models. Due to this, the model is unable to learn and its behavior becomes unstable.
Vanishing Gradient: This occurs during the training of neural networks, where the gradients used to train the network become extremely small or vanish as they shift from output layers to the earlier layers.
53. What are RMSE and MSE in a linear regression model?
Root Mean Square Error (RMSE): It measures the average difference between values predicted by a model and the actual values. It gives an estimate of the accuracy of the target value prediction. It is one of the ways used to test the performance of the machine learning model.
Mean Squared Error (MSE): It measures how close a regression line is to a set of data points. In this method, the distances from the points to the regression line are squared. The total is divided by the number of data points to calculate the mean squared error (MSE).
54. What are support vectors in support vector machines (SVM)?
Support vector machine is a machine learning algorithm that aims to find a hyperplane that separates the two classes. Support vectors are the data points that are closest to the hyperplane. The separating line in the SVM is determined with the help of these data points.
55. What is the significance of C in support vector machines?
The parameter C in the support vector machine is a hyperparameter that controls the trade-off between maximizing the margin and minimizing the misclassification error. It decides the penalty for misclassifying a training example.
If C has a smaller value, there is a larger penalty for misclassification. So, the model will try to correctly classify. Alternatively, if C has a larger value, there is a smaller penalty for misclassification. So, the model will try to maximize the margin even if it results in misclassifying the examples. A smaller value of C will result in a rigid model prone to underfitting. A large value of C will result in a flexible model prone to overfitting. Therefore, the value of C should be chosen carefully.
56. What is a computational graph?
Computational graphs are a type of graph used in data science to represent mathematical expressions. This graph has a network of nodes where each node corresponds to a mathematical operation. It is also known as the dataflow graph. The popular deep learning library TensorFlow is based on computational graphs.
57. What are auto-encoders?
Auto-encoders are learning networks that convert input into outputs with minimum errors. This requires the output to be almost equal to or close to the input. Multiple layers are added in between the input and output layers. Note that the layers in between are supposed to be smaller than the input layer.
58. Explain KPI, lift, model fitting, robustness, and DOE.
Key Performance Indicator (KPI): These indicators measure how well a business performs across different parameters. The aim is to understand if the business is achieving its objectives.
Lift: It is used to measure the performance of a target model against performance of a randomly selected model. Lift helps in understanding how good the model is with prediction versus if there was no model.
Model Fitting: It indicates how well the target model aligns with given observations.
Design of Experiments (DOE): It represents the task design that aims to describe and explain information variation under certain hypothetical conditions to reflect variables.
59. What is correlation and covariance?
Correlation: It is used to measure the quantitative relationship between two variables and is estimated in terms of how strongly the variables are related.
Covariance: It explains the systematic relationship between two variables where changes in one influence the changes in another variable. Covariance represents the extent to which two variables change together in a cycle.
60. What is kernel trick?
A kernel trick is a method used for solving a nonlinear problem with the help of a linear classifier by transforming linearly inseparable data into separate ones in higher dimensions.
61. What is a box plot and histogram?
A box plot and a histogram are visualizations used to represent data distributions for communication of information. The box plot depicts the numerical data through quartiles. It explains how closely the data is grouped, how the data is incorrect, and the symmetry of the data.
Histograms are bar chart representations of information that indicate the frequency of numerical values. They are useful in estimating variations, probability distributions, and outliers.
62. What are tuning strategies? Explain grid search and random search tuning strategies.
Tuning strategies are used to find the right set of hyperparameters or properties that are fixed before the model is trained on the dataset. Both grid search and random search are optimization techniques used to search for efficient hyperparameters.
Grid Search: This search method is similar to a grid where the values are in a matrix and a search is conducted. Here, each parameter is tried out and its accuracy is evaluated. After all the combinations are tested, the model with the highest accuracy is selected.
Random Search: In this method, random hyperparameter sets are tried and tested to find the best solution. The function is tested at random configurations in parameter space to optimize the search.
63. What is the central limit theorem?
The central limit theorem is a concept in statistics, according to which, the distribution of the sample mean will approach a normal distribution with the increase in the sample size. This is correct regardless of the underlying distribution of the population from which the sample is taken. So, even if the individual data points are not normally distributed in a sample, we can use normal distribution-based methods to draw inferences about the population.
64. What are the two types of target variables for predictive modeling?
The two types of target variables for predictive modeling are:
- Numerical or Continuous Variables: These are variables whose values are within a range. The values could be any value in the range or they can be from different ranges. Example: height, weight, age, and income.
- Categorical Variables: These variables can take one or a limited number of possible values. Here, each individual or unit of observation is assigned to a particular group based on a qualitative property. Example: Exam result (Pass or Fail), blood type, and gender.
65. What is the confidence interval in statistics?
The confidence interval is the range within which the results are expected to be if the experiment is repeated. This interval measures the degree of certainty or uncertainty in a sampling method. They are also used in hypothesis testing and regression analysis. They are created using a confidence level of 95% or 99%.
66. What is hypothesis testing? Why is it useful for data scientists?
Hypothesis testing is a statistical method used in data science to test the validity of a statement or hypothesis about a population. It is used to determine if there is enough evidence to support a hypothesis and to examine the statistical significance of the results.
Hypothesis testing is essential for data scientists because it facilitates informed decision-making based on data rather than assumptions. Data scientists can conclude data and present their findings clearly and reliably.
67. How is Bayes’ theorem used in data science?
Bayes’ theorem is a mathematical formula that calculates the probability of an occurring event based on prior knowledge of event-related conditions. In data science, it is used in Bayesian statistics and machine learning to perform tasks, such as prediction, classification, and estimation.
68. What is a parametric test and non-parametric test?
A parametric test is a statistical test that assumes that the population’s parameters follow a specific probability distribution. A non-parametric test does not make any assumptions about the underlying probability distribution of the data. Non-parametric tests are also known as distribution-free tests.
69. Explain the benefits of NumPy in data science.
The following are the benefits of NumPy in data science:
- It provides fast and efficient tools for working with arrays and matrices of numerical data.
- It integrates well with other scientific computing libraries, such as SciPy and Pandas. This makes it easier for the user to perform more complex data science tasks.
Advanced Data Science Interview Questions and Answers
Given below are advanced data science interview questions for data science professionals with experience:
70. What are some popular types of neural networks?
The following are some of the popular types of neural networks:
- Recurrent Neural Network (RNN): It uses sequential data in language translation, voice recognition, image capture, and more. Google’s voice search and Apple’s Siri use the RNN algorithm.
- Convolutional Neural Network: CNN consists of multiple players that process and extract features from data. They are mainly used for image processing and object detection.
- Long Short-Term Memory Network: It can learn and memorize long-term dependencies. They are used for time-series predictions as they can remember previous inputs.
- Multilayer Perceptions: It has multiple layers of perceptrons that have activation functions. Multilayer perceptions have an input layer and an output layer but they can also have multiple hidden layers. They are usually used for speech recognition and machine-translation software.
71. What is a generative adversarial network? What are its two essential components?
Generative adversarial networks (GANs) are a robust class of neural networks that are used for unsupervised learning. It trains two neural networks in a way that they compete with each other to generate more authentic and accurate new data from a specific training dataset.
The two essential components of GAN are:
- Generator: It learns to generate plausible data. The generated instances become negative training examples for the discriminator.
- Discriminator: It learns to distinguish the generator’s fake data from the authentic data. Also, the discriminator penalizes the generator for producing implausible results.
72. What is univariate, bivariate, and multivariate analysis?
- Univariate Analysis: It takes into consideration only one variable to find various patterns in the data. For example, the marks of all students in French. Once the marks are fed into the system, the conclusions can be taken out by using methods, such as mean, median, mode, etc.
- Bivariate Analysis: This type of analysis takes into account two sorts of variables to highlight the relationships and causes between the two variables. For example, sales of woolen clothes in the winter season according to the temperature. If the temperature is lower, the woolen clothes sell more on those days and when it is high, the woolen sales drop down. According to the table and the variables of the two, bivariate analysis will give you such conclusions.
- Multivariate Analysis: This type of analysis involves more than two variables to draw conclusions and see the relationship between the variables. You can again give a similar example for multivariate analysis as you give for the two analyses above.
73. Explain the Confusion Matrix. How will you calculate accuracy by using the confusion matrix?
A confusion matrix is a table that summarizes the predictions of a problem that has been defined. It is also used to evaluate the performance of classification models.
Take a simple example for calculating accuracy. Once you have thought of an example, simply put the formula:
Accuracy = (True Positive + True Negative) / Total Observations
74. With the help of an equation, calculate the precision and the recall rate.
The formula for precision is as follows:
Precision = (True positive) / (True Positive + False Positive)
The formula for recall rate is as follows:
Recall Rate = (True Positive) / (Total Positive + False Negative)
75. Explain the concept of the bias-variance tradeoff in machine learning.
The bias-variance tradeoff is a fundamental machine learning concept that refers to the tradeoff between a model’s ability to match the training data (low bias) and its ability to generalize to new data. A model with a high bias cannot capture the complexity of the data and hence underfits the training data, whereas a model with a high variance overfits the training data and thus fails to generalize to new data.
76. What is ensemble learning in machine learning?
Ensemble learning is a machine learning approach that integrates many models to increase the accuracy and resilience of predictions. The core principle underlying ensemble learning is to combine the predictions of numerous weak learners (models that perform just marginally better than random guessing) to generate a strong learner (model that performs significantly better than random guessing).
77. What is a bagging ensemble? How do you choose the number of models to use in this ensemble?
A bagging ensemble is used to reduce variance by averaging predictions from models trained on different subsets of data. It is also known as bootstrap aggregation.
The number of models used in an ensemble is decided by the trade-off between performance and computational cost. Therefore, if the number of models is increased, the performance of the ensemble will increase but the computational cost will increase too. Instead, use cross-validation to determine the optimal number of models. It decides the number based on the evaluation metric chosen.
78. What is Boosting?
Boosting is an ensemble learning method that combines a set of weak classifiers to build a strong classifier. This is done to reduce the training errors. The process of boosting involves selecting a random data sample, fitting it with the model, and training it sequentially to ensure each model tries to compensate for the weakness of the previous model. It is useful for models that are sensitive to the training data and are highly prone to overfitting.
79. Differentiate between supervised and unsupervised learning.
The following are the differences between supervised and unsupervised learning.
Parameters | Supervised Learning | Unsupervised Learning |
Input | Uses labeled data as input. | Uses unlabeled data as input. |
Objective | The goal is to predict outcomes for new data. | The goal is to get insights from large-volumes of new data. |
Feedback | It has a feedback mechanism. | It has no feedback mechanism. |
Algorithm | Commonly used algorithms include decision trees, support vector machines, and logistic regression. | Commonly used algorithms include hierarchical clustering, k-means clustering, and apriori algorithm. |
Applications | Sentiment analysis, pricing predictions, spam detection, and weather forecasting. | Medical imaging, customer personas, and anomaly detection. |
Complexity | It is simple compared to unsupervised learning. | It is complex because you need powerful tools to work with large amounts of unclassified data. |
80. What is KNN? What will happen if the number of neighbors increases in KNN?
The k-nearest neighbors algorithm or KNN is a non-parametric, supervised learning classifier. This algorithm uses proximity to make classifications about a grouping of an individual data point.
If the number of neighbors increases in KNN, the classifier will become more conservative and the decision boundary will become more smooth. Though this helps tackle the overfitting issues, it causes the classifier to be less sensitive to subtle patterns in the training data. This can make the model prone to underfitting. Therefore, it is important to choose an appropriate value of k.
81. What are label encoding and one-hot encoding?
Label encoding and one-hot encoding are two different techniques used for encoding categorical variables as numerical values. They are used in machine learning models during preprocessing.
Label encoding is used for categorical variables that have a natural order. This method assigns a unique integer value to each category. These integer values are determined by the natural order of the categories.
One-hot encoding is used for categorical variables that do not have a natural order or ranking. It creates new binary columns for each category. In this method, a value of 1 indicates the presence of the category, whereas a value of 0 indicates the absence of the category. This is useful in preserving the uniqueness of each category and preventing the model from assuming any ordinal relationships between them.
82. Why is label encoding not ideal for nominal data?
Label encoding is not ideal for nominal data because of the following reasons:
- It can create an ordinal relationship between categories where this relationship doesn’t exist. This is problematic because the model can assume your categories are dependent on one another.
- It can lead to unexpected results if you have an imbalanced dataset.
83. What are some problems encountered with one-hot encoding?
Here are some of the issues you may encounter while using one-hot encoding:
- It can create a large number of new columns in the dataset, making it complex to work with the data.
- It can lead to overfitting. This is especially true if you have a small dataset and a large number of categories.
- It makes it challenging to add new categories to the dataset at later stages of time.
84. Which encoding techniques can we use when dealing with a large number of categorical values in a column?
Frequency encoding and target encoding are ideal to use when dealing with a large number of categorical values in a column. Frequency encoding replaces each category with the frequency of that category in the dataset. It is effective when the categories have a natural ordinal relationship based on their frequency. Target encoding replaces each category with the mean of the target variable for that category. It is effective when the categories have a clear relationship with the target variable.
85. What are linear discriminant analysis and principal component analysis?
Linear Discriminant Analysis (LDA): It is a supervised technique used to find a lower-dimensional subspace that maximizes the separation between different classes of data. It is used as a dimensionality reduction technique for classification problems.
Principal Component Analysis (PCA): It is a technique used to reduce the dimensionality of large datasets. It transforms a large set of variables into a smaller set that still contains most of the information of the large set.
86. What will happen if the mean, median, and mode are the same for a dataset?
The mean, median, and mode of a dataset will be the same if the said dataset consists of a single value that occurs with 100% frequency. For example, if a dataset contains the following values, 4,4,4,4,4. The mean, median, and mode will be 4. This happens because the dataset has a single value that occurs with 100% frequency.
If the dataset consists of multiple values different from each other, the mean, median, and mode will be different. It would mean that no value occurs with 100% frequency.
The mean, median, and mode are influenced by outliers or extreme values in a dataset. If the dataset has extreme values, the mean and median will be significantly different from the mode, even if a single value occurs in high frequency.
87. What are lambda functions in Python? Give an example.
A lambda function in Python is a small anonymous function used when you do not want to define a function using the def keyboard. They are used in combination with high-order functions, such as filter(), reduce(), and map().
Example:
# Regular function to square a numberdef square(x): return x ** 2
# Equivalent lambda functionsquare_lambda = lambda x: x ** 2
# Using the lambda functionresult = square_lambda(5)print(result) # Output: 25
Here, the ‘square_lambda’ is a lambda function that squares its input ‘x’.
88. What are the different ways to find outliers in the data?
Outliers are data points caused by errors, anomalies, or unusual scenarios. They have a significant impact on statistical analysis and machine learning models. You can use the following ways to find outliers in the data:
- Visual Inspection: You can visually inspect the data for outliers using plots like histograms, box plots, and scatter plots.
- Summary Statistics: You can calculate the summary statistics, such as mean, median, or interquartile range, and compare them to the data to find outliers.
- Z-Score: It is a measure of how many standard deviations a data point is from the mean. Data points that have a z-score more than a specific limit can be considered outliers.
89. What is skewness in statistics? Explain its types.
Skewness is a measure of the symmetry of a distribution. A distribution is symmetrical if it is in the shape of a bell curve where the data points are concentrated around the mean. If a distribution is not symmetrical, it is considered skewed. Here, the data points are concentrated on one side of the mean rather than the other.
The two types of skewness are as follows:
- Positive Skewness: It occurs when the distribution has a long tail on the right side and the majority of the data points are concentrated on the left side of the mean. It indicates that there are few extreme values on the right side of the distribution.
- Negative Skewness: It occurs when the distribution has a long tail on the left side and the majority of the data points are on the right side of the mean. It indicates that there are few extreme values on the left side of the distribution.
90. What are decorators in Python?
Decorators in Python are a way to modify the functionality of a method, function, or class without changing their source code. They are implemented as functions that take other functions as arguments and return a new function. A decorator starts with @ symbol and is used immediately before the method, function, or class it decorates. This symbol indicates that the following function is a decorator.
91. From the given dataset, extract only those rows where the ‘grade’ value is greater than 80 and the ‘gender’ is female.
Name | Gender | Grade |
Alice | Female | 85 |
Bob | Male | 70 |
Charlie | Female | 90 |
David | Male | 75 |
Eva | Female | 95 |
Here is the code to extract rows where the ‘grade’ value is greater than 80 and the ‘gender’ is female.
import pandas as pd
# Reading the dataset from a file (assuming it's a CSV file named 'student_details.csv')student_details = pd.read_csv('student_details.csv')
# Extracting rows where grade > 80 and gender is femalefiltered_students = student_details[(student_details['grade'] > 80) & (student_details['gender'] == 'female')]
print(filtered_students)
92. Create a scatter plot using ggplot to visualize the relationship between ‘height’ and ‘weight’ for a group of individuals. Plot ‘weight’ on the y-axis, ‘height’ on the x-axis, and color the points based on ‘gender’.
The following code will create a scatter plot using ggplot to visualize the relationship between ‘height’ and ‘weight’ for a group of individuals:
library(ggplot2)
# Sample datasetdata <- data.frame( height = c(160, 165, 170, 175, 180), weight = c(60, 70, 80, 90, 100), gender = c('female', 'female', 'male', 'male', 'female'))
# Create scatter plotggplot(data, aes(x = height, y = weight, color = gender)) + geom_point() + labs(x = 'Height', y = 'Weight', color = 'Gender')
93. Write a function to calculate Euclidean distance between two points P1 and P2 that represent the coordinates (x1, y1) and (x2, y2) respectively.
The following is the code to calculate Euclidean distance between P1 and P2:
def euclidean_distance(P1, P2): return (((P1[0] - P2[0]) ** 2) + ((P1[1] - P2[1]) ** 2)) ** 0.5
The above code calculates the difference between x coordinates and y coordinates, squares each difference, sums them, and then takes the square root of the sum to find the distance.
94. Write code to calculate the root mean square error (RMSE) given the lists of values representing the actual and predicted values.
def rmse(actual, predicted): errors = [actual[i] - predicted[i] for i in range(len(actual))] squared_errors = [x ** 2 for x in errors] mean_squared_error = sum(squared_errors) / len(squared_errors) return mean_squared_error ** 0.5
In the above code, the mean of the squared errors is computed and the square root of this mean is taken to calculate the root mean square error.
95. How will you detect if the time series data is stationary?
The time series data is considered to be stationary when the variance or mean is constant with time. If the variance remains constant and does not change over a period of time in a dataset, it is considered to be stationary.
96. From the given confusion matrix, calculate the precision and recall.
Predicted Positive | Predicted Negative | |
Actual Positive | 20 | 5 |
Actual Negative | 10 | 15 |
Here is how you can calculate precision and recall:
# Confusion matrixtrue_positive = 20false_positive = 5false_negative = 10
# Calculate precisionprecision = true_positive / (true_positive + false_positive)
# Calculate recallrecall = true_positive / (true_positive + false_negative)
print("Precision:", precision)print("Recall:", recall)
97. What is TF/IDF vectorization?
The Term Frequency-Inverse Document Frequency (TF/IDF) is a numerical measure that allows one to determine how important a word is to a document in a collection of documents called a corpus. This measure is often used in text mining and information retrieval.
98. What are large language models?
Large language models (LLMs) are AI models designed to process and generate text that resembles human language based on the input received. They use deep learning and other advanced techniques to comprehend and produce language patterns that enable them to engage in conversations and answer questions. They undergo training using extensive sets of textual data from diverse sources to gain the ability to recognize patterns, understand context, and generate appropriate responses.
99. What is a transformer in machine learning?
Transformer in machine learning is the architecture of a neural network that has garnered significant acclaim, especially in the natural language processing (NLP) domain. It was designed to overcome the limitations encountered by recurrent neural networks (RNNs) when working with sequential data. Transformers do not use sequential processing and can parallelize computations, facilitating enhanced scalability and efficiency.
100. Write the code to build an ROC curve.
Here is how you can build a ROC curve using the ‘roc_curve’ function from the scikit-learn library:
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.metrics import roc_curve, roc_auc_score
# Example predicted probabilities and true labelspredicted_probabilities = np.array([0.1, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9])true_labels = np.array([0, 0, 0, 1, 1, 1, 1])
# Compute ROC curvefpr, tpr, thresholds = roc_curve(true_labels, predicted_probabilities)
# Compute ROC area under the curve (AUC)roc_auc = roc_auc_score(true_labels, predicted_probabilities)
# Plot ROC curveplt.figure()plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.title('Receiver Operating Characteristic (ROC) Curve')plt.legend(loc='lower right')plt.show()
Conclusion
By reviewing the data science interview questions, you can practice your responses and confidently navigate the interview process to demonstrate your knowledge in the domain. These questions help you grasp common data science concepts like machine learning, data visualization, statistical analysis, data cleaning, and data science tools. Apply both your technical knowledge and analytical thinking skills to answer the questions confidently and ace your data science interview.
Did you find this blog helpful? Let us know in the comments below. If you are seeking to pursue a career in data science, consider opting for a data science course with a placement guarantee. From technical knowledge to mock interview practice, you will get mentorship to achieve a successful career in the field of data science.