10 Scikit-learn Functions Every ML Beginner Should Know

You've cleaned your data. You've visualized it. Now it's time to build models.
Scikit-learn is the most used machine learning library in Python. Here are the 10 functions you'll use in every single ML project — with examples and use cases.
1. train_test_split() — Split Your Data
Always split your data before training. Never train on your test data.
Example:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Use case: Keeping 20% of your data aside to test how well your model performs on unseen data.
2. StandardScaler() — Scale Your Features
Most ML algorithms perform better when your data is on the same scale.
Example:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Use case: Scaling age, salary, and experience columns so no single feature dominates the model.
3. LabelEncoder() — Encode Categories
ML models don't understand text. Convert categories to numbers.
Example:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["gender_encoded"] = le.fit_transform(df["gender"])
# Male → 1, Female → 0
Use case: Converting gender, city, or department columns into numeric values before training.
4. LogisticRegression() — Classify Yes or No
The simplest classification model. Always start here.
Example:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Use case: Predicting if a customer will churn, if an email is spam, if a patient has a disease.
5. LinearRegression() — Predict Numbers
Use when your target is a number, not a category.
Example:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Use case: Predicting house prices, salary based on experience, sales based on ad spend.
6. RandomForestClassifier() — Powerful Classification
More powerful than logistic regression. Works well on most datasets.
Example:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Use case: Customer churn prediction, fraud detection, disease classification.
7. accuracy_score() — Measure Your Model
How often is your model correct?
Example:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy * 100:.2f}%")
Use case: Checking how well your classification model performs on the test set.
8. confusion_matrix() — See Where Your Model Fails
Accuracy alone doesn't tell the full story. This shows you exactly where your model goes wrong.
Example:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, predictions)
ConfusionMatrixDisplay(cm).plot()
Use case: Seeing how many predictions were correct, and what type of errors your model is making.
9. cross_val_score() — Test More Reliably
One train-test split isn't always enough. Cross-validation gives a more honest score.
Example:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(f"Average Accuracy: {scores.mean() * 100:.2f}%")
Use case: Getting a reliable accuracy estimate by training and testing on 5 different data splits.
10. classification_report() — Full Model Report
Gives you precision, recall, and F1-score in one shot.
Example:
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))
Use case: Evaluating your model beyond just accuracy — especially important for imbalanced datasets.
Quick Reference — Save This
| Function | What It Does |
|---|---|
| train_test_split() | Split data into train and test sets |
| StandardScaler() | Scale features to the same range |
| LabelEncoder() | Convert categories to numbers |
| LogisticRegression() | Binary classification model |
| LinearRegression() | Numeric prediction model |
| RandomForestClassifier() | Powerful classification model |
| accuracy_score() | Measure model accuracy |
| confusion_matrix() | See prediction errors visually |
| cross_val_score() | Reliable accuracy with CV |
| classification_report() | Full precision, recall, F1 report |
One Important Rule
Always scale your test data using the scaler fitted on training data.
# Correct ✅
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Wrong ❌
X_test_scaled = scaler.fit_transform(X_test)
Fitting the scaler on test data causes data leakage. Your model will look better than it actually is.
Save this article and come back to it every time you start a new ML project.





