10 Scikit-learn Functions Every ML Beginner Should Know

You've cleaned your data. You've visualized it. Now it's time to build models.

Scikit-learn is the most used machine learning library in Python. Here are the 10 functions you'll use in every single ML project — with examples and use cases.

1. train_test_split() — Split Your Data

Always split your data before training. Never train on your test data.

Example:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Use case: Keeping 20% of your data aside to test how well your model performs on unseen data.

2. StandardScaler() — Scale Your Features

Most ML algorithms perform better when your data is on the same scale.

Example:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Use case: Scaling age, salary, and experience columns so no single feature dominates the model.

3. LabelEncoder() — Encode Categories

ML models don't understand text. Convert categories to numbers.

Example:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["gender_encoded"] = le.fit_transform(df["gender"])
# Male → 1, Female → 0

Use case: Converting gender, city, or department columns into numeric values before training.

4. LogisticRegression() — Classify Yes or No

The simplest classification model. Always start here.

Example:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Use case: Predicting if a customer will churn, if an email is spam, if a patient has a disease.

5. LinearRegression() — Predict Numbers

Use when your target is a number, not a category.

Example:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Use case: Predicting house prices, salary based on experience, sales based on ad spend.

6. RandomForestClassifier() — Powerful Classification

More powerful than logistic regression. Works well on most datasets.

Example:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Use case: Customer churn prediction, fraud detection, disease classification.

7. accuracy_score() — Measure Your Model

How often is your model correct?

Example:

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy * 100:.2f}%")

Use case: Checking how well your classification model performs on the test set.

8. confusion_matrix() — See Where Your Model Fails

Accuracy alone doesn't tell the full story. This shows you exactly where your model goes wrong.

Example:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, predictions)
ConfusionMatrixDisplay(cm).plot()

Use case: Seeing how many predictions were correct, and what type of errors your model is making.

9. cross_val_score() — Test More Reliably

One train-test split isn't always enough. Cross-validation gives a more honest score.

Example:

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)
print(f"Average Accuracy: {scores.mean() * 100:.2f}%")

Use case: Getting a reliable accuracy estimate by training and testing on 5 different data splits.

10. classification_report() — Full Model Report

Gives you precision, recall, and F1-score in one shot.

Example:

from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

Use case: Evaluating your model beyond just accuracy — especially important for imbalanced datasets.

Quick Reference — Save This

Function	What It Does
train_test_split()	Split data into train and test sets
StandardScaler()	Scale features to the same range
LabelEncoder()	Convert categories to numbers
LogisticRegression()	Binary classification model
LinearRegression()	Numeric prediction model
RandomForestClassifier()	Powerful classification model
accuracy_score()	Measure model accuracy
confusion_matrix()	See prediction errors visually
cross_val_score()	Reliable accuracy with CV
classification_report()	Full precision, recall, F1 report

One Important Rule

Always scale your test data using the scaler fitted on training data.

# Correct ✅
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Wrong ❌
X_test_scaled = scaler.fit_transform(X_test)

Fitting the scaler on test data causes data leakage. Your model will look better than it actually is.

Save this article and come back to it every time you start a new ML project.

10 Scikit-learn Functions Every ML Beginner Should Know

1. train_test_split() — Split Your Data

2. StandardScaler() — Scale Your Features

3. LabelEncoder() — Encode Categories

4. LogisticRegression() — Classify Yes or No

5. LinearRegression() — Predict Numbers

6. RandomForestClassifier() — Powerful Classification

7. accuracy_score() — Measure Your Model

8. confusion_matrix() — See Where Your Model Fails

9. cross_val_score() — Test More Reliably

10. classification_report() — Full Model Report

Quick Reference — Save This

One Important Rule

Comments

More from this blog

Machine Learning Complete Guide — Roadmap, Concepts, Careers and Projects 2026

AI Complete Guide — What It Is, Skills, Careers and Projects for Beginners 2026

Data Analytics Complete Guide — Roadmap, Tools, and Career Paths for 2026

Data Science vs Data Analytics vs Data Engineering — What's the Difference?

Welcome to Neural Notes: Your Weekly Data Science Learning Calendar

Command Palette

1. train_test_split() — Split Your Data

2. StandardScaler() — Scale Your Features

3. LabelEncoder() — Encode Categories

4. LogisticRegression() — Classify Yes or No

5. LinearRegression() — Predict Numbers

6. RandomForestClassifier() — Powerful Classification

7. accuracy_score() — Measure Your Model

8. confusion_matrix() — See Where Your Model Fails

9. cross_val_score() — Test More Reliably

10. classification_report() — Full Model Report

Quick Reference — Save This

One Important Rule

Comments

More from this blog