Pandas is a powerful Python library for data manipulation and analysis.

Basic Terms

Axis

In Pandas, axis refers to the direction along which an operation is applied:

Axis Direction Refers To
0 Vertical Rows (downward)
1 Horizontal Columns (across)

axis=0 means “go down the rows” (operate column-wise) axis=1 means “go across the columns” (operate row-wise)

Core Data Structures

Series

A one-dimensional labeled array holding data of any type such as integers, strings, Python objects etc.

data = {'a': 1, 'b': 2, 'c': 3}
s = pd.Series(data)
print(s)

DataFrame

A two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns. Think of it like a spreadsheet or SQL table.

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 28],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df)

Reading and Writing Data

Pandas can read and write data from various formats like CSV, Excel, JSON, SQL databases, and more.

# Reading from CSV
df = pd.read_csv('data.csv')

# Writing to CSV
df.to_csv('output.csv', index=False)  # index=False prevents writing row indices

# Other formats:
# pd.read_excel('data.xlsx')
# pd.read_json('data.json')
# ...

Accessing Data

Columns

names = df['Name']  # Access the 'Name' column as a Series
ages = df.Age       # Alternative way to access a column

Rows

first_row = df.loc[0]  # Access the first row by label
second_row = df.iloc[1] # Access the second row by integer position

Slicing

subset = df[1:3]   # Rows 2 and 3

Data Manipulation

Filtering:

young_people = df[df['Age'] < 30]

Sorting:

df_sorted = df.sort_values(by='Age', ascending=False)

Adding Columns:

df['NewColumn'] = df['Age'] * 2

deleting Columns:

DataFrame.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')[source]

Remove rows or columns by specifying label names and corresponding axis, or by directly specifying index or column names.

Applying Functions:

df['NameLength'] = df['Name'].apply(len)

Grouping:

grouped = df.groupby('City')['Age'].mean()

Combining DataFrames:

merged_df = pd.merge(df1, df2, on='KeyColumn') # Merge two DataFrames based on a common column
concatenated_df = pd.concat([df1, df2])        # Concatenate DataFrames

Handling Missing Data

df.dropna()       # Remove rows with missing values
df.fillna(0)      # Fill missing values with 0

DataFrame describe

Understand the return from describe()

Statistic Meaning
count Number of non-missing values
mean Average value
std Standard deviation (spread)
min Minimum value
25% 1st quartile (25th percentile)
50% 2nd quartile = median
75% 3rd quartile (75th percentile)
max Maximum value

mean vs. 50%

Feature Mean Median (50%)
What it is The average of all values The middle value when sorted
Formula Sum of values ÷ number of values Middle value (or average of two middle ones)
Sensitive to outliers? ✅ Yes – pulled by extreme values ❌ No – more resistant to outliers
Good for Symmetrical distributions Skewed distributions (with outliers)
  • Mean: looks at the arithmetic average — good for clean, normal data
  • 50% (median): finds the middle value — better for messy or skewed data. The median value tell that half the data is below this number, half above.
  • 25% value is the 25% item value (or average of two ones)!

DataFrame sample

DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False):

Return a random sample of items from an axis of object.

  • n: Number of items from axis to return.
  • frac: Fraction of axis items to return. Cannot be used with n.
  • axis: Default is row for DataFrame type.
  • random_state: is used to control the randomness of an operation so that you get the same result every time you run your code. Just like a seed value passed to the internal random number generator. For example, by setting random_state (e.g., to 42), you ensure the same output every time your code runs.

References