Pandas is a powerful Python library for data manipulation and analysis.
Basic Terms
Axis
In Pandas, axis
refers to the direction along which an operation is applied:
Axis | Direction | Refers To |
---|---|---|
0 | Vertical | Rows (downward) |
1 | Horizontal | Columns (across) |
axis=0 means “go down the rows” (operate column-wise) axis=1 means “go across the columns” (operate row-wise)
Core Data Structures
Series
A one-dimensional labeled array holding data of any type such as integers, strings, Python objects etc.
data = {'a': 1, 'b': 2, 'c': 3}
s = pd.Series(data)
print(s)
DataFrame
A two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns. Think of it like a spreadsheet or SQL table.
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 28],
'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df)
Reading and Writing Data
Pandas can read and write data from various formats like CSV, Excel, JSON, SQL databases, and more.
# Reading from CSV
df = pd.read_csv('data.csv')
# Writing to CSV
df.to_csv('output.csv', index=False) # index=False prevents writing row indices
# Other formats:
# pd.read_excel('data.xlsx')
# pd.read_json('data.json')
# ...
Accessing Data
Columns
names = df['Name'] # Access the 'Name' column as a Series
ages = df.Age # Alternative way to access a column
Rows
first_row = df.loc[0] # Access the first row by label
second_row = df.iloc[1] # Access the second row by integer position
Slicing
subset = df[1:3] # Rows 2 and 3
Data Manipulation
Filtering:
young_people = df[df['Age'] < 30]
Sorting:
df_sorted = df.sort_values(by='Age', ascending=False)
Adding Columns:
df['NewColumn'] = df['Age'] * 2
deleting Columns:
DataFrame.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')[source]
Remove rows or columns by specifying label names and corresponding axis, or by directly specifying index or column names.
Applying Functions:
df['NameLength'] = df['Name'].apply(len)
Grouping:
grouped = df.groupby('City')['Age'].mean()
Combining DataFrames:
merged_df = pd.merge(df1, df2, on='KeyColumn') # Merge two DataFrames based on a common column
concatenated_df = pd.concat([df1, df2]) # Concatenate DataFrames
Handling Missing Data
df.dropna() # Remove rows with missing values
df.fillna(0) # Fill missing values with 0
DataFrame describe
Understand the return from describe()
Statistic | Meaning |
---|---|
count | Number of non-missing values |
mean | Average value |
std | Standard deviation (spread) |
min | Minimum value |
25% | 1st quartile (25th percentile) |
50% | 2nd quartile = median |
75% | 3rd quartile (75th percentile) |
max | Maximum value |
mean vs. 50%
Feature | Mean | Median (50%) |
---|---|---|
What it is | The average of all values | The middle value when sorted |
Formula | Sum of values ÷ number of values | Middle value (or average of two middle ones) |
Sensitive to outliers? | ✅ Yes – pulled by extreme values | ❌ No – more resistant to outliers |
Good for | Symmetrical distributions | Skewed distributions (with outliers) |
-
Mean
: looks at the arithmetic average — good for clean, normal data -
50%
(median): finds the middle value — better for messy or skewed data. Themedian
value tell that half the data is below this number, half above. -
25%
value is the 25% item value (or average of two ones)!
DataFrame sample
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)
:
Return a random sample of items from an axis of object.
-
n
: Number of items from axis to return. -
frac
: Fraction of axis items to return. Cannot be used withn
. -
axis
: Default is row for DataFrame type. -
random_state
: is used to control the randomness of an operation so that you get the same result every time you run your code. Just like a seed value passed to the internal random number generator. For example, by settingrandom_state
(e.g., to 42), you ensure the same output every time your code runs.