Model Code Guideline

Below is a set of guidelines on how to structure the codebase so that we can consistently transform raw data into production-ready features and quickly deploy models to production environments. The primary goal is to ensure reusability, maintainability, and consistency of feature transformations.

1. Data Processing

Please Keep Data Transformations in Their Own Module.
- Why?
  - Separating transformations from model training logic makes it easier to reuse these transformations for inference.
  - This separation also keeps your code cleaner and more maintainable.
- How to do it?
  - If you write python scripts, create a dedicated Python module (e.g., data_processing.py or feature.py) containing functions or classes for each transformation.
  - If you use notebooks, create a separate section in the notebook that contains functions or classes for each transformation.
  - Each data-processing function or class should have a clearly defined input (e.g., a DataFrame or multiple DataFrames) and output (e.g., a transformed DataFrame).
  - Example:
```
# data_processing.py or data processing section

def transform(df: pd.DataFrame) -> pd.DataFrame:
    """
    Applies developer-specific transformations on the DataFrame and returns the result.
    """
    # ...
    return transformed_df
```

Please use the Same Code for Training and Inference

Why?
- If you replicate transformations by rewriting them in separate scripts, you risk introducing mismatches or errors.
- A common approach is to define transformations in one place and call them from both your training code and your inference code.
How to do it?
- During training, call the transformations from your data-processing module before training the model.
- During inference, import the same data-processing functions and apply them to incoming data.

# train.py or train section
from data_processing import clean_text, encode_categorical

def train_model(training_data):
    df = load_data(training_data)
    df['text_clean'] = clean_text(df['text'])
    df_encoded = encode_categorical(df['category'])
    # ...
    # Train your model using the processed features
    model.fit(df_encoded, df['target'])

# inference.py or inference section
from data_processing import clean_text, encode_categorical

def predict(model, new_data):
    df = load_data(new_data)
    df['text_clean'] = clean_text(df['text'])
    df_encoded = encode_categorical(df['category'])
    # ...
    return model.predict(df_encoded)

2. Model Development

Please separate your training logic from your inference (i.e., prediction) logic by developing at least two distinct scripts/modules:

Training Script / Section
- Purpose: Orchestrates data loading (training set), applies any data preprocessing or feature engineering, and trains the model.
- Outputs:
  - A trained model artifact (e.g., .pth, .pt, or .h5 file depending on your framework).
  - Optional: logs or metrics for evaluation (e.g., validation accuracy, loss).
Inference (Prediction) Script / Section
- Purpose: Loads the trained model artifact, applies the same data preprocessing steps (in the same order and format) as in training, then runs forward passes to get predictions.
- Inputs: The trained model artifact plus any new (unlabeled) input data.
- Outputs: Predicted labels, numeric values, or other model-specific outputs.

3. What to upload

Archive ****your project folder into a .zip containing:

If you use notebooks, include the following sections in the notebook. If you write python scripts, include the following script files.
- Data processing module
  - This ensures no training-serving skew in data transformations.
- Model training module
- Model inference/prediction module that handles model loading and inference logic (e.g., inference.py or predict.py):
  - Loads the model artifacts.
  - Applies the same data transformations used in training (at least, the ones necessary for inference).
  - Exposes an entry point (e.g., a function) to receive input data and produce predictions.
Dependency Definitions
- List all Python libraries and versions in a requirements.txt file
Model Artifacts (weights/files) - Optional, as long as your results are reproducible by running your code

By following these guidelines, you’ll end up with a more robust, maintainable, and production-friendly workflow—allowing us to integrate the models quickly while ensuring stable, consistent predictions.