Model Code Guideline

Below is a set of guidelines on how to structure the codebase so that we can consistently transform raw data into production-ready features and quickly deploy models to production environments. The primary goal is to ensure reusability, maintainability, and consistency of feature transformations.

1. Data Processing

  • Please Keep Data Transformations in Their Own Module.

    • Why?
      • Separating transformations from model training logic makes it easier to reuse these transformations for inference.
      • This separation also keeps your code cleaner and more maintainable.
    • How to do it?
      • If you write python scripts, create a dedicated Python module (e.g., data_processing.py or feature.py) containing functions or classes for each transformation.
      • If you use notebooks, create a separate section in the notebook that contains functions or classes for each transformation.
      • Each data-processing function or class should have a clearly defined input (e.g., a DataFrame or multiple DataFrames) and output (e.g., a transformed DataFrame).
      • Example:
    # data_processing.py or data processing section
    
    def transform(df: pd.DataFrame) -> pd.DataFrame:
        """
        Applies developer-specific transformations on the DataFrame and returns the result.
        """
        # ...
        return transformed_df
    
  • Please use the Same Code for Training and Inference

    • Why?
      • If you replicate transformations by rewriting them in separate scripts, you risk introducing mismatches or errors.
      • A common approach is to define transformations in one place and call them from both your training code and your inference code.
    • How to do it?
      • During training, call the transformations from your data-processing module before training the model.
      • During inference, import the same data-processing functions and apply them to incoming data.
    # train.py or train section
    from data_processing import clean_text, encode_categorical
    
    def train_model(training_data):
        df = load_data(training_data)
        df['text_clean'] = clean_text(df['text'])
        df_encoded = encode_categorical(df['category'])
        # ...
        # Train your model using the processed features
        model.fit(df_encoded, df['target'])
    
    # inference.py or inference section
    from data_processing import clean_text, encode_categorical
    
    def predict(model, new_data):
        df = load_data(new_data)
        df['text_clean'] = clean_text(df['text'])
        df_encoded = encode_categorical(df['category'])
        # ...
        return model.predict(df_encoded)
    

2. Model Development

Please separate your training logic from your inference (i.e., prediction) logic by developing at least two distinct scripts/modules:

  • Training Script / Section
    • Purpose: Orchestrates data loading (training set), applies any data preprocessing or feature engineering, and trains the model.
    • Outputs:
      • A trained model artifact (e.g., .pth, .pt, or .h5 file depending on your framework).
      • Optional: logs or metrics for evaluation (e.g., validation accuracy, loss).
  • Inference (Prediction) Script / Section
    • Purpose: Loads the trained model artifact, applies the same data preprocessing steps (in the same order and format) as in training, then runs forward passes to get predictions.
    • Inputs: The trained model artifact plus any new (unlabeled) input data.
    • Outputs: Predicted labels, numeric values, or other model-specific outputs.

3. What to upload

Archive ****your project folder into a .zip containing:

  • If you use notebooks, include the following sections in the notebook. If you write python scripts, include the following script files.
    • Data processing module
      • This ensures no training-serving skew in data transformations.
    • Model training module
    • Model inference/prediction module that handles model loading and inference logic (e.g., inference.py or predict.py):
      • Loads the model artifacts.
      • Applies the same data transformations used in training (at least, the ones necessary for inference).
      • Exposes an entry point (e.g., a function) to receive input data and produce predictions.
  • Dependency Definitions
    • List all Python libraries and versions in a requirements.txt file
  • Model Artifacts (weights/files) - Optional, as long as your results are reproducible by running your code

By following these guidelines, you’ll end up with a more robust, maintainable, and production-friendly workflow—allowing us to integrate the models quickly while ensuring stable, consistent predictions.