Skip to main content
  • Home
  • Development
  • Documentation
  • Donate
  • Operational login
  • Browse the archive

swh logo
SoftwareHeritage
Software
Heritage
Archive
Features
  • Search

  • Downloads

  • Save code now

  • Add forge now

  • Help

https://doi.org/10.5281/zenodo.14318846
17 December 2024, 12:45:03 UTC
  • Code
  • Branches (0)
  • Releases (1)
  • Visits
    • Branches
    • Releases
      • 1
      • 1
    • c8b2287
    • /
    • combining-hmm-and-ssf-code
    • /
    • Code to filter raw data
    • /
    • functions.py
    Raw File Download

    To reference or cite the objects present in the Software Heritage archive, permalinks based on SoftWare Hash IDentifiers (SWHIDs) must be used.
    Select below a type of object currently browsed in order to display its associated SWHID and permalink.

    • content
    • directory
    • snapshot
    • release
    origin badgecontent badge Iframe embedding
    swh:1:cnt:f1194455a28f8e83f0d38c585f689128a86d92ed
    origin badgedirectory badge Iframe embedding
    swh:1:dir:7e02716be0f7d098f9aea3075eb51d7842c5a320
    origin badgesnapshot badge
    swh:1:snp:05a2af42b588522ca08f036c1f785d8457dcf25e
    origin badgerelease badge
    swh:1:rel:aa35d9e39d94cbf3f73362c2c2b5cd04c355e955

    This interface enables to generate software citations, provided that the root directory of browsed objects contains a citation.cff or codemeta.json file.
    Select below a type of object currently browsed in order to generate citations for them.

    • content
    • directory
    • snapshot
    • release
    Generate software citation in BibTex format (requires biblatex-software package)
    Generating citation ...
    Generate software citation in BibTex format (requires biblatex-software package)
    Generating citation ...
    Generate software citation in BibTex format (requires biblatex-software package)
    Generating citation ...
    Generate software citation in BibTex format (requires biblatex-software package)
    Generating citation ...
    functions.py
    # -*- coding: utf-8 -*-
    """
    Created on Tue Oct 27 08:44:18 2020
    
    Code containing various functions used for filtering codes. Functions are imported and 
    applied in other codes.
    
    Functions and purposes:
        convert_coords - convert coordinated from wgs to utm32N projection
        calc_distance - calculate distance between 2 points
        calc_vel - calculate velocity
        add_discharge - adds discharge to the nearest 10m^3/s. Requires the
                        discharge data to be hourly. This code is a prequesite to
                        filtering if in the river or not (as that code relies on
                        river shape files for every 10m^3/s discharge)
        add_discharge_cont - same as above but discharge as a continuous variable
        filter_if_in_river_all_discharges - filters if points are within river
                                            shapefiles that exist for every 10^3/s
                                            discharge, output is a df with points 
                                            outside the river dropped
        if_in_shapefile - function to identify if point in shapefile or not
        add_temp - add hourly temperature to the data (can easily do in R now and 
                   likely will do so since needed for SSF data)
                    also need to extract temperature from receivers instead so
                    this code is unused now as far as later data processing goes
        filter_time_gap - filter data based on time between successive points and
                            remove points under a defined threshold
        convert_decimal_hours - code to convert time to decimal hours. again
                                defunct as can do more easily in R at a later stage.
    
    @author: Rachel
    """
    
    
    #------ getting set up ---------------#
    import os
    import pandas as pd
    import pyproj
    import numpy as np
    import shapefile
    #import matplotlib.pyplot as plt
    
    import shapely.geometry
    #from shapely.geometry import shape, Point
    from shapely.geometry import Point
    
    
    #%% 
    #function to convert coordinates
    def convert_coords(data_frame,longitude_heading,latitude_heading):
        '''
        function to convert coords from wgs to utm 32N
    
        Parameters
        ----------
        data_frame : fish data frame
        longitude_heading : column name of longitude to be converted
        latitude_heading : column name of latitude to be converted
    
        Returns
        -------
        adds new columns to df as x and y
    
        '''
        #define input projection
        in_proj = pyproj.Proj("epsg:4326")
        #define output projection
        out_proj = pyproj.Proj("epsg:32632")
        #transform original coords x1,y1 to x2,y2
        x2,y2 = pyproj.Transformer.from_proj(in_proj,
                                             out_proj,always_xy=True).transform(data_frame[longitude_heading].values, 
                                                                         data_frame[latitude_heading].values)
        
        #have defined always_xy to be True as otherwise switches long and lat around
        
        #assign converted coordinates to new columns in the data frame
        data_frame['x'], data_frame['y'] = x2, y2
        return
    
    #so to summarise this function
    #takes input and output projections
    #transforms the coordinates
    #saves new coordinates to the specified data frame as a new column
    
    
    #%%
    #function for distance between 2 points
    
    def calc_distance(data_frame,x_coord,y_coord):
        '''
        calculates distance between successive points and saves output as new column
    
        Parameters
        ----------
        data_frame : data frame
        x_coord : column name
        y_coord : column name
    
       
    
        '''
        #calculate differences between value and preceeding entry for x and y
        delta_x= data_frame[x_coord].diff()
        delta_y = data_frame[y_coord].diff()
        #formula to calculate distance
        distance = (delta_x**2+delta_y**2)**0.5
        #assign distance to data frame
        data_frame['distance_travelled'] = distance
        return
    
    
    #%%
    #function for velocity
    
    def calc_vel(data_frame,date_time_source,distance_heading):
        '''
        function to calculate velocity + add as column to the dataframe
    
        Parameters
        ----------
        data_frame : data frame
        date_time_source : column name OR index
        distance_heading : column name of distance column
    
        Returns
        -------
        None.
    
        '''
        #convert index to a series that can be used in calculations
        time = date_time_source.to_series()
        #work out difference in times
        delta_time = time.diff()
        #convert to seconds
        time_s = delta_time/np.timedelta64(1,'s')
        #calculate velocity
        velocity = data_frame[distance_heading]/time_s
        #assign velocity to column
        data_frame['fish_velocity'] = velocity
        #create time diff column for fun
        data_frame['time_diff'] = time_s
        return 
    
    
    
    #%%
    #function to split based on discharge; hourly
    #flow_data_hourly = pd.read_csv('hourly_flow.csv', sep=',', index_col='date')
    #fish_df = manip_data.copy()
    #discharge_df = flow_data_hourly
    
    
    def add_discharge(discharge_df,fish_df_og):
        '''
        Function to add corresponding hourly discharge values (to nearest 10) 
        to every fish location. Assign output to dataframe.
    
        Parameters
        ----------
        discharge_df : file containing discharge values
        fish_df : file with fish data
    
        Returns
        -------
        merged : merged data frame
    
        '''
        #create separate date column so can use as reference merger
        discharge_df['Date'] = discharge_df.index
        #round discharges to nearest 10 
        discharge_df['discharge_nearest_10'] = discharge_df['Q'].round(-1)
        fish_df = fish_df_og
        #first need to convert from time stamps into date +hour
        #this makes it compatible for mean daily discharge
        #below assigns the index to a variable; not in the dataframe yet as didnt work that way
        hourly = fish_df.index
        hourly = pd.to_datetime(hourly)
        #code below to round to nearest hour
        hourly = hourly.round('H')
        #now convert to string and get rid of extras; adds to dataframe as a column
        fish_df['Date'] = hourly.strftime('%d/%m/%Y %H:%M')
        #save index as new column as merging it with flow will get rid of it
        fish_df['Time'] =fish_df.index
        #merge the two based on date
        merged = pd.merge(fish_df,discharge_df,on='Date')
        #recreate the index
        merged.index = merged['Time']
        #drop unnecessary columns
        merged = merged.drop(['Date','Time','Q'],axis=1)
    
        return merged
    
    #different version, single output
    #multiple outputs generated in next step
    #%% add discharge as a continuous variable
    
    def add_discharge_cont(discharge_df,fish_df):
        '''
        Function to add corresponding hourly discharge values 
        to every fish location. Assign output to dataframe.
    
        Parameters
        ----------
        discharge_df : file containing discharge values
        fish_df : file with fish data
    
        Returns
        -------
        merged : merged data frame
    
        '''
        #create separate date column so can use as reference merger
        discharge_df['Date'] = discharge_df.index
        #round discharges to nearest 10
        discharge_df['discharge'] = discharge_df['Q']
        #first need to convert from time stamps into date +hour
        #this makes it compatible for mean daily discharge
    
        hourly = fish_df.index
        hourly = pd.to_datetime(hourly)
        #code below to round to nearest hour
        hourly = hourly.round('H')
        #now convert 2 string and get rid of extras; adds to dataframe as a column
        fish_df['Date'] = hourly.strftime('%d/%m/%Y %H:%M')
        #save index as new column
        fish_df['Time'] =fish_df.index
        #merge the two based on date
        merged = pd.merge(fish_df,discharge_df,on='Date')
        #recreate the index
        merged.index = merged['Time']
        #drop unnecessary columns
        merged = merged.drop(['Date','Time','Q'],axis=1)
    
        return merged
    
    #different version, single output
    #multiple outputs generated in next step
    
    
    
    
    #%%
    #function to filter if in river or not
    #based on discharge - currently set for daily
    #this function will go inside the "split by discharge" one
    
    #OK this has been editted to use the 80m discharge shapefile to check if in river
    #when discharge >84.9
    #i havent tested it yet though
    
    
    def filter_if_in_river_all_discharges(data_frame, shp_file_folder,lon_location,lat_location):
        '''
        A function to filter if a point is in the river or not. Rows containing
        points outside the shapefile are removed. This function takes a dataframe
        with multiple discharge values and uses a separate shapefile for each
        discharge value present
        
        Parameters
        ----------
        data_frame : data frame of fish points with multiple discharge values
        shp_file_folder : folder where all shapefiles location
        lon_location : column name for the longitude: currently using UTM format so 'x'
        lat_location : column name for the latitude: currently using UTM format so 'y'
        output_folder : folder where want output to be saved
    
        Returns
        -------
        filtered_df : the filtered data frame with points dropped.
    
        '''
        #create copy of dataframe to avoid manipulating original
        data_frame_copy = data_frame.copy()
        #generates a new column for whether a point is in river or not
        data_frame_copy['in_river'] = 'placeholder'
        #loop to get right shape file per discharge
        #it goes through each line individually
        
        #change so pre read in shapefiles
        for i in range((len(data_frame_copy))):
            #get the discharge for that single row
            discharge_value = int(data_frame_copy['discharge_nearest_10'][i])
            #bit to account for discharges over 80
            #if 80 or less
            if discharge_value <= 80:
                #create shapefile name using discharge value
                shp_file_location = shp_file_folder+'wet_'+str(discharge_value)+'m3s.shp'
                #read shapefile
                shp_file = shapefile.Reader(shp_file_location)
                #create polygon
                polygons = shp_file.shapes()
                #define latitude and longitude for this
                longitude = data_frame_copy[lon_location]
                latitude = data_frame_copy[lat_location]
                point = Point(longitude[i], latitude[i])
                polygon = shapely.geometry.shape(polygons[0]) #this code shows shape of shape file
                data_frame_copy['in_river'][i] = polygon.contains(point)
            else:
                #else use the 80 shapefile
                shp_file_location = shp_file_folder+'wet_'+str(80)+'m3s.shp'
                #read shapefile
                shp_file = shapefile.Reader(shp_file_location)
                #create polygon
                polygons = shp_file.shapes()
                #define latitude and longitude for this
                longitude = data_frame_copy[lon_location]
                latitude = data_frame_copy[lat_location]
                point = Point(longitude[i], latitude[i])
                polygon =shapely.geometry.shape(polygons[0]) #this code shows shape of shape file
                data_frame_copy['in_river'][i] = polygon.contains(point)
                
        #this filters the data and keeps only points in the river
        filtered_df = data_frame_copy[(data_frame_copy['in_river']==True)]
        #drop true/false column
        filtered_df = filtered_df.drop(['in_river'], axis=1)
    
        return filtered_df
    
    #function creates a filtered df
    
    
    
    #%%
    #function to determine if in a shapefile e.g. area near ladder, within ladder
    
    def if_in_shapefile(data_frame, shp_file_name,lon_location,lat_location,
                        output_col_name, number):
        '''
        function to determine if a point is within a specified shapefile. It
        assigns a True or False value for points within and outwith the specified
        shapefile. 
    
        Parameters
        ----------
        data_frame : data frame containing the points.
        
        shp_file : shapefile
        
        lon_location : column name containing longitude (in this case x as utm shapefile)
        
        lat_location : column name containing latitude (in this case y as utm shapefile)
        
        output_col_name : name of column where want true/false saved
        
        number : for 'rectangle' it is 1; for 'ladder' it is 0. 
                    I do not know why it has to be like this
    
        Returns
        -------
        alters the supplied dataframe, no returns
    
        '''
        shp_shapes = shp_file_name.shapes()
        #ok need to have [1] in the code below for unknown Reason
        #it just works that way
        shp_polygon = shapely.geometry.shape(shp_shapes[number])
        #create placeholder column for results to go in
        data_frame[output_col_name] = bool('True')
        #loop that will mark if point is in rectangle or not
        for i in range(len(data_frame)):
            point = Point(data_frame[lon_location][i],data_frame[lat_location][i])
            data_frame[output_col_name][i] = shp_polygon.contains(point)
        return
    
    #%%
    #function to assign temperature
    
    def add_temp(temp_data,fish_df):
        '''
        function to add hourly temperature as a column to the dataframe
    
        Parameters
        ----------
        temp_df : source of temperature data.
        
        fish_df : fish data frame
    
        Returns
        -------
        creates data framenew column containing hourly temperature
    
        '''
        #copy data frame for manipulation
        fish_df = fish_df.copy()
        #code to merge temp to time, based off the discharge cpde
        #create separate date column so can use as reference merger
        temp_data = temp_data.copy()
        temp_data['Date'] = temp_data.index
        #first need to convert from time stamps into date +hour
        #this makes it compatible
        #ok so below assigns the index to a variable; not in the dataframe yet as didnt work that way lol
        hourly = fish_df.index
        #code below to round to nearest hour
        hourly = hourly.round('H')
        #now convert 2 string and get rid of extras; adds to dataframe as a column
        fish_df['Date'] = hourly.strftime('%d/%m/%Y %H:%M')
        #save index as new column
        fish_df['Time'] = fish_df.index
        #merge the two based on date
        data_with_temp = pd.merge(fish_df,temp_data,on='Date')
        #recreate the index
        data_with_temp.index = data_with_temp['Time']
        #drop unnecessary columns
        data_with_temp = data_with_temp.drop(['Date','Time'],axis=1)
        return data_with_temp
    
    #%%
    #function to filter time points based on difference between them
    
    
    
    def filter_time_gap(data_frame,upper_limit):
        '''
        function to filter dataframe to remove points within a specified time range.
    
        Parameters
        ----------
        data_frame : data frame of fish locations between which want to filter
        
        upper_limit : time range (in seconds) below which we want to discard points
    
        Returns
        -------
        data_frame : a data frame for assigning to a variable
        '''
        #copy df
        data_frame = data_frame.copy()
        #time difference between points in rectangle
        #convert index to a series that can be used in calculations
        time = data_frame.index
        time = time.to_series()
        #work out difference in times
        delta_time = time.diff()
        #convert to seconds
        delta_time = delta_time/np.timedelta64(1,'s')
        #assign to sheet
        data_frame['time_between_points_in_rect'] = delta_time
        #filter locations within specified limit
        data_frame = data_frame.loc[data_frame['time_between_points_in_rect'].isnull() | (data_frame['time_between_points_in_rect']>=upper_limit)]
        return data_frame
    
    #%%
    
    #function to convert time of day to decimal time
    #makes in easier to use as a predictor in models when decimal hours
    
    def convert_decimal_hours(data_frame, input_column):
        '''
        function to convert a time stamp (in hh:mm:ss format) to decimal hours. The
        result is saved to the input data frame, as a new column 'decimal hour'
    
        Parameters
        ----------
        data_frame : data_frame needed
        
        input_column : time column, index e.g. data.index or data[column]
    
    
        '''
        #create output column with placeholder number
        date_time = pd.to_datetime(input_column)
        #create integer vectors for each componenent
        hour = date_time.strftime('%H').astype(str).astype(int)
        mins = date_time.strftime('%M').astype(str).astype(int)
        secs = date_time.strftime('%S').astype(str).astype(int)
        #do the maths
        dec_hour = hour+(mins/60)+(secs/3600)
    
        data_frame['decimal_hour'] = dec_hour
    
        return
    
    

    back to top

    Software Heritage — Copyright (C) 2015–2025, The Software Heritage developers. License: GNU AGPLv3+.
    The source code of Software Heritage itself is available on our development forge.
    The source code files archived by Software Heritage are available under their own copyright and licenses.
    Terms of use: Archive access, API— Content policy— Contact— JavaScript license information— Web API