xkcd comic about getting python high

Python 101

  1. Python, a programming language, where space matters!! With that, files portability between Windows and Unix becomes a huge problem, due to the use of CR/LF vs LF in these platforms.
  2. Python3 is a new language that supersedes Python 2, it adds additional construct, so program written in Python3 won't necessarily run in Python 2.x
  3. ; is largely uneeded. lines ends with LF. ; can be used to separate commands when trying to write one-liner in the interactive shell.
  4. Anaconda: a distribution of python, managed core modules and some 195 libraries. It is especially useful for getting Python in Windows and mac.
  5. Miniconda is a lightweight version of anaconda. don't come with the huge library of anaconda, let user to pull only whatever that is needed.
  6. Pypi - Python Package Index, intended to be comprehensive catalog of all open source pyton packages.
  7. pip - pypi install packager?? as of 2015, mostly just use pip
  8. iPython: interactive shell for python (and other lang now).
    pip install ipython
    pip install ipython[notebook]
  9. Jupyther/iPython notebook: This allow writting text interspearsed into python code. good for testing ideas, data crunching and visualization type of project. Anaconda comes with this, and typically run the server at http://localhost:8888
    Ref: nbviewer
    And this OUseful blog describes 7 ways of running iPython notebooks. I particularly liked CoLaboratory implementation on Google Drive. Authorea support for iPhython notebook (as part of its web authoring platform) on the cloud was pretty neat too.
  10. No pointers?? see below
  11. Python2 FAQ is a good read once beyond the syntahtic sugar and need to know more internals in a real programming project.

Tools from Python Libraries

Python idiosyncracies

If running a .py script and get an error of " : File not found", check to ensure that the python script does not end with DOS ^M characters. If needed, cat | dos2unix > and run the new script. It is a weired error, and I thought most program can handle ^M these days...

#!/usr/bin/env python
many python script starts as that. It effectively look at user's Environment variable and find out what (where) python is defined and run it as the interpreter. Calling #!/home/username/bin/python may not always work, as PYTHONLIB won't be setup (unless done in the code).

Python can both be interpreted or compiled into byte-code.
Typically .pyc is produced on execution now, so first run incur a JIT compilation delay.

Python 2 vs Python 3

Things to watch out for to write code that is more portable between python2 and python3

avoid has_key() in python2.
ie, avoid dict.has_key(k)
instead use k in dict.iterkeys() or simply k in dict which works in both python2/3

use print(x) rather than print x, the latter does not work in python 3.

float division: eg, 27/3, in python3 will automatically floating division, python2 will assume int unless use:
from __future__ import division

python3 can use _ as thousand separators in numbers (instead of comma), and it doesn't have to be group of 3, it is for human reading and stripped by the interpreter. Also note that it is *NOT* a "decimal point" like cents and dollars when only two digits is in a group. (Why is Spanish and some other lang reverse role of comma and period?) eg:
>>> print( 5_000_111_000_222_021 - 4_20010 )
>>> print( 5_4321 + 100_00 )

Distributions of Python

  1. avail for Windows, Linux, ...
  2. anaconda, by Win, OS X, Linux. Free for commercial use too. Include 300 popular packages. Additional packages can be installed via pip (eg awscli).
  3. ActiveState (mostly for windows? personally try to avoid, even though nothing wrong with it really, just kinda non standard).


  • PYTHONPATH - module search path
  • sys.path - this should be system path, but made sound python modules will be searched in this path... (cuz they are essentially python programs?)
  • PYTHONSTARTUP - interactive startup file (commands listed here would be run as if they were typed in interactive shell).
    # one good way to use the environment's PATH, but if not set, can at least have some default.
    import os
    import sys
    if 'SPARK_HOME' not in os.environ:
    	os.environ['SPARK_HOME'] = '/srv/spark'
    SPARK_HOME = os.environ['SPARK_HOME']
    sys.path.insert(1, os.path.join( SPARK_HOME, "python", "build"))
    if 'PYTHONPATH' not in os.environ:
            os.environ['PYTHONPATH'] = "/home/system_web/local_python_2.7.9/lib/python2.7/site-packages/"
    PYTHONPATH = os.environ['PYTHONPATH']
    sys.path.insert(1, os.path.join(PYTHONPATH, "/home/system_web/local_python_2.7.9/lib/python2.7/site-packages/", "/opt/python/2.7.9/lib/python2.7/site-packages/"))
    ## not sure if insert will check for duplicates...
    ## it will certainly complain if it doesn't exist.

    Setting up the program on NixOS

    System-wide, edit /etc/nixos/configuration.nix
    environment.systemPackages = with pkgs; [
        python27Packages.ipython 			# does NOT provide python
        python27Full 				# ipython does NOT depends on this package
    As user:
    nix-env -i python-2.6.9

    Setting up the program on Windows has a step-by-step install, including how to setup environment variables. is download page for Anaconda, include version for winsows (and os x, linux).


    PyPI = Python Package Index - equiv of CPAN.
    Allow for installation of python library installation using "pip"
    sudo yum install python-pip
        -or- python install 
    pip install easybuild
    pip list
    pip show      easybuild
    pip uninstall enum
    pip   install enum34
    Libraries are installed to /usr/lib/pythonN.N/site-packages/
    List installed modules/librarys/packages
    pip list
    import pip
    sorted(["%s==%s" % (i.key, i.version) for i in pip.get_installed_distributions()])

    Python Environment/Virtual Environment

    (Contrast this with Continuum's Conda, which do this and some more)
    python3.6 -m venv ./venv 		# create a virtual env (python 3.6)
    source ./venv/bin/activate 
    pip install -r requirements.txt		# list all pip install packages in a requirement file
    ## below are old school, obsolete way to create/invoke virtual env
    pyvenv ~/local_python_3.4		# create a virtual env (python 3.4)
    					# in python 2.7, use virtualenv ~/local_python_2.7 instead
    					# create vir env once for each version of python being used
    source ~/local_python_3.4/bin/activate	# activate virt env (change path accordingly for diff version of python)
    pip install  scipy			# install module into virtual env using pip  (eg for installing scipy)
    easy_install scipy			# install module into virtual env using easy_install (alternate of pip, don't need to do both)

    Python Module

    Python libraries are provided as module, which can be imported. Python search for modules listed in the environment variable PYTHONPATH.
    Types of modules:
    when writting python modules, best if it does not output anything. else, when consumer do import module_foo it would essentially execute codes in that module and output things that may not be desirable. has a good overview of modules:
  • __name__ contains the module itself
  • if __name__ == __main__ : # module is being executed directly, can place desired execution statement in here
  • when import module, all statement and definition will be executed when the import is run (once). This is why if there are any print statement in the main body of the module, they will be printed out at the import
  • dir( __builtin__ ) list all the names defined by python built-in. Can use dir() with any modules.

    Packages - python dotted modules name. ie, provide hierarchy.

    Python Language

    0 is the index of the first element (like perl.  unlike awk, which is 1).
    [ ] = list.  ordered items.  think of array in most languages., 
          in python, it behave somewhat like a stack.  ie think push/pop.  
          [].append() add items (push).
          in reality is is a linked-list.  items can be removed from the middle of the list..
          	myList = [ 'a', 'b', 'c', 'c' ]
    	myList[1] 	# evals to 'b'		# array syntax, 0-based index
    	myList[-2]				# -ve wraps around, return 2nd from last item
    	myList[1:3]				# slice
    	myList.append( 'e' )			# add item to list
    	del myList[2]				# strink list
    	L2 = [ 'ab' ['cde', 'fgh' ] ]		# nested list
    	len(L2)					# length of list, in this eg returns 2
    	L2[i][j]				# 2D array index for nested array.
    	for x in myList 			# items will be expanded for consecutive x
    ( ) = tuple, contain ordered elements.  *immutable*  
          Strings are implemented as tuples and are immutable.
    	point = (x, y)
    	t1 = ( 2, 2 )
    	t2 =   2, 4, 2, 8  			# () syntax is optional when there is no ambiguity
    	t3 = ('xy', ('abc', 'def', 'ghi') )	# nested tuples
    	t3[i][j]				# 2D array syntax works on nested tuples as well
    	t2[1:3]					# slide syntax works on set/tuples too
    	for x in t2	 			# items will be expanded for consecutive x
    	items = set( myList )			# dup 'c' will be removed
    { } = dictionary/hash.  key -> value list.   eg ENV[HOME]  = '/nfshome/tin'      # Perl: %ENV{HOME} = "/nfshome/tin"
    	dictionaries are mappings, not sequences.
    	codon[ATG] = 'lysine'
    	codon.keys()				# or iterkeys() or both?
    	len( codon )
    	resultTable['species']['homo sapiens'] = 1
    	resultTable{ 'species' : { 'homo sapiens': 1} }	# nesting, what is really happening for 2D hash above
    	for k in codon : 			# same as for k in codon.iterkeys()
    		print codon[k]			# iteration for hash is automatically on hash key 
    Additional container datatypes, see 
    python3 collections
    Things that evaluate to False: 
    False    	# build-in boolean. but does not take FALSE
    0.0		# float
    0j		# complex
    bool( 0j )	# type-cast
    ""		# empty string
    []		# empty list
    {}		# empty dictionary
    ()		# empty set
    when __len__() is 0		# eg user-defined function return 0 length list
    when __bool__() is False	# eg user-defined function returning false as boolean value
    True		# build-in bool type, does not take TRUE. 
    None		# types.NoneType  a function that should return an object but just issue "return" will get None. kinda like NULL
    Ellipsis	# used in slicing syntax
    __debug__ 	# true if not started as python -O 
    * and **
    *  eg *args 	# list of args
    ** eg **kwargs	# dictionary (key, val) variable list of args
    def fn1( *args ) :
    	enumerate( args )
    	# *args  is for variable number of arguments
    def fn2( **kwargs ) :
    	for (name,val) in kwargs.items() :
    		print( name )
    	# **kwargs is for variable number of named arguments
    % is the new magic in python , but it is old magic.  new one is {}
    print( "most c-style string %s works" % stringvar )
    print( "num %d, fixed point %8.2f, exponent %12e"  % ( 123, 6.1234, 0.0000123 ) )
    print( "Total rows processed: {:,}".format(rows )  )  # {:,} provides thousand separators$
    strings, like tuple, are immutable.
    if 'abc' in StringVariable : 	# searching string to see if it contain a substring:
    if 'abc' == StringVariable :	# the two strings are the same
    if  Foo  is  Bar :		# see if two objects are the same (which would then means same string, but this is OBJECT comparison!)
    "" vs ''	# very subtle difference that i have yet to hit.  It is NOT like shell where variables are not evaluated inside ''
    [] vs ()	# really depends if function is expecting a list or a tuple/set
    Strings examples:
    description = """Topspin NMR software (data processing option only)"""
    """this can be multiline string
    and can serve as 
    comment out code
    '''here is another multiline string
    that includes line break'''
    Note that while multi line string can be treated as multi-line comment, the indentation matter!
    the quotes must start at the right indent level of the preceding line.  
    if it starts flust left margin, it can breaks code
    source = [ 'topspin.%s.tgz' % version ]
    %s %s % (path,version)
    install_cmd = [ 'tar xfzp %s/%s' % (source_urls,sources) ]     ## file:/// screws up tar
    sanity_check_paths = {
        'files': ['bin/%s' % x for x in ['moe', 'moebatch', 'chemcompd', 'rism3d', 'sdwash']],
        'dirs': [],
    postinstallcmds = [ 'pwd', 'ls -la', 'touch TesT.txt', 'mkdir %(installdir)s/prog/curdir/wongja7', 'chgrp emv-structchem prog/curdir', 'chmod g+w prog/curdir' ]
    modextravars = {
        #'TOPSPIN_HOME': '/usr/prog/topspin/3.5pl2',
        'TOPSPIN_HOME': '%(installdir)s',
    toolchain = {'name': 'dummy', 'version': 'dummy'}

    Globals, Module's var

    Best way to make modules variables?
    This maybe one way, which is what i used in taxo reporter.
    Define variable at top of module, and comment that other who import it would change it?
    Similar in spirit to __debug__ and __builtin__
    import mydb = bar
    this way, bar could be set as cli args (eg parsed by argsparse and many file path can set on run time, yet have some defaults defined as the module's global var)
    cross-mudule var discussion suggested: = bar
    which may seems to be done enough, but new version of python may run into conflicts.
    Note __builtin__ is global counterpart that need to be import before use.  python also changed this to builtin.
    Python2 FAQ 
    recommends the creation of  a global module for the project, calling it or, 
    put all variables there,
    and have all consumer refer to it.
    For a large project with multiple, cross-module references, this avoid a spagetti of "globals" in each module .py file.
    OOP's use of mutator/constructor to set them isn't necessary.  Just modify the var, python don't offer protection, just conventions.

    Scoping rules

    LEGB Rule.
    Enclosing function locals
    Global   (module)
    Build-in (python)
    Before changing a global var inside a fn, must first declare var as global 
    Python3 added a nonlocal clause
    Python Scoping rules discussion

    Snipplets in stand alone program

    # tab nanny 
    python -t   # display warnings
    python -tt  # display errors
    # use SPACES in python!!  
    # avoid TAB, which is treated as 8 spaces.
    # space is what delineate a block.  
    # code indented 4 spaces is at diff block level than those with 2 spaces !!
    # also note the use of :  after evaluation of condition, the else clause
    if ( A < B and C < D) :
        print( "and will be optimized, C < D is evaluated only if A < B is True" )
        print( "python &, | are  bitwise operator" )
        print( "this is still part of the if-condition" )
    elif ( P == Q ):
        print( "string and numeric equality is tested by ==" )
    elif ( P != Q ):
        print( "!= can be used to test whether two objects are different" )
    elif ( 1.5 < X < 4.8 or 178 > Y > 188 ):
        print( "range test can be carried out as condition evaluation" )
        print( "final else part" )
    print "this line is beyond the end of the if/elif/else block"
    # note there is no brackets or endif command to delineate the block !!
    eg for-loop
    for x in list :
    use `continue` to jump to next iteration
    while( X < 10 ) :
    # logical operator 
    # just simple word, no all caps, no use of && ||   (editor will color these reserved word differently)
    # string equality comparision using == 
    txt = "abc"
    if( txt == "abc" ):
        print( "match" )
    # import regular expression (regex) lib
    import re
    # this is closest to perl re search
    m ="(\w+)(Jul)(\w+)", "foo_Jul_bar")
    if m : # ie execute only when a match is found
    	print( "YES match found" )
    	print( ) # "foo_Jul_bar", ie the whole regex match
    	print( ) # "foo_", perl's \1
    	print( ) # "Jul" , perl's \2
    	print( ) # "_bar", perl's \3
    else : 
    	print( "NO  match found" )
    #re.match(...) match only starting from the beginning
    # get command line argument
    import sys
    option1 = sys.argv[1]
    # argv[0] is the name of the command, eg full path of python, or script name
    # example for enumeration and 2D hash 
    # enum functional style, need python3 
    from enum import Enum
    RankSet = Enum( 'Rank',   'species genus family order superkingdom' )
    RankSet = Enum( 'Rank',  ['species', 'genus', 'family', 'order', 'class', 'phylum', 'superkingdom', 'no rank', 'NoLineageData']  )       
    def example2Dhash( giList ) :
        # 2D dictionary is really a hash nested inside another hash
        # simple usage can use a decent format.  
        # but initialization is pretty hairly, 
        # Under some circumstance may not need to init the 2D dictionary, 
        # but in this eg there 2D hash is evaluated before it is set
        # in the line "if lineage in resultTable2[rank]:"i
        # therefore init is needed (or add more test condition before the if-line).
        # may really want to create a class, and go with OOP for at least this data structure...
        # 2D hash ref: 
        resultTable2 = {"species": {}}
        resultTable2 = {"NoLineageData": {}}
        for nom in RankSet.__members__ :
                    resultTable2.update( { nom: {"NoLineageData":0} } )     # seed both hash keys, may not need this complication for init sake
                    resultTable2.update( { nom: {} } )                      # seed only first hash key
        # other example of init elements of the 2D hash:
        #resultTable2.update( { "NoLineageData": {"NoLineageData":0} } )
        #resultTable2 =       { "species": {"homo":0} }
        #resultTable2.update( { "species": {},         "genus":   {},          "family":  {} } )
        #resultTable2.update( { "species": {"HOmo":0}, "genus":   {"GEnus":0}, "family":  {} } )
        # if did not initialize the 2D hash above, assignment below would fail.
        for gi in giList:
              for rank_item in RankSet:
                    print( )      # .name ref
                    rank = str(    
                    lineage = getLineageByGi( gi, rank )
                    dbg( "%s \t %s \t %s " % (gi, rank, lineage) )
                    if lineage in resultTable2[rank]:               # python3 changed  has_key to "KEY in" python3
                        resultTable2[rank][lineage] += 1		
                        resultTable2[rank][lineage]  = 1
                        ###resultTable2.update( { rank }: {[lineage]: 1} )	# don't really need this convoluted syntax!
              #for-end rank_set
        #for-end gi
        print( resultTable2["species"]["Aedes pseudoscutellaris reovirus"] )
    Ref for Enumeration :
  • tech blog cover .name, .value, Enum( n ), Enum['name']
  • python3 standard library autonumber eg
    # Auto numbering Enumeration in a class, so as to be able to define functions
    # It demo some construct, but a hash maybe simpler and less overhead
    # Enum was a python3 feature, thus
    # in python2, need to "pip install enum34"  (which is diff than enum module, out of fashion now)
    from enum import Enum
    class AutoNumber( Enum ) :
            def __new__( cls ) :
                    value = len( cls.__members__ ) + 1
                    obj = object.__new__( cls )
                    obj._value_ = value
                    return obj
    class RankSet( AutoNumber ) :
            species         = ()    # order in this list matter!!
            genus           = ()    #
            family          = ()    # can add other ranks in middle if desired
            superkingdom    = ()    # code expects sk to be highest
            #'no rank'       = ()   # can't do this, but new class-way of RankSet should not need this anyway
            def getLowest( cls ) :
                    #(name, member) = cls.species
                    #return cls.species     # return RankSet.species  (what is needed programatically for getParent() etc
                    return cls( 1 )         # know that lowest rank in Enum class starts with 1
                    #return        # return species  #
                    # below will do the equivalent, but much slower
                    for (name, member) in cls.__members__.items() :
                            if member.value == 1 :
                                    return name
                    #return cls.__members__
            def getHighest( cls ) :
                    return cls.superkingdom      # how to use value=max ??
            # getParent(species)
            def getParent( cls, rank ) :
                    if( rank == cls['superkingdom']  ) :
                            return None         # return None, as no parent for sk
                    return cls( rank.value + 1)
                    # below will do the equivalent, but much slower
                    for (name, member) in cls.__members__.items() :
                            if member.value == rank.value + 1 :
                                    return member
            def getChild( cls, rank ) :
                    if( rank.value == 1 ) :
                            return          # return None, as no child for species
                    return cls( rank.value - 1)
                    # below will do the equivalent, but much slower
                    for (name, member) in cls.__members__.items() :
                            if member.value == rank.value - 1 :
                                    return member
    # RankSet class end
    # RankSet is meant to be a static class, not to be instantiated.
    # support calls like these:
    #        RankSet.getLowest()                            # RankSet.species
    #        RankSet.getLowest().name                       # species
    #        RankSet.getParent( RankSet.getLowest() )       # RankSet.genus
    #        r = RankSet.getChild( RankSet.getLowest() )    # get None when "out of range"
    #        if r is None :					# r == None works, but may break when == gets overloaded
    #                print( "got None from RankSet fn call..." )
    #        RankSet(1).value  RankSet(3).names   RankSet['genus']  are valid attributes
    Snipplet with example of namedtuple ::
    def eg_of_create_namedtuple() :
        giList = { }                
        f = open( filename, 'r' )
        for line in f:
            lineList = line.split( '|' )
            g = lineList[1]                         # python list index start at 0
            GiNode = namedtuple( 'GiNode', ['Freq', 'Taxid'] )
            if g not in giList :
                    taxid = getTaxidByGi( g )
                    giList[g] = GiNode( Freq=1, Taxid=taxid)
            else :
                    #(freq, taxid) = giList[g]  # this works
                    gin = giList[g]		    # but this keep to the spirit of namedTuple as an entity
                    giList[g] = GiNode( gin.Freq+1, gin.Taxid )
        return giList
    def egConsumer_of_namedtuple( giList ) :
            for g in giList :
                    print( "looking at gi:%s \t with freq: %s \t and taxid=%s" % (g, giList[g].Freq, giList[g].Taxid) )
                    if giList[g].Taxid not in resultTable4[currentRank] :
    			# the "key" to the namedtuple is available here even when it is not defined here
                            parentTaxid = getParentByTaxid( giList[g].Taxid )
                            rankName = getLineageByTaxid( giList[g].Taxid, currentRank )
                            node = TaxoNode( parentTaxid, rankName, giList[g].Freq, giList[g].Taxid )
                            resultTable4[lowestRank][giList[g].Taxid] = node
    # reading text file
    f = open( filename, 'r' )
    print f			# print whole file
    for line in f:
            print line$
            lineList = line.split( ',' )
    # write to file
    outFH = open( outfile, 'w' )
    outFH.write( "typical write method\n" )
    print( "print redirect write method, need to add 'from __future__ import print_function' in python2 to work" , file = outFH )
    displayText = '{0: ^50}'.format( entry )
    print( "%5d \t %8.4f %% \t %s" % intNum, floatNum, stringVar , file = outFH )
    The python Sort HowTo is a concise read on how to sort iterables by specifying which field to use as key. __repr__ ...

    mylist.sort() sorts in-place, so save space and slightly faster. return None. sorted(mylist) returns a new sorted list, so a tad slower, but said to be not too significant.
    By default, Python use timsort, an optmized mergesort. It is heavily optimized on sorted input and can return as fast as o(N-1). Typical performance is lg(N!).
    More info at

    Functional Programming in python

    Python supports imperative(procedural), OOP, as well as functional style. Since it is not dictated/required, hybrid approach is possible. A few observations:
    1. avoiding side effects (core of functional programming) may not always be possible. eg. printing message to screen, writting to file.
    2. functional programming center on being stateless. easier to achieve for a function with specific input and produce output deterministically. but the inside of the funciton may need to be stateful for more complex tasks. Well, sorting is complex, can be done procedurally (bubble sort) but can also done by divide and conquer without state, as can be done via merge sort.
    3. focusing on stateless, functional programming is in this sense at opposite spectrum of OOP, which is object with methods to provide internal state change.
    The following slide from my colleague Wes provide the gist of FP in Python:
    lambda functions: create anonymous functions
        	addFive = lambda x: x + 5
        	addFive(8) 		# result: 13
        	map(func, sequence) 	# Applies func() to every element of sequence.
    	filter(func, sequence) 	# Returns elements where func() returns True.
    	reduce(func, sequence) 	# Reduces a list to a single value.
    	sum = reduce(lambda x,y: (x+y), [1,2,3]) 	# result: 6 
    list comprehensions: syntactic sugar, clearer than map() or filter()
    	[x.upper() for x in seq]              vs.    map(lambda x: x.upper(), seq)
    	[x         for x in seq if x > 0]     vs. filter(lambda x: x > 0,     seq)
    collect() 	# Return a list of all elements
    maybe useful books...
    concise intro to functional programming.  likely using python as construct.
    about 45 pages.  maybe better than some web stuff?
    talks about lambda fn, map/reduce/filter, then go into recursion, comprehension, generators.
    the above is Section III of Treading on Python Volume 2: Intermediate Python
    (seems like I don't like either).
    start out with procedural/functional hybrid.
    maybe easier to follow to get more functional code into programs.
    some 330 pages.  dive into many specific of iter(), where they are used, etc.
    worthwhile if start programming a lot in python.
    PS.  LISP is the early functional language.  code was pretty hard to read.  Erlang is more modern.  CouchDB is coded in Erlang.


    even if don't really want to get fully functional, understanding iterator goes a long way in understanding many procedural constructs.
    eg.  for X in Y is really for X in iter(Y) 
    list and dictionaries are iterable.
    dictionaries especially!
    m = { 'jan', 1, 'feb', 2, 'mar', 3 }
    for key in m: 		# same as for key in iter(m)   
        print key, m[key]   
    			# side note: python2 allowed 
    			# if m.has_key( k ) 
    			# the has_key() is no longer avail in python3
    			# so use the syntax of
    			# if k in m  
    iter( m )		# create iterator from dictionary.  see 
    m.items()		# python2 use m.iteritems()
    in 2D dictionaries... ??
    table = { 'species', { 'HBV', 13, 'BK', 28, 'HIV', 14 }, 
              'genus',   { 'H',   27, 'B',  28 }
              'family',  { 'tot', 55 }
    ranks = table.iterkeys()
    familySum = table[genus].itervalues()
    table.iteritems()  ??
    there are the 
    that help understand list/tuple generations/conversion. 
    Ref: Python 2 - Functional - iterators

    Generator and Comprehension

    This is probably key to wrap head around functional programming.
    () is for generator ... return iterator
    [] is for function  ... return list
    the content inside the parenthesis and brackets will tell it is not tuple or dictionary/hash
    ( obj.count for obj in list_all_objects() )

    Ref: Python 2 - Functional - generator...

    Object Oriented Programing in Python

    OOP, especially data structure with functions to modify its state, is like the opposite of Functional Programming. GUI are probably natual with OOP, but biz logics probably better with FP, and Procedural approach good enough. Python modules provides encapsulation and separation, yielding some benefits of OOP w/o the altered logic imposed by classes. see
    class myClass(parentClass1, parentClass2) : 
    	classwideVar = "this is shared by all object/instances of this class. "		# be careful with this, not like Java!
    	def __init__( self )
    		instanceVar = "this is instance specific"
    	def fn( self ) : 
    		print( "hello world" )
    		# super() refer to parent class
    parentClass can be blank if not inheriting anything. this is defined in the class clause, obj declaration need not state anything here. standard data type can be used for parentClass. eg object, Enum, multiple parent classes can be listed, (comma?) delimited.
    x = myClass()
    xf = x.fn	# this is valid!  a method name is an attribute of the class... this define an alias to the function...  
    xf()		# actually calls x.f()
    myClass.fn(x) 	# this is what is happeneing when calling x.fn(), which is why first param of fn is called self.
    Data attributes override method attributes with the same name !!
    Use some standard to avoid bugs, eg verbs for methods, nouns for data.
    well, Java says data should be private and accessed via methods provided by the class...
    class Pizza(object):
        shape = "round"					# ie, all pizzas will have the same shape.  
        favoriteIngredient = "pepperoni"
        def __init__(self, ingredients):
                self.ingredients = ingredients		# variables comes to live when they are first executed
        @classmethod					# define a class-static method, ie not variable by object instantiation
        def getFavoriteIngredient( cls ):
        	return cls.favoriteIngredient
    p = Pizza( "pineapple" )
    print( p.ingredients )					# attributes in python are "public" in the C++/Java nomenclature
    							# nothing in python enfoce data hiding, it is all done by convention!!
    print( Pizza.shape ) 
    # class is object too in python!
    # Exceptions are ... ??
    ref: static class in python tutorial
    Python 3 tutorial on classes

    DataFrame (Pandas), DataSeries

  • DataFrame is essentially a table (2D).
  • Operations (methods) work on all elements of a given column. so avoid having to write iterative loops.
  • DataSeries is a different data structure and has different methods.
    These are horizontal? But not exactly 1D?
    import pandas as pd
    unemployment = pd.read_csv("data.csv")
    myTable.to_csv("path/result.csv")  # save result, export to csv
    slices (return another dataframe) vs loc/iloc (return a data series)
    pretty confusing here.
    also, does it mutate the object (dataframe) like method would?  
    or just return a new data frame that is diplayed by jupyter notebook, but otherwise discarded if not saved to a new dataframe.
    Note that for loc, ending index is NOT included.  But it is included in iloc. !! 
    [...] is for slicing
    [[...]] ??
    Used to join two dataframes.  
    This is essentially a JOIN in datababase parlance.
    Left/Right inner/outter applies, which may generate really strange looking tables.  RTFM.
    inplace=True # edit table in place
    inplace=False  # good for transient display?   don't save into existing table, saving need assignment to new table
    unemployment = unemployment.drop(... , inplace=True, ... )   # drop column 
    dropna()  # drop (rows?) with missing value.
    unemployment['en_name'].unique()  # return unique country names
    unemployment['en_name'].nunique()  #  think of count( ...unique() )
    unemployment['unemployment_rate'].isnull().sum()  # give a count of number of rows where column unemployment_rate is null (ie missing data).
    .reset_index(drop=True, inplace=True)    # eg before plotting, good to reset the row index if done work to remove data.
    index usually used as x values in plots, thus sequential indexing would be nice (or else get gap?)
    pd.to_datetime('1868/3/23')  # in yyyy/m/dd format!! :)
    pd.to_datetime('3/23/1868', format='%m/%d/%Y')  # specify format
    return a timestamp object.
    Ref: Berkeley D-Lab
  • Introduction to pandas
  • PySpark

    PySpar, Python, SparkSQL and submitting job to a Cloudera YARN cluster
    (More info about these technology in the BigData page.
    from pyspark import SparkContext
    from pyspark.sql import SQLContext, Row
    from pyspark.sql.types import *
    def main() :
            sc = SparkContext( appName='pyspark_yarn_app' )
            #sc = SparkContext( 'local', 'pyspark_local_app' )
            sqlContext = SQLContext(sc)
            lines = sc.textFile("ncbi.taxo.dump.csv") 
            parts = l: l.split("\t"))
            acc_taxid = p: (p[0], p[1].strip(), p[2].strip(), p[3].strip() ))
            schemaString = "acc acc_ver taxid gi"
            fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
            schema = StructType(fields)
            schemaAccTaxid = sqlContext.createDataFrame(acc_taxid,schema)
            sqlResult = sqlContext.sql( "SELECT taxid from acc_taxid WHERE acc_ver = 'T02634.1' " )  # sparkSQL does NOT allow for 
            myList = sqlResult.collect()            # need .collect() to consolidate result into "Row"
            print( myList[0].taxid )                # taxid is the name of the column specified in select
    	# note that std out is typically mixed with many hadoop job output, best to print to a file
    # main()-end
    # ref:
    To submit to cluster, run spark-submit from the command line, depending on whether you want to be very specific on job parameters:
    spark-submit --master yarn --deploy-mode cluster
    spark-submit --master yarn --deploy-mode cluster --driver-memory 8G --executor-memory 16G --total-executor-cores 32

    If the python program (app) resides in HDFS, then it can be specified as
    spark-submit --master yarn --deploy-mode cluster "hdfs:///user/tin/"

    YARN creates quite a number of wrapping layers, so many standard output and std err get lost. to see those, it is better to run in local mode instead of cluster mode. use one of:
    spark-submit --master local
    spark-submit --master local[4]

    Common location to hunt for spark-submit and pyspark:

    IMHO, it is best to specify the job parameter in the command line as arguments to spark-submit. However, they can be coded in the python app itself by putting the arguments in the SparkContext, see code below for example.
    The settings defined in the python code trump cli argument for spark-submit.
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SQLContext, Row
    from pyspark.sql.types import *
    def main() :
            conf = SparkConf()
            conf.set( "", "spark_app")     
            #conf.set( "spark.master", "local" )                                 
            conf.set( "spark.master", "yarn" )
            conf.set( "spark.submit.deployMode", "cluster" )
            conf.set( "spark.eventLog.enabled", True )                        
            conf.set( "spark.eventLog.dir", "file:///home/tin/spark" )       
            sc = SparkContext( conf=conf )                               # conf= is needed for spark 1.5 and older
    sqlContext = SQLContext(sc)
    sqlContext = SQLContext(sc)

    Parallel Programming in python

    1. Python Global Interpreter Lock (GIL) enforces only 1 python instruction is run at a time, thus pythong program cannot be multi-threaded. GIL release lock every 5 ms so OS scheduler can schedule other threads. NOTE: multiple process are completely independent (ie they have their own GIL).
    2. Network IO function typically release the GIL while they xfer data
    3. Threads are still avail from threading import Thread, Event but suitable mostly for doing async io stuff. Dealing with the GIL in the current implementation is hard to yield high perf parallel code
    4. Numpy?SciPy, zlib, bz2, and many high perf math libs are natively parallel due to their native implementation in C. The Python interface to them release the GIL while running.
    5. Parallelization for AI work: TensorFlow and PyTorch (SciKit-Learn?) are implemented in C++ as python extension, and code there does not depends on the GIL either. multi-core CPU and GPU code works fine in this space.
    6. PySpark, but have to use the hardoom/spark framework
    7. mpi4py, async parallel paradigm of MPI
    8. Child process based approach: Process and Pool Class: import multiprocessing. Cuz GIL, this tends to be higher performance. But there are overhead of inter-process communication: serialize-deserialize, (if fork()-based, then child share parent memory/data?

    Fluent Python. Ch 20: Concurrency Models in Python

    Concurrency is about keeping track of many things that are happening at the same time, structure is needed to keep track of this. However, solution of this may not always be parallelizable. Parallelism deals with execution.

    process share memory via pipe, which are raw bytes, so can be between diff languages

    threads are within the same program, thus they share memory, thus language, data structure format, much easier to code for simpler tasks such as array sharing.

    Dask is a parallel lib can farm out to a cluster of machines (think HPC). Offers API with routine that resemble (but not identical?) to NumPy, Pandas and Scikit-Learn. For large parallel program, especially at the start, Dask would be a good platform.

    Dask has a scheduler, while one can run in laptop, on HPC it need to invoke the pieces that tie as batch jobs.
    On user end, the dask/python code need to install the dask-jobqueue library. Write a declaration on how big the dask cluster job will be, and also write slurm job submit script requesting the desired resource (number of nodes, running time, etc). Slurm would just run the job like any multi-node job. ref:

    Dash with Jupyther lab: SSH Tunneling was used to use web browser on laptop tunnel to cluster. Maybe can use OOD to circumvent same network requirements.
    Interactive data analysis using Dask and HPC is possible, but heterogenous node job scheduling needs more work.



    Python vs your favorite language

    As explained by the folks at toggl
    (Yes, Python is a the real thing! -- well, so is Perl :)

    Doc URL

    (cc) Tin Ho. See main page for copyright info.